抓取网页新闻(pythoner来说工程建立,)
优采云 发布时间: 2021-09-16 08:23抓取网页新闻(pythoner来说工程建立,)
最近,使用scratch捕获web页面,这对于Python来说非常方便。详细文件如下:
要使用scratch捕获网页信息,您需要创建一个新项目scratch startproject myproject
项目建立后,将有一个子目录myproject/myproject,其中收录item.py(根据您想要获取的内容的定义)、pipeline.py(用于处理捕获的数据、保存数据库或其他),然后是spider文件夹,您可以在其中编写爬虫脚本
这里,以爬行网站的图书信息为例:
Item.py如下所示:
from scrapy.item import Item, Field
class BookItem(Item):
# define the fields for your item here like:
name = Field()
publisher = Field()
publish_date = Field()
price = Field()
我们想要获取的所有内容都在上面定义,包括名称、出版商、出版日期和价格
现在我们要写的是,爬虫进入互联网战争来捕获信息
spider/book.py如下所示:
from urlparse import urljoin
import simplejson
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from myproject.items import BookItem
class BookSpider(CrawlSpider):
name = 'bookspider'
allowed_domains = ['test.com']
start_urls = [
"http://test_url.com", #这里写开始抓取的页面地址(这里网址是虚构的,实际使用时请替换)
]
rules = (
#下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)
Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),
#下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)
Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = BookItem()
item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0]
item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract()
publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract()
item['publisher'] = publisher and publisher[0] or ''
publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[\u2e80-\u9fffh]+\uff1a([\d-]+)")
item['publish_date'] = publish_date and publish_date[0] or ''
prices = hxs.select('//p[@class="price_m"]/text()').re("(\d*\.*\d*)")
item['price'] = prices and prices[0] or ''
return item
然后,在捕获信息后,需要保存它。此时需要编写pipelines.py(对于scapy,使用twisted,因此可以看到特定数据库操作的twisted数据,下面简要介绍如何将其保存到数据库中):
from scrapy import log
#from scrapy.core.exceptions import DropItem
from twisted.enterprise import adbapi
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy.contrib.pipeline.images import ImagesPipeline
import time
import MySQLdb
import MySQLdb.cursors
class MySQLStorePipeline(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db = 'test',
user = 'user',
passwd = '******',
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',
use_unicode = False
)
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
query.addErrback(self.handle_error)
return item
def _conditional_insert(self, tx, item):
if item.get('name'):
tx.execute(\
"insert into book (name, publisher, publish_date, price ) \
values (%s, %s, %s, %s)",
(item['name'], item['publisher'], item['publish_date'],
item['price'])
)
之后,在setting.py中添加管道:
ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']
??最后,运行scratch crawl bookspider开始爬行
参考: