抓取网页新闻(pythoner来说工程建立,)

优采云发布时间: 2021-09-16 08:23

　　最近，使用scratch捕获web页面，这对于Python来说非常方便。详细文件如下：

　　要使用scratch捕获网页信息，您需要创建一个新项目scratch startproject myproject

　　项目建立后，将有一个子目录myproject/myproject，其中收录item.py（根据您想要获取的内容的定义）、pipeline.py（用于处理捕获的数据、保存数据库或其他），然后是spider文件夹，您可以在其中编写爬虫脚本

　　这里，以爬行网站的图书信息为例：

　　Item.py如下所示：

　　from scrapy.item import Item, Field

class BookItem(Item):

# define the fields for your item here like:

name = Field()

publisher = Field()

publish_date = Field()

price = Field()

　　我们想要获取的所有内容都在上面定义，包括名称、出版商、出版日期和价格

　　现在我们要写的是，爬虫进入互联网战争来捕获信息

　　spider/book.py如下所示：

　　from urlparse import urljoin

import simplejson

from scrapy.http import Request

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import HtmlXPathSelector

from myproject.items import BookItem

class BookSpider(CrawlSpider):

name = 'bookspider'

allowed_domains = ['test.com']

start_urls = [

"http://test_url.com", #这里写开始抓取的页面地址(这里网址是虚构的,实际使用时请替换)

]

rules = (

#下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)

Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),

#下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)

Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"),

)

def parse_item(self, response):

hxs = HtmlXPathSelector(response)

item = BookItem()

item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0]

item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract()

publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract()

item['publisher'] = publisher and publisher[0] or ''

publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[\u2e80-\u9fffh]+\uff1a([\d-]+)")

item['publish_date'] = publish_date and publish_date[0] or ''

prices = hxs.select('//p[@class="price_m"]/text()').re("(\d*\.*\d*)")

item['price'] = prices and prices[0] or ''

return item

　　然后，在捕获信息后，需要保存它。此时需要编写pipelines.py（对于scapy，使用twisted，因此可以看到特定数据库操作的twisted数据，下面简要介绍如何将其保存到数据库中）：

　　from scrapy import log

#from scrapy.core.exceptions import DropItem

from twisted.enterprise import adbapi

from scrapy.http import Request

from scrapy.exceptions import DropItem

from scrapy.contrib.pipeline.images import ImagesPipeline

import time

import MySQLdb

import MySQLdb.cursors

class MySQLStorePipeline(object):

def __init__(self):

self.dbpool = adbapi.ConnectionPool('MySQLdb',

db = 'test',

user = 'user',

passwd = '******',

cursorclass = MySQLdb.cursors.DictCursor,

charset = 'utf8',

use_unicode = False

)

def process_item(self, item, spider):

query = self.dbpool.runInteraction(self._conditional_insert, item)

query.addErrback(self.handle_error)

return item

def _conditional_insert(self, tx, item):

if item.get('name'):

tx.execute(\

"insert into book (name, publisher, publish_date, price ) \

values (%s, %s, %s, %s)",

(item['name'], item['publisher'], item['publish_date'],

item['price'])

)

　　之后，在setting.py中添加管道：

　　ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']

　　?？最后，运行scratch crawl bookspider开始爬行

　　参考：

0

2021-09-16

抓取网页新闻

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页新闻(pythoner来说工程建立,)

0 个评论

发起人