Scrapy爬虫框架：抓取天猫淘宝数据

优采云发布时间: 2020-05-05 08:05

　　有了前两篇的基础，接下来通过抓取天猫和淘宝的数据来详尽说明，如何通过Scrapy爬取想要的内容。完整的代码：[不带数据库版本][ 数据库版本]。

　　通过天猫的搜索，获取搜索下来的每件商品的销量、收藏数、价格。

　　所以，最终的目的是通过获取两个页面的内容，一个是搜索结果，从上面找下来每一个商品的详尽地址，然后第二个是商品详尽内容，从上面获取到销量、价格等。

　　有了思路如今我们先下载搜索结果页面，然后再下载页面中每一项详尽信息页面。

　　 def _parse_handler(self, response):

''' 下载页面 """

self.driver.get(response.url)

pass

　　很简单，通过self.driver.get(response.url)就能使用selenium下载内容，如果直接使用response中的网页内容是静态的。

　　上面说了怎样下载内容，当我们下载好内容后，需要从上面去获取我们想要的有用信息，这里就要用到选择器，选择器构造方法比较多，只介绍一种，这里看详尽信息：

　　>>> body = '<html><body><span>good</span></body></html>'

>>> Selector(text=body).xpath('//span/text()').extract()

[u'good']

　　这样就通过xpath取下来了good这个词组，更详尽的xpath教程点击这儿。

　　Selector 提供了好多形式出了xpath，还有css选择器，正则表达式，中文教程看这个，具体内容就不多说，只须要晓得这样可以快速获取我们须要的内容。

　　简单的介绍了如何获取内容后，现在我们从第一个搜索结果中获取我们想要的商品详尽链接，通过查看网页源代码可以看见，商品的链接在这里：

　　...

<a class="J_ClickStat" data-nid="523242229702" href="//detail.tmall.com/item.htm?spm=a230r.1.14.46.Mnbjq5&id=523242229702&ns=1&abbucket=14" target="_blank" trace="msrp_auction" traceidx="5" trace-pid="" data-spm-anchor-id="a230r.1.14.46">WD/西部数据 WD30EZRZ台式机3T电脑<span class="H">硬盘</span> 西数蓝盘3TB 替绿盘</a>

</p>

...

　　使用之前的规则来获取到a元素的href属性就是须要的内容：

　　selector = Selector(text=self.driver.page_source) # 这里不要省略text因为省略后Selector使用的是另外一个构造函数，self.driver.page_source是这个网页的html内容

selector.css(".title").css(".J_ClickStat").xpath("./@href").extract()

　　简单说一下，这里通过css工具取了class叫title的p元素，然后又获取了class是J_ClickStat的a元素，最后通过xpath规则获取a元素的href中的内容。啰嗦一句css中若果是取id则应当是selector.css("#title")，这个和css中的选择器是一致的。

　　同理，我们获取到商品详情后，以获取销量为例，查看源代码：

<li class="tm-ind-item tm-ind-emPointCount" data-spm="1000988"><div class="tm-indcon"><a href="//vip.tmall.com/vip/index.htm" target="_blank"><span class="tm-label">送天猫积分</span><span class="tm-count">55</span></a></div></li>

</ul>

　　获取月销量:

　　selector.css(".tm-ind-sellCount").xpath("./div/span[@class='tm-count']/text()").extract_first()

　　获取累计评价:

　　selector.css(".tm-ind-reviewCount").xpath("./div[@class='tm-indcon']/span[@class='tm-count']/text()").extract_first()

　　最后把获取下来的数据包装成Item返回。淘宝或则淘宝她们的页面内容不一样，所以规则也不同，需要分开去获取想要的内容。

　　Item是scrapy中获取下来的结果，后面可以处理这种结果。

　　Item通常是放在items.py中

　　import scrapy

class Product(scrapy.Item):

name = scrapy.Field()

price = scrapy.Field()

stock = scrapy.Field()

last_updated = scrapy.Field(serializer=str)

　　>>> product = Product(name='Desktop PC', price=1000)

>>> print product

Product(name='Desktop PC', price=1000)

　　>>> product['name']

Desktop PC

>>> product.get('name')

Desktop PC

>>> product['price']

1000

>>> product['last_updated']

Traceback (most recent call last):

...

KeyError: 'last_updated'

>>> product.get('last_updated', 'not set')

not set

>>> product['lala'] # getting unknown field

Traceback (most recent call last):

...

KeyError: 'lala'

>>> product.get('lala', 'unknown field')

'unknown field'

>>> 'name' in product # is name field populated?

True

>>> 'last_updated' in product # is last_updated populated?

False

>>> 'last_updated' in product.fields # is last_updated a declared field?

True

>>> 'lala' in product.fields # is lala a declared field?

False

　　>>> product['last_updated'] = 'today'

>>> product['last_updated']

today

>>> product['lala'] = 'test' # setting unknown field

Traceback (most recent call last):

...

KeyError: 'Product does not support field: lala'

　　这里只须要注意一个地方，不能通过product.name的方法获取，也不能通过product.name = "name"的形式设置值。

　　当Item在Spider中被搜集以后，它将会被传递到Item Pipeline，一些组件会根据一定的次序执行对Item的处理。

　　每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方式的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被遗弃而不再进行处理。

　　以下是item pipeline的一些典型应用：

　　现在实现一个Item过滤器，我们把获取下来若果是None的数据形参为0，如果Item对象是None则丢弃这条数据。

　　pipeline通常是放在pipelines.py中

　　 def process_item(self, item, spider):

if item is not None:

if item["p_standard_price"] is None:

item["p_standard_price"] = item["p_shop_price"]

if item["p_shop_price"] is None:

item["p_shop_price"] = item["p_standard_price"]

item["p_collect_count"] = text_utils.to_int(item["p_collect_count"])

item["p_comment_count"] = text_utils.to_int(item["p_comment_count"])

item["p_month_sale_count"] = text_utils.to_int(item["p_month_sale_count"])

item["p_sale_count"] = text_utils.to_int(item["p_sale_count"])

item["p_standard_price"] = text_utils.to_string(item["p_standard_price"], "0")

item["p_shop_price"] = text_utils.to_string(item["p_shop_price"], "0")

item["p_pay_count"] = item["p_pay_count"] if item["p_pay_count"] is not "-" else "0"

return item

else:

raise DropItem("Item is None %s" % item)

　　最后须要在settings.py中添加这个pipeline

　　ITEM_PIPELINES = {

'TaoBao.pipelines.TTDataHandlerPipeline': 250,

'TaoBao.pipelines.MysqlPipeline': 300,

}

　　后面那种数字越小，则执行的次序越靠前，这里先过滤处理数据，获取到正确的数据后，再执行TaoBao.pipelines.MysqlPipeline添加数据到数据库。

　　完整的代码：[不带数据库版本][ 数据库版本]。

　　之前说的方法都是直接通过命令scrapy crawl tts来启动。怎么用IDE的调试功能呢？很简单通过main函数启动爬虫：

　　# 写到Spider里面

if __name__ == "__main__":

settings = get_project_settings()

process = CrawlerProcess(settings)

spider = TmallAndTaoBaoSpider

process.crawl(spider)

process.start()

　　在获取数据的时侯，很多时侯会碰到网页重定向的问题，scrapy会返回302之后不会手动重定向后继续爬取新地址，在scrapy的设置中，可以通过配置来开启重定向，这样虽然域名是重定向的scrapy也会手动到最终的地址获取内容。

　　解决方案：settings.py中添加REDIRECT_ENABLED = True

　　很多时侯爬虫都有自定义数据，比如之前写的是硬碟关键字，现在通过参数的方法如何传递呢？

　　解决方案：

　　大部分时侯，我们可以取到完整的网页信息，如果网页的ajax恳求太多，网速很慢的时侯，selenium并不知道什么时候ajax恳求完成，这个时侯假如通过self.driver.get(response.url)获取页面天猫反爬虫，然后通过Selector取数据天猫反爬虫，很可能还没加载完成取不到数据。

　　解决方案：通过selenium提供的工具来延后获取内容，直到获取到数据，或者超时。

0

2020-05-05

python爬虫 scrapy xpath

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

Scrapy爬虫框架：抓取天猫淘宝数据

0 个评论

发起人