百度网页关键字抓取(Python爬取百度搜索结果并保存-云+社区-腾讯云)

优采云发布时间: 2021-10-02 07:21

　　学习自：教你如何使用Python抓取百度搜索结果并保存-云+社区-腾讯云

　　如何用python模拟百度搜索、Python交流、技术*敏*感*词*、鱼C论坛

　　指定关键词，百度搜索，保存搜索结果，记录搜索内容和标题

　　想法：

　　首页：*（用关键字替换*）

　　其他页面：*&pn=n（n/10+1为实际页面）

　　1、利用关键词构造百度搜索网址

　　2、爬虫爬取URL

　　3、分析每个选项的XPath并记录选项的名称和URL

　　4、注意每个搜索项的XPath为//*[@class="t"]/a，其名称为该项的文本内容，链接为该项的属性href

　　#每一个搜索项的XPath

//*[@class="t"]/a

#每一项标题的XPath

. #就一个点

#每一项链接的XPath

./@href

　　5、提取标题后，需要用正则表达式过滤。因为页面源码有and标签，需要用正则表达式删除标签。因此，这里不能直接使用 XPath 函数 text() 进行提取。相反，您应该使用extract 直接提取源代码，然后使用正则表达式提取所需的元素。

　　 eles=response.xpath('//*[@class="t"]/a') #提取搜索每一项

for ele in eles:

name=ele.xpath('.').extract() #提取标题相关的要素源码,extract方法返回一个List

name=''.join(name).strip() #要将List中的要素连接起来

name=name.replace('','').replace('', '')#删除其中的与标签

re_bd=re.compile(r'>(.*)</a>')#构建re compile

item['name']=re_bd.search(name).groups(1)#筛选name项

item['link']=ele.xpath('./@href').extract()[0]#直接提取Link

yield item

　　6、完整代码如下

　　import scrapy

from scrapy import Request

from BD.items import BdItem

import re

class BdsSpider(scrapy.Spider):

name = 'BDS'

allowed_domains = ['www.baidu.com']

key=input('输入关键字')

url='http://www.baidu.com/s?wd='+key

start_urls = [url]

def parse(self, response):

item=BdItem()

eles=response.xpath('//*[@class="t"]/a')

for ele in eles:

name=ele.xpath('.').extract()

name=''.join(name).strip()

name=name.replace('','').replace('', '')

re_bd=re.compile(r'>(.*)</a>')

item['name']=re_bd.search(name).groups(1)

item['link']=ele.xpath('./@href').extract()[0]

yield item

next_url = self.url + '&pn=10'

yield Request(url=next_url)

　　7、运行

　　scrapy crawl BDS -O baidu.csv

　　其他

　　需要在Setting中设置User-Agent，避免被百度识别为爬虫而拒绝请求

0

2021-10-02

百度网页关键字抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

百度网页关键字抓取(Python爬取百度搜索结果并保存-云+社区-腾讯云)

0 个评论

发起人