php 循环抓取网页内容(第一步就是获取有哪些城市使用插件XPath进行一个测试？ )

优采云发布时间: 2021-10-24 20:11

　　php 循环抓取网页内容(第一步就是获取有哪些城市使用插件XPath进行一个测试？

)

　　因此将其分配给名为 base_url 的变量以供后续使用

　　自动创建的爬出带有爬虫的名字。启动爬虫时需要这个名字，现在不用

　　name = 'area_spider'

allowed_domains = ['aqistudy.cn'] # 爬取的域名，不会超出这个顶级域名

base_url = "https://www.aqistudy.cn/historydata/"

start_urls = [base_url]

　　城市信息

　　进入首页后可以看到大量的城市信息，所以我们第一步就是获取有哪些城市

　　def parse(self, response):

print('爬取城市信息....')

url_list = response.xpath("//div[@class='all']/div[@class='bottom']/ul/div[2]/li/a/@href").extract() # 全部链接

city_list = response.xpath("//div[@class='all']/div[@class='bottom']/ul/div[2]/li/a/text()").extract() # 城市名称

for url, city in zip(url_list, city_list):

yield scrapy.Request(url=url, callback=self.parse_month, meta={'city': city})

　　使用插件XPath Helper可以对xpath进行测试，看定位内容是否正确

　　xpath.png

　　随便点击一个地区，可以发现url已经变成了北京

　　那么url_list得到的就是需要拼接的内容monthdata.php?city=city name

　　city_list的最后一部分是text()，所以得到的是具体的文本信息

　　将获取到的url_list和city_list一一传递给scrapy.Request，其中url是需要爬取的页面地址，city是item中需要的内容，所以暂时将item存入meta中并通过它到下一个回调函数 self .parse_month

　　月份信息

　　def parse_month(self, response):

print('爬取{}月份...'.format(response.meta['city']))

url_list = response.xpath('//tbody/tr/td/a/@href').extract()

for url in url_list:

url = self.base_url + url

yield scrapy.Request(url=url, callback=self.parse_day, meta={'city': response.meta['city']})

　　本步骤获取每个城市的所有月份信息，获取每个月份的URL地址。继续向下传递从上面传递的城市

　　最终数据

　　获得最终URL后，实例化item，然后完成item字典，返回item

　　def parse_day(self, response):

print('爬取最终数据...')

item = AirHistoryItem()

node_list = response.xpath('//tr')

node_list.pop(0) # 去除第一行标题栏

for node in node_list:

item['data'] = node.xpath('./td[1]/text()').extract_first()

item['city'] = response.meta['city']

item['aqi'] = node.xpath('./td[2]/text()').extract_first()

item['level'] = node.xpath('./td[3]/text()').extract_first()

item['pm2_5'] = node.xpath('./td[4]/text()').extract_first()

item['pm10'] = node.xpath('./td[5]/text()').extract_first()

item['so2'] = node.xpath('./td[6]/text()').extract_first()

item['co'] = node.xpath('./td[7]/text()').extract_first()

item['no2'] = node.xpath('./td[8]/text()').extract_first()

item['o3'] = node.xpath('./td[9]/text()').extract_first()

yield item

　　使用中间件实现selenium操作

　　打开中间件文件 middlewares.py

　　因为在服务器上爬取，所以选择使用谷歌的无界面浏览器chrome-headless

　　from selenium import webdriver

from selenium.webdriver.chrome.options import Options

chrome_options = Options()

chrome_options.add_argument('--headless') # 使用无头谷歌浏览器模式

chrome_options.add_argument('--disable-gpu')

chrome_options.add_argument('--no-sandbox')

# 指定谷歌浏览器路径

webdriver.Chrome(chrome_options=chrome_options,executable_path='/root/zx/spider/driver/chromedriver')

　　然后获取页面渲染后的源码

　　request.url 是传递给中间件的 url。由于首页是静态页面，首页不进行selenium操作

　　if request.url != 'https://www.aqistudy.cn/historydata/':

self.driver.get(request.url)

time.sleep(1)

html = self.driver.page_source

self.driver.quit()

return scrapy.http.HtmlResponse(url=request.url, body=html.encode('utf-8'), encoding='utf-8',request=request)

　　后续的操作也很简单，最后对获取到的内容进行正确编码返回到爬虫的下一步

　　所有中间件代码

　　使用下载器保存项目内容

　　修改 pipelines.py 用于文件存储

　　import json

class AirHistoryPipeline(object):

def open_spider(self, spider):

self.file = open('area.json', 'w')

def process_item(self, item, spider):

context = json.dumps(dict(item),ensure_ascii=False) + '\n'

self.file.write(context)

return item

def close_spider(self,spider):

self.file.close()

　　修改设置文件使中间件和下载器生效

　　打开settings.py文件

　　修改如下内容：DOWNLOADER_MIDDLEWARES使刚才写的中间件中的类，ITEM_PIPELINES是管道中的类

　　BOT_NAME = 'air_history'

SPIDER_MODULES = ['air_history.spiders']

NEWSPIDER_MODULE = 'air_history.spiders'

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

DOWNLOADER_MIDDLEWARES = {

'air_history.middlewares.AreaSpiderMiddleware': 543,

}

ITEM_PIPELINES = {

'air_history.pipelines.AirHistoryPipeline': 300,

}

　　运行

　　使用scrapy crawl area_spider运行爬虫

　　结果.png

　　所有蜘蛛代码

　　# -*- coding: utf-8 -*-

import scrapy

from air_history.items import AirHistoryItem

class AreaSpiderSpider(scrapy.Spider):

name = 'area_spider'

allowed_domains = ['aqistudy.cn'] # 爬取的域名，不会超出这个顶级域名

base_url = "https://www.aqistudy.cn/historydata/"

start_urls = [base_url]

def parse(self, response):

print('爬取城市信息....')

url_list = response.xpath("//div[@class='all']/div[@class='bottom']/ul/div[2]/li/a/@href").extract() # 全部链接

city_list = response.xpath("//div[@class='all']/div[@class='bottom']/ul/div[2]/li/a/text()").extract() # 城市名称

for url, city in zip(url_list, city_list):

url = self.base_url + url

yield scrapy.Request(url=url, callback=self.parse_month, meta={'city': city})

def parse_month(self, response):

print('爬取{}月份...'.format(response.meta['city']))

url_list = response.xpath('//tbody/tr/td/a/@href').extract()

for url in url_list:

url = self.base_url + url

yield scrapy.Request(url=url, callback=self.parse_day, meta={'city': response.meta['city']})

def parse_day(self, response):

print('爬取最终数据...')

item = AirHistoryItem()

node_list = response.xpath('//tr')

node_list.pop(0) # 去除第一行标题栏

for node in node_list:

item['data'] = node.xpath('./td[1]/text()').extract_first()

item['city'] = response.meta['city']

item['aqi'] = node.xpath('./td[2]/text()').extract_first()

item['level'] = node.xpath('./td[3]/text()').extract_first()

item['pm2_5'] = node.xpath('./td[4]/text()').extract_first()

item['pm10'] = node.xpath('./td[5]/text()').extract_first()

item['so2'] = node.xpath('./td[6]/text()').extract_first()

item['co'] = node.xpath('./td[7]/text()').extract_first()

item['no2'] = node.xpath('./td[8]/text()').extract_first()

item['o3'] = node.xpath('./td[9]/text()').extract_first()

yield item

0

2021-10-24

php 循环抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

php 循环抓取网页内容(第一步就是获取有哪些城市使用插件XPath进行一个测试？ )

0 个评论

发起人

AI时代内容工厂

php 循环抓取网页内容(第一步就是获取有哪些城市使用插件XPath进行一个测试？ )

0 个评论

发起人

相关问题