htmlunit抓取动态网页(,ajax动态加载的网页并提取网页信息(需进行) )
优采云 发布时间: 2021-10-09 22:03htmlunit抓取动态网页(,ajax动态加载的网页并提取网页信息(需进行)
)
网页有几种采集:
1.静态网页
2.动态网页(需要js、ajax动态加载数据的网页)
3.采集的网页前需要模拟登录
4.加密网页
3、4个解决方案和想法会在后续博客中说明
目前只有 1、2 的解决方案和想法:
一.静态网页
解决静态网页的方法有很多很多采集! java和python都提供了很多工具包或者框架,比如java httpclient、Htmlunit、Jsoup、HtmlParser等,Python urllib、urllib2、BeautifulSoup、Scrapy等,不详,网上资料很多。
二.动态网页
对于采集来说,动态网页就是那些需要通过js和ajax动态加载获取数据的网页。 采集 有两种数据方案:
1.通过抓包工具分析js、ajax的请求,模拟js加载后获取数据的请求。
2.调用浏览器内核,获取加载网页的源码,然后解析源码
研究爬虫的人一定对js有所了解。网上学习资料很多,就不一一列举了。我写这篇文章只是为了文章完整性
调用浏览器内核的工具包也有几个,不过不是今天的重点。今天的重点是文章的标题。 Scrapy框架结合Spynner采集需要动态加载js和ajax。并提取页面信息(以采集微信公众号文章列表为例)
开始...
1.创建微信公众号文章list采集项目(以下简称微采集)
scrapy startproject weixin
2.在spider目录下创建一个采集spider文件
vim weixinlist.py
编写如下代码
from weixin.items import WeixinItem
import sys
sys.path.insert(0,'..')
import scrapy
import time
from scrapy import Spider
class MySpider(Spider):
name = 'weixinlist'
allowed_domains = []
start_urls = [
'http://weixin.sogou.com/gzh?openid=oIWsFt5QBSP8mn4Jx2WSGw_rCNzQ',
]
download_delay = 1
print('start init....')
def parse(self, response):
sel=scrapy.Selector(response)
print('hello,world!')
print(response)
print(sel)
list=sel.xpath('//div[@class="txt-box"]/h4')
items=[]
for single in list:
data=WeixinItem()
title=single.xpath('a/text()').extract()
link=single.xpath('a/@href').extract()
data['title']=title
data['link']=link
if len(title)>0:
print(title[0].encode('utf-8'))
print(link)
3.在items.py中添加WeixinItem类
4.在items.py同级目录下创建一个下载中间件downloadwebkit.py,在里面写入如下代码:
import spynner
import pyquery
import time
import BeautifulSoup
import sys
from scrapy.http import HtmlResponse
class WebkitDownloaderTest( object ):
def process_request( self, request, spider ):
# if spider.name in settings.WEBKIT_DOWNLOADER:
# if( type(request) is not FormRequest ):
browser = spynner.Browser()
browser.create_webview()
browser.set_html_parser(pyquery.PyQuery)
browser.load(request.url, 20)
try:
browser.wait_load(10)
except:
pass
string = browser.html
string=string.encode('utf-8')
renderedBody = str(string)
return HtmlResponse( request.url, body=renderedBody )
这段代码是在网页加载完成后调用浏览器内核获取源码
5.在setting.py文件中配置并声明下载使用下载中间件
在底部添加以下代码:
#which spider should use WEBKIT
WEBKIT_DOWNLOADER=['weixinlist']
DOWNLOADER_MIDDLEWARES = {
'weixin.downloadwebkit.WebkitDownloaderTest': 543,
}
import os
os.environ["DISPLAY"] = ":0"
6.运行程序:
运行命令:
scrapy crawl weixinlist
运行结果:
<p>kevinflynndeMacBook-Pro:spiders kevinflynn$ scrapy crawl weixinlist
start init....
2015-07-28 21:13:55 [scrapy] INFO: Scrapy 1.0.1 started (bot: weixin)
2015-07-28 21:13:55 [scrapy] INFO: Optional features available: ssl, http11
2015-07-28 21:13:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'weixin.spiders', 'SPIDER_MODULES': ['weixin.spiders'], 'BOT_NAME': 'weixin'}
2015-07-28 21:13:55 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
2015-07-28 21:13:55 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-28 21:13:55 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, WebkitDownloaderTest, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-28 21:13:55 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-28 21:13:55 [scrapy] INFO: Enabled item pipelines:
2015-07-28 21:13:55 [scrapy] INFO: Spider opened
2015-07-28 21:13:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-28 21:13:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
QFont::setPixelSize: Pixel size