动态网页抓取(Selenium驱动GeckoDriver安装方法(组图)实践案例(图))
优采云 发布时间: 2022-02-25 12:24动态网页抓取(Selenium驱动GeckoDriver安装方法(组图)实践案例(图))
上面得到的结果令人困惑。要从这些json数据中提取出我们想要的数据,我们需要使用json库来解析数据。
# coding: UTF-8
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
r = requests.get(url, headers = headers)
json_data = json.loads(r.text)
comments_list = json_data['data']
for eachone in comments_list:
message = eachone['content']
print message
以上代码只能爬取单个页面,要想爬取更多的内容需要了解URL的规律
首页的真实网址:%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open
第二页的真实网址:%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=20&status=open
对比上面两个网址,我们会发现两个特别重要的变量,offset和limit
# coding : utf-8
import requests
import json
def single_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
r = requests.get(url, headers = headers)
json_data = json.loads(r.text)
comments_list = json_data['data']
for eachone in comments_list:
message = eachone['content']
print(message)
for page in (0,2):
link1 = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset="
link2 = "&status=open"
page_str = str(page * 20)
link = link1 + page_str + link2
single_page(link)
3.Selenium 模拟浏览器捕获
对于一些复杂的网站,上述方法将不再适用。另外,有些数据的真实地址的URL很长很复杂,有的网站为了避免这些会加密地址,使得一些变量难以破解。
因此,我们将使用 selenium 浏览器渲染引擎,直接用浏览器显示网页,解析 HTML、JS 和 CSS
(1)Selenium安装及基本介绍
参考博文:火狐浏览器驱动GeckoDriver安装方法
# coding:utf-8
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import sys
#reload(sys)
#sys.setdefaultencoding("utf-8")
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
# windows版,需要安装 greckodriver
binary = FirefoxBinary(r'D:\Program Files (x86)\Mozilla Firefox\firefox.exe')
driver = webdriver.Firefox(firefox_binary = binary, capabilities = caps)
driver.get("https://www.baidu.com")
以下操作以Chrome浏览器为例;
# coding:utf-8
from selenium import webdriver
#windows版
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')
driver.get("https://www.baidu.com")
(2)硒实践案例
现在我们使用浏览器渲染来抓取之前的评论数据
先在“Inspect”页面找到HTML代码标签,尝试获取第一条评论
代码如下:
# coding:utf-8
from selenium import webdriver
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')
driver.get("https://www.zhihu.com/question/22913650")
comment = driver.find_element_by_css_selector('')
print(comment.text)
(3)Selenium 获取文章的所有评论
要获取所有评论,脚本需要能够自动点击“加载更多”、“所有评论”、“下一页”等。
代码如下:
# coding: utf-8
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
content = eachcomment.find_element_by_tag_name('p')
print (content.text)
Selenium 选择元素的方法:
# text
find_element_by_css_selector('div.body_inner')
#
find_element_by_xpath("//form[@id='loginForm']"
# text
find_element_by_id
# text
find_element_by_name('myname')
# text
find_element_by_link_text('text')
# text
find_element_by_partial_link_text('te')
# text
find_element_by_tag_name('div')
# text
find_element_by_class_name('body_inner')
要查找多个元素,可以在上面的'element'后面加s成为元素。前两个比较常用
(4)Selenium的高级操作
为了加快selenium的爬取速度,常通过以下方法实现:
(1)控制 CSS 加载
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.stylesheet",2)
(2)控制图片的显示
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.image",2)
(3)控制 JavaScript 的执行
fp = webdriver.FirefoxProfile()
fp.set_preference("javascript.enabled", False)
对于 Chrome 浏览器:
options=webdriver.ChromeOptions()
prefs={
'profile.default_content_setting_values': {
'images': 2,
'javascript':2
}
}
options.add_experimental_option('prefs',prefs)
browser = webdriver.Chrome(chrome_options=options)
4.Selenium爬虫实战:深圳短租数据
(1)网站分析
(2)项目实战