动态网页抓取(Selenium驱动GeckoDriver安装方法(组图)实践案例(图))

优采云发布时间: 2022-02-25 12:24

　　上面得到的结果令人困惑。要从这些json数据中提取出我们想要的数据，我们需要使用json库来解析数据。

　　# coding: UTF-8

import requests

from bs4 import BeautifulSoup

import json

url = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open"

headers = {

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

}

r = requests.get(url, headers = headers)

json_data = json.loads(r.text)

comments_list = json_data['data']

for eachone in comments_list:

message = eachone['content']

print message

　　以上代码只能爬取单个页面，要想爬取更多的内容需要了解URL的规律

　　首页的真实网址：%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open

　　第二页的真实网址：%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=20&status=open

　　对比上面两个网址，我们会发现两个特别重要的变量，offset和limit

　　# coding : utf-8

import requests

import json

def single_page(url):

headers = {

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

}

r = requests.get(url, headers = headers)

json_data = json.loads(r.text)

comments_list = json_data['data']

for eachone in comments_list:

message = eachone['content']

print(message)

for page in (0,2):

link1 = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset="

link2 = "&status=open"

page_str = str(page * 20)

link = link1 + page_str + link2

single_page(link)

　　3.Selenium 模拟浏览器捕获

　　对于一些复杂的网站，上述方法将不再适用。另外，有些数据的真实地址的URL很长很复杂，有的网站为了避免这些会加密地址，使得一些变量难以破解。

　　因此，我们将使用 selenium 浏览器渲染引擎，直接用浏览器显示网页，解析 HTML、JS 和 CSS

　　（1）Selenium安装及基本介绍

　　参考博文：火狐浏览器驱动GeckoDriver安装方法

　　# coding:utf-8

from selenium import webdriver

from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

import sys

#reload(sys)

#sys.setdefaultencoding("utf-8")

caps = webdriver.DesiredCapabilities().FIREFOX

caps["marionette"] = False

# windows版,需要安装 greckodriver

binary = FirefoxBinary(r'D:\Program Files (x86)\Mozilla Firefox\firefox.exe')

driver = webdriver.Firefox(firefox_binary = binary, capabilities = caps)

driver.get("https://www.baidu.com")

　　以下操作以Chrome浏览器为例；

　　# coding:utf-8

from selenium import webdriver

#windows版

driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')

driver.get("https://www.baidu.com")

　　（2）硒实践案例

　　现在我们使用浏览器渲染来抓取之前的评论数据

　　先在“Inspect”页面找到HTML代码标签，尝试获取第一条评论

　　代码如下：

　　# coding:utf-8

from selenium import webdriver

driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')

driver.get("https://www.zhihu.com/question/22913650")

comment = driver.find_element_by_css_selector('')

print(comment.text)

　　(3）Selenium 获取文章的所有评论

　　要获取所有评论，脚本需要能够自动点击“加载更多”、“所有评论”、“下一页”等。

　　代码如下：

　　# coding: utf-8

from selenium import webdriver

from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

import time

caps = webdriver.DesiredCapabilities().FIREFOX

caps["marionette"] = False

binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')

#把上述地址改成你电脑中Firefox程序的地址

driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)

driver.get("http://www.santostang.com/2017/03/02/hello-world/")

driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))

comments = driver.find_elements_by_css_selector('div.reply-content')

for eachcomment in comments:

content = eachcomment.find_element_by_tag_name('p')

print (content.text)

　　Selenium 选择元素的方法：

　　# text

find_element_by_css_selector('div.body_inner')

#

find_element_by_xpath("//form[@id='loginForm']"

# text

find_element_by_id

# text

find_element_by_name('myname')

# text

find_element_by_link_text('text')

# text

find_element_by_partial_link_text('te')

# text

find_element_by_tag_name('div')

# text

find_element_by_class_name('body_inner')

　　要查找多个元素，可以在上面的'element'后面加s成为元素。前两个比较常用

　　(4）Selenium的高级操作

　　为了加快selenium的爬取速度，常通过以下方法实现：

　　(1）控制 CSS 加载

　　fp = webdriver.FirefoxProfile()

fp.set_preference("permissions.default.stylesheet",2)

　　(2）控制图片的显示

　　fp = webdriver.FirefoxProfile()

fp.set_preference("permissions.default.image",2)

　　(3）控制 JavaScript 的执行

　　fp = webdriver.FirefoxProfile()

fp.set_preference("javascript.enabled", False)

　　对于 Chrome 浏览器：

　　options=webdriver.ChromeOptions()

prefs={

'profile.default_content_setting_values': {

'images': 2,

'javascript':2

}

options.add_experimental_option('prefs',prefs)

browser = webdriver.Chrome(chrome_options=options)

　　4.Selenium爬虫实战：深圳短租数据

　　(1）网站分析

　　(2）项目实战

0

2022-02-25

动态网页抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

动态网页抓取(Selenium驱动GeckoDriver安装方法(组图)实践案例(图))

0 个评论

发起人