动态网页抓取(Selenium驱动GeckoDriver安装方法(组图)实践案例(图))

优采云 发布时间: 2022-02-25 12:24

  动态网页抓取(Selenium驱动GeckoDriver安装方法(组图)实践案例(图))

  上面得到的结果令人困惑。要从这些json数据中提取出我们想要的数据,我们需要使用json库来解析数据。

  # coding: UTF-8

import requests

from bs4 import BeautifulSoup

import json

url = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open"

headers = {

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

}

r = requests.get(url, headers = headers)

json_data = json.loads(r.text)

comments_list = json_data['data']

for eachone in comments_list:

message = eachone['content']

print message

  以上代码只能爬取单个页面,要想爬取更多的内容需要了解URL的规律

  首页的真实网址:%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open

  第二页的真实网址:%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=20&status=open

  对比上面两个网址,我们会发现两个特别重要的变量,offset和limit

  # coding : utf-8

import requests

import json

def single_page(url):

headers = {

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

}

r = requests.get(url, headers = headers)

json_data = json.loads(r.text)

comments_list = json_data['data']

for eachone in comments_list:

message = eachone['content']

print(message)

for page in (0,2):

link1 = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset="

link2 = "&status=open"

page_str = str(page * 20)

link = link1 + page_str + link2

single_page(link)

  3.Selenium 模拟浏览器捕获

  对于一些复杂的网站,上述方法将不再适用。另外,有些数据的真实地址的URL很长很复杂,有的网站为了避免这些会加密地址,使得一些变量难以破解。

  因此,我们将使用 selenium 浏览器渲染引擎,直接用浏览器显示网页,解析 HTML、JS 和 CSS

  (1)Selenium安装及基本介绍

  参考博文:火狐浏览器驱动GeckoDriver安装方法

  # coding:utf-8

from selenium import webdriver

from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

import sys

#reload(sys)

#sys.setdefaultencoding("utf-8")

caps = webdriver.DesiredCapabilities().FIREFOX

caps["marionette"] = False

# windows版,需要安装 greckodriver

binary = FirefoxBinary(r'D:\Program Files (x86)\Mozilla Firefox\firefox.exe')

driver = webdriver.Firefox(firefox_binary = binary, capabilities = caps)

driver.get("https://www.baidu.com")

  以下操作以Chrome浏览器为例;

  # coding:utf-8

from selenium import webdriver

#windows版

driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')

driver.get("https://www.baidu.com")

  (2)硒实践案例

  现在我们使用浏览器渲染来抓取之前的评论数据

  先在“Inspect”页面找到HTML代码标签,尝试获取第一条评论

  代码如下:

  # coding:utf-8

from selenium import webdriver

driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')

driver.get("https://www.zhihu.com/question/22913650")

comment = driver.find_element_by_css_selector('')

print(comment.text)

  (3)Selenium 获取文章的所有评论

  要获取所有评论,脚本需要能够自动点击“加载更多”、“所有评论”、“下一页”等。

  代码如下:

  # coding: utf-8

from selenium import webdriver

from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

import time

caps = webdriver.DesiredCapabilities().FIREFOX

caps["marionette"] = False

binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')

#把上述地址改成你电脑中Firefox程序的地址

driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)

driver.get("http://www.santostang.com/2017/03/02/hello-world/")

driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))

comments = driver.find_elements_by_css_selector('div.reply-content')

for eachcomment in comments:

content = eachcomment.find_element_by_tag_name('p')

print (content.text)

  Selenium 选择元素的方法:

  # text

find_element_by_css_selector('div.body_inner')

#

find_element_by_xpath("//form[@id='loginForm']"

# text

find_element_by_id

# text

find_element_by_name('myname')

# text

find_element_by_link_text('text')

# text

find_element_by_partial_link_text('te')

# text

find_element_by_tag_name('div')

# text

find_element_by_class_name('body_inner')

  要查找多个元素,可以在上面的'element'后面加s成为元素。前两个比较常用

  (4)Selenium的高级操作

  为了加快selenium的爬取速度,常通过以下方法实现:

  (1)控制 CSS 加载

  fp = webdriver.FirefoxProfile()

fp.set_preference("permissions.default.stylesheet",2)

  (2)控制图片的显示

  fp = webdriver.FirefoxProfile()

fp.set_preference("permissions.default.image",2)

  (3)控制 JavaScript 的执行

  fp = webdriver.FirefoxProfile()

fp.set_preference("javascript.enabled", False)

  对于 Chrome 浏览器:

  options=webdriver.ChromeOptions()

prefs={

'profile.default_content_setting_values': {

'images': 2,

'javascript':2

}

}

options.add_experimental_option('prefs',prefs)

browser = webdriver.Chrome(chrome_options=options)

  4.Selenium爬虫实战:深圳短租数据

  (1)网站分析

  (2)项目实战

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线