python抓取动态网页(python爬虫遇到ajax动态页面怎么办？直接指向当前目录？)

优采云发布时间: 2021-10-15 03:14

　　一般情况下，python爬虫遇到ajax动态页面时，通常会直接分析模拟ajax请求获取数据。但是今天遇到了一个网站，因为某些原因没有公开网址。点击搜索按钮后，先跳到a页，然后从a页跳到b页，再从b页跳回a页。这两个跳转完成后，ajax提交给a页面的请求就会返回结果。

　　我也怀疑是cookie或者refenen的问题，结果证明不是这个原因。即使你伪造请求头后访问页面a，返回的也不是真正的结果页面，而是一段跳转到页面b的js代码。

　　既然你不知道网站在跳跃过程中做了什么，那我们就直接上大杀器吧。

　　phantomjs 可以简单理解为一个js解释器，不用说selenium，用pip安装就行了。

　　从（当然也可以自己下载源码）下载编译好的包，解压后的bin目录就是我们需要的

　　1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

　　import sys

reload(sys)

sys.setdefaultencoding('utf-8')

from selenium import webdriver

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

import time

if __name__ == "__main__":

dcap = dict(DesiredCapabilities.PHANTOMJS)

dcap["phantomjs.page.settings.resourceTimeout"] = 5

dcap["phantomjs.page.settings.loadImages"] = False

# 伪造ua信息

dcap["phantomjs.page.settings.userAgent"] = ("myua")

# 添加头文件

# dcap["phantomjs.page.customHeaders.Referer"] = (

# "https://www.google.com/"

#)

# 代理

service_args = [

'--proxy=127.0.0.1:8080',

#'--proxy-type=http',

#'--proxy-type=socks5',

#'--proxy-auth=username:password'

]

driver = webdriver.PhantomJS(

executable_path='./phantomjs',

service_args=service_args,

desired_capabilities=dcap

)

driver.get("http://www.xxx.cn/")

driver.find_element_by_id('kw').send_keys("xxx") #模仿填写搜索内容

driver.find_element_by_id("btn_ci").click() #模仿点击搜索按钮

time.sleep(7)#等待页面加载

page = driver.page_source

open("res.html","w").write(page)

driver.quit()

　　其中，我的源代码直接放在bin目录下，所以executable_path直接指向当前目录。 Find_element_by_id 查看目标网站源码即可看到，睡眠时间也应根据实际情况进行修改。 Driver.implicitly_wait(30)这里也可以用，但是这个网站的data id是随机生成的，所以我直接用sleep。

　　参考网站：

0

2021-10-15

python抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取动态网页(python爬虫遇到ajax动态页面怎么办？直接指向当前目录？)

0 个评论

发起人

AI时代内容工厂

python抓取动态网页(python爬虫遇到ajax动态页面怎么办？直接指向当前目录？)

0 个评论

发起人

相关问题