浏览器抓取网页(绕开网站的反爬机制获取正确的页面是什么？ )

优采云发布时间: 2022-02-24 13:25

　　浏览器抓取网页(绕开网站的反爬机制获取正确的页面是什么？

)

　　在编写爬虫的过程中，有的网站会设置反爬机制，拒绝响应非浏览器访问；或者短时间内频繁爬取会触发网站的反爬机制，导致IP被阻塞无法爬取网页。这需要修改爬虫中的请求头来伪装浏览器访问，或者使用代理发起请求。从而绕过网站的反爬机制来获取正确的页面。

　　本文使用python3.6、常用的request库requests和自动化测试库selenium来使用浏览器。

　　这两个库的使用可以参考官方文档或者我的另一篇博客：如何通过python爬虫获取网页的html内容并下载附件。

　　1、requests 伪装头发送请求

　　在requests发送的请求的request header中，User-Agent会被识别为python程序发送的请求，如下图：

　　import requests

url = 'https://httpbin.org/headers'

response = requests.get(url)

if response.status_code == 200:

print(response.text)

　　返回结果：

　　{

"headers": {

"Accept": "*/*",

"Accept-Encoding": "gzip, deflate",

"Host": "httpbin.org",

"User-Agent": "python-requests/2.20.1"

}

　　注意：是一个开源的网站，用于测试网页请求，比如上面的/headers链接，会返回发送请求的请求头。详情请参考其官网。

　　User-Agent：User-Agent（英文：用户代理）是指由软件代理提供的代表用户行为的自身标识符。用于识别浏览器类型和版本、操作系统和版本、浏览器内核和其他信息的标识符。有关详细信息，请参阅 Wikipedia 条目：用户代理

　　对于反爬网站，它会识别它的headers，拒绝返回正确的网页。此时，您需要将发送的请求伪装成浏览器的标头。

　　用浏览器打开网站，会看到如下页面，就是浏览器的headers。

　　[外链图片传输失败，源站可能有防盗链机制，建议保存图片直接上传(img-71I5qEeA-71)()]

　　复制上述请求头并将其传递给 requests.get() 函数以将请求伪装成浏览器。

　　import requests

url = 'https://httpbin.org/headers'

myheaders = {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",

"Accept-Encoding": "br, gzip, deflate",

"Accept-Language": "zh-cn",

"Host": "httpbin.org",

"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"

}

response = requests.get(url, headers=myheaders)

if response.status_code == 200:

print(response.text)

　　返回的结果变为：

　　{

"headers": {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",

"Accept-Encoding": "br, gzip, deflate",

"Accept-Language": "zh-cn",

"Host": "httpbin.org",

"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"

}

　　应用爬虫时，可以随机切换到其他User-Agent，避免触发反爬。

　　2、Selenium 模拟浏览器伪装头的使用

　　使用自动化测试工具selenium可以模拟使用浏览器访问网站。本文使用 selenium 3.14.0 版本，支持 Chrome 和 Firefox 浏览器。要使用浏览器，需要下载相应版本的驱动程序。

　　驱动下载地址：

　　使用 webdriver 访问您自己的浏览器的标题。

　　from selenium import webdriver

url = 'https://httpbin.org/headers'

driver_path = '/path/to/chromedriver'

browser = webdriver.Chrome(executable_path=driver_path)

browser.get(url)

print(browser.page_source)

browser.close()

　　打印出返回的网页代码：

　　{

"headers": {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",

"Accept-Encoding": "gzip, deflate, br",

"Accept-Language": "zh-CN,zh;q=0.9",

"Host": "httpbin.org",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"

}

　　浏览器驱动也可以伪装User-Agent，只需在创建webdriver浏览器对象时传入设置即可：

　　from selenium import webdriver

url = 'https://httpbin.org/headers'

user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"

driver_path = '/path/to/chromedriver'

opt = webdriver.ChromeOptions()

opt.add_argument('--user-agent=%s' % user_agent)

browser = webdriver.Chrome(executable_path=driver_path, options=opt)

browser.get(url)

print(browser.page_source)

browser.close()

　　此时返回的 User-Agent 成为传入的设置。

　　{

"headers": {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",

"Accept-Encoding": "gzip, deflate, br",

"Accept-Language": "zh-CN,zh;q=0.9",

"Host": "httpbin.org",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"

}

　　Firefox 浏览器驱动程序的设置不同：

　　from selenium import webdriver

url = 'https://httpbin.org/headers'

user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"

driver_path = '/path/to/geckodriver'

profile = webdriver.FirefoxProfile()

profile.set_preference("general.useragent.override", user_agent)

browser = webdriver.Firefox(executable_path=driver_path, firefox_profile=profile)

browser.get(url)

print(browser.page_source)

browser.close()

　　3、requests 使用 ip 代理发送请求

　　当某个ip在短时间内被访问过于频繁时，会触发网站的反爬机制，需要输入验证码甚至屏蔽该ip来禁止访问。这时候就需要使用代理ip发起请求来获取正确的网页了。

　　访问该网站以查看您的 ip。

　　import requests

url = 'https://httpbin.org/ip'

response = requests.get(url)

print(response.text)

　　返回本地网络的ip：

　　使用代理IP类似于1中伪装headers的方式，比如现在得到一个代理IP：58.58.213.55:8888，使用如下：

　　import requests

proxy = {

'http': 'http://58.58.213.55:8888',

'https': 'http://58.58.213.55:8888'

}

response = requests.get('https://httpbin.org/ip', proxies=proxy)

print(response.text)

　　返回的是代理ip

　　{

"origin": "58.58.213.55, 58.58.213.55"

}

　　4、selenium webdriver 使用代理 ip

　　chrome驱动使用代理ip的方式类似于伪装成user-agent：

　　from selenium import webdriver

url = 'https://httpbin.org/ip'

proxy = '58.58.213.55:8888'

driver_path = '/path/to/chromedriver'

opt = webdriver.ChromeOptions()

opt.add_argument('--proxy-server=' + proxy)

browser = webdriver.Chrome(executable_path=driver_path, options=opt)

browser.get(url)

print(browser.page_source)

browser.close()

　　打印结果：

　　{

"origin": "58.58.213.55, 58.58.213.55"

}

　　firefox驱动的设置略有不同，需要下载浏览器扩展close_proxy_authentication来取消代理用户认证。可以从谷歌下载，本文使用firefox 65.0.1（64位）版本，可用扩展文件为：close_proxy_authentication-1.1-sm+tb+fx .xpi。

　　代码显示如下：

　　from selenium import webdriver

url = 'https://httpbin.org/ip'

proxy_ip = '58.58.213.55'

proxy_port = 8888

xpi = '/path/to/close_proxy_authentication-1.1-sm+tb+fx.xpi'

driver_path = '/path/to/geckodriver'

profile = webdriver.FirefoxProfile()

profile.set_preference('network.proxy.type', 1)

profile.set_preference('network.proxy.http', proxy_ip)

profile.set_preference('network.proxy.http_port', proxy_port)

profile.set_preference('network.proxy.ssl', proxy_ip)

profile.set_preference('network.proxy.ssl_port', proxy_port)

profile.set_preference('network.proxy.no_proxies_on', 'localhost, 127.0.0.1')

profile.add_extension(xpi)

browser = webdriver.Firefox(executable_path=driver_path, firefox_profile=profile)

browser.get(url)

print(browser.page_source)

browser.close()

　　它还打印代理 ip。

　　{

"origin": "58.58.213.55, 58.58.213.55"

}

　　转动：

0

2022-02-24

浏览器抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

浏览器抓取网页(绕开网站的反爬机制获取正确的页面是什么？ )

0 个评论

发起人