爬虫抓取网页数据(7,686,方式抓取网页数据的三种方式(组图))

优采云发布时间: 2022-01-15 22:15

　　0.前言0.1 爬网

　　本文将说明三种爬取网络数据的方法：正则表达式、BeautifulSoup 和 lxml。

　　用于获取网页内容的代码详情请参考 Python Web Crawler - Your First Crawler（我的短书博客）。使用此代码来抓取整个网页。

　　import requests

def download(url, num_retries=2, user_agent='wswp', proxies=None):

'''下载一个指定的URL并返回网页内容

参数：

url(str): URL

关键字参数：

user_agent(str):用户代理（默认值：wswp）

proxies（dict）：代理（字典）: 键：‘http’'https'

值：字符串（‘http(s)://IP’）

num_retries(int):如果有5xx错误就重试（默认：2）

#5xx服务器错误，表示服务器无法完成明显有效的请求。

#https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81

'''

print('==========================================')

print('Downloading:', url)

headers = {'User-Agent': user_agent} #头部设置，默认头部有时候会被网页反扒而出错

try:

resp = requests.get(url, headers=headers, proxies=proxies) #简单粗暴，.get(url)

html = resp.text #获取网页内容，字符串形式

if resp.status_code >= 400: #异常处理，4xx客户端错误返回None

print('Download error:', resp.text)

html = None

if num_retries and 500 tr#places_area__row > td.w2p_fw' )[0].text_content()

#lxml_xpath

tree.xpath('//tr[@id="places_area__row"]/td[@class="w2p_fw"]' )[0].text_content()

　　Chrome浏览器可以轻松复制各种表情：

　　通过上面的下载功能和不同的表达方式，我们可以通过三种不同的方式抓取数据。

　　1.不同方式爬取数据1.1 正则表达式爬取网页

　　正则表达式在python或其他语言中有很好的应用。它使用简单的规定符号来表达不同的字符串组合形式，简洁高效。学习正则表达式很有必要。. Python 内置正则表达式，无需额外安装。

　　import re

targets = ('area', 'population', 'iso', 'country', 'capital', 'continent',

'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format',

'postal_code_regex', 'languages', 'nei*敏*感*词*ours')

def re_scraper(html):

results = {}

for target in targets:

results[target] = re.search(r'.*?(.*?)'

% target, html).groups()[0]

return results

　　1.2BeautifulSoup 抓取数据

　　BeautifulSoup的使用可以看python网络爬虫——BeautifulSoup爬取网络数据

　　代码显示如下：

　　from bs4 import BeautifulSoup

targets = ('area', 'population', 'iso', 'country', 'capital', 'continent',

'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format',

'postal_code_regex', 'languages', 'nei*敏*感*词*ours')

def bs_scraper(html):

soup = BeautifulSoup(html, 'html.parser')

results = {}

for target in targets:

results[target] = soup.find('table').find('tr', id='places_%s__row' % target) \

.find('td', class_="w2p_fw").text

return results

　　1.3 lxml捕获数据

　　from lxml.html import fromstring

def lxml_scraper(html):

tree = fromstring(html)

results = {}

for target in targets:

results[target] = tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % target)[0].text_content()

return results

def lxml_xpath_scraper(html):

tree = fromstring(html)

results = {}

for target in targets:

results[target] = tree.xpath('//tr[@id="places_%s__row"]/td[@class="w2p_fw"]' % target)[0].text_content()

return results

　　1.4 运行结果

　　scrapers = [('re', re_scraper), ('bs',bs_scraper), ('lxml', lxml_scraper), ('lxml_xpath',lxml_xpath_scraper)]

html = download('http://example.webscraping.com/places/default/view/Australia-14')

for name, scraper in scrapers:

print(name,"=================================================================")

result = scraper(html)

print(result)

　　==========================================

Downloading: http://example.webscraping.com/places/default/view/Australia-14

re =================================================================

{'area': '7,686,850 square kilometres', 'population': '21,515,754', 'iso': 'AU', 'country': 'Australia', 'capital': 'Canberra', 'continent': 'OC', 'tld': '.au', 'currency_code': 'AUD', 'currency_name': 'Dollar', 'phone': '61', 'postal_code_format': '####', 'postal_code_regex': '^(\\d{4})$', 'languages': 'en-AU', 'nei*敏*感*词*ours': ' '}

bs =================================================================

{'area': '7,686,850 square kilometres', 'population': '21,515,754', 'iso': 'AU', 'country': 'Australia', 'capital': 'Canberra', 'continent': 'OC', 'tld': '.au', 'currency_code': 'AUD', 'currency_name': 'Dollar', 'phone': '61', 'postal_code_format': '####', 'postal_code_regex': '^(\\d{4})$', 'languages': 'en-AU', 'nei*敏*感*词*ours': ' '}

lxml =================================================================

{'area': '7,686,850 square kilometres', 'population': '21,515,754', 'iso': 'AU', 'country': 'Australia', 'capital': 'Canberra', 'continent': 'OC', 'tld': '.au', 'currency_code': 'AUD', 'currency_name': 'Dollar', 'phone': '61', 'postal_code_format': '####', 'postal_code_regex': '^(\\d{4})$', 'languages': 'en-AU', 'nei*敏*感*词*ours': ' '}

lxml_xpath =================================================================

{'area': '7,686,850 square kilometres', 'population': '21,515,754', 'iso': 'AU', 'country': 'Australia', 'capital': 'Canberra', 'continent': 'OC', 'tld': '.au', 'currency_code': 'AUD', 'currency_name': 'Dollar', 'phone': '61', 'postal_code_format': '####', 'postal_code_regex': '^(\\d{4})$', 'languages': 'en-AU', 'nei*敏*感*词*ours': ' '}

　　从结果可以看出，正则表达式在某些地方返回了额外的元素，而不是纯文本。这是因为这些地方的网页结构与其他地方不同，所以正则表达式不能完全覆盖相同的内容，例如某些地方的链接和图片。并且 BeautifulSoup 和 lxml 具有提取文本的特殊功能，因此不会出现类似的错误。

　　既然有三种不同的爬取方式，那有什么区别呢？申请情况如何？如何选择？

　　······················································································

0

2022-01-15

爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

爬虫抓取网页数据(7,686,方式抓取网页数据的三种方式(组图))

0 个评论

发起人