python网页数据抓取(Python1.方法简介及安装教程运行结果结果分析)

优采云发布时间: 2022-02-13 17:12

　　1.方法介绍

　　在python3中从下载的网页中抓取数据主要有三种方式，分别是正则表达式、BeautifulSoup和Lxml。这三种方法各有特点。

　　正则表达式，也称为正则表达式。（英文：Regular Expression，在代码中常缩写为regex、regexp或RE），计算机科学中的一个概念。常规表通常用于检索和替换与特定模式（规则）匹配的文本。

　　BeautifulSoup 是一个用 Python 编写的 HTML/XML 解析器，可以很好地处理非标准标签并生成解析树。它提供了用于导航、搜索和修改解析树的简单而常用的操作。它可以大大节省您的编程时间。（安装教程传送门）

　　Lxml 是基于 XML 解析库 libxml2 的 Python 包装器。模块用C语言编写，解析速度比BeautifulSoup快，但安装过程也比较复杂。

　　2.性能对比代码

　　import time

import re

from bs4 import BeautifulSoup

import lxml.html

from urllib import request

def download(url, user_agent="wsap", num=2):

print("Downloading:"+url)

try:

req = request.Request(url)

req.add_header('user_agent', user_agent)

html = request.urlopen(req).read()

except Exception as e:

print('Download error:')

html = None

if num > 0:

if hasattr(e, "code") and 500 tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content()

return results

NUM_LTERATION = 1000

html_copy = download('http://example.webscraping.com/places/default/view/United-Kingdom-239')

for name, scraper in [('Regular expressions', re_scraper),

('BeautifulSoup', beautiful_soup_scraper),

('Lxml', lxml_scraper)]:

start = time.time()

for i in range(NUM_LTERATION):

if scraper == re_scraper:

re.purge()

result = scraper(html_copy)

assert(result['area'] == '244,820 square kilometres')

end = time.time()

print('%s: %.2f seconds' % (name, end - start))

　　3.运行结果

　　4.结果分析

　　如果您的爬虫的瓶颈是下载页面，而不是提取数据，那么使用较慢的方法（如 BeautifulSoup）不是问题。如果你只需要抓取少量数据并想避免额外的依赖，那么正则表达式可能更合适。但是，lxml 通常是抓取数据的最佳选择，因为它快速且健壮，而正则表达式和 BeautifulSoup 仅在某些情况下有用。

0

2022-02-13

python网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python网页数据抓取(Python1.方法简介及安装教程运行结果结果分析)

0 个评论

发起人

AI时代内容工厂

python网页数据抓取(Python1.方法简介及安装教程运行结果结果分析)

0 个评论

发起人

相关问题