python抓取网页数据(用Python写网络爬虫》——2.2三种网页抓取方法)

优采云发布时间: 2021-12-20 18:09

　　摘要：本文讲了三种使用Python抓取网页数据的方法；它们是正则表达式 (re)、BeautifulSoup 模块和 lxml 模块。本文所有代码运行在python3.5.

　　本文抓取的是[中央气象台](http://www.nmc.cn/)首页头条信息：

　　HTML 层次结构是：

　　抓取 href、标题和标签的内容。

　　一、正则表达式

　　复制外层HTML：

　　高温预警

　　代码：

　　# coding=utf-8

import re, urllib.request

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

html = html.decode('utf-8') #python3版本中需要加入

links = re.findall('<a target="_blank" href="(.+?)" title'/span,html)

titles = re.findall(span class="hljs-string"'a target="_blank" .+? title="(.+?)"'/span,html)

tags = re.findall(span class="hljs-string"'a target="_blank" .+? title=.+?(.+?)/a'/span,html)

span class="hljs-keyword"for/span span class="hljs-keyword"link/span,title,tag in zip(links,titles,tags):

span class="hljs-keyword"print/span(tag,url+span class="hljs-keyword"link/span,title)/code/pre/p

p正则表达式符号'.'表示匹配任何字符串（\n除外）； ‘+’表示匹配0个或多个正则表达式； ‘？ '表示在正则表达式之前匹配0次或1次。更多信息请参考Python中的正则表达式教程/p

p输出结果如下：/p

ppre class="prettyprint"code class=" hljs avrasm"高温预警 http://wwwspan class="hljs-preprocessor".nmc/spanspan class="hljs-preprocessor".cn/span/publish/country/warning/megatemperaturespan class="hljs-preprocessor".html/span 中央气象台span class="hljs-number"7/span月span class="hljs-number"13/span日span class="hljs-number"18/span时继续发布高温橙色预警

山洪灾害气象预警 http://wwwspan class="hljs-preprocessor".nmc/spanspan class="hljs-preprocessor".cn/span/publish/mountainfloodspan class="hljs-preprocessor".html/span 水利部和中国气象局span class="hljs-number"7/span月span class="hljs-number"13/span日span class="hljs-number"18/span时联合发布山洪灾害气象预警

强对流天气预警 http://wwwspan class="hljs-preprocessor".nmc/spanspan class="hljs-preprocessor".cn/span/publish/country/warning/strong_convectionspan class="hljs-preprocessor".html/span 中央气象台span class="hljs-number"7/span月span class="hljs-number"13/span日span class="hljs-number"18/span时继续发布强对流天气蓝色预警

地质灾害气象风险预警 http://wwwspan class="hljs-preprocessor".nmc/spanspan class="hljs-preprocessor".cn/span/publish/geohazardspan class="hljs-preprocessor".html/span 国土资源部与中国气象局span class="hljs-number"7/span月span class="hljs-number"13/span日span class="hljs-number"18/span时联合发布地质灾害气象风险预警/code/pre/p

p二、BeautifulSoup 模块/p

pBeautiful Soup 是一个非常流行的 Python 模块。该模块可以解析网页并提供方便的界面来定位内容。/p

p复制选择器：/p

ppre class="prettyprint"code class=" hljs css"span class="hljs-id"#alarmtip/span > ul > li.waring > a:nth-child(1)

　　因为这里我们要抓取多个数据，而不仅仅是第一个，所以我们需要将其更改为：

　　#alarmtip > ul > li.waring > a

　　代码：

　　from bs4 import BeautifulSoup

import urllib.request

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html,'lxml')

content = soup.select('#alarmtip > ul > li.waring > a')

for n in content:

link = n.get('href')

title = n.get('title')

tag = n.text

print(tag, url + link, title)

　　输出结果和上面一样。

　　三、lxml 模块

　　Lxml 是一个基于 libxml2（一个 XML 解析库）的 Python 包。本模块为C语言编写，解析速度比Beautiful Soup快，但安装过程较复杂。

　　代码：

　　import urllib.request,lxml.html

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

tree = lxml.html.fromstring(html)

content = tree.cssselect('li.waring > a')

for n in content:

link = n.get('href')

title = n.get('title')

tag = n.text

print(tag, url + link, title)

　　输出结果和上面一样。

　　四、将捕获的数据存储在列表或字典中

　　以 BeautifulSoup 模块为例：

　　from bs4 import BeautifulSoup

import urllib.request

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html,'lxml')

content = soup.select('#alarmtip > ul > li.waring > a')

######### 添加到列表中

link = []

title = []

tag = []

for n in content:

link.append(url+n.get('href'))

title.append(n.get('title'))

tag.append(n.text)

######## 添加到字典中

for n in content:

data = {

'tag' : n.text,

'link' : url+n.get('href'),

'title' : n.get('title')

}

　　五、总结

　　表2.1总结了每种爬取方法的优缺点。

　　源代码链接

　　参考文献：

　　《用Python编写网络爬虫》——2.2种网络爬虫方法

0

2021-12-20

python抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取网页数据(用Python写网络爬虫》——2.2三种网页抓取方法)

0 个评论

发起人