全自动文章采集、AI生成、自动发布，网站自媒体全搞定！立即注册

从网页抓取数据( 捕获异常时应的异常写法：捕获父类的父类异常)

优采云发布时间: 2022-01-08 07:08

　　从网页抓取数据(

捕获异常时应的异常写法：捕获父类的父类异常)

　　from urllib import request,error

try:

#此处访问了一个不存在的网页

response = request.urlopen('https://cuiqingcai.com/indea.html')

except error.HTTPError as e:

print(e.reason)

print(e.code)

print(e.headers)

　　

　　2）因为URLError类是HTTPError的父类，所以在捕获异常的时候，应该选择先捕获子类的异常，再捕获父类的异常。它是这样写的：

　　from urllib import request,error

try:

response = request.urlopen('https://cuiqingcai.com/indea.html')

#先捕获子类异常

except error.HTTPError as e:

print(e.reason)

print(e.code)

print(e.headers)

#后捕获父类异常

except error.URLError as e:

print(e.reason)

#用else来处理正常的逻辑

else:

print("Request Successfully")

　　上面的案例是一个很好的异常处理格式。

　　3）有时，reason 属性返回的不一定是字符串，而是对象。

　　import socket

import urllib.request

import urllib.error

try:

response = urllib.request.urlopen('https://www.baidu.com',timeout=0.01)

except urllib.error.URLError as e:

print(type(e.reason))

if isinstance(e.reason,socket.timeout):

print('Time Out')

　　

　　设置超时时间0.01秒，强制抛出超时异常。从运行结果可以看出，reason属性的类型是socket.timeout类，所以可以使用isinstance()方法判断其类型，进行更细致的针对性处理。

　　解析模块

　　urllib 库中的 parse 模块定义了处理 URL 的标准接口，例如提取、合并和链接 URL 的各个部分。支持以下协议的URL处理：file、ftp、gopher、hdl、http、https、imap、mailto、mms、news、prospero、rsync、rtsp、rtspu、sftp、sip、sips、snews、svn、svn+shh ，远程登录，wais。

　　1. urlparse()

　　urlparse() 方法可以实现 URL 的识别和分割。API使用如下：

　　urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

　　其中，urlstring参数为必填项，其余可选。

　　urlparse()方法的返回结果是一个ParseResult类型的对象，它收录6部分，分别是scheme（协议）、netloc（域名）、path（访问路径）、params（参数）、query（查询条件，一般使用 GET 类型的 URL），片段（锚点）。通用 URL 将由这 6 个部分组成。

　　1）urlparse() 方法的基本用法，用于解析 URL。

　　from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

print(type(result))

print(result)

　　

　　2）如果指定了scheme参数并且URL中不收录scheme参数，解析时会使用scheme参数指定的协议。

　　from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='http')

print(result)

　　

　　由于URL中不收录scheme，而协议是通过scheme参数指定的，从运行结果可以看出解析时使用的是scheme指定的协议，但是由于URL中没有指定scheme，解析的 netloc 为空。

　　3）如果URL中收录scheme，同时指定了scheme参数，解析时默认使用URL中的scheme。

　　from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

print(result)

　　

　　从运行结果可以看出，URL中的scheme是用来解析的。

　　4）如果 allow_fragments 为 False，则 URL 中收录的片段将被解析为路径、参数或查询的一部分。

　　from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

print(result)

　　

　　可以看出，fragment被解析为query的一部分，fragment为空。如果 URL 不收录参数和查询：

　　from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

print(result)

　　

　　可以看出，当URL不收录params和query时，fragment被解析为路径的一部分，fragment为空。

　　5）urlparse()的返回结果ParseResult实际上是一个元组，可以通过索引顺序或者属性名来获取。

　　from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

#打印：http http www.baidu.com www.baidu.com

print(result.scheme,result[0],result.netloc,result[1])

　　2. urlunparse()

　　urlparse() 的相反方法是 urlunparse()，用于构造 URL。接收到的参数是一个可迭代的对象（如列表、元组等），长度必须为6，否则会抛出参数数量不足或过多的问题。

　　from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']

#打印：http://www.baidu.com/index.html;user?a=6#comment

print(urlunparse(data));

　　3. urlsplit( )

　　urlsplit()方法和urlparse()方法类似，只是urlsplit()方法不单独解析params部分，而是解析成path部分。

　　from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')

#打印：SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

print(result)

　　它返回一个SplitResult类型的对象，它实际上是一个元组类型，所以你可以使用属性来获取值，也可以使用索引来获取它。

　　from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')

#打印：http

print(result.scheme)

#打印：http

print(result[0])

　　4. urlunsplit()

　　urlsplit() 的相反方法是 urlunsplit()，它用于构造 URL。接收到的参数是一个可迭代对象（如列表、元组等），长度必须为5，否则会抛出参数数量不足或过多的问题。

　　from urllib.parse import urlunsplit

data = ['http','www.baidu.com','index.html','a=6','comment']

#打印：http://www.baidu.com/index.html?a=6#comment

print(urlunsplit(data))

　　5. urljoin( )

　　除了使用 urlunparse( ) 和 urlunsplit( ) 方法构造 URL 之外，您还可以使用 urljoin( ) 方法构造 URL。使用urljoin()方法时，提供一个base_url（base URL）作为第一个参数，新的URL作为第二个参数，该方法会分析base_url中的scheme、netloc和path这三个内容URL的缺失部分补充，最后补充后的网址。

　　from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','FAQ.html'))

print(urljoin('http://www.baidu.com','https://cuiqingcai.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))

print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/index.php'))

print(urljoin('http://www.baidu.com','?category=2#comment'))

print(urljoin('www.baidu.com','?category=2#comment'))

print(urljoin('www.baidu.com#comment','?category=2'))

　　

　　可以发现base_url提供了三种内容scheme，netloc和path。如果新网址中不存在这三项，则进行补充；如果它们已经存在于新 URL 中，它们将不会被替换。base_url 中的参数、查询和片段不起作用。

　　6. urlcode( )

　　urlencode() 方法在构造请求参数时使用。

　　1）在GET请求方法中构造请求参数。

　　from urllib.parse import urlencode

params = {

'name':'germey',

'age':22

}

base_url = 'http://www.baidu.com?'

url = base_url+urlencode(params)

#打印：http://www.baidu.com?name=germey&age=22

print(url)

　　2）在POST请求方法中构造请求参数。

　　import urllib.request

from urllib.parse import urlencode

params = {

'name':'germey',

'age':22

}

data = bytes(urlencode(params),encoding='utf-8')

response = urllib.request.urlopen('http://httpbin.org/post',data=data)

print(response.read())

　　7. parse_qs( )

　　parse_qs() 方法可以将表示参数的字符串转换回字典。

　　from urllib.parse import parse_qs

query = 'name=germey&age=22'

#打印：{'name': ['germey'], 'age': ['22']}

print(parse_qs(query))

　　8. parse_qsl( )

　　类似于 parse_qs() 方法，它将表示参数的字符串转换为元组列表。

　　from urllib.parse import parse_qsl

query = 'name=germey&age=22'

#打印：[('name', 'germey'), ('age', '22')]

print(parse_qsl(query))

　　返回的结果是一个列表，列表中的每个元素都是一个元组，元组的第一个内容是参数名，第二个内容是参数值。

　　9. 引用()

　　quote() 方法可以将内容转换为 URL 编码格式。URL中有中文参数时，会造成乱码。在这种情况下，您可以使用 quote() 方法将其转换为 URL 编码。

　　from urllib.parse import quote

keyword = '你好'

url = 'https://www.baidu.com/s?wd='+quote(keyword)

#打印：https://www.baidu.com/s?wd=%E4%BD%A0%E5%A5%BD

print(url)

　　10. 取消引用（）

　　unquote() 对应于 quote() 方法，可以解码 URL 编码的字符串。

　　from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E4%BD%A0%E5%A5%BD'

#打印：https://www.baidu.com/s?wd=你好

print(unquote(url))

　　机器人解析器模块

　　使用urllib库中的robotparser模块，可以实现对网站Robots协议的解析。

　　1. 机器人协议

　　Robots 协议也称为爬虫协议和机器人协议。它的全称是Robots Exclusion Protocol，用来告诉爬虫和搜索引擎哪些页面可以爬取，哪些页面不能爬取。它通常是一个名为 robots.txt 的文本文件，通常放在网站的根目录下。搜索爬虫访问站点时，首先会检查站点根目录下是否存在robots.txt文件。如果存在，则搜索爬虫将根据其中定义的爬取范围进行爬取；如果没有找到该文件，搜索爬虫将访问所有可直接访问的页面。

　　1）robots.txt 文件示例 1、所有爬虫都不能爬取任何页面，但是公共目录可以爬取。

　　User-agent:*

Disallow:/

Allow:/public/

　　2）robots.txt 文件示例 2. 禁止爬虫爬取任何目录。

　　User-agent:*

Disallow:/

　　3）robots.txt 文件示例 3. 允许爬虫爬取任何页面。

　　User-agent:*

Disallow:

　　4）robots.txt 文件示例。只允许名为 WebCrawler 的爬虫爬取任何页面，不允许其他爬虫爬取任何页面。

　　User-agent:WebCrawler

Disallow:

User-agent:*

Disallow:/

　　2. 爬虫名称

　　爬虫有名字，常用的搜索爬虫名字和对应的网站。

　　爬虫名称属于网站

　　百度蜘蛛

　　百度

　　谷歌机器人

　　谷歌

　　360蜘蛛

　　360搜索

　　游道机器人

　　有道

　　ia_archiver

　　亚历克萨

　　小型*敏*感*词*

　　阿尔塔维斯塔

　　3. 使用机器人解析模块

　　robotsparse模块提供了一个RobotFileParser类，可以根据网站的robots.txt文件判断爬虫是否有权限爬取这个网页。这个类使用简单，在构造函数中传入robots.txt的链接即可。

　　urllib.robotparser.RobotFileParser(url='')

　　url 参数是可选的。如果构造时没有传入url，也可以通过set_url()方法设置。RobotFileParser类的常用方法如下：

　　1）判断简书的某些网页是否可以爬取。

　　from urllib.robotparser import RobotFileParser

rp = RobotFileParser()

#设置robots.txt文件的路径

rp.set_url('https://www.jianshu.com/')

#执行读取分析操作

rp.read()

#判断是否可爬

print(rp.can_fetch('*','https://www.jianshu.com/p/b67554025d7d'))

print(rp.can_fetch('*','https://www.jianshu.com/search?q=python&page=1&type=collections'))

　　

　　2）使用 parse() 方法执行读取和解析操作。

　　from urllib.robotparser import RobotFileParser

from urllib.request import urlopen

rp = RobotFileParser()

#通过urlopen()直接打开百度的robots.txt文件并交给RobotFileParser进行分析

rp.parse(urlopen('https://www.baidu.com/robots.txt').read().decode('utf-8').split('\n'))

#判断是否可爬

print(rp.can_fetch('*','https://www.baidu.com/p/b67554025d7d'))

print(rp.can_fetch('*','https://www.baidu/search?q=python&page=1&type=collections'))

　　有时，判断是否可以爬取时返回False，或者爬虫运行时出现403拒绝服务。发送请求时，在请求头中设置 User-Agent 和 Host 可能会返回 True。

0

2022-01-08

从网页抓取数据

0 个评论

要回复文章请先登录或注册

视
频
教
程

在
线
客
服

官方客服QQ群

在
线
客
服