文章采集调用(第一章初见网络爬虫1.1网络连接1.2BeautifulSoup简介 )

优采云发布时间: 2022-03-21 01:06

　　文章采集调用(第一章初见网络爬虫1.1网络连接1.2BeautifulSoup简介

)

　　第一章第一个网络爬虫1.1 网络连接

　　1 本节介绍了浏览器获取信息的主要原理，然后举了个python爬取网页源代码的例子

2

3

　　1#调用urllib库里的request模块，导入urlopen函数

2from urllib.requrest import urlopen

3#利用调用的urlopen函数打开并读取目标对象，并把结果赋值给html变量

4html = urlopen('http://pythonscrapying.com/pages/page1.html')

5#把html中的内容读取并打印出来

6print(html.read())

7

　　1.2 BeautifulSoup 简介

　　BeautifulSoup 通过定位 HTML 标签对复杂的网络信息进行格式化和组织，并使用易于使用的 Python 对象为我们展示 XML 结构信息。

　　1.21 安装 BeautifulSoup

　　我在win10下使用，所以直接在powershell中输入

　　1pip install bs4

2

　　就是这样。

　　1.21 运行 BeautifulSoup

　　第一个例子也是用的，不过这次是用bs实现的

　　1#调用urllib库里的request模块的urlopen函数

2from urllib.request import urlopen

3#调用bs4库里的bs模块（注意大小写）

4from bs4 import BeautifulSoup

5#利用调用的urlopen函数打开并读取目标对象，并把结果赋值给html变量

6html = urlopen('http://pythonscrapying.com/pages/page1.html')

7#把html中的内容用bs读取并赋值给bsObj

8bsObj = BeautifulSoup(html.read())

9#打印出bsObj的h1标签

10print(bsObj.h1)

11

　　主要是想说明，bs可以提取网页信息

　　1.23 可靠的互联网连接

　　本节的大意是排除爬虫可能遇到的不可靠因素，防止其发生。

　　第一

　　1html = urlopen('http://pythonscrapying.com/pages/page1.html')

2

　　这行代码中可能出现两个主要异常：

　　该页面在服务器上不存在服务器不存在

　　当第一个异常发生时，程序返回一个 HTTP 错误。 HTTP 错误可能是“404 Page Not Found”“500 Internal Server Error”异常。我们可以通过以下方式处理：

　　1#尝试运行这行代码

2try:

3 html = urlopen('http://pythonscrapying.com/pages/page1.html')

4#如果抛出HTTPError异常

5except HTTPError as e:

6 #打印出这个异常

7 print(e)

8 #返回空值，因为默认情况为return None，中断程序，或接着执行另一个方案

9#否则

10else:

11 #程序继续。注意：如果已经抛出了上面的错误，这段else语句不会执行。

12

　　如果服务器不存在，即域名打不开，urlopen会返回一个None对象。我们可以添加判断语句来判断返回的html是否为None：

　　1if html is None:

2 print('URL is not found')

3else:

4 #程序继续

5

　　当对象为None时，如果调用None下面的子标签会发生AttributeError。

　　1try:

2 badContent = bsObj.nonExistingTag.anotherTag

3except AttributeError as e:

4 print('Tag was not found')

5else:

6 if badContent ==None:

7 print('Tag was not found')

8 else:

9 print(badContent)

10

　　合并上面的代码，方便阅读

　　1from urllib.request import urlopen

2from urllib.error import HTTPError

3from bs4 import BeautifulSoup

4def getTitle(url):

5 try:

6 html = urlopen(url)

7 except HTTPError as e:

8 return None

9 try:

10 bsObj = BeautifulSoup(html.read())

11 title = bsObj.body.h1

12 except AttributeError as e:

13 return None

14 return title

15title = getTitle('http://www.pythonscraping.com/pages/page1.html')

16if title == None:

17 print('Title could not be found')

18else:

19 print(title)

20

0

2022-03-21

文章采集调用

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

文章采集调用(第一章初见网络爬虫1.1网络连接1.2BeautifulSoup简介 )

0 个评论

发起人

AI时代内容工厂

文章采集调用(第一章初见网络爬虫1.1网络连接1.2BeautifulSoup简介 )

0 个评论

发起人

相关问题