php正则函数抓取网页连接(Windows1064位Python2.7.10V使用的编程集成开发环境)

优采云发布时间: 2021-11-23 17:18

　　通过正则表达式获取一个网页中的所有URL链接，并下载这些URL链接的源码

　　使用的系统：Windows 10 64位

　　Python语言版本：Python 2.7.10V

　　编程Python集成开发环境使用：PyCharm 2016 04

　　我使用的 urllib 版本：urllib2

　　注意：我这里没有使用 Python2，也没有使用 Python3

　　一、简介

　　通过前面两节（网络爬虫爬取网页和解决爬取的网页显示乱码问题），我们终于完成了最终的download()函数。

　　而在上一节中，我们通过解析网站映射中的URL，抓取了目标站点的所有网页。在上一节中，我们介绍了一种抓取网页中所有链接页面的方法。在本节中，我们使用正则表达式获取网页中的所有 URL 链接并下载这些 URL 链接的源代码。

　　2. 介绍

　　至此，我们已经利用目标网站的结构特征实现了两个简单的爬虫。只要有这两种技术，就应该用于爬取，因为这两种方法最大限度地减少了需要下载的网页数量。但是，对于一些网站，我们需要让爬虫更像普通用户：关注链接，访问感兴趣的内容。

　　通过点击所有链接，我们可以轻松下载网站的整个页面。但是这种方法会下载很多我们不需要的网页。例如，如果我们想从某个在线论坛抓取用户账户详情页面，那么此时我们只需要下载账户页面，而不是下载讨论轮帖子页面。本博客中的链接爬虫会使用正则表达式来判断需要下载哪些页面。

　　3. 主码

　　import re

def link_crawler(seed_url, link_regex):

"""Crawl from the given seed URL following links matched by link_regex

"""

crawl_queue = [seed_url]

while crawl_queue:

url = crawl_queue.pop()

html = download(url)

# filter for links matching our regular expression

for link in get_links(html):

if re.match(link_regex, link):

crawl_queue.append(link)

def get_links(html):

"""Return a list of links from html

"""

# a regular expression to extract all links from the webpage

webpage_regex = re.compile(']+href=["\'](.*?)["\']', re.IGNORECASE)

# list of all links from the webpage

return webpage_regex.findall(html)

　　4.解释初级代码

　　1 .

　　def link_crawler(seed_url, link_regex):

　　这个函数就是我们要对外使用的函数。作用：首先下载seed_url网页的源码，然后提取里面的所有链接URL，然后将所有匹配到的链接URL与link_regex进行匹配，如果链接URL中有link_regex内容，则将链接URL放入队列中，下次执行 crawl_queue: 时，将对该链接 URL 执行相同的操作。重复，直到crawl_queue队列为空，然后退出函数。

　　2 .

　　get_links(html)函数的作用：用于获取html页面中的所有链接URL。

　　3.

　　webpage_regex = re.compile(']+href=["\']'(.*?)["\']', re.IGNORECASE)

　　匹配模板被制作并存储在网页正则表达式对象中。匹配这样一个字符串，提取xxx的内容，这个xxx就是网址URL。

　　4.

　　return webpage_regex.findall(html)

　　使用webpage_regex模板匹配html网页源代码上所有符合格式的字符串，提取里面的xxx内容。

　　正则表达式的详细知识请到这个网站了解：

　　五。*敏*感*词*

　　先启动Python终端交互命令，在PyCharm软件的终端窗口或Windows系统的DOS窗口执行如下命令：

　　C:\Python27\python.exe -i 1-4-4-regular_expression.py

　　执行 link_crawler() 函数：

　　>>> link_crawler('http://example.webscraping.com', '/(index|view)')

　　输出：

Downloading: http://example.webscraping.com

Downloading: /index/1

Traceback (most recent call last):

File "1-4-4-regular_expression.py", line 50, in

link_crawler('http://example.webscraping.com', '/(index|view)')

File "1-4-4-regular_expression.py", line 36, in link_crawler

html = download(url)

File "1-4-4-regular_expression.py", line 13, in download

html = urllib2.urlopen(request).read()

File "C:\Python27\lib\urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "C:\Python27\lib\urllib2.py", line 423, in open

protocol = req.get_type()

File "C:\Python27\lib\urllib2.py", line 285, in get_type

raise ValueError, "unknown url type: %s" % self.__original

ValueError: unknown url type: /index/1

　　运行时，出现错误。下载 /index/1 URL 时发生此错误。这个/index/1是目标站点中的相对链接，是完整网页URL的路径部分，不收录协议和服务器部分。我们无法使用 download() 函数下载它。在浏览器中浏览网页时，相对链接可以正常工作，但是使用urllib2下载网页时，由于上下文不可知，无法下载成功。

　　七。改进代码

　　所以为了让urllib2成为网页，我们需要将相对链接转换为绝对链接，这样问题就可以解决了。

　　Python中有一个模块可以实现这个功能：urlparse。

　　对 link_crawler() 函数进行了以下改进：

import urlparse

def link_crawler(seed_url, link_regex):

"""Crawl from the given seed URL following links matched by link_regex

"""

crawl_queue = [seed_url]

while crawl_queue:

url = crawl_queue.pop()

html = download(url)

for link in get_links(html):

if re.match(link_regex, link):

link = urlparse.urljoin(seed_url, link)

crawl_queue.append(link)

　　8. 运行：

　　运行程序：

　　>>> link_crawler('http://example.webscraping.com', '/(index|view)')

　　输出：

Downloading: http://example.webscraping.com

Downloading: http://example.webscraping.com/index/1

Downloading: http://example.webscraping.com/index/2

Downloading: http://example.webscraping.com/index/3

Downloading: http://example.webscraping.com/index/4

Downloading: http://example.webscraping.com/index/5

Downloading: http://example.webscraping.com/index/6

Downloading: http://example.webscraping.com/index/7

Downloading: http://example.webscraping.com/index/8

Downloading: http://example.webscraping.com/index/9

Downloading: http://example.webscraping.com/index/10

Downloading: http://example.webscraping.com/index/11

Downloading: http://example.webscraping.com/index/12

Downloading: http://example.webscraping.com/index/13

Downloading: http://example.webscraping.com/index/14

Downloading: http://example.webscraping.com/index/15

Downloading: http://example.webscraping.com/index/16

Downloading: http://example.webscraping.com/index/17

Downloading: http://example.webscraping.com/index/18

Downloading: http://example.webscraping.com/index/19

Downloading: http://example.webscraping.com/index/20

Downloading: http://example.webscraping.com/index/21

Downloading: http://example.webscraping.com/index/22

Downloading: http://example.webscraping.com/index/23

Downloading: http://example.webscraping.com/index/24

Downloading: http://example.webscraping.com/index/25

Downloading: http://example.webscraping.com/index/24

Downloading: http://example.webscraping.com/index/25

Downloading: http://example.webscraping.com/index/24

Downloading: http://example.webscraping.com/index/25

Downloading: http://example.webscraping.com/index/24

　　通过运行结果可以看到：虽然可以无误地下载网页，但是会不断下载同一个网页。为什么会这样？这是因为这些链接 URL 之间存在链接。如果两个网页相互有链接，那么面对这个程序，它会继续无限循环。

　　因此，我们还需要继续改进程序：避免抓取相同的链接，因此我们需要记录抓取了哪些链接，如果已抓取，则不抓取它。

　　9、继续完善link_crawler()函数：

def link_crawler(seed_url, link_regex):

crawl_queue = [seed_url]

# keep track which URL's have seen before

seen = set(crawl_queue)

while crawl_queue:

url = crawl_queue.pop()

html = download(url)

for link in get_links(html):

# check if link matches expected regex

if re.match(link_regex, link):

# form absolute link

link = urlparse.urljoin(seed_url, link)

# check if have already seen this link

if link not in seen:

seen.add(link)

crawl_queue.append(link)

　　10. 运行：

　　>>> link_crawler('http://example.webscraping.com', '/(index|view)')

　　输出：

Downloading: http://example.webscraping.com

Downloading: http://example.webscraping.com/index/1

Downloading: http://example.webscraping.com/index/2

Downloading: http://example.webscraping.com/index/3

Downloading: http://example.webscraping.com/index/4

Downloading: http://example.webscraping.com/index/5

Downloading: http://example.webscraping.com/index/6

Downloading: http://example.webscraping.com/index/7

Downloading: http://example.webscraping.com/index/8

Downloading: http://example.webscraping.com/index/9

Downloading: http://example.webscraping.com/index/10

Downloading: http://example.webscraping.com/index/11

Downloading: http://example.webscraping.com/index/12

Downloading: http://example.webscraping.com/index/13

Downloading: http://example.webscraping.com/index/14

Downloading: http://example.webscraping.com/index/15

Downloading: http://example.webscraping.com/index/16

Downloading: http://example.webscraping.com/index/17

Downloading: http://example.webscraping.com/index/18

Downloading: http://example.webscraping.com/index/19

Downloading: http://example.webscraping.com/index/20

Downloading: http://example.webscraping.com/index/21

Downloading: http://example.webscraping.com/index/22

Downloading: http://example.webscraping.com/index/23

Downloading: http://example.webscraping.com/index/24

Downloading: http://example.webscraping.com/index/25

Downloading: http://example.webscraping.com/view/Zimbabwe-252

Downloading: http://example.webscraping.com/view/Zambia-251

Downloading: http://example.webscraping.com/view/Yemen-250

Downloading: http://example.webscraping.com/view/Western-Sahara-249

　　现在这个程序是一个非常完美的程序，它会爬取所有位置，并且可以按预期停止。最后，一个可用的爬虫是完美的。

　　总结：

　　这样，我们就引入了三个源代码，用于抓取站点或网页中的所有链接 URL。这些只是初步的程序。接下来，我们可能还会遇到这样的问题：

　　1、如果某些网站设置了禁止爬取的URL，为了实现本站的规则，我们必须根据其robots.txt文件来设计爬取程序。

　　2.谷歌在中国是不可用的，所以如果我们要使用代理去谷歌，我们需要为我们的爬虫程序设置一个代理。

　　3、如果我们的爬虫爬取网站太快，可能是被目标站点的服务器屏蔽了，所以需要限制下载速度。

　　4. 一些网页有日历之类的东西。这个东西中的每个日期都是一个 URL 链接。我们会爬这个毫无意义的东西吗？日期是无止境的，所以对于我们的爬虫程序来说，这是一个爬虫陷阱，我们需要避免掉入爬虫陷阱。

　　我们需要解决这4个问题。为了得到爬虫程序的最终版本。

0

2021-11-23

php正则函数抓取网页连接

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

php正则函数抓取网页连接(Windows1064位Python2.7.10V使用的编程集成开发环境)

0 个评论

发起人