网页手机号抓取程序(Python2urllib.request实现下载网页的三种方式方法方法)

优采云发布时间: 2021-10-21 14:05

　　1 什么是网络爬虫

　　网络爬虫（网络蜘蛛、网络机器人、网络追逐者、自动索引、模拟程序）是一种按照一定的规则自动抓取互联网信息，从互联网上抓取有价值信息的程序或脚本。提示：自动提取网页，从万维网上下载网页供搜索引擎使用的程序，是搜索引擎的重要组成部分。

　　（1) 爬取目标的描述或定义；

　　（2) 对网页或数据的分析和过滤；

　　（3) URL 搜索策略。

　　2 Python爬虫架构

　　Python爬虫架构主要由五部分组成：调度器、URL管理器、网页下载器、网页解析器、应用（抓取有价值的数据）。

　　让我们用一张图来解释调度器是如何协同工作的：

　　3 urllib.request下载网页的三种方式

　　方法一：使用urllib.request.urlopen(url)方法函数实现最基本的请求url发起（打开url URL的操作）

　　函数原型如下： urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

　　方法二：使用 response=urllib.request。请求 (url) 和 urllib.request.urlopen(request) 函数

　　响应= urllib.request。请求（url）实现访问目标url、数据、headers和方法

　　urllib.request.urlopen(request) 参数为请求对象，代码中的响应为上一步（打开url URL的操作）获取的请求对象

　　Tips：构建一个完整的请求，如果需要给请求添加headers（请求头）等信息，我们需要使用更强大的Request类来构建请求。Request的意思是在请求的时候方便一些信息的传递，而urlopen没有。

　　方法三：添加urllib.request处理cookies的能力，用urllib.request.urlopen(url)函数实现

　　提示：Python 2 使用 urllib2 代替 urllib.request，cookie 代替 http.cookiejar，print 代替 print()

　　#!/usr/bin/python

# -*- coding: UTF-8 -*-

import http.cookiejar

import urllib.request

url = "http://www.baidu.com"

response1 = urllib.request.urlopen(url)

print ("第一种方法")

# 获取状态码，200表示成功

print (response1.getcode())

# 获取网页内容的长度

print (len(response1.read()))

print ("第二种方法")

request = urllib.request.Request(url)

# 模拟Mozilla浏览器进行爬虫

request.add_header("user-agent", "Mozilla/5.0")

response2 = urllib.request.urlopen(request)

print (response2.getcode())

print (len(response2.read()))

print ("第三种方法")

cookie=http.cookiejar.CookieJar()

# 加入urllib.request处理cookie的能力

opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))

urllib.request.install_opener(opener)

response3 = urllib.request.urlopen(url)

print (response3.getcode())

print (len(response3.read()))

print (cookie)

　　执行结果如下图所示：

　　4 使用第三方库Beautiful Soup解析html文件4.1 安装Beautiful Soup

　　Beautiful Soup：Python的第三方插件，用于提取xml和HTML中的数据，官网地址。

　　打开cmd（命令提示符），在Python（Python3版本）安装目录下输入Scripts，输入dir查看是否有pip.exe，如果使用可以使用Python自带的pip命令安装，输入以下命令安装到：

　　pip install beautifulsoup4

　　执行下图：

　　2、测试是否安装成功

　　写一个python文件test.py，输入：

　　import bs4

print (bs4)

　　运行文件，如果可以正常输出，则安装成功，如下。

　　4.2 使用 Beautiful Soup 解析 html 文件

　　#!/usr/bin/python

# -*- coding: UTF-8 -*-

import re

from bs4 import BeautifulSoup

html_doc = """

The Dormouse\'s story

<p class="title">The Dormouse\'s story

　　Once upon a time there were three little sisters; and their names were

and they lived at the bottom of a well.

　　...

"""

# 创建一个BeautifulSoup解析对象

soup = BeautifulSoup(html_doc, "html.parser")

# 获取所有的链接

links = soup.find_all(\'a\')

print ("所有的链接")

for link in links:

print (link.name, link[\'href\'], link.get_text())

print ("获取特定的URL地址")

link_node = soup.find(\'a\', href="http://news.baidu.com")

print (link_node.name, link_node[\'href\'], link_node[\'class\'], link_node.get_text())

print ("正则表达式匹配")

link_node = soup.find(\'a\', href=re.compile(r"hao"))

print (link_node.name, link_node[\'href\'], link_node[\'class\'], link_node.get_text())

print ("获取P段落的文字")

p_node = soup.find(\'p\', class_=\'story\')

print (p_node.name, p_node[\'class\'], p_node.get_text())</p>

　　执行结果如下：

　　-------------------------------------------------- ----Tanwheey-------------------------------------------- -----------------------------

　　热爱生活，热爱工作。

0

2021-10-21

网页手机号抓取程序

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页手机号抓取程序(Python2urllib.request实现下载网页的三种方式方法方法)

0 个评论

发起人

AI时代内容工厂

网页手机号抓取程序(Python2urllib.request实现下载网页的三种方式方法方法)

0 个评论

发起人

相关问题