java爬虫抓取动态网页(Python3数据分析与机器学习实战(Python爬虫分析汇总))
优采云 发布时间: 2021-09-22 10:17java爬虫抓取动态网页(Python3数据分析与机器学习实战(Python爬虫分析汇总))
使用urllib模块,该模块提供了一个用于读取网页数据的接口,可以像读取本地文件一样读取WWW和FTP上的数据。Urllib是一个URL处理包,它采集一些用于处理URL的模块
Urllib。请求模块:用于打开和读取URL。Urllib。错误模块:收录由urllib生成的一些错误。请求,可通过try捕获和处理。Urllib。解析模块:收录一些解析URL的方法。Urllib.robot解析器:用于解析robots.txt文本文件。它通过该类提供的can提供一个单独的robotfileparser类。fetch()方法测试爬虫是否可以下载页面
以下代码是获取网页的代码:
import urllib.request
url = "https://www.douban.com/"
# 这边需要模拟浏览器才能进行抓取
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
data = response.read()
# 这边需要转码才能正常显示
print(str(data, 'utf-8'))
# 下面代码可以打印抓取网页的各类信息
print(type(response))
print(response.geturl())
print(response.info())
print(response.getcode())
2、grab收录关键词网页
代码如下:
import urllib.request
data = {'word': '海贼王'}
url_values = urllib.parse.urlencode(data)
url = "http://www.baidu.com/s?"
full_url = url + url_values
data = urllib.request.urlopen(full_url).read()
print(str(data, 'utf-8'))
3、下载贴吧中的图片@
代码如下:
import re
import urllib.request
# 获取网页源代码
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
return html
# 获取网页所有图片
def getImg(html):
reg = r'src="([.*\S]*\.jpg)" pic_ext="jpeg"'
imgre = re.compile(reg)
imglist = re.findall(imgre, html)
return imglist
html = getHtml('https://tieba.baidu.com/p/3205263090')
html = html.decode('utf-8')
imgList = getImg(html)
imgName = 0
# 循环保存图片
for imgPath in imgList:
f = open(str(imgName) + ".jpg", 'wb')
f.write((urllib.request.urlopen(imgPath)).read())
f.close()
imgName += 1
print('正在下载第 %s 张图片 ' % imgName)
print('该网站图片已经下载完')
4、stock数据捕获
代码如下:
<p>
import random
import re
import time
import urllib.request
# 抓取所需内容
user_agent = ["Mozilla/5.0 (Windows NT 10.0; WOW64)", 'Mozilla/5.0 (Windows NT 6.3; WOW64)',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)',
'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1',
'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3',
'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12',
'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11']
stock_total = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'}
for page in range(1, 8):
url = 'http://quote.stockstar.com/stock/ranklist_a_3_1_' + str(page) + '.html'
request = urllib.request.Request(url=url, headers={"User-Agent": random.choice(user_agent)})
response = urllib.request.urlopen(request)
content = str(response.read(), 'gbk')
pattern = re.compile('