网页新闻抓取(阿里巴巴的网页源代码提取新闻信息的基本方法(图) )

优采云发布时间: 2021-10-01 02:23

　　网页新闻抓取(阿里巴巴的网页源代码提取新闻信息的基本方法(图)

)

　　百度新闻信息抓取

　　内容

　　前言

　　通过对百度新闻标题、链接、日期和来源的爬取，了解使用python语言爬取少量数据的基本方法。

　　获取在百度新闻中搜索“阿里巴巴”的网页源代码

　　为了获取请求头，我们可以在谷歌浏览器的地址栏中输入 about:version 来获取请求头。

　　除了请求头，我们还需要构造url。

　　在网页上输入阿里巴巴，然后在地址栏中找到url，通过简化url得到这样一个url---->阿里巴巴。

　　有了请求头，我们就可以写基本的爬虫代码了，呵呵。

　　import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}

url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'

res = requests.get(url, headers=headers).text

print(res)

　　部分结果如下：

　　// 回馈

　　设置超时（功能（）{

　　var s = document.createElement("脚本");

　　s.charset="utf-8";

　　s.src="";

　　document.body.appendChild(s);

　　},0);

　　编写正则表达式提取新闻信息

　　有了源码，我们必须分析源码，才能提取出下面新闻的来源和日期。

　　我发现新闻的标题、链接、日期都在“p class="c-author"下面，所以我知道怎么提取了。

　　import requests

import re

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}

url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'

res = requests.get(url, headers=headers).text

p_info = '(.*?)'

info = re.findall(p_info, res, re.S)

print(info)

　　代码结果收录\n、\t、 

　　同样的，我们用同样的方法，通过正则表达式获取特定的标题和链接。

　　import requests

import re

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}

url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'

res = requests.get(url, headers=headers).text

p_href = '.*?(.*?)'

title = re.findall(p_title, res, re.S)

print("链接是：", '\n', href)

print("标题是：", '\n', title)

　　结果：

　　数据清洗和打印

　　import requests

import re

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}

url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'

res = requests.get(url, headers=headers).text

p_info = '(.*?)'

info = re.findall(p_info, res, re.S)

# 新闻来源和日期清洗

for i in range(len(info)):

info[i].split(' ')

info[i] = re.sub('', '', info[i])

p_href = '.*?(.*?)'

title = re.findall(p_title, res, re.S)

# 新闻标题清洗----strip()->除去不需要的空格和换行符、.*?->代替文本之间的所有内容，清洗掉

for i in range(len(title)):

title[i] = title[i].strip()

title[i] = re.sub('', '', title[i])

print("日期是：", '\n', info)

print("链接是：", '\n', href)

print("标题是：", '\n', title)

　　结果（缺陷：日期未清洗）：

　　实战完整代码

　　先介绍爬取新闻标题、日期、链接的完整代码：

　　# 1.批量爬取一家公司的多页信息

def baidu(page):

import requests

import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

num = (page - 1) * 10

url = 'https://www.baidu.com/s?tn=news&rtt=4&bsst=1&cl=2&wd=阿里巴巴&Ppn=' + str(num)

res = requests.get(url, headers=headers).text

p_info = '(.*?)'

p_href = '.*?(.*?)'

info = re.findall(p_info, res, re.S)

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

source = [] # 先创建两个空列表来储存等会分割后的来源和日期

date = []

for i in range(len(info)):

title[i] = title[i].strip()

title[i] = re.sub('', '', title[i])

info[i] = re.sub('', '', info[i])

source.append(info[i].split(' ')[0])

date.append(info[i].split(' ')[1])

source[i] = source[i].strip()

date[i] = date[i].strip()

print(str(i + 1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')

print(href[i])

for i in range(10): # i是从0开始的序号,所以下面要写成i+1

baidu(i+1)

print('第' + str(i+1) + '页爬取成功')

　　结果：

　　然后介绍爬取多个公司新闻的标题、日期、链接的代码：

　　import requests

import re

def baidu(company):

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=' + company

res = requests.get(url, headers=headers).text

p_info = '(.*?)'

p_href = '.*?(.*?)'

info = re.findall(p_info, res, re.S)

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

source = [] # 先创建两个空列表来储存等会分割后的来源和日期

date = []

for i in range(len(info)):

title[i] = title[i].strip()

title[i] = re.sub('', '', title[i])

info[i] = re.sub('', '', info[i])

source.append(info[i].split(' ')[0])

date.append(info[i].split(' ')[1])

source[i] = source[i].strip()

date[i] = date[i].strip()

print(str(i + 1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')

print(href[i])

while True: # 24小时不间断爬取

companys = ['华能信托', '阿里巴巴', '万科集团', '百度', '腾讯', '京东']

for i in companys:

try:

baidu(i)

print(i + '百度新闻爬取成功')

except:

print(i + '百度新闻爬取失败')

　　部分结果如下图所示：

0

2021-10-01

网页新闻抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页新闻抓取(阿里巴巴的网页源代码提取新闻信息的基本方法(图) )

0 个评论

发起人