抓取网页新闻(Python爬取网中标标书并保存成PDF格式爬取博客文章 )

优采云发布时间: 2022-03-12 14:09

　　抓取网页新闻(Python爬取网中标标书并保存成PDF格式爬取博客文章

)

　　前言

　　本文文字和图片来源于网络，仅供学习交流，不做任何商业用途。如有任何问题，请及时联系我们进行处理。

　　PS：如需Python学习资料，可点击下方链接自行获取

　　Python免费学习资料及*敏*感*词*流答案点击加入

　　基础开发环境

　　import parsel

import requests

import re

　　登陆页面分析

　　今天爬上新闻网的国际新闻版块

　　点击显示更多新闻内容

　　可以看到相关的数据接口，里面收录了新闻标题的url地址和新闻详情

　　如何提取url地址

　　1、转换为json，键值对值；

　　2、使用正则表达式匹配url地址；

　　两种方法都可以实现，看个人喜好

　　根据界面数据链接中的pager变化进行翻页，对应页码。

　　在详情页可以看到新闻内容在div标签里面的p标签中，按照正常的分析网站可以得到新闻内容。

　　储存方法

　　1、可以保存txt文本形式

　　2、也可以另存为PDF

　　我还谈到了抓取文章内容并将其保存为 PDF。您可以点击下面的链接查看相关的保存方法。

　　Python爬取中标并保存为PDF格式

　　Python爬取CSDN博客文章并制作成PDF文件

　　如果这篇文章是文章，使用保存txt文本的形式。

　　整体爬取思路总结代码实现

　　def get_html(html_url):

"""

获取网页源代码 response

:param html_url: 网页url地址

:return: 网页源代码

"""

response = requests.get(url=html_url, headers=headers)

return response

　　def get_page_url(html_data):

"""

获取每篇新闻url地址

:param html_data: response.text

:return: 每篇新闻的url地址

"""

page_url_list = re.findall('"url":"(.*?)"', html_data)

return page_url_list

　　def file_name(name):

"""

文件命名不能携带特殊字符

:param name: 新闻标题

:return: 无特殊字符的标题

"""

replace = re.compile(r'[\\/\:\*\?\"\|]')

new_name = re.sub(replace, '_', name)

return new_name

　　def download(content, title):

"""

with open 保存新闻内容 txt

:param content: 新闻内容

:param title: 新闻标题

:return:

"""

path = '新闻\' + title + '.txt'

with open(path, mode='a', encoding='utf-8') as f:

f.write(content)

print('正在保存', title)

　　def main(url):

"""

主函数

:param url: 新闻列表页 url地址

:return:

"""

html_data = get_html(url).text # 获得接口数据response.text

lis = get_page_url(html_data) # 获得新闻url地址列表

for li in lis:

page_data = get_html(li).content.decode('utf-8', 'ignore') # 新闻详情页 response.text

selector = parsel.Selector(page_data)

title = re.findall('(.*?)', page_data, re.S)[0] # 获取新闻标题

new_title = file_name(title)

new_data = selector.css('#cont_1_1_2 div.left_zw p::text').getall()

content = ''.join(new_data)

download(content, new_title)

if __name__ == '__main__':

for page in range(1, 101):

url_1 = 'https://channel.chinanews.com/cns/cjs/gj.shtml?pager={}&pagenum=9&t=5_58'.format(page)

main(url_1)

　　运行效果图

0

2022-03-12

抓取网页新闻

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页新闻(Python爬取网中标标书并保存成PDF格式爬取博客文章 )

0 个评论

发起人

AI时代内容工厂

抓取网页新闻(Python爬取网中标标书并保存成PDF格式爬取博客文章 )

0 个评论

发起人

相关问题