python抓取动态网页(利用Python实现抓取知乎热点话题(一)_网页分析_光明网)

优采云发布时间: 2021-11-08 21:05

　　前言

　　用Python捕捉知乎热门话题，废话不多说。

　　让我们愉快的开始吧~

　　开发工具

　　Python版本：3.6.4

　　相关模块：

　　请求模块；

　　重新模块；

　　熊猫模块；

　　lxml 模块；

　　随机模块；

　　以及一些 Python 自带的模块。

　　环境设置

　　安装Python并将其添加到环境变量中，pip安装所需的相关模块。

　　思维分析

　　本文爬取了知乎的热门话题《网通向腾讯高管提出拒绝陪酒的相关规定》，腾讯实习生如何看待？“例如

　　目标网址

　　网络分析

　　检查网页源代码等后，确定网页的回答内容是动态加载的，需要进入浏览器的开发者工具进行抓包。进入Nonetwork→XHR，在网页上用鼠标下拉即可得到我们需要的数据包

　　获取准确的网址

　　https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=default

https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=default

　　URL 有很多不必要的参数，您可以在浏览器中删除它们。两个URL的区别在于后面的offset参数。第一个URL的offset参数为0，第二个为5，偏移量以5的容差递增；网页数据格式为json格式。

　　代码

　　import requests\

import pandas as pd\

import re\

import time\

import random\

\

df = pd.DataFrame()\

headers = {\

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'\

}\

for page in range(0, 1360, 5):\

url = f'https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={page}&platform=desktop&sort_by=default'\

response = requests.get(url=url, headers=headers).json()\

data = response['data']\

for list_ in data:\

name = list_['author']['name'] # 知乎作者\

id_ = list_['author']['id'] # 作者id\

created_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(list_['created_time'] )) # 回答时间\

voteup_count = list_['voteup_count'] # 赞同数\

comment_count = list_['comment_count'] # 底下评论数\

content = list_['content'] # 回答内容\

content = ''.join(re.findall("[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b\u4e00-\u9fa5]", content)) # 正则表达式提取\

print(name, id_, created_time, comment_count, content, sep='|')\

dataFrame = pd.DataFrame(\

{'知乎作者': [name], '作者id': [id_], '回答时间': [created_time], '赞同数': [voteup_count], '底下评论数': [comment_count],\

'回答内容': [content]})\

df = pd.concat([df, dataFrame])\

time.sleep(random.uniform(2, 3))\

df.to_csv('知乎回答.csv', encoding='utf-8', index=False)\

print(df.shape)

　　显示结果

　　—————————————————————————————————————————————

0

2021-11-08

python抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取动态网页(利用Python实现抓取知乎热点话题(一)_网页分析_光明网)

0 个评论

发起人