网页抓取数据百度百科( 大数据之美获取百度指数相关的数据困难及解决办法 )

优采云发布时间: 2022-02-12 11:26

　　网页抓取数据百度百科(

大数据之美获取百度指数相关的数据困难及解决办法

)

　　作者 | 叶廷云

　　来源|艾婷云君

　　一、简介

　　在实际业务中，我们可能会使用爬虫根据关键词获取百度搜索索引的历史数据，然后进行相应的数据分析。

　　百度指数，体验大数据之美。但是，要获取百度指数相关的数据，困难在于：

　　本文以获取关键词（北京冬奥会，冬奥会开幕式）：近期百度搜索索引数据为例，讲解使用爬虫获取百度搜索索引历史数据的过程根据关键词（以冬奥会为例），然后制作近90天冬奥会搜索索引可视化和采集报道的素材的词云图媒体。

　　二、网页分析

　　如果没有百度账号，需要先注册，然后进入百度指数官网：

　　百度指数

　　搜索冬奥会，选择过去90天，可以看到最近90天冬奥会搜索指数的折线图：

　　最后要做的是获取这些搜索索引数据并将其保存到本地 Excel。

　　首先，登录后需要获取cookie（必须要有，否则无法获取数据）。具体cookie获取如下：

　　分析可以找到json数据的接口，如下：

　　Request URL中word参数后面跟着搜索到的关键词（只编码汉字），days=90，表示过去90天的数据，从前一天往前推一个月当前日期，并可根据需要修改天数以获取更多数据或更少数据。将Request URL粘贴到浏览器中查看（查看JSON数据网页，有JSON Handle之类的插件会很方便）

　　https://index.baidu.com/api/SearchApi/index?area=0&word[[%7B%22name%22:%22%E5%86%AC%E5%A5%A5%E4%BC%9A%22,%22wordType%22:1%7D]]&days=90

　　可以看到以下数据：

　　将all、pc、wise对应的数据解密后，与搜索索引的折线图显示的数据进行对比，发现all部分的数据就是搜索索引的数据。这个请求返回的数据都在这里了，也可以看到uniqid，而且每次刷新加密数据时，加密数据和uniqid都会发生变化。

　　经过多次分析，发现请求数据的url下的uniqid出现在这个url中，如上图。

　　因此需要从请求数据对应的url中获取数据，解析出搜索索引对应的加密数据和uniqid，然后将url拼接得到key，最后调用解密方法解密得到搜索索引的数据。

　　https://index.baidu.com/Interface/ptbk?uniqid=b92927de43cc02fcae9fbc0cee99e3a9

　　找到对应的url后，爬虫的基本思路还是一样的：发送请求，得到响应，解析数据，然后解密保存数据。

　　三、数据采集

　　Python代码：

<p># -*- coding: UTF-8 -*-

"""

@Author ：叶庭云

@公众号：AI庭云君

@CSDN ：https://yetingyun.blog.csdn.net/

"""

import execjs

import requests

import datetime

import pandas as pd

from colorama import Fore, init

init()

# 搜索指数数据解密的Python代码

def decryption(keys, data):

dec_dict = {}

for j in range(len(keys) // 2):

dec_dict[keys[j]] = keys[len(keys) // 2 + j]

dec_data = ''

for k in range(len(data)):

dec_data += dec_dict[data[k]]

return dec_data

if __name__ == "__main__":

# 北京冬奥会冬奥会开幕式

keyword = '北京冬奥会' # 百度搜索收录的关键词

period = 90 # 时间近90天

start_str = 'https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22'

end_str = '%22,%22wordType%22:1%7D]]&days={}'.format(period)

dataUrl = start_str + keyword + end_str

keyUrl = 'https://index.baidu.com/Interface/ptbk?uniqid='

# 请求头

header = {

'Accept': 'application/json, text/plain, */*',

'Accept-Encoding': 'gzip, deflate, br',

'Accept-Language': 'zh-CN,zh;q=0.9',

'Connection': 'keep-alive',

'Cookie': '注意：换成你的Cookie',

'Host': 'index.baidu.com',

'Referer': 'https://index.baidu.com/v2/main/index.html',

'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',

'sec-ch-ua-mobile': '?0',

'Sec-Fetch-Dest': 'empty',

'Sec-Fetch-Mode': 'cors',

'Sec-Fetch-Site': 'same-origin',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'

}

# 设置请求超时时间为16秒

resData = requests.get(dataUrl,

timeout=16, headers=header)

uniqid = resData.json()['data']['uniqid']

print(Fore.RED + "uniqid：{}".format(uniqid))

keyData = requests.get(keyUrl + uniqid,

timeout=16, headers=header)

keyData.raise_for_status()

keyData.encoding = resData.apparent_encoding

# 解析json数据

startDate = resData.json()['data']['userIndexes'][0]['all']['startDate']

print(Fore.RED + "startDate：{}".format(startDate))

endDate = resData.json()['data']['userIndexes'][0]['all']['endDate']

print(Fore.RED + "endDate：{}".format(endDate))

source = (resData.json()['data']['userIndexes'][0]['all']['data']) # 原加密数据

print(Fore.RED + "原加密数据：{}".format(source))

key = keyData.json()['data'] # 密钥

print(Fore.RED + "密钥：{}".format(key))

res = decryption(key, source)

# print(type(res))

resArr = res.split(",")

# 生成datetime

dateStart = datetime.datetime.strptime(startDate, '%Y-%m-%d')

dateEnd = datetime.datetime.strptime(endDate, '%Y-%m-%d')

dataLs = []

# 起始日期到结束日期每一天

while dateStart

0

2022-02-12

网页抓取数据百度百科

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页抓取数据百度百科( 大数据之美获取百度指数相关的数据困难及解决办法 )

0 个评论

发起人