python网页数据抓取(Python学习资料2.抓取结果3.详细步骤分析（一） )

优采云发布时间: 2021-09-26 17:12

　　python网页数据抓取(Python学习资料2.抓取结果3.详细步骤分析（一）

)

　　1. 获取目标：

　　百度NBA图片

　　Python爬虫、数据分析、网站开发等案例教程视频在线免费观看

　　https://space.bilibili.com/523606542

　　点击加群找管理员免费获取Python学习资料2.抢结果

　　3.详细步骤分析

　　（1）分析是否是动态加载的关键是在滚动鼠标滚轮时观察XHR中的包是否发生了变化。如果这里的面包数量有更新，那么页面很可能是动态请求, 分析后的百度图片是动态加载的。

　　（2）找到动态加载包后，我们分析该包的请求，难点是分析查询参数。这里建议大家至少找两组关键词对比，找出关键词差异不同包之间，看变化规律（树偷偷提醒大家找一个叫pn的查询参数）整个动态其实是他一个人控制的，找到包后，他发出请求请求，分析提取图片的数据url就可以了（对于图片，一定要写成二进制！）

　　4.完整源码

　　此抓取所需的工具包请求和 json

　　import requests as rq

import json

import time

import os

count = 1

def crawl(page):

global count

if not os.path.exists('E://桌面/NBA'):

os.mkdir('E://桌面/NBA')

url = 'https://image.baidu.com/search/acjson?'

header = {

# 'Referer': 'https://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&fm=rs4&word',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'

}

param = {

"tn": "resultjson_com",

"logid": "11007362803069082764",

"ipn": "rj",

"ct": "201326592",

"is": "",

"fp": "result",

"queryWord": "NBA",

"cl": "2",

"lm": "-1",

"ie": "utf-8",

"oe": "utf-8",

"adpicid": "",

"st": "-1",

"z": "",

"ic": "",

"hd": "",

"latest": "",

"copyright": "",

"word": "NBA",

"s": "",

"se": "",

"tab": "",

"width": "",

"height": "",

"face": "0",

"istype": "2",

"qc": "",

"nc": "1",

"fr": "",

"expermode": "",

"force": "",

"pn": page,

"rn": "30",

"gsm": "1e",

"1615565977798": "",

}

response = rq.get(url, headers=header, params=param)

result = response.text

# print(response.status_code)

j = json.loads(result)

# print(j)

img_list = []

for i in j['data']:

if 'thumbURL' in i:

# print(i['thumbURL'])

img_list.append(i['thumbURL'])

# print(len(img_list))

for n in img_list:

r = rq.get(n, headers=header)

with open(f'E://桌面/NBA/{count}.jpg', 'wb') as f:

f.write(r.content)

count += 1

if __name__ == '__main__':

for i in range(30, 601, 30):

t1 = time.time()

crawl(i)

t2 = time.time()

t = t2 - t1

print('page {0} is over!!! 耗时{1:.2f}秒！'.format(i//30, t))

0

2021-09-26

python网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python网页数据抓取(Python学习资料2.抓取结果3.详细步骤分析（一） )

0 个评论

发起人

AI时代内容工厂

python网页数据抓取(Python学习资料2.抓取结果3.详细步骤分析（一） )

0 个评论

发起人

相关问题