java爬虫抓取动态网页(Python爬虫、数据分析、网站开发等案例教程视频免费在线观看 )
优采云 发布时间: 2022-03-20 00:05java爬虫抓取动态网页(Python爬虫、数据分析、网站开发等案例教程视频免费在线观看
)
Python爬虫、数据分析、网站开发等案例教程视频在线免费观看
https://space.bilibili.com/523606542
Python学习交流群:。抓住目标:
百度NBA图片
2.获取结果
3.详细步骤分析
(1)分析是否是动态加载的关键是在滚动鼠标滚轮时观察XHR中的包是否发生了变化。如果这里的面包数量已经更新,那么页面很可能是动态请求,分析的百度图片是动态加载的。
(2)找到动态加载的包后,我们分析包的请求,难点在于查询参数的分析。这里建议大家至少找两组关键字对比,找出两者的区别不同包中的关键字,看它的变化规律(那棵树偷偷提醒大家找一个叫pn的查询参数)整个动态其实是他一个人控制的,找到包后发出request请求,进行数据分析提取图片url就可以了(图片一定要写二进制!)
4.完整源代码
本次爬取所需的工具包请求和 json
import requests as rq
import json
import time
import os
count = 1
def crawl(page):
global count
if not os.path.exists('E://桌面/NBA'):
os.mkdir('E://桌面/NBA')
url = 'https://image.baidu.com/search/acjson?'
header = {
# 'Referer': 'https://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&fm=rs4&word',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}
param = {
"tn": "resultjson_com",
"logid": "11007362803069082764",
"ipn": "rj",
"ct": "201326592",
"is": "",
"fp": "result",
"queryWord": "NBA",
"cl": "2",
"lm": "-1",
"ie": "utf-8",
"oe": "utf-8",
"adpicid": "",
"st": "-1",
"z": "",
"ic": "",
"hd": "",
"latest": "",
"copyright": "",
"word": "NBA",
"s": "",
"se": "",
"tab": "",
"width": "",
"height": "",
"face": "0",
"istype": "2",
"qc": "",
"nc": "1",
"fr": "",
"expermode": "",
"force": "",
"pn": page,
"rn": "30",
"gsm": "1e",
"1615565977798": "",
}
response = rq.get(url, headers=header, params=param)
result = response.text
# print(response.status_code)
j = json.loads(result)
# print(j)
img_list = []
for i in j['data']:
if 'thumbURL' in i:
# print(i['thumbURL'])
img_list.append(i['thumbURL'])
# print(len(img_list))
for n in img_list:
r = rq.get(n, headers=header)
with open(f'E://桌面/NBA/{count}.jpg', 'wb') as f:
f.write(r.content)
count += 1
if __name__ == '__main__':
for i in range(30, 601, 30):
t1 = time.time()
crawl(i)
t2 = time.time()
t = t2 - t1
print('page {0} is over!!! 耗时{1:.2f}秒!'.format(i//30, t))