java爬虫抓取动态网页(Python爬虫、数据分析、网站开发等案例教程视频免费在线观看 )

优采云 发布时间: 2022-03-20 00:05

  java爬虫抓取动态网页(Python爬虫、数据分析、网站开发等案例教程视频免费在线观看

)

  Python爬虫、数据分析、网站开发等案例教程视频在线免费观看

  https://space.bilibili.com/523606542

  Python学习交流群:。抓住目标:

  百度NBA图片

  

  2.获取结果

  

  3.详细步骤分析

  (1)分析是否是动态加载的关键是在滚动鼠标滚轮时观察XHR中的包是否发生了变化。如果这里的面包数量已经更新,那么页面很可能是动态请求,分析的百度图片是动态加载的。

  

  (2)找到动态加载的包后,我们分析包的请求,难点在于查询参数的分析。这里建议大家至少找两组关键字对比,找出两者的区别不同包中的关键字,看它的变化规律(那棵树偷偷提醒大家找一个叫pn的查询参数)整个动态其实是他一个人控制的,找到包后发出request请求,进行数据分析提取图片url就可以了(图片一定要写二进制!)

  

  4.完整源代码

  本次爬取所需的工具包请求和 json

  import requests as rq

import json

import time

import os

count = 1

def crawl(page):

global count

if not os.path.exists('E://桌面/NBA'):

os.mkdir('E://桌面/NBA')

url = 'https://image.baidu.com/search/acjson?'

header = {

# 'Referer': 'https://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&fm=rs4&word',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'

}

param = {

"tn": "resultjson_com",

"logid": "11007362803069082764",

"ipn": "rj",

"ct": "201326592",

"is": "",

"fp": "result",

"queryWord": "NBA",

"cl": "2",

"lm": "-1",

"ie": "utf-8",

"oe": "utf-8",

"adpicid": "",

"st": "-1",

"z": "",

"ic": "",

"hd": "",

"latest": "",

"copyright": "",

"word": "NBA",

"s": "",

"se": "",

"tab": "",

"width": "",

"height": "",

"face": "0",

"istype": "2",

"qc": "",

"nc": "1",

"fr": "",

"expermode": "",

"force": "",

"pn": page,

"rn": "30",

"gsm": "1e",

"1615565977798": "",

}

response = rq.get(url, headers=header, params=param)

result = response.text

# print(response.status_code)

j = json.loads(result)

# print(j)

img_list = []

for i in j['data']:

if 'thumbURL' in i:

# print(i['thumbURL'])

img_list.append(i['thumbURL'])

# print(len(img_list))

for n in img_list:

r = rq.get(n, headers=header)

with open(f'E://桌面/NBA/{count}.jpg', 'wb') as f:

f.write(r.content)

count += 1

if __name__ == '__main__':

for i in range(30, 601, 30):

t1 = time.time()

crawl(i)

t2 = time.time()

t = t2 - t1

print('page {0} is over!!! 耗时{1:.2f}秒!'.format(i//30, t))

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线