python网页数据抓取(Python学习资料2.抓取结果3.详细步骤分析(一) )

优采云 发布时间: 2021-09-26 17:12

  python网页数据抓取(Python学习资料2.抓取结果3.详细步骤分析(一)

)

  1. 获取目标:

  百度NBA图片

  

  Python爬虫、数据分析、网站开发等案例教程视频在线免费观看

  https://space.bilibili.com/523606542

  点击加群找管理员免费获取Python学习资料2.抢结果

  

  3.详细步骤分析

  (1)分析是否是动态加载的关键是在滚动鼠标滚轮时观察XHR中的包是否发生了变化。如果这里的面包数量有更新,那么页面很可能是动态请求, 分析后的百度图片是动态加载的。

  

  (2) 找到动态加载包后,我们分析该包的请求,难点是分析查询参数。这里建议大家至少找两组关键词对比,找出关键词差异不同包之间,看变化规律(树偷偷提醒大家找一个叫pn的查询参数)整个动态其实是他一个人控制的,找到包后,他发出请求请求,分析提取图片的数据url就可以了(对于图片,一定要写成二进制!)

  

  4.完整源码

  此抓取所需的工具包请求和 json

  import requests as rq

import json

import time

import os

count = 1

def crawl(page):

global count

if not os.path.exists('E://桌面/NBA'):

os.mkdir('E://桌面/NBA')

url = 'https://image.baidu.com/search/acjson?'

header = {

# 'Referer': 'https://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&fm=rs4&word',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'

}

param = {

"tn": "resultjson_com",

"logid": "11007362803069082764",

"ipn": "rj",

"ct": "201326592",

"is": "",

"fp": "result",

"queryWord": "NBA",

"cl": "2",

"lm": "-1",

"ie": "utf-8",

"oe": "utf-8",

"adpicid": "",

"st": "-1",

"z": "",

"ic": "",

"hd": "",

"latest": "",

"copyright": "",

"word": "NBA",

"s": "",

"se": "",

"tab": "",

"width": "",

"height": "",

"face": "0",

"istype": "2",

"qc": "",

"nc": "1",

"fr": "",

"expermode": "",

"force": "",

"pn": page,

"rn": "30",

"gsm": "1e",

"1615565977798": "",

}

response = rq.get(url, headers=header, params=param)

result = response.text

# print(response.status_code)

j = json.loads(result)

# print(j)

img_list = []

for i in j['data']:

if 'thumbURL' in i:

# print(i['thumbURL'])

img_list.append(i['thumbURL'])

# print(len(img_list))

for n in img_list:

r = rq.get(n, headers=header)

with open(f'E://桌面/NBA/{count}.jpg', 'wb') as f:

f.write(r.content)

count += 1

if __name__ == '__main__':

for i in range(30, 601, 30):

t1 = time.time()

crawl(i)

t2 = time.time()

t = t2 - t1

print('page {0} is over!!! 耗时{1:.2f}秒!'.format(i//30, t))

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线