网页爬虫抓取百度图片(什么是爬虫网络爬虫(.txt)(图))

优采云发布时间: 2022-03-22 01:15

　　什么是爬行动物

　　网络爬虫（也称为网络蜘蛛、网络机器人，在 FOAF 社区中，更常被称为网络追逐者）是根据一定规则自动从万维网上爬取信息的程序或脚本。其他不太常用的名称是 ant、autoindex、emulator 或 worm。（来源：百度百科）

　　爬虫协议

　　Robots Protocol（也称Crawler Protocol、Robot Protocol等）的全称是“Robots Exclusion Protocol”。网站通过Robots Protocol，告诉搜索引擎哪些页面可以爬取，哪些页面不能爬取。

　　robots.txt 文件是一个文本文件，可以使用任何常见的文本编辑器（例如 Windows 系统附带的记事本）创建和编辑。robots.txt 是协议，而不是命令。robots.txt 是搜索引擎在访问网站时查看的第一个文件。robots.txt 文件告诉蜘蛛可以查看服务器上的哪些文件。（来源：百度百科）

　　爬虫百度图片

　　目标：爬取百度图片并存入电脑

　　首先，数据是公开的吗？可以下载吗？

　　从图中可以看出，百度的图片是完全可以下载的，说明图片可以爬取

　　首先，了解什么是图片？

　　有形的东西，我们看，是图片、照片、拓片等的统称。绘画是技术制图的基本术语，指的是用点、线、符号、文字和数字来描述的一种形式。事物的几何特征、形状、位置和大小。随着数字采集技术和信号处理理论的发展，越来越多的图片以数字形式存储。

　　那么图片需要在哪里呢？

　　图片保存在云服务器的数据库中

　　每张图片都有对应的url，通过requests模块发起请求，以文件的wb+方式保存

1import requests 2r = requests.get('http://pic37.nipic.com/20140113/8800276_184927469000_2.png') 3with open('demo.jpg','wb+') as f: 4 f.write(r.content)

　　但是谁写代码是为了爬图，还是直接下载比较好。爬虫的目的是达到批量下载的目的，这才是真正的爬虫

　　先了解json

　　JSON（JavaScript Object Notation，JS Object Notation）是一种轻量级的数据交换格式。它基于 ECMAScript（欧洲计算机协会开发的 js 规范）的一个子集，使用完全独立于编程语言的文本格式来存储和表示数据。简洁明了的层次结构使 JSON 成为理想的数据交换语言。

　　json是js的对象，就是访问数据

　　JSON字符串

1{ 2 “name”: “毛利”, 3 “age”: 18, 4 “ feature “ : [‘高’, ‘富’, ‘帅’] 5}

　　Python字典

1{ 2 ‘name’: ‘毛利’, 3 ‘age’: 18 4 ‘feature’ : [‘高’, ‘富’, ‘帅’] 5}

　　但是在python中，不能直接通过键值对获取值，所以不得不说python中的字典

　　在python中导入json，通过json.loads(s)将json数据转成python数据（字典） -->

　　Ajax 代表“Asynchronous Javascript And XML”，指的是一种用于创建交互式 Web 应用程序的 Web 开发技术。

　　图片是通过ajax方式加载的，也就是我下拉的时候会自动加载图片，因为网站自动发起了请求，

　　构造ajax url请求将json转成字典，通过取字典的键值对的值获取图片对应的url

1import requests 2import json 3headers = { 4 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'} 5r = requests.get('https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1561022599290=',headers = headers).text 6res = json.loads(r)['data'] 7for index,i in enumerate(res): 8 url = i['hoverURL'] 9 print(url) 10 with open( '{}.jpg'.format(index),'wb+') as f: 11 f.write(requests.get(url).content)

　　一个json有30张图片，所以通过发出json请求，我们可以爬到30张，但是还是不够。

　　首先分析不同json发起的请求

1https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=60&rn=30&gsm=3c&1561022599355= 2https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1561022599290=

　　其实可以发现，当再次发起请求时，关键是pn在不断变化

　　最后封装代码，一个list定义producer用来存储不断生成的图片url，另一个list定义consumer用来保存图片

1# -*- coding：utf-8 -*- 2# time ：2019/6/20 17:07 3# author: 毛利 4import requests 5import json 6import os 7def get_pic_url(num): 8 pic_url= [] 9 headers = { 10 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'} 11 for i in range(num): 12 13 page_url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30&gsm=1e&1561022599290='.format(30*i) 14 r = requests.get(page_url, headers=headers).text 15 res = json.loads(r)['data'] 16 if res: 17 print(res) 18 for j in res: 19 try: 20 url = j['hoverURL'] 21 pic_url.append(url) 22 except: 23 print('该图片的url不存在') 24 25 print(len(pic_url)) 26 return pic_url 27 28def down_img(num): 29 pic_url =get_pic_url(num) 30 31 if os.path.exists('D:\图片'): 32 pass 33 else: 34 os.makedirs('D:\图片') 35 36 path = 'D:\图片\' 37 for index,i in enumerate(pic_url): 38 filename = path + str(index) + '.jpg' 39 print(filename) 40 with open(filename, 'wb+') as f: 41 f.write(requests.get(i).content) 42if __name__ == '__main__': 43 num = int(input('爬取几次图片：一次30张')) 44 down_img(num)

　　爬取过程

　　抓取结果

　　文章首次发表于：

0

2022-03-22

网页爬虫抓取百度图片

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页爬虫抓取百度图片(什么是爬虫网络爬虫(.txt)(图))

0 个评论

发起人