网页爬虫抓取百度图片(百度图片爬取图片的基本信息总结（一）)

优采云发布时间: 2022-01-04 07:08

　　一、准备工作

　　使用python抓取并保存百度图片。以情感图片为例，百度搜索可以得到如下图片

　　f12 开源代码

　　这里可以看到我们这次要爬取的图片的基本信息在img-scr中

　　二、代码实现

　　本次抓取主要使用了以下第三方库

　　import re

import time

import requests

from bs4 import BeautifulSoup

import os

　　简单的想法可以分为三个小部分

　　1.获取网页内容

　　2.分析网页

　　3.保存图片到对应位置

　　先看第一部分：获取网页内容

　　baseurl = 'https://cn.bing.com/images/search?q=%E6%83%85%E7%BB%AA%E5%9B%BE%E7%89%87&qpvt=%e6%83%85%e7%bb%aa%e5%9b%be%e7%89%87&form=IGRE&first=1&cw=418&ch=652&tsc=ImageBasicHover'

head = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 Edg/92.0.902.67"}

response = requests.get(baseurl, headers=head) # 获取网页信息

html = response.text # 将网页信息转化为text形式

　　有那么容易吗？

　　解析网页的第二部分是大头

　　看代码

　　Img = re.compile(r'img.*src="(.*?)"') # 正则表达式匹配图片

soup = BeautifulSoup(html, "html.parser") # BeautifulSoup解析html

#i = 0 # 计数器初始值

data = [] # 存储图片超链接的列表

for item in soup.find_all('img', src=""): # soup.find_all对网页中的img—src进行迭代

item = str(item) # 转换为str类型

Picture = re.findall(Img, item) # 结合re正则表达式和BeautifulSoup, 仅返回超链接

for b in Picture:

data.append(b)

#i = i + 1

return data[-1]

# print(i)

　　这里用到了BeautifulSoup和re正则表达式的相关知识，需要一定的基础

　　这里是第三部分：保存图片

　　 for m in getdata(

baseurl='https://cn.bing.com/images/search?q=%E6%83%85%E7%BB%AA%E5%9B%BE%E7%89%87&qpvt=%e6%83%85%e7%bb%aa%e5%9b%be%e7%89%87&form=IGRE&first=1&cw=418&ch=652&tsc=ImageBasicHover'):

resp = requests.get(m) #获取网页信息

byte = resp.content # 转化为content二进制

print(os.getcwd()) # os库中输出当前的路径

i = i + 1 # 递增

# img_path = os.path.join(m)

with open("path{}.jpg".format(i), "wb") as f: # 文件写入

f.write(byte)

time.sleep(0.5) # 每隔0.5秒下载一张图片放入D://情绪图片测试

print("第{}张图片爬取成功!".format(i))

　　每行代码的解释都写在评论里给大家了。不明白的可以直接私信或者评论~

　　以下是完整代码

　　import re

import time

import requests

from bs4 import BeautifulSoup

import os

# m = 'https://tse2-mm.cn.bing.net/th/id/OIP-C.uihwmxDdgfK4FlCIXx-3jgHaPc?w=115&h=183&c=7&r=0&o=5&pid=1.7'

'''

resp = requests.get(m)

byte = resp.content

print(os.getcwd())

img_path = os.path.join(m)

'''

def main():

baseurl = 'https://cn.bing.com/images/search?q=%E6%83%85%E7%BB%AA%E5%9B%BE%E7%89%87&qpvt=%e6%83%85%e7%bb%aa%e5%9b%be%e7%89%87&form=IGRE&first=1&cw=418&ch=652&tsc=ImageBasicHover'

datalist = getdata(baseurl)

def getdata(baseurl):

Img = re.compile(r'img.*src="(.*?)"') # 正则表达式匹配图片

datalist = []

head = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 Edg/92.0.902.67"}

response = requests.get(baseurl, headers=head) # 获取网页信息

html = response.text # 将网页信息转化为text形式

soup = BeautifulSoup(html, "html.parser") # BeautifulSoup解析html

# i = 0 # 计数器初始值

data = [] # 存储图片超链接的列表

for item in soup.find_all('img', src=""): # soup.find_all对网页中的img—src进行迭代

item = str(item) # 转换为str类型

Picture = re.findall(Img, item) # 结合re正则表达式和BeautifulSoup, 仅返回超链接

for b in Picture: # 遍历列表，取最后一次结果

data.append(b)

# i = i + 1

datalist.append(data[-1])

return datalist # 返回一个包含超链接的新列表

# print(i)

'''

with open("img_path.jpg","wb") as f:

f.write(byte)

'''

if __name__ == '__main__':

os.chdir("D://情绪图片测试")

main()

i = 0 # 图片名递增

for m in getdata(

baseurl='https://cn.bing.com/images/search?q=%E6%83%85%E7%BB%AA%E5%9B%BE%E7%89%87&qpvt=%e6%83%85%e7%bb%aa%e5%9b%be%e7%89%87&form=IGRE&first=1&cw=418&ch=652&tsc=ImageBasicHover'):

resp = requests.get(m) #获取网页信息

byte = resp.content # 转化为content二进制

print(os.getcwd()) # os库中输出当前的路径

i = i + 1 # 递增

# img_path = os.path.join(m)

with open("path{}.jpg".format(i), "wb") as f: # 文件写入

f.write(byte)

time.sleep(0.5) # 每隔0.5秒下载一张图片放入D://情绪图片测试

print("第{}张图片爬取成功!".format(i))

　　最终运行截图

　　三、总结

　　这次我只保存了29张图片。抓取其他网页的时候，使用的方法都是一样的。最重要的是灵活改变网页的内容，观察其源代码。另外，部分网站可能有防爬措施，爬的时候请注意~如果还有不明白的请私信

0

2022-01-04

网页爬虫抓取百度图片

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页爬虫抓取百度图片(百度图片爬取图片的基本信息总结（一）)

0 个评论

发起人