从网页抓取数据(Python如何从网页下载图像？带你了解如何使用requests)

优采云发布时间: 2021-11-21 14:14

　　Python 如何从网页下载图片？本文将向您介绍如何使用请求和 BeautifulSoup 库在 Python 中从单个网页中提取和下载图像。

　　您有没有想过下载网页上的所有图片？ Python 如何从网页中下载所有图像？在本教程中，您将学习如何构建 Python 爬虫以从网页中的给定 URL 检索所有图像，并使用请求和 BeautifulSoup 库下载它们。

　　Python从网页下载图片介绍：首先，我们需要很多依赖，我们来安装它们：

　　pip3 install requests bs4 tqdm

　　打开一个新的 Python 文件并导入必要的模块：

　　import requests

import os

from tqdm import tqdm

from bs4 import BeautifulSoup as bs

from urllib.parse import urljoin, urlparse

　　Python 如何从网页下载图片？首先，我们创建一个URL验证器来确保传入的URL是有效的，因为有些网站把编码数据放在了URL位置，所以我们需要跳过这些：

　　def is_valid(url):

"""

Checks whether `url` is a valid URL.

"""

parsed = urlparse(url)

return bool(parsed.netloc) and bool(parsed.scheme)

　　urlparse() 函数将 URL 解析为六个部分。我们只需要检查netloc（域名）和scheme（协议）是否存在即可。

　　其次，我会写核心函数来获取网页的所有图片网址：

　　def get_all_images(url):

"""

Returns all image URLs on a single `url`

"""

soup = bs(requests.get(url).content, "html.parser")

　　网页的HTML内容在soup对象中。要提取HTML中的所有img标签，我们需要使用soup.find_all("img")方法。让我们看看它的作用：

　　 urls = []

for img in tqdm(soup.find_all("img"), "Extracting images"):

img_url = img.attrs.get("src")

if not img_url:

# if img does not contain src attribute, just skip

continue

　　这将检索所有 img 元素作为 Python 列表。

　　Python 从网络下载所有图像：我将它包装在 tqdm 对象中只是为了打印进度条。要获取 img 标签的 URL，有一个 src 属性。但是，有些标签不收录 src 属性，我们使用上面的 continue 语句跳过这些标签。

　　现在我们需要确保 URL 是绝对的：

　　 # make the URL absolute by joining domain with the URL that is just extracted

img_url = urljoin(url, img_url)

　　有些 URL 收录我们不喜欢的 HTTP GET 键值对（以“/image.png?c=3.2.5”结尾），让我们删除它们：

　　 try:

pos = img_url.index("?")

img_url = img_url[:pos]

except ValueError:

pass

　　我们得到'?'的位置字符，然后把后面的所有东西都删掉，如果没有，就会引发ValueError，这就是为什么我把它包裹在一个try/except块中（当然你可以更好的方式来实现它，如果是这样，请在下面的评论中与我们分享）。

　　现在让我们确保每个 URL 都有效并返回所有图像 URL：

　　 # finally, if the url is valid

if is_valid(img_url):

urls.append(img_url)

return urls

　　Python从网页下载图片的例子介绍：既然我们有了一个抓取所有图片网址的函数，我们还需要一个使用Python从网页下载文件的函数。我从本教程中介绍了以下功能：

　　def download(url, pathname):

"""

Downloads a file given an URL and puts it in the folder `pathname`

"""

# if path doesn't exist, make that path dir

if not os.path.isdir(pathname):

os.makedirs(pathname)

# download the body of response by chunk, not immediately

response = requests.get(url, stream=True)

# get the total file size

file_size = int(response.headers.get("Content-Length", 0))

# get the file name

filename = os.path.join(pathname, url.split("/")[-1])

# progress bar, changing the unit to bytes instead of iteration (default by tqdm)

progress = tqdm(response.iter_content(1024), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True, unit_divisor=1024)

with open(filename, "wb") as f:

for data in progress.iterable:

# write data read to the file

f.write(data)

# update the progress bar manually

progress.update(len(data))

　　复制上面的函数基本上就是使用了要下载的文件的url和文件所在文件夹的路径名。

　　相关：如何在 Python 中将 HTML 表格转换为 CSV 文件。

　　最后，这是主要功能：

　　def main(url, path):

# get all images

imgs = get_all_images(url)

for img in imgs:

# for each image, download it

download(img, path)

　　Python从一个网页下载所有图片：从这个页面获取所有图片的网址，并一一下载。让我们测试一下：

　　main("https://yandex.com/images/", "yandex-images")

　　这将从该 URL 下载所有图像并将它们存储在将自动创建的文件夹“yandex-images”中。

　　Python 如何从网页下载图片？但请注意，有些网站使用 Javascript 加载数据。在这种情况下，您应该使用 requests_html 库。我做了另一个脚本，对原创脚本做了一些调整，并处理了 Javascript 渲染，请点击这里查看。

　　好的，我们完成了！以下是您可以实施以扩展代码的一些想法：

0

2021-11-21

从网页抓取数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

从网页抓取数据(Python如何从网页下载图像？带你了解如何使用requests)

0 个评论

发起人

AI时代内容工厂

从网页抓取数据(Python如何从网页下载图像？带你了解如何使用requests)

0 个评论

发起人

相关问题