python网页数据抓取(,使用到了urllib模块,的代码实现：抓取功能)

优采云发布时间: 2021-10-19 02:16

　　本文文章主要介绍使用Python3编写抓取网页和只抓取网页图片的脚本。使用了 urllib 模块。有需要的朋友可以参考

　　抓取网页内容最基本的代码实现：

　　 #!/usr/bin/env python from urllib import urlretrieve def firstNonBlank(lines): for eachLine in lines: if not eachLine.strip(): continue else: return eachLine def firstLast(webpage): f = open(webpage) lines = f.readlines() f.close() print firstNonBlank(lines), lines.reverse() print firstNonBlank(lines), def download(url='http://www',process=firstLast): try: retval = urlretrieve(url)[0] except IOError: retval = None if retval: process(retval) if __name__ == '__main__': download()

　　使用urllib模块实现网页抓图功能：

　　 import urllib.request import socket import re import sys import os targetDir = r"H:\pic" def destFile(path): if not os.path.isdir(targetDir): os.mkdir(targetDir) pos = path.rindex('/') t = os.path.join(targetDir, path[pos+1:]) #会以/作为分隔 return t if __name__ == "__main__": hostname = "http://www.douban.com/" req = urllib.request.Request(hostname) webpage = urllib.request.urlopen(req) contentBytes = webpage.read() match = re.findall(r'(http:[^\s]*?(jpg|png|gif))', str(contentBytes) )#r'(http:[^\s]*?(jpg|png|gif))'中包含两层圆括号，故有两个分组， #上面会返回列表，括号中匹配的内容才会出现在列表中 for picname, picType in match: print(picname) print(picType) ''''' 输出： http://img3.douban.com/pics/blank.gif gif http://img3.douban.com/icon/g111328-1.jpg jpg http://img3.douban.com/pics/blank.gif gif http://img3.douban.com/icon/g197523-19.jpg jpg http://img3.douban.com/pics/blank.gif gif ... '''

　　以上是使用Python3编写抓取网页和只抓取网页图片的脚本的详细内容。更多详情请关注其他相关html中文网站文章！

0

2021-10-19

python网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python网页数据抓取(,使用到了urllib模块,的代码实现：抓取功能)

0 个评论

发起人

AI时代内容工厂

python网页数据抓取(,使用到了urllib模块,的代码实现：抓取功能)

0 个评论

发起人

相关问题