php 网页内容抓取(用Python抓取html页面并保存的时候是乱码的问题)

优采云发布时间: 2022-02-09 22:07

　　使用Python爬取html页面并保存时，经常会出现爬取的网页内容乱码的问题。出现这个问题的原因，一方面是你自己代码中的编码设置有问题，另一方面是在编码设置正确的情况下，网页的实际编码不匹配标记编码。html页面上显示的编码在这里：

　　代码显示如下：

　　这里有一个简单的解决方案：使用chardet判断网页的真实代码，同时从url请求返回的信息中判断代码。如果两种编码不同，使用bs模块扩展为GB18030编码；如果相同，直接写入文件（系统默认编码设置为utf-8）.

　　import urllib2 import sys import bs4 import chardet reload(sys) sys.setdefaultencoding(&＃39;utf-8&＃39;) def download(url): htmlfile = open(&＃39;test.html&＃39;,&＃39;w&＃39;) try: result = urllib2.urlopen(url) cOntent= result.read() info = result.info() result.close() except Exception,e: print &＃39;download error!!!&＃39; print e else: if content != None: charset1 = (chardet.detect(content))[&＃39;encoding&＃39;] #real encoding type charset2 = info.getparam(&＃39;charset&＃39;) #declared encoding type print charset1,&＃39; &＃39;, charset2 # case1: charset is not None. if charset1 != None and charset2 != None and charset1.lower() != charset2.lower(): newcOnt= bs4.BeautifulSoup(content, from_encoding=&＃39;GB18030&＃39;) #coding: GB18030 for cont in newcont: htmlfile.write(&＃39;%s\n&＃39;%cont) # case2: either charset is None, or charset is the same. else: #print sys.getdefaultencoding() htmlfile.write(content) #default coding: utf-8 htmlfile.close() if __name__ == "__main__": url = &＃39;https://www.php1.cn&＃39; download(url)

　　得到的test.html文件打开如下，可以看到它是以UTF-8存储的，没有BOM编码格式，也就是我们设置的默认编码：

　　更多关于python爬取保存html页面时出现乱码的信息文章请关注PHP中文网站！

0

2022-02-09

php 网页内容抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

php 网页内容抓取(用Python抓取html页面并保存的时候是乱码的问题)

0 个评论

发起人

AI时代内容工厂

php 网页内容抓取(用Python抓取html页面并保存的时候是乱码的问题)

0 个评论

发起人

相关问题