内容分享：微信公众号文章爬取实战

优采云发布时间: 2020-09-06 05:54

　　微信公众号文章抓实战

　　在之前的爬虫实战中，我们基于关键词搜索了相关的微信公众号文章，并获得了一系列相关的文章标题，链接等。找到文章后，我们需要保存，此实验的目的是抓取微信公众号文章文本内容。

　　实验环境

　　python3

　　主要使用的请求pyquery库

　　步骤分析

　　本文使用CSDN公共帐户的Python来抓取北京二手房数据，以分析北票人是否负担得起房屋？随附完整的源代码作为示例，

　　在请求此页面后，我们获得文章标题，作者，官方帐户信息和文章文本信息。由于我们要显示文章的内容，因此为了确保文章文本的格式不变，我们提取html格式，最后将所有提取的内容合并为html格式，并通过浏览器保持文章的原创格式。

　　应注意，文章中的所有图片均来自互联网。这些图片无法通过打开本地html进行解析，因此我们提取了图片链接，然后将其下载并保存到本地，并且html中的图片链接替换了本地位置。

　　在实验过程中确定文章标题

　　文章标题位于以下位置：

1 2

title = doc.find('.rich_media_title').text()

　　确认官方帐户信息

　　作者信息，官方帐户的来源，微信帐户以及官方帐户的介绍都可以通过pyquery提取：

1 2 3 4

# 微信公众号 author = doc.find('#meta_content .rich_media_meta_text').text() source = doc.find('#js_name').text() source_info = doc.find('.profile_meta_value').text()

　　确认文章的内容

　　文章的正文内容无法通过text（）提取，因为提取的内容只是文本部分，而且缺少格式，因此显示起来非常难看，因此我们使用html（）保留身体部位的html元素：

1 2

# 正文内容 content = doc.find('.rich_media_content')

　　提取所有图像链接

　　所有图像链接都在img元素的data-src属性中：

1 2 3 4 5 6

# 所有图片链接 pics_src = content.find('img').items() for each in pics_src: if '=' in each.attr('data-src'): pic.append(each.attr('data-src')) #print(pic)

　　下载图片

　　下载文章中的所有图片并将其保存在本地文件中

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

def (title,url): print(url) pic_name = url.split('/')[4] pic_type = url.split('=')[1] response = requests.get(url,headers=random.choice(headers)) try: if response.status_code == 200: file_dir = "{0}/{1}".format(os.getcwd(), title) if not os.path.isdir(file_dir): os.mkdir(file_dir) path = os.path.join(file_dir,pic_name+'.'+pic_type) if not os.path.exists(path): with open(path,'wb') as f: f.write(response.content) except: pass

　　替换图片

　　将文本链接中的图像链接替换为本地图像的链接，但要注意，我们需要在img元素中添加src属性，因为实际的图像链接存储在此处，然后存储图像的位置用作src属性值。

1 2 3 4 5 6 7 8

for item in content.find('img').items(): pic_url = item.attr('data-src') if '=' in pic_url: pic_name = pic_url.split('/')[4] pic_type = pic_url.split('=')[1] image = pic_name + '.' + pic_type item.add_class('src') item.attr('src',image)

　　生成index.html

　　以html格式添加先前提取的文章标题，作者和官方帐户信息以生成index.html，然后将其打开以查看微信正文内容。

1 2 3 4 5 6 7 8 9

file_dir = "{0}/{1}".format(os.getcwd(), title) path = os.path.join(file_dir,'index.html') index = ''+title+'

' index += ''+author+''+source+'

' index += ''+source_info+'

' index += content with open(path, 'wb') as f: f.write(index.encode('utf-8'))

　　结果分析与解释

　　有关完整代码，请参见微信2. py

　　运行代码：

1

python https://mp.weixin.qq.com/s/QAwrisNuu1dThFbs__wF_Q

　　检索到的index.html如下：

0

2020-09-06

querylist采集微信公众号文章

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

内容分享：微信公众号文章爬取实战

0 个评论

发起人

AI时代内容工厂

内容分享：微信公众号文章爬取实战

0 个评论

发起人

相关问题