如何批量采集高质量好文章( 上上篇文章爬虫如何爬取微信公众号文章(二))

优采云发布时间: 2021-11-20 07:07

　　如何批量采集高质量好文章(

上上篇文章爬虫如何爬取微信公众号文章(二))

　　python爬虫如何实现每天爬取微信公众号的推送文章

　　上一篇文章如何抓取微信公众号文章

　　第一篇文章如何用python爬虫爬取微信公众号文章(二)

　　以上文章分别介绍了如何批量获取公众号的历史文章url，以及如何批量抓取公众号的文章，提取需要的数据并保存在数据库。

　　本文文章将介绍如何自动抓取公众号每天推送的文章，然后提取数据保存到数据库中。

　　首先介绍一个基于itchat的微信借口wxpy。通过大量的界面优化提高了模块的易用性，并进行了丰富的功能扩展。

　　通过wxpy可以接收微信公众号推送的文章，但是它只实现了每个文章@的标题（title）、摘要（summary）、链接（url）、封面图（cover） > ，我在此基础上又增加了两个属性，分别是文章的发布时间（pub_time）和文章的出处（source）

　　@property

def articles(self):

"""

公众号推送中的文章列表 (首篇的标题/地址与消息中的 text/url 相同)

其中，每篇文章均有以下属性:

* `title`: 标题

* `summary`: 摘要

* `url`: 文章 URL

* `cover`: 封面或缩略图 URL

"""

from wxpy import MP

if self.type == SHARING and isinstance(self.sender, MP):

tree = ETree.fromstring(self.raw['Content'])

# noinspection SpellCheckingInspection

items = tree.findall('.//mmreader/category/item')

article_list = list()

for item in items:

def find_text(tag):

found = item.find(tag)

if found is not None:

return found.text

article = Article()

article.title = find_text('title')

article.summary = find_text('digest')

article.url = find_text('url')

article.cover = find_text('cover')

article.pub_time = find_text('pub_time')

article.source = find_text('.//name')

article_list.append(article)

return article_list

　　这两个属性也应该添加到article.py中：

　　# 发布时间

self.pub_time = None

# 来源

self.source = None

　　事实上，还有其他几个属性。属性如下，可以通过ElementTree获取，看你的需要。

5

1

1566993086

100020868

1

963025857335934976

2

1

　　改完上面的代码后，第一篇文章中实现主要逻辑的函数文章python爬虫如何抓取微信公众号文章（二)也需要改，即url 地址和发布时间作为参数传递，而不是列表类型。

　　 def wechat_run(self,url,pub_time): # 实现主要逻辑

# 打开数据库连接（ip/数据库用户名/登录密码/数据库名）

db = pymysql.connect("localhost", "root", "root", "weixin_database")

# 使用 cursor() 方法创建一个游标对象 cursor

cursor = db.cursor()

html_str = self.parse_url(url)

content_list = self.get_content_list(html_str)

title = ''.join(content_list[0]["title"])

# other1 = ''.join(content_list[0]["other"])

other = '\n'.join(content_list[0]["other"])

create_time = pub_time

# print(other)

p1 = re.compile(r'\s*[（|(]20\d+[）|)]\s*[\u4e00-\u9fa5]*[\d]*[\u4e00-\u9fa5]+[\d]+号', re.S)

anhao = re.search(p1, other)

if (anhao):

anhao = anhao.group().replace("\n", "")

else:

anhao = ""

p2 = re.compile(r'\s[【]*裁判要[\u4e00-\u9fa5]\s*.*?(?=[【]|裁判文)', re.S)

zhaiyao = ''.join(re.findall(p2, other)).replace("\n", "")

# print(zhaiyao)

p3 = re.compile('.*?', re.S)

html = re.search(p3, html_str)

if (html):

html = re.search(p3, html_str).group().replace("\n", "")

else:

html = html_str.replace("\n", "")

sql = """INSERT INTO weixin_table(title,url,anhao,yaozhi,other,html,create_time,type_id)

VALUES ({},{},{},{},{},{},{},{})""".format('"' + title + '"', '"' + url + '"', '"' + anhao + '"',

'"' + zhaiyao + '"', '"' + other + '"', "'" + html + "'",

create_time, 4)

# print(sql)

try:

# 执行sql语句

cursor.execute(sql)

# 提交到数据库执行

db.commit()

print("数据插入成功")

except:

print("数据插入失败:")

info = sys.exc_info()

print(info[0], ":", info[1])

# 如果发生错误则回滚

db.rollback()

# 3.保存html

page_name = title

self.save_html(html, page_name)

# 关闭数据库连接

db.close()

　　然后编写接收微信消息和公众号推送的函数：

　　# -*- coding: utf-8 -*-

# @Time : 2019/8/29 上午8:31

# @Author : jingyoushui

# @Email : jingyoushui@163.com

# @File : wechat.py

# @Software: PyCharm

from beijing import WeixinSpider_1

from wxpy import *

import pandas as pd

bot = Bot(cache_path=True, console_qr=True)

# 打印来自其他好友、群聊和公众号的消息

@bot.register()

def print_others(msg):

print('msg:' + str(msg))

articles = msg.articles

if articles is not None:

for article in articles:

a = str(article.source)

print('title:' + str(article.title))

print('url:' + str(article.url))

print('pub_time:' + article.pub_time)

print('source:' + a)

if a != "KMTV" and a != "北京行政裁判观察":

pass

else:

content_list = []

items = []

items.append(str(article.title))

url = str(article.url)

items.append(url)

pub_time = article.pub_time

items.append(pub_time)

content_list.append(items)

name = ['title', 'link', 'create_time']

test = pd.DataFrame(columns=name, data=content_list)

if a == "KMTV":

test.to_csv("everyday_url/kmtv.csv", mode='a', encoding='utf-8')

print("保存成功")

if a == "北京行政裁判观察":

test.to_csv("everyday_url/beijing.csv", mode='a', encoding='utf-8')

print("保存成功")

weixin_spider_1 = WeixinSpider_1()

weixin_spider_1.wechat_run(url, pub_time)

if __name__ == '__main__':

# 堵塞线程

bot.join()

　　首先获取要爬取的公众号推送的文章的标题、url、发布时间、来源等信息，并保存在一个csv文件中，然后调用WeixinSpider_1类的wechat_run函数实现url的分析和数据提取数据、保存数据到数据库等操作。

　　在终端运行程序，打印出二维码，扫描手机微信即可登录

　　操作方式是阻塞线程，可以处于登录状态，除非你在网页上登录这个账号，否则会被挤掉退出。

　　8月30日补充：

0

2021-11-20

如何批量采集高质量好文章

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

如何批量采集高质量好文章( 上上篇文章爬虫如何爬取微信公众号文章(二))

0 个评论

发起人

AI时代内容工厂

如何批量采集高质量好文章( 上上篇文章爬虫如何爬取微信公众号文章(二))

0 个评论

发起人

相关问题