根据关键词文章采集系统( 爬取多个之爬取多篇含有关键词的文章标题和内容 )

优采云发布时间: 2021-12-18 05:05

　　根据关键词文章采集系统(

爬取多个之爬取多篇含有关键词的文章标题和内容

)

　　Python爬虫爬取多篇收录文章标题和内容优化的文章

　　需要链接上一篇文章

　　Python 爬虫爬取收录文章标题和内容

　　的多篇文章

　　与关键词

　　实现的功能

　　爬取多个关键字的文章标题和内容（将要爬取的关键词放在一个数组中遍历for），为每个关键字创建一个文件夹，每个文章放在单独的txt文件中，运行结果（自己调试的时候只设置page为2，array为2）：

　　代码设计思路

　　可以看上一篇python爬虫爬取的文章。很多文章都收录文章的标题和关键词的内容，因为这段代码是对上一篇文章的进一步优化，代码设计思路大同小异。 .

　　源代码

　　（可能关键词太多爬上去会很慢，可以加到data数组里，最好不要加太多，我自己试过，会跑得很慢。）

　　import re

import requests

import os

from bs4 import BeautifulSoup

titles=[]

urls=[]

reg = "[^0-9A-Za-z\u4e00-\u9fa5]"#标点符号

data=['通用设备制造业','软件和信息技术服务业','金属制品、机械和设备修理业']

for j in range(0,len(data)):

keyword = data[j]

path="./"+keyword

if(not(os.path.exists(path))):#判断是否存在该文件夹，不存在创建

os.mkdir(path, 755)

pagenum = '2'

#keyword=input("输入想要在维科网搜索的关键字：")

#pagenum=input("输入想要查找的前几页（如果输入2，即找前2页的）：")

'''

txt_name=keyword+"/关键词："+keyword+"前"+pagenum+"页具体内容.txt"

with open(txt_name,'w',encoding='utf-8') as f:

f.write(txt_name+'\r')

f.close()

'''

for i in range(1,int(pagenum)+1):

html="http://www.ofweek.com/newquery.action?keywords="+keyword+"&type=1&pagenum="+str(i)#科技新闻

resp=requests.get(html)

resp.encoding='utf-8'

content=resp.text

bs=BeautifulSoup(content,'html.parser')

for news in bs.select('div.zx-tl'):#每个标题都是存在类名为no-pic的li标签里面

url=news.select('a')[0]['href']

urls.append(url)

title=news.select('a')[0].text

titles.append(title)

for i in range(len(urls)):

resp=requests.get(urls[i])

resp.encoding='utf-8'

content=resp.text

bs=BeautifulSoup(content,'html.parser')

page_content=bs.select('div.artical-content')[0].text

time=bs.select("div.time.fl")[0].text

time=time.replace("\n","")

time=time.replace("\t","")

title2=titles[i]

title2=re.sub(reg, '',title2)#正则表达去掉标题的标点符号等，避免文件命名报错

txt_name=keyword+"/"+title2+".txt"

time=time[0:17]

with open(txt_name,'w',encoding='utf-8') as f:

f.write(titles[i])

f.write(time)

f.write(page_content)

f.close()

print(keyword+"文件已经成功记录！")

0

2021-12-18

根据关键词文章采集系统

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

根据关键词文章采集系统( 爬取多个之爬取多篇含有关键词的文章标题和内容 )

0 个评论

发起人

AI时代内容工厂

根据关键词文章采集系统( 爬取多个之爬取多篇含有关键词的文章标题和内容 )

0 个评论

发起人

相关问题