抓取网页新闻(网易爬虫基于scrapy框架的新闻爬虫便可以工作吗？)

优采云发布时间: 2022-04-02 19:01

　　3.2 定义项目

　　Item是爬取数据的容器；它的用法类似于 python 字典，它提供了额外的保护来防止由拼写错误导致的未定义字段错误。

　　在item文件中我们定义了我们要爬取的字段：

　　如果我们只需要抓取新闻标题、新闻内容、新闻url、新闻发布时间这四个字段，那么我们可以在item中定义这个：

　　class NewsRecItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

title = scrapy.Field()

pubtime = scrapy.Field()

content=scrapy.Field()

url=scrapy.Field()

pass

　　3.3 编写网络爬虫

　　接下来，我们开始为网易新闻编写爬虫。

　　在news_rec/spiders/目录下新建news163Spider.py

　　# -*- coding: utf-8 -*-

import json

from bs4 import BeautifulSoup, Comment

from scrapy import Spider, Request

from scrapy.settings.default_settings import DEFAULT_REQUEST_HEADERS

import re

from news_rec.items import NewsRecItem

class News163Spider(Spider):

name = 'news_163_spider'

allowed_domains = ['163.com']

start_urls = "http://news.163.com/special/0001220O/news_json.js"

def start_requests(self):

DEFAULT_REQUEST_HEADERS['Accept'] = '*/*'

DEFAULT_REQUEST_HEADERS['Host'] = 'news.163.com'

DEFAULT_REQUEST_HEADERS['Referer'] = 'http://news.163.com/'

req = Request(self.start_urls.format(category="news"),callback=self.parse_list,meta={"title":"ContentList"}, encoding='utf-8')

yield req

def parse_list(self, response):

#163爬取的response数据是gzip解压过来的这里没办法自动转码要随时调整

try:

j_str=response.body.decode("gb18030")

except UnicodeDecodeError as e:

j_str = response.body.decode("utf-8")

print("163.com下gb18030解码失败,已转utf-8")

else:

# json_str = re.search(r"data=((.|\s)*?);", j_str).group(1)

json_str = j_str[9:-1]

list_json = json.loads(json_str)

# content_list=list_json['result']['data']

for i in range(0, 3):

for con in list_json["news"][i]:

msg = response.meta

msg["url"] = con["l"]

msg["title"] = con["t"]

msg["pubtime"] = con["p"]

yield Request(msg["url"], callback=self.parse_content, meta=msg)

def parse_content(self, response):

try:

soup = BeautifulSoup(response.text, 'html.parser')

if "news.163.com" in response.request.url:

source_content = soup.find("div", id="endText")

news_source = soup.find("a", id="ne_article_source").get_text()

# 清除延伸阅读

if source_content.find("div", class_="related_article related_special") is not None:

source_content.find("div", class_="related_article related_special").extract()

if source_content.find("p", class_="otitle") is not None:

source_content.find("p", class_="otitle").extract()

content = source_content

urlset_tmp = self.crawler.stats.get_value("urlset")

item = NewsRecItem()

item['title'] = response.meta.get("title", "")

item['pubtime'] = response.meta.get("pubtime", "")

item["content"] = content.prettify()

item["url"] = response.request.url

yield item

else:

print("163域名下爬取网站列表不匹配")

return ''

except BaseException as e:

print("提取内容失败：" + response.request.url)

print(e.with_traceback())

return

　　以上是网易爬虫的代码。接下来，我们可以使用scrapy来运行爬虫，并将爬取解析的结果输出到文件中。

　　scrapy crawl news_163_spider -o item.json

　　通过运行上述命令，可以将item的结果输出为json文件。

　　此外，如果我们希望将结果存储在数据库中，

　　CREATE TABLE `news_table` (

`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,

`title` varchar(255) DEFAULT NULL,

`pubtime` datetime DEFAULT NULL,

`content` text,

`url` varchar(255) DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=MyISAM AUTO_INCREMENT=126695 DEFAULT CHARSET=utf8

　　建表后，在pipline中添加插入数据库的代码。

　　pipline 可以理解为item的管道。在处理爬取的项目时，它会经过管道。如果要对文本进行过滤、存入库等，可以在pipline中添加相应的过滤代码。

　　# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

from scrapy import Item

class NewsRecPipeline(object):

def process_item(self, item, spider):

# self.insert_db(item)

self.insert_mysql(item)

return item

# 打开数据库

def open_spider(self, spider):

self.db = pymysql.connect('localhost','root',

'root', 'rec',

charset='utf8')

# self.cursor = self.db.cursor(DictCursor)

self.cursor = self.db.cursor()

self.ori_table = 'news_table'

# 关闭数据库

def close_spider(self, spider):

print("关闭"+ spider.name +"项目爬虫。。。")

self.cursor.close()

# self.db_conn.connection_pool.disconnect()

# 插入数据

def insert_db(self, item):

if isinstance(item, Item):

item = dict(item)

def insert_mysql(self,item):

sql='''insert into {0} (pubtime,title,content,url) VALUES ('{1}','{2}','{3}','{4}') '''.format(self.ori_table,

item.get('pubtime', ''),

item.get('title',''),pymysql.escape_string(item.get('content','')),item.get('url',''))

# print(sql)

try:

self.cursor.execute(sql)

print('写入成功')

except BaseException as e:

# print(e)

print("异常sql:"+sql)

　　添加代码后，注意将管道添加到设置文件中。默认情况下，设置文件中的管道被注释掉。我们需要取消注释它。 itempipeline 后面的数字范围为 0-1000，仅代表 pipeline 的优先级。级别，数字越小，优先级越高。

　　使用命令行运行爬虫：

　　scrapy crawl news_163_spider

　　为了方便操作，写了一个运行脚本。当我们需要爬取文章时，直接运行这个脚本即可。

　　程序运行完毕后，我们就可以看到从数据库中抓取到的新闻了。

　　至此，一个基于scrapy框架的新闻爬虫就可以工作了。当然，在大型爬虫项目中，也有很多细节需要注意，比如关键词过滤、断点爬取、代理ip池等。如果您对爬虫感兴趣，请留言交流。

0

2022-04-02

抓取网页新闻

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页新闻(网易爬虫基于scrapy框架的新闻爬虫便可以工作吗？)

0 个评论

发起人