抓取网页新闻(好久没有写爬虫scrapy的小爬爬来网易新闻，代码原型 )

优采云发布时间: 2022-03-28 23:08

　　抓取网页新闻(好久没有写爬虫scrapy的小爬爬来网易新闻，代码原型

)

　　好久没写爬虫了。我写了一个scrapy小爬虫来抓取网易新闻。代码原型是github上的爬虫。言归正传，scrapy爬虫主要有几个文件需要修改。这个爬虫需要你安装 mongodb 数据库和 pymongo。进入数据库后，可以使用find语句查看数据库中的内容。爬取的内容如下：

　　{

"_id" : ObjectId("5577ae44745d785e65fa8686"),

"from_url" : "http://tech.163.com/",

"news_body" : [

"科技讯 6月9日凌晨消息2015",

"全球开发者大会（WWDC 2015）在旧",

"召开，网易科技进行了全程图文直播。最新",

"9操作系统在",

"上性能得到极大提升，可以实现分屏显示，也可以支持画中画功能。",

"新版iOS 9 增加了QuickType 键盘，让输入和编辑都更简单快捷。在搭配外置键盘使用 iPad 时，用户可以用快捷键来进行操作，例如在不同 app 之间进行切换。",

"而且，iOS 9 重新设计了 app 间的切换。iPad的分屏功能可以让用户在不离开当前 app 的同时就能打开第二个 app。这意味着两个app在同一屏幕上，同时开启、并行运作。两个屏幕的比例可以是5：5，也可以是7：3。",

"另外，iPad还支持“画中画”功能，可以将正在播放的视频缩放到一角，然后利用屏幕其它空间处理其他的工作。",

"据透露分屏功能只支持iPad Air2；画中画功能将只支持iPad Air, iPad Air2, iPad mini2, iPad mini3。",

"\r\n"

],

"news_from" : "网易科技报道",

"news_thread" : "ARKR2G22000915BD",

"news_time" : "2015-06-09 02:24:55",

"news_title" : "iOS 9在iPad上可实现分屏功能",

"news_url" : "http://tech.163.com/15/0609/02/ARKR2G22000915BD.html"

}

　　以下是需要修改的文件：

　　1.spider爬虫文件，制定爬取规则主要使用xpath

　　2.items.py 主要指定要爬取的内容

　　3.pipeline.py有一个指向和存储数据的功能，这里我们还要添加一个store.py文件，文件里面是创建一个MongoDB数据库。

　　4.setting.py配置文件，主要配置agent、User_Agent、抓取间隔、延迟等。

　　这些主要是这些文件。本次scrapy根据之前的爬虫增加了几个新的功能。一是与数据库联动，实现存储功能。它不存储为 json 或 txt 文件。二是在spider中设置follow。= True 这个属性表示在爬升的结果上继续往下爬，相当于一个深度搜索的过程。让我们看看下面的源代码。

　　一般我们首先写的是items.py文件

　　# -*- coding: utf-8 -*-

import scrapy

class Tech163Item(scrapy.Item):

news_thread = scrapy.Field()

news_title = scrapy.Field()

news_url = scrapy.Field()

news_time = scrapy.Field()

news_from = scrapy.Field()

from_url = scrapy.Field()

news_body = scrapy.Field()

　　之后我们编写蜘蛛文件。我们可以任意命名一个文件，因为我们在调用爬虫的时候，只需要知道它的文件里面的爬虫的名字，也就是属性name="news"，而我们这里的爬虫的名字就是news。如果需要使用这个爬虫，可能需要修改下面Rule中的allow属性并修改时间，因为网易新闻不会存储超过一年的新闻。如果现在是 8 月 15 日，您可以将时间更改为最近，您可以将其更改为 /15/08。

　　#encoding:utf-8

import scrapy

import re

from scrapy.selector import Selector

from tech163.items import Tech163Item

from scrapy.contrib.linkextractors import LinkExtractor

from scrapy.contrib.spiders import CrawlSpider,Rule

class Spider(CrawlSpider):

name = "news"

allowed_domains = ["tech.163.com"]

start_urls = ['http://tech.163.com/']

rules = (

Rule(

LinkExtractor(allow = r"/15/06\d+/\d+/*"),

#代码中的正则/15/06\d+/\d+/*的含义是大概是爬去/15/06开头并且后面是数字/数字/任何格式/的新闻

callback = "parse_news",

follow = True

#follow=ture定义了是否再爬到的结果上继续往后爬

),

)

def parse_news(self,response):

item = Tech163Item()

item['news_thread'] = response.url.strip().split('/')[-1][:-5]

self.get_title(response,item)

self.get_source(response,item)

self.get_url(response,item)

self.get_news_from(response,item)

self.get_from_url(response,item)

self.get_text(response,item)

return item

def get_title(self,response,item):

title = response.xpath("/html/head/title/text()").extract()

if title:

item['news_title'] = title[0][:-5]

def get_source(self,response,item):

source = response.xpath("//div[@class='ep-time-soure cDGray']/text()").extract()

if source:

item['news_time'] = source[0][9:-5]

def get_news_from(self,response,item):

news_from = response.xpath("//div[@class='ep-time-soure cDGray']/a/text()").extract()

if news_from:

item['news_from'] = news_from[0]

def get_from_url(self,response,item):

from_url = response.xpath("//div[@class='ep-time-soure cDGray']/a/@href").extract()

if from_url:

item['from_url'] = from_url[0]

def get_text(self,response,item):

news_body = response.xpath("//div[@id='endText']/p/text()").extract()

if news_body:

item['news_body'] = news_body

def get_url(self,response,item):

news_url = response.url

if news_url:

item['news_url'] = news_url

　　然后我们创建一个store.py文件，我们在其中创建一个数据库，然后在pipeline文件中引用这个数据库，将数据存储到数据库中。让我们看看下面的源代码。

　　import pymongo

import random

HOST = "127.0.0.1"

PORT = 27017

client = pymongo.MongoClient(HOST,PORT)

NewsDB = client.NewsDB

　　在 pipeline.py 文件中，我们将导入 NewsDB 数据库，并使用 update 语句将每条新闻插入到这个数据库中。判断有两种：一种是判断爬虫名称是否为新闻，另一种是判断线程号是否为空。，最重要的一句话就是NewsDB.new.update(spec,{"$set":dict(item)},upsert = True)，将字典中的数据插入数据库。

　　from store import NewsDB

class Tech163Pipeline(object):

def process_item(self, item, spider):

if spider.name != "news":

return item

if item.get("news_thread",None) is None:

return item

spec = {"news_thread":item["news_thread"]}

NewsDB.new.update(spec,{"$set":dict(item)},upsert = True)

return None

　　最后，我们将更改配置文件以设置 USER_AGENT。我们需要让爬虫最大程度的模仿浏览器的行为，这样才能成功爬取你想要的内容。

　　BOT_NAME = 'tech163'

SPIDER_MODULES = ['tech163.spiders']

NEWSPIDER_MODULE = 'tech163.spiders'

ITEM_PIPELINES = ['tech163.pipelines.Tech163Pipeline',]

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'tech163 (+http://www.yourdomain.com)'

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'

DOWNLOAD_TIMEOUT = 15

0

2022-03-28

抓取网页新闻

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页新闻(好久没有写爬虫scrapy的小爬爬来网易新闻，代码原型 )

0 个评论

发起人

AI时代内容工厂

抓取网页新闻(好久没有写爬虫scrapy的小爬爬来网易新闻，代码原型 )

0 个评论

发起人

相关问题