网页爬虫抓取百度图片(fildder抓包来分析这些新闻网址等信息隐藏在那个地方 )

优采云 发布时间: 2022-02-16 05:21

  网页爬虫抓取百度图片(fildder抓包来分析这些新闻网址等信息隐藏在那个地方

)

  先分析

  

  打开网站后,再打开源码,发现之前的一些新闻头条在源码中可以找到,而下面的头条在源码中找不到

  

  这时候我们就需要使用fildder抓包来分析这些新闻的URL以及隐藏在那个地方的其他信息

  

  这些有我们正在寻找的信息

  

  我们复制网址在浏览器中打开发现不是我们要找的源信息

  

  复制这个url找到我们的源码,比较两个url的区别

  t只是一个时间戳,我们将第二个URL改为

  &ajax=json

  我们可以通过上面的网址访问源代码

  我们受到下面蓝*敏*感*词*的启发,只需要通过以上网址的拼接来获取我们的源代码

  

  1、首先定义我们要爬取哪些字段。这是在 items 中定义的

  import scrapy

class BaidunewsItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

# pass

"""1定义爬取的东西"""

title=scrapy.Field()

link=scrapy.Field()

content=scrapy.Field()

  2、编写爬虫文件

  # -*- coding: utf-8 -*-

import scrapy

from baidunews.items import BaidunewsItem

from scrapy.http import Request

import re

class NewsSpider(scrapy.Spider):

name = 'news'

allowed_domains = ['news.baidu.com']

start_urls = ['http://news.baidu.com/widget?id=LocalNews&ajax=json']

all_id = ['LocalNews', 'civilnews', 'InternationalNews', 'EnterNews', 'SportNews', 'FinanceNews', 'TechNews',

'MilitaryNews', 'InternetNews', 'DiscoveryNews', 'LadyNews', 'HealthNews', 'PicWall']

all_url = []

for i in range(len(all_id)):

current_id = all_id[i]

current_url = 'http://news.baidu.com/widget?id=' + current_id + '&ajax=json'

all_url.append(current_url)

"""得到某个栏目块的url"""

def parse(self, response):

for i in range(0, len(self.all_url)):

print("第" + str(i) + "个栏目")

yield Request(self.all_url[i], callback=self.next)

"""某个模块下面的所有的新闻的url"""

def next(self, response):

data = response.body.decode('utf-8', 'ignore')

# print(self.all_url)

partten1 = '"url":"(.*?)"'

partten2 = '"m_url":"(.*?)"'

url1 = re.compile(partten1, re.S).findall(data)

url2 = re.compile(partten2, re.S).findall(data)

if (len(url1) != 0):

url = url1

else:

url = url2

# print(url)

# print("===========================")

for i in range(0, len(url)):

thisurl = re.sub(r"\\\/", '/', url[i])

# print(thisurl)

yield Request(thisurl, callback=self.next2,dont_filter=True) # thisurl是返回的值,回调函数调next2去处理,

def next2(self, response):

item = BaidunewsItem()

item['link'] = response.url

item['title'] = response.xpath("/html/head/title/text()").extract()

item['content'] = response.body

yield item

  这里总结几点

  1、代码中yieldRequest()中的第一个参数是yield*敏*感*词*返回的值,第二个callback=XXX是接下来要调用的参数

  2、yield Request(thisurl, callback=self.next2,dont_filter=True) 如果不加dont_filter=True,则无法获取返回项

  3 教我的大佬解释说,allowed_domains = [''] 我们里面这个参数是设置成在URL中收录这个来获取我们想要的信息,比如current_url = '#39; + current_id + '&ajax=json ' 而且我们的每个栏目块下都没有一条新闻。只有加上上面dont_filter=True的参数才能返回我们的item信息

  3、我们可以编写管道来查看一些返回的项目

  # -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class BaidunewsPipeline(object):

def process_item(self, item, spider):

# print("===========================")

print(item["title"].decode('utf-8')+item['link']+item['content'])

return item

  4 我们还需要在执行代码之前在设置中注册

  # -*- coding: utf-8 -*-

# Scrapy settings for baidunews project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'baidunews'

SPIDER_MODULES = ['baidunews.spiders']

NEWSPIDER_MODULE = 'baidunews.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'baidunews (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'baidunews.middlewares.BaidunewsSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# 'baidunews.middlewares.BaidunewsDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#注册pipeline

ITEM_PIPELINES = {

'baidunews.pipelines.BaidunewsPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  5然后我们就可以在命令行窗口成功执行scrapy爬取新闻了

  6 下面是我们的目录结构

  

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线