如何抓取网页数据( 安装MySQL-python执行如下不报错说明(安装成功))

优采云发布时间: 2022-01-15 21:01

　　如何抓取网页数据(

安装MySQL-python执行如下不报错说明(安装成功))

　　安装 MySQL-python

[root@centos7vm ~]# pip install MySQL-python

　　如无报错则安装成功如下：

[root@centos7vm ~]# python

Python 2.7.5 (default, Nov 20 2015, 02:00:19)

[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import MySQLdb

>>>

　　创建页表

　　为了保存网页，在mysql数据库中创建页表，sql语句如下：

CREATE TABLE `page` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,

`post_date` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

`post_user` varchar(255) COLLATE utf8_unicode_ci DEFAULT '',

`body` longtext COLLATE utf8_unicode_ci,

`content` longtext COLLATE utf8_unicode_ci,

PRIMARY KEY (`id`),

UNIQUE KEY `title` (`title`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

　　其中title为文章的标题，post_date为文章的发布时间，post_user为发布者（即公众号），body为网页的原创内容， content 是提取出来的纯文本格式的文本

　　创建项目结构

　　修改我们scrapy项目中的item.py文件，保存提取的结构化数据如下：

import scrapy

class WeixinItem(scrapy.Item):

# define the fields for your item here like:

title = scrapy.Field()

post_date = scrapy.Field()

post_user = scrapy.Field()

body = scrapy.Field()

content = scrapy.Field()

　　生成项目结构

　　修改爬虫脚本，在解析函数中加入如下语句：

def parse_profile(self, response):

title = response.xpath('//title/text()').extract()[0]

post_date = response.xpath('//em[@id="post-date"]/text()').extract()[0]

post_user = response.xpath('//a[@id="post-user"]/text()').extract()[0]

body = response.body

tag_content = response.xpath('//div[@id="js_content"]').extract()[0]

content = remove_tags(tag_content).strip()

item = WeixinItem()

item['title'] = title

item['post_date'] = post_date

item['post_user'] = post_user

item['body'] = body

item['content'] = content

return item

　　注意：如果你不会写爬虫脚本，请看上一篇文章《教你成为全栈开发者（Full Stack Developer）31-使用微信搜索捕获公众号》帐号文章@ >》

　　另外：这里的内容是去掉标签后的纯文本，使用remove_tags，需要加载库：

from w3lib.html import remove_tags

　　创建管道

　　scrapy 持久化数据的方式是通过管道。各种开源爬虫软件都会提供各种持久化的方法。比如pyspider提供了写mysql、mongodb、文件等的持久化方法，scrapy，作为爬虫老手，给我们留下了接口，我们可以自定义各个管道，可以通过配置灵活选择

　　管道机制是通过pipelines.py文件和settings.py文件的结合实现的

　　修改scrapy项目中pipelines.py的内容如下：

# -*- coding: utf-8 -*-

import sys

reload(sys)

sys.setdefaultencoding('utf8')

import MySQLdb

class WeixinPipeline(object):

def __init__(self):

self.conn = MySQLdb.connect(host="127.0.0.1",user="myname",passwd="mypasswd",db="mydbname",charset="utf8")

self.cursor = self.conn.cursor()

def process_item(self, item, spider):

sql = "insert ignore into page(title, post_user, body, content) values(%s, %s, %s, %s)"

param = (item['title'], item['post_user'], item['body'], item['content'])

self.cursor.execute(sql,param)

self.conn.commit()

　　里面的数据库配置根据自己的修改。这里的process_item会在爬取的时候自动调用，爬虫脚本返回的item会通过参数传入。这里通过insert将item结构化数据插入到mysql数据库中

　　p>

　　再看settings.py文件，如下：

ITEM_PIPELINES = {

'weixin.pipelines.WeixinPipeline': 300,

}

　　运行爬虫后，数据库如下：

　　相当完美，准备用这些数据作为机器学习的训练样本，预测未来会发生什么，听听下一次分解

0

2022-01-15

如何抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

如何抓取网页数据( 安装MySQL-python执行如下不报错说明(安装成功))

0 个评论

发起人