浏览器抓取网页(R语言利用RSelenium包或者Rwebdriver模拟浏览器异步加载等难爬取的网页信息)

优采云发布时间: 2021-11-21 00:21

　　Python使用selenium模拟浏览器抓取异步加载等难抓取页面信息背景

　　在我之前的文章《R语言使用RSelenium包或者Rwebdriver模拟浏览器爬取异步加载等难爬的网页信息》中提到过

　　这次我将添加上一篇博客中提到的python实现。其他背景和一些包的介绍将不再解释。

　　程序说明

　　从中文起点抓取信息后，存储到本地MySQL数据库中。有一些处理的细节，我在这里提一下：

　　1、部分数据不计分，使用try...except...pass语句处理，避免出错和数据格式不一致；

　　2、不知道为什么，Firefox总是爬500多本书（不超过1000）而且总是提示crash，所以我设置在这里，每次爬300本书)本书重启浏览器，虽然会延迟时间，但是避免了浏览器崩溃。另外，使用谷歌浏览器抓取时总是出现启动问题，换几个版本也不好。它不像 Firefox 那样容易使用。

　　3、因为一一写入数据库太慢，全部不适合。我也用和上面第二个一样的设置，用300条记录批量写入一次。

　　代码

　　所有代码都贴在下面供您参考。基本学会了模拟浏览器，大部分网页都可以爬取。另一个是速度问题，当然最好不要使用浏览器。

　　# -*- coding: utf-8 -*-

"""

Created on Fri Apr 28 11:32:42 2017

@author: tiger

"""

from selenium import webdriver

from bs4 import BeautifulSoup

import datetime

import random

import requests

import MySQLdb

######获取所有的入选书籍页面链接

# 获得进入每部书籍相应的页面链接

def get_link(soup_page):

soup = soup_page

items = soup('div','book-mid-info')

## 提取链接

links = []

for item in items:

links.append('https:'+item.h4.a.get('href'))

return links

### 进入每个链接，提取需要的信息

def get_book_info(link):

driver.get(link)

#soup = BeautifulSoup(driver.page_source)

#根据日期随机分配的id

book_id=datetime.datetime.now().strftime("%Y%m%d%H%M%S")+str(random.randint(1000,9999))

### 名称

title = driver.find_element_by_xpath("//div[@class='book-information cf']/div/h1/em").text

### 作者

author = driver.find_element_by_xpath("//div[@class='book-information cf']/div/h1/span/a").text

###类型

types = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[1]/a").text

###状态

status = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[1]/span[1]").text

###字数

words = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[3]/em[1]").text

###点击

cliks = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[3]/em[2]").text

###推荐

recoms = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[3]/em[3]").text

### 评论数

try :

votes = driver.find_element_by_xpath("//p[@id='j_userCount']/span").text

except (ZeroDivisionError,Exception) as e:

votes=0

print e

pass

#### 评分

score = driver.find_element_by_id("j_bookScore").text

##其他信息

info = driver.find_element_by_xpath("//div[@class='book-intro']").text.replace('\n','')

return (book_id,title,author,types,status,words,cliks,recoms,votes,score,info)

#############保持数据到mysql

def to_sql(data):

conn=MySQLdb.connect("localhost","root","tiger","test",charset="utf8" )

cursor = conn.cursor()

sql_create_database = 'create database if not exists test'

cursor.execute(sql_create_database)

# try :

# cursor.select_db('test')

# except (ZeroDivisionError,Exception) as e:

# print e

#cursor.execute("set names gb2312")

cursor.execute('''create table if not exists test.tiger_book2(book_id bigint(80),

title varchar(50),

author varchar(50),

types varchar(30),

status varchar(20),

words numeric(8,2),

cliks numeric(10,2),

recoms numeric(8,2),

votes varchar(20),

score varchar(20),

info varchar(3000),

primary key (book_id));''')

cursor.executemany('insert ignore into test.tiger_book2 values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s);',data)

cursor.execute('select * from test.tiger_book2 limit 5;')

conn.commit()

cursor.close()

conn.close()

#####进入每部影片的介绍页面提取信息

base_url = "http://a.qidian.com/?size=-1&sign=-1&tag=-1&chanId=-1&subCateId=-1&orderId=&update=-1&page="

links = []

Max_Page = 30090

rank = 0

for page in range(1,Max_Page+1):

print "Processing Page ",page,".Please wait..."

CurrentUrl = base_url +unicode(page)+u'&month=-1&style=1&action=-1&vip=-1'

CurrentSoup = BeautifulSoup(requests.get(CurrentUrl).text,"lxml")

links.append(get_link(CurrentSoup))

#sleep(1)

print links[9][19]

### 获得所有书籍信息

books = []

rate = 1

driver = webdriver.Firefox()

for i in range(0,Max_Page):

for j in range(0,20):

try:

print "Getting information of the",rate,"-th book."

books.append(get_book_info(links[i][j]))

#sleep(0.8)

except Exception,e:

print e

rate+=1

if i % 15 ==0 :

driver.quit()

#写入数据库

to_sql(books)

books=[]

driver = webdriver.Firefox()

driver.quit()

to_sql(books)

###添加id

#n=len(books)

#books=zip(*books)

#books.insert(0,range(1,n+1))

#books=zip(*books)

##print books[198]

　　4、比较

　　Python比R更容易安装Selenium，不需要在命令提示符下启动selenium。然而，在没有性能优化的情况下，R 速度更快，编码问题也相对较少。

0

2021-11-21

浏览器抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

浏览器抓取网页(R语言利用RSelenium包或者Rwebdriver模拟浏览器异步加载等难爬取的网页信息)

0 个评论

发起人