左手用R右手Python——CSS网页解析实战

优采云发布时间: 2022-06-02 16:48

　　杜雨，EasyCharts团队成员，R语言中文社区专栏作者，兴趣方向为：Excel商务图表，R语言数据可视化，地理信息数据可视化。个人公众号：数据小魔方（微信ID：datamofang），“数据小魔方”创始人。

　　之前我陆陆续续写了几篇介绍在网页抓取中CSS和XPath解析工具的用法，以及实战应用，今天这一篇作为系列的一个小结，主要分享使用R语言中Rvest工具和Python中的requests库结合css表达式进行html文本解析的流程。

　　css和XPath在网页解析流程中各有优劣，相互结合、灵活运用，会给网络数据抓取的效率带来很大提升！

　　R语言：

　　library("rvest")

url% html_text() %>% c(title,.) ###考虑分类，枚举出所有分类标签

category=result %>% html_nodes(".category") %>% html_text() %>% c(category,.) ###提取作者、副标题、评价、评分、价格：

author_text=subtext=eveluate_text=rating_text=price_text=rep('',length) for (i in 1:length){ ###考虑作者不唯一的情况:

author_text[i]=result %>% html_nodes(sprintf("ol li:nth-of-type(%d) div.info > p:nth-of-type(1) a,ol li:nth-of-type(%d) .author a",i,i)) %>% html_text() %>% paste(collapse ='/') ###考虑副标题是否存在

if (result %>% html_nodes(sprintf("ol li:nth-of-type(%d) .subtitle",i)) %>% length() != 0){

subtext[i]=result %>% html_nodes(sprintf("ol li:nth-of-type(%d) .subtitle",i)) %>% html_text()

} ###考虑评价是否存在：

if (result %>% html_nodes(sprintf("ol > li:nth-of-type(%d) a.ratings-link span",i)) %>% length() !=0){

eveluate_text[i]=result %>% html_nodes(sprintf("ol > li:nth-of-type(%d) a.ratings-link span",i)) %>% html_text()

} ###考虑评分是否存在：

if (result %>% html_nodes(sprintf("ol > li:nth-of-type(%d) span.rating-average",i)) %>% length() != 0){

rating_text[i]=result %>% html_nodes(sprintf("ol > li:nth-of-type(%d) span.rating-average",i)) %>% html_text()

} ###考虑价格是否存在：

if (result %>% html_nodes(sprintf("ol > li:nth-of-type(%d) span.price-tag",i)) %>% length() != 0){

price_text[i]=result %>% html_nodes(sprintf("ol > li:nth-of-type(%d) span.price-tag",i)) %>% html_text()

}

} ###合并以上信息

author=c(author,author_text)

subtitle=c(subtitle,subtext)

eveluate_nums=c(eveluate_nums,eveluate_text)

rating=c(rating,rating_text)

price=c(price,price_text) ###打印任务状态：

print(sprintf("page %d is over!!!",page+1))

} ###打印全局任务状态

print("everything is OK")

myresult=data.frame(title,subtitle,author,category,price,rating,eveluate_nums) return (myresult)

}

　　运行自动抓取函数：

　　myresult=getcontent(url)

检查数据结构并修正： str(myresult)

myresult$price% sub("元|免费","",.) %>% as.numeric()

myresult$rating

0

2022-06-02

c 抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

左手用R右手Python——CSS网页解析实战

0 个评论

发起人