c 抓取网页数据(Python网络数据实战系列16——XPath与网页解析库)

优采云发布时间: 2021-11-09 00:25

　　经常有朋友跟我商量，在使用R语言做网络数据采集时，遇到空值、缺失值或不存在值该怎么办。

　　因为我们从网上抓取的数据大多是关系型的，需要字段和记录一一对应，但是html文档的结构差异很大，代码复杂，很难保证提取出来的数据在开始时是严格相关的。需要做很多缺失值、不存在的内容判断。

　　如果原创数据是关系数据，但是你抓取的字段乱序，记录不能一一对应，那么这些数据通常价值不大。今天我用一个小案例来演示（和昨天的案例一样）。如何在网页遍历和循环嵌套中设置逻辑判断，对缺失值和不存在值及时填写默认值，让你的爬虫代码更健壮，输出内容更规律.

　　加载扩展包：

　　#加载包：

library("XML")

library("stringr")

library("RCurl")

library("dplyr")

library("rvest")

#提供目标网址链接/报头参数

url% xpathSApply(.,"//span[@class='category']/span[2]/span | //p[@class='category']/span[@class='labled-text'] | //div[@class='category']",xmlValue) %>% c(category,.)

###提取作者/副标题/评论数/评分/价格信息：

author_text=subtitle_text=eveluate_nums_text=rating_text=price_text=rep('',length)

for (i in 1:length){

###提取作者

author_text[i]=content %>% xpathSApply(.,sprintf("//li[%d]//p[@class]//span/following-sibling::span/a | //li[%d]//div[@class='author']/a",i,i),xmlValue) %>% paste(.,collapse='/')

###考虑副标题是否存在

if (content %>% xpathSApply(.,sprintf("//ol/li[%d]//p[@class='subtitle']",i),xmlValue) %>% length!=0){

subtitle_text[i]=content %>% xpathSApply(.,sprintf("//ol/li[%d]//p[@class='subtitle']",i),xmlValue)

}

###考虑评价是否存在：

if (content %>% xpathSApply(.,sprintf("//ol/li[%d]//a[@class='ratings-link']/span",i),xmlValue) %>% length!=0){

eveluate_nums_text[i]=content %>% xpathSApply(.,sprintf("//ol/li[%d]//a[@class='ratings-link']/span",i),xmlValue)

}

###考虑评分是否存在：

if (content %>% xpathSApply(.,sprintf("//ol/li[%d]//div[@class='rating list-rating']/span[2]",i),xmlValue) %>% length!=0){

rating_text[i]=content %>% xpathSApply(.,sprintf("//ol/li[%d]//div[@class='rating list-rating']/span[2]",i),xmlValue)

}

###考虑价格是否存在：

if (content %>% xpathSApply(.,sprintf("//ol/li[%d]//span[@class='price-tag ']",i),xmlValue) %>% length!=0){

price_text[i]=content %>% xpathSApply(.,sprintf("//ol/li[%d]//span[@class='price-tag ']",i),xmlValue)

}

#拼接以上通过下标遍历的书籍记录数

author=c(author,author_text)

subtitle=c(subtitle,subtitle_text)

eveluate_nums=c(eveluate_nums,eveluate_nums_text)

rating=c(rating,rating_text)

price=c(price,price_text)

#打印单页任务状态

print(sprintf("page %d is over!!!",page))

}

#构建数据框

myresult=data.frame(title,subtitle,author,category,price,rating,eveluate_nums)

#打印总体任务状态

print("everything is OK")

#返回最终汇总的数据框

return(myresult)

}

　　提供URL链接，运行我们构建的爬取功能：

　　myresult=getcontent(url)

[1] "page 0 is over!!!"

[1] "page 1 is over!!!"

[1] "page 2 is over!!!"

[1] "page 3 is over!!!"

[1] "everything is OK"

　　查看数据结构：

　　str(myresult)

　　规格变量类型：

<p>myresult$price% sub("元|免费","",.) %>% as.numeric()

myresult$rating

0

2021-11-09

c 抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

c 抓取网页数据(Python网络数据实战系列16——XPath与网页解析库)

0 个评论

发起人