c 抓取网页数据(2019独角兽企业重金招聘Python工程师标准(gt)(组图))
优采云 发布时间: 2022-03-04 12:17c 抓取网页数据(2019独角兽企业重金招聘Python工程师标准(gt)(组图))
2019独角兽企业招聘Python工程师标准>>>
在这篇文章中,我主要展示了如何爬取谷歌学术网页。该示例展示了使用 rvest 包捕获作者博士生导师的个人学术数据。我们可以看到他的合著者、论文被引用的次数以及他们的隶属关系。Hadley Wickham 在 RStudio 博客中写道:“rvest 的灵感来自于可以轻松从 HTML 页面中抓取数据的美丽汤之类的库”。因为它被设计为与 magrittr 一起使用。我们可以通过一些简单易懂的代码块组成的管道操作来表达复杂的操作。
加载 R 包:
使用 ggplot2 包绘图
1library(rvest)
2library(ggplot2)
3
他的论文被引用了多少次?
使用 SelectorGadget 的 CSS 选择器查找“引用者”列。
1page % html_text()%>%as.numeric()
2
请参阅此计数的引文:
1citations
2148 96 79 64 57 57 57 55 52 50 48 37 34 33 30 28 26 25 23 22
3
绘制引用次数的条形图:
1barplot(citations, main="How many times has each paper been cited?", ylab='Number of citations', col="skyblue", xlab="")
2
共同作者、他们的隶属关系和引用次数
同样,我们使用 SelecotGadget 的 CSS 选择器来查找匹配的共同作者:
1page % html_nodes(css=".gsc_1usr_name a") %>% html_text()
3Coauthors = as.data.frame(Coauthors)
4names(Coauthors)='Coauthors'
5
查看下一位合著者
1head(Coauthors)
2 Coauthors
31 Jason Evans
42 Mutlu Ozdogan
53 Rasmus Houborg
64 M. Tugrul Yilmaz
75 Joseph A. Santanello, Jr.
86 Seth Guikema
9
10dim(Coauthors)
11[1] 27 1
12
截至2016年1月1日,他共有27位合著者。
他的合著者被引用了多少次?
1page % html_nodes(css = ".gsc_1usr_cby")%>%html_text()
3
4citations
5 [1] "Cited by 2231" "Cited by 1273" "Cited by 816" "Cited by 395" "Cited by 652" "Cited by 1531"
6 [7] "Cited by 674" "Cited by 467" "Cited by 7967" "Cited by 3968" "Cited by 2603" "Cited by 3468"
7[13] "Cited by 3175" "Cited by 121" "Cited by 32" "Cited by 469" "Cited by 50" "Cited by 11"
8[19] "Cited by 1187" "Cited by 1450" "Cited by 12407" "Cited by 1939" "Cited by 9" "Cited by 706"
9[25] "Cited by 336" "Cited by 186" "Cited by 192"
10
通过全局替换提取数字字符串
1citations = gsub('Cited by','', citations)
2
3citations
4 [1] " 2231" " 1273" " 816" " 395" " 652" " 1531" " 674" " 467" " 7967" " 3968" " 2603" " 3468" " 3175"
5[14] " 121" " 32" " 469" " 50" " 11" " 1187" " 1450" " 12407" " 1939" " 9" " 706" " 336" " 186"
6[27] " 192"
7
将字符串转换为数值类型,然后得到ggplot2可用的数据框格式:
1citations = as.numeric(citations)
2citations = as.data.frame(citations)
3
合著者的附属机构
创建共同作者、引用和隶属关系的数据框
1cauthors=cbind(Coauthors, citations, affilation)
2
3cauthors
4 Coauthors citations Affilation
51 Jason Evans 2231 University of New South Wales
62 Mutlu Ozdogan 1273 Assistant Professor of Environmental Science and Forest Ecology, University of Wisconsin
73 Rasmus Houborg 816 Research Scientist at King Abdullah University of Science and Technology
84 M. Tugrul Yilmaz 395 Assistant Professor, Civil Engineering Department, Middle East Technical University, Turkey
95 Joseph A. Santanello, Jr. 652 NASA-GSFC Hydrological Sciences Laboratory
10.....
11
按引用次数对共同作者重新排序
按引用次数对共同作者重新排序以获得降序序列图:
<p>1cauthors$Coauthors