js抓取网页内容( “显示更多”按钮刮一个谷歌学者页面(组图) )

优采云发布时间: 2022-03-16 06:08

　　js抓取网页内容(

“显示更多”按钮刮一个谷歌学者页面(组图)

)

　　使用 Selenium 使用 javascript 抓取网页

　　pythonseleniumweb-scraping

　　使用 Selenium 用 javascript、javascript、python、selenium、web-scraping、beautifulsoup、Javascript、Python、Selenium、Web Scraping、Beautifulsoup 抓取网页，我想用“显示更多”按钮抓取一个 Google Scholar 页面。我从上一个问题中了解到，它不是 html 而是 javascript，并且有几种方法可以抓取这样的页面。我尝试了 selenium 并尝试了以下代码 from selenium import webdriverfrom bs4 import BeautifulSoupoptions = webdriver.ChromeOptions()options.add_argument('--ignore-certificate-errors')option

　　我想用“显示更多”按钮抓取一个谷歌学者页面。我从上一个问题中了解到，它不是 html 而是 javascript，并且有几种方法可以抓取这样的页面。我尝试了硒并尝试了以下代码

　　from selenium import webdriver

from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()

options.add_argument('--ignore-certificate-errors')

options.add_argument('--incognito')

options.add_argument('--headless')

chrome_path = r"....path....."

driver = webdriver.Chrome(chrome_path)

driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")

driver.find_element_by_xpath('/html/body/div/div[13]/div[2]/div/div[4]/form/div[2]/div/button/span/span[2]').click()

soup = BeautifulSoup(driver.page_source,'html.parser')

papers = soup.find_all('tr',{'class':'gsc_a_tr'})

for paper in papers:

title = paper.find('a',{'class':'gsc_a_at'}).text

author = paper.find('div',{'class':'gs_gray'}).text

journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]

print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)

　　浏览器现在单击“显示更多”按钮并显示整个页面。但是，我仍然只能获得前 20 篇论文的信息。我不理解为什么。请帮忙

　　谢谢

　　我认为您的问题是当您的程序检查网站时，新元素尚未完全加载。尝试导入时间，然后睡几分钟。这样（我删除了无头功能，以便您可以看到程序工作）：

　　导入时间

从selenium导入webdriver

从selenium.webdriver.chrome.options导入选项

选项=选项（）

options.page_load_策略='normal'

driver=webdriver.Chrome（options=options）

驱动程序。获取（“https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")

#笨拙的方法

#加载所有可用文章，然后对其进行迭代

对于范围（1,3）内的i：

驱动程序。通过_css_选择器（'#gsc_bpf_more'）查找_元素。单击（）

#等待元素加载

时间。睡眠（3）

#所有数据所在的容器

对于驱动程序中的结果。通过“css”选择器（“#gsc_a_b.gsc_a_t”）查找“元素”：

title=result.find_element_by_css_选择器（'.gsc_a_at'）。text

authors=result.find_element_by_css_selector（'.gsc_a_at+.gs_gray'）。text

publication=result.find_element_by_css_选择器（'.gs_gray+.gs_gray'）。text

印刷品（标题）

印刷品（作者）

印刷品（出版物）

#只是为了分开

打印（）

　　部分输出：

　　有环保意识的消费者参与的税收/补贴政策

南班萨尔，南甘戈帕迪亚

环境经济与管理杂志45（2），333-355

绿色消费者面前监管工具的选择与设计

班萨尔群岛

资源和能源经济学30（3），345-368

　　from selenium import webdriver

import time

from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()

options.add_argument('--ignore-certificate-errors')

options.add_argument('--incognito')

driver = webdriver.Chrome()

driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")

time.sleep(3)

driver.find_element_by_id("gsc_bpf_more").click()

time.sleep(4)

soup = BeautifulSoup(driver.page_source, 'html.parser')

papers = soup.find_all('tr', {'class': 'gsc_a_tr'})

for paper in papers:

title = paper.find('a', {'class': 'gsc_a_at'}).text

author = paper.find('div', {'class': 'gs_gray'}).text

journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]

print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)

0

2022-03-16

js抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

js抓取网页内容( “显示更多”按钮刮一个谷歌学者页面(组图) )

0 个评论

发起人

AI时代内容工厂

js抓取网页内容( “显示更多”按钮刮一个谷歌学者页面(组图) )

0 个评论

发起人

相关问题