抓取动态网页(如何帮助我从以下链接中获取结果,你知道吗 )

优采云 发布时间: 2022-01-04 10:16

  抓取动态网页(如何帮助我从以下链接中获取结果,你知道吗

)

  如果您能帮助我从以下链接获取结果,我将不胜感激:

  我正在使用 python3.7、beautifulsoup4 和 Selenium。你知道吗

  我写了一个程序来提取酒店用户评论的特征,比如评论者姓名、评论日期、评论者评分、评论者国家、入住日期、评论标题和评论本身(在这个例子中,评论分为正面和负部分)。我使用beautifulsoup4从HTML标签中提取文本,依靠Selenium点击“cookienotification”按钮并循环浏览页面结果。你知道吗

  当我成功循环浏览页面结果时,我没有提取第一页后检索到的内容。每 N 个页面从第一个结果页面检索相同的内容,我敢打赌这可能是因为内容是通过 JQuery 加载的。在这一点上,我不确定行为是什么,或者我需要在页面源代码中寻找什么,或者如何继续寻找解决方案。你知道吗

  任何提示或建议将不胜感激!你知道吗

  我的代码片段:

  from selenium import webdriver

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.support.ui import WebDriverWait, Select

from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup

import time

driver = webdriver.Chrome('/Users/admin/Desktop/chrome_driver/chromedriver')

#initiate driver-browser via Selenium - with original url

driver.get('link1')

def acceptCookies():

time.sleep(3)

element = driver.find_elements_by_xpath("//button[@class='cookie-warning-v2__banner-cta bui-button bui-button--wide bui-button--secondary close_warning']")

if element != None:

element = driver.find_elements_by_xpath("//button[@class='cookie-warning-v2__banner-cta bui-button bui-button--wide bui-button--secondary close_warning']")

element[0].click()

def getData(count, soup):

try:

for line in soup.find_all('li', class_='review_item'):

count += 1

review={}

review["review_metadata"]={}

review["review_metadata"]["review_date"] = line.find('p', class_='review_item_date').text.strip()

if line.find('p', class_='review_staydate') != None:

review["review_metadata"]["review_staydate"] = line.find('p', class_='review_staydate').text.strip()

review["review_metadata"]["reviewer_name"] = line.find('p', class_='reviewer_name').text.strip()

print(review["review_metadata"]["reviewer_name"])

review["review_metadata"]["reviewer_country"] = line.find('span', class_='reviewer_country').text.strip()

review["review_metadata"]["reviewer_score"] = line.find('span', class_='review-score-badge').text.strip()

if line.find('p', class_='review_pos') != None:

review["review_metadata"]["review_pos"] = line.find('p', class_='review_pos').text.strip()

if line.find('p', class_='review_neg') != None:

review["review_metadata"]["review_neg"] = line.find('p', class_='review_neg').text.strip()

scoreword = line.find('span', class_='review_item_header_scoreword')

if scoreword != None :

review["review_metadata"]["review_header"] = scoreword.text.strip()

else:

review["review_metadata"]["review_header"] = ""

hotel_reviews[count] = review

return hotel_reviews

except Exception as e:

return print('the error is', e)

# Finds max-range of pagination (number of result pages retrieved)

def find_max_pages():

max_pages = driver.find_elements_by_xpath("//div[@class='bui-pagination__list']//div//span")

max_pages = max_pages[-1].text

max_pages = max_pages.split()

max_pages = int(max_pages[1])

return max_pages

hotel_reviews= {}

count = 0

review_page = {}

hotel_reviews_2 = []

# Accept on Cookie-Notification

acceptCookies()

# Find Max Pages

max_pages = find_max_pages()

# Find every pagination link in order to loop through each review page carousel

element = driver.find_elements_by_xpath("//a[@class='bui-pagination__link']")

for item in range(max_pages-1):

review_page = getData(count, soup)

hotel_reviews_2.extend(review_page)

time.sleep(2)

element = driver.find_elements_by_xpath("//a[@class='bui-pagination__link']")

element[item].click()

driver.get(url=driver.current_url)

print(driver.page_source)

print(driver.current_url)

soup = BeautifulSoup(driver.page_source, 'lxml')

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线