动态网页抓取(Selenium实例：Airbnb短租数据目的：动态网页(图))

优采云发布时间: 2021-10-16 09:48

　　动态网页抓取

　　上次实现了豆瓣书书Top250书名的静态网页爬取，这次跟着同一本书，研究一下动态网页的爬取。

　　动态网页简介

　　动态网页和静态网页的区别在于静态网页上显示的内容在HTML源代码中，而动态网页往往使用AJAX技术在后端和服务器之间交换数据，这样网页就可以访问，而无需重新加载整个页面。执行部分更新。

　　AJAX，全称是Asynchronous JavaScript And XML，即异步JavaScript和XML。它的使用使互联网应用程序更快更小，减少了网页重复内容的下载，节省了流量，但爬取过程比较复杂。

　　动态网页抓取过程

　　爬取AJAX加载的动态网页内容有两种方式：

　　通过浏览器审查元素解析地址

　　使用Chrome浏览器查看网页元素，找到真实数据地址，点击网络显示浏览器从Web服务器获取的所有文件。这个过程称为“数据包捕获”。这种方法容易遇到很多问题。例如，一些网页已经实施了一些加密措施来避免抓取数据，使用“检查”功能很难找到调用地址。通过 Selenium 模拟浏览器爬行

　　该方法使用浏览器渲染引擎，在显示网页时直接使用浏览器解析HTML、应用CSS样式、执行JavaScript语句。该方法在爬取过程中会自动操作浏览器浏览各种网页，并顺便往下爬取数据，即爬取动态网页为爬取静态网页。硒安装

　　Selenium 与其他 Python 库一样，可以使用 pip 安装。代码如下：

　　pip install selenium

　　成功出现。

　　Selenium 示例：Airbnb 短租数据

　　目的：获取湖南长沙前10家短租房的名称、价格、评论数、房型、床位、入住人数。

　　网址：[]=%2Fhomes&query=长沙&place_id=ChIJxWQcnvM1JzQRgKbxoZy75bE&s_tag=R2PBwazh

　　打开Airbnb长沙200强短租房源页面，点击“查询”查看数据所在位置，如图：

　　获取某家的数据地址：div._gig1e7

　　在这些数据中定位价格数据的地址是：div._18gk84h

　　同理可以得到评价数据、房名数据、房型数据，汇总如下表：

　　数据元素类

　　一个房子的所有数据

　　div

　　_gig1e7

　　价钱

　　div

　　_18gk84h

　　评价编号

　　div

　　_13o4q7nw

　　姓名

　　div

　　_qhtkbey

　　房屋类型

　　跨度

　　_fk7kh10

　　一旦找到数据的地址，就可以使用Selenium来获取Airbnb首页的数据。代码显示如下：

　　import time

from selenium import webdriver

#init url

url = 'https://www.airbnb.cn/s/homes?refinement_paths%5B%5D=%2Fhomes&query=%E9%95%BF%E6%B2%99&place_id=ChIJxWQcnvM1JzQRgKbxoZy75bE&s_tag=R2PBwazh'

#init browser

driver = webdriver.Chrome()

driver.get(url)

time.sleep(3)

#get data

rent_list = driver.find_elements_by_css_selector('div._gig1e7')

for eachhouse in rent_list:

#find the comments

comment = eachhouse.find_element_by_css_selector('div._13o4q7nw')

comment = comment.text

#find the price

price = eachhouse.find_element_by_css_selector('div._18gk84h')

price = price.text.replace("每晚","").replace("价格", "").replace("\n", "")

#find the name

name = eachhouse.find_element_by_css_selector('div._qhtkbey')

name = name.text

#find other details

details = eachhouse.find_element_by_css_selector('span._fk7kh10')

details = details.text

house_type = details.split(" · ")[0]

bed_number = details.split(" · ")[1]

print(comment,price,name,house_type,bed_number)

　　结果是这样的：

　　这只是为了获取一页的内容，我们的目标是前10页，所以查看第二页的地址，可以发现地址变成了：[]=%2Fhomes§ion_offset=6&items_offset=18&s_tag= mt59xV_D

　　第三页地址为：[]=%2Fhomes§ion_offset=6&items_offset=36&s_tag=mt59xV_D

　　区别在于偏移量，是18的倍数，所以加一个循环，得到前十页的数据。代码可以修改如下：

　　import time

from selenium import webdriver

#init browser

driver = webdriver.Chrome()

for i in range(0,10):

url = 'https://www.airbnb.cn/s/homes?refinement_paths%5B%5D=%2Fhomes&toddlers=0&query=%E9%95%BF%E6%B2%99&s_tag=qevSKrvy&section_offset=6&items_offset='+ str(i*18) + '&place_id=ChIJxWQcnvM1JzQRgKbxoZy75bE'

driver.get(url)

time.sleep(3)

#get data

rent_list = driver.find_elements_by_css_selector('div._gig1e7')

for eachhouse in rent_list:

#find the comments

comment = eachhouse.find_element_by_css_selector('div._13o4q7nw')

comment = comment.text

#find the price

price = eachhouse.find_element_by_css_selector('div._18gk84h')

price = price.text.replace("每晚","").replace("价格", "").replace("\n", "")

#find the name

name = eachhouse.find_element_by_css_selector('div._qhtkbey')

name = name.text

#find other details

details = eachhouse.find_element_by_css_selector('span._fk7kh10')

details = details.text

house_type = details.split(" · ")[0]

bed_number = details.split(" · ")[1]

print(comment,price,name,house_type,bed_number)

　　当前结果是Airbnb上长沙地区房屋信息的前10页：

　　因为我懒，所以花了这么长时间才做这件事，所以我走到墙边反思。

0

2021-10-16

动态网页抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

动态网页抓取(Selenium实例：Airbnb短租数据目的：动态网页(图))

0 个评论

发起人