python抓取网页数据(【小编】内容挺不错的使用方法和参考代码（一）)

优采云发布时间: 2021-12-03 22:27

　　今天给大家分享一篇关于Python微医注册网医生数据采集的文章。我觉得内容还不错。现在我把它分享给你。有很好的参考价值。有需要的朋友，跟着小编一起来看看吧。酒吧

　　1. 写在前面

　　今天要爬取的那个网站叫做WeDoctor网站，地址是，我们将通过python3爬虫爬取这个URL，然后将数据存入CSV，为后面的分析教程做准备。本文主要使用的库有pyppeteer和pyquery

　　首先找到医生列表页面

　　全国/全部/无限/p5

　　本页显示75952条数据。实际测试中，翻到第38页时，无法加载数据。视觉上，后台程序没有返回数据，但为了学习，我们忍了。

　　2. 页面网址

　　全国/全部/无限/p1

　　全国/全部/无限/p2

　　...

　　全国/全部/无限制/p38

　　数据超过38页，量不是很大。我们只需要选择一个库来抓取它。对于这个博客，我发现了一个不受欢迎的库。

　　在使用pyppeteer的过程中，发现材料那么少，很尴尬。而且，官方文档写得不好。有兴趣的朋友可以自行查看。这个库的安装也在下面的网址中。

　　最简单的使用方法就是在官方文档中简单的写了，如下，可以直接将网页另存为图片。

　　 import asyncio from pyppeteer import launch async def main(): browser = await launch() # 运行一个无头的浏览器 page = await browser.newPage() # 打开一个选项卡 await page.goto('http://www.baidu.com') # 加载一个页面 await page.screenshot({'path': 'baidu.png-600'}) # 把网页生成截图 await browser.close() asyncio.get_event_loop().run_until_complete(main()) # 异步

　　下面我整理了一些参考代码，大家可以参考一下。

　　 browser = await launch(headless=False) # 可以打开浏览器 await page.click('#login_user') # 点击一个按钮 await page.type('#login_user', 'admin') # 输入内容 await page.click('#password') await page.type('#password', '123456') await page.click('#login-submit') await page.waitForNavigation() # 设置浏览器窗口大小 await page.setViewport({ 'width': 1350, 'height': 850 }) content = await page.content() # 获取网页内容 cookies = await page.cookies() # 获取网页cookies

　　3. 抓取页面

　　运行如下代码，可以看到控制台不断打印网页的源代码，只要得到源代码，就可以进行后续的分析并保存数据。如果出现控件不输出任何东西的情况，那么请把下面的

　　await launch(headless=True) 修改为 await launch(headless=False)

　　 import asyncio from pyppeteer import launch class DoctorSpider(object): async def main(self, num): try: browser = await launch(headless=True) page = await browser.newPage() print(f"正在爬取第 {num} 页面") await page.goto("https://www.guahao.com/expert/all/全国/all/不限/p{}".format(num)) content = await page.content() print(content) except Exception as e: print(e.args) finally: num += 1 await browser.close() await self.main(num) def run(self): loop = asyncio.get_event_loop() asyncio.get_event_loop().run_until_complete(self.main(1)) if __name__ == '__main__': doctor = DoctorSpider() doctor.run()

　　4. 分析数据

　　解析的数据使用 pyquery。这个库在之前的博客中已经使用过，可以直接应用到案例中。结果数据通过pandas保存到CSV文件中。

　　 import asyncio from pyppeteer import launch from pyquery import PyQuery as pq import pandas as pd # 保存csv文件 class DoctorSpider(object): def __init__(self): self._data = list() async def main(self,num): try: browser = await launch(headless=True) page = await browser.newPage() print(f"正在爬取第 {num} 页面") await page.goto("https://www.guahao.com/expert/all/全国/all/不限/p{}".format(num)) content = await page.content() self.parse_html(content) print("正在存储数据....") data = pd.DataFrame(self._data) data.to_csv("微医数据.csv", encoding='utf_8_sig') except Exception as e: print(e.args) finally: num+=1 await browser.close() await self.main(num) def parse_html(self,content): doc = pq(content) items = doc(".g-doctor-item").items() for item in items: #doctor_name = item.find(".seo-anchor-text").text() name_level = item.find(".g-doc-baseinfo>dl>dt").text() # 姓名和级别 department = item.find(".g-doc-baseinfo>dl>dd>p:eq(0)").text() # 科室 address = item.find(".g-doc-baseinfo>dl>dd>p:eq(1)").text() # 医院地址 star = item.find(".star-count em").text() # 评分 inquisition = item.find(".star-count i").text() # 问诊量 expert_team = item.find(".expert-team").text() # 专家团队 service_price_img = item.find(".service-name:eq(0)>.fee").text() service_price_video = item.find(".service-name:eq(1)>.fee").text() one_data = { "name": name_level.split(" ")[0], "level": name_level.split(" ")[1], "department": department, "address": address, "star": star, "inquisition": inquisition, "expert_team": expert_team, "service_price_img": service_price_img, "service_price_video": service_price_video } self._data.append(one_data) def run(self): loop = asyncio.get_event_loop() asyncio.get_event_loop().run_until_complete(self.main(1)) if __name__ == '__main__': doctor = DoctorSpider() doctor.run()

　　综上所述，这个库不是很好用。可能我之前没有仔细研究过。感觉一般。可以多试一下，看看整体效率是否可以提高。

　　资料清单：

　　总结

　　以上是Python WeDoctor注册网医生数据抓取的详细内容，更多请关注其他相关html中文网站文章！

0

2021-12-03

python抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取网页数据(【小编】内容挺不错的使用方法和参考代码（一）)

0 个评论

发起人

AI时代内容工厂

python抓取网页数据(【小编】内容挺不错的使用方法和参考代码（一）)

0 个评论

发起人

相关问题