python抓取网页数据( Python的request和beautifulsoup组件包抓取和解析网页的分析)

优采云 发布时间: 2021-10-05 10:12

  python抓取网页数据(

Python的request和beautifulsoup组件包抓取和解析网页的分析)

  使用 Python 工具抓取网页

  时间2015-10-25

  最近在做一个基于文本分析的项目,需要爬取相关网页进行分析。我使用Python的request和beautifulsoup组件包来抓取和解析网页。爬取过程中发现了很多问题,在爬取工作开始之前是始料未及的。例如,由于不同网页的解析过程可能不一致,这可能会导致解析失败;再比如,由于服务器资源访问过于频繁,可能会导致远程主机关闭连接错误。下面的代码考虑了这两个问题。

  import requests

import bs4

import time

# output file name

output = open("C:\\result.csv", 'w', encoding="utf-8")

# start request

request_link = "http://where-you-want-to-crawl-from"

response = requests.get(request_link)

# parse the html

soup = bs4.BeautifulSoup(response.text,"html.parser")

# try to get the link starting with href

try:

link = str((soup.find_all('a')[30]).get('href'))

except Exception as e_msg:

link = 'NULL'

# find the related app

if (link.startswith("/somewords")):

# sleep

time.sleep(2)

# request the sub link

response = requests.get("some_websites" + link)

soup = bs4.BeautifulSoup(response.text,"html.parser")

# get the info you want: div label and class is o-content

info_you_want = str(soup.find("div", {"class": "o-content"}))

try:

sub_link = ((str(soup.find("div", {"class": "crumb clearfix"}))).split('</a>')[2]).split('')[0].strip()

except Exception as e_msg:

sub_link = "NULL_because_exception"

try:

info_you_want = (info_you_want.split('"o-content">')[1]).split('')[0].strip()

except Exception as e_msg:

info_you_want = "NULL_because_exception"

info_you_want = info_you_want.replace('\n', '')

info_you_want = info_you_want.replace('\r', '')

# write results into file

output.writelines(info_you_want + "\n" + "\n")

# not find the aimed link

else:

output.writelines(str(e) + "," + app_name[e] + "\n")

output.close()

  相关文章

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线