python抓取网页数据( Python的request和beautifulsoup组件包抓取和解析网页的分析)
优采云 发布时间: 2021-10-05 10:12python抓取网页数据(
Python的request和beautifulsoup组件包抓取和解析网页的分析)
使用 Python 工具抓取网页
时间2015-10-25
最近在做一个基于文本分析的项目,需要爬取相关网页进行分析。我使用Python的request和beautifulsoup组件包来抓取和解析网页。爬取过程中发现了很多问题,在爬取工作开始之前是始料未及的。例如,由于不同网页的解析过程可能不一致,这可能会导致解析失败;再比如,由于服务器资源访问过于频繁,可能会导致远程主机关闭连接错误。下面的代码考虑了这两个问题。
import requests
import bs4
import time
# output file name
output = open("C:\\result.csv", 'w', encoding="utf-8")
# start request
request_link = "http://where-you-want-to-crawl-from"
response = requests.get(request_link)
# parse the html
soup = bs4.BeautifulSoup(response.text,"html.parser")
# try to get the link starting with href
try:
link = str((soup.find_all('a')[30]).get('href'))
except Exception as e_msg:
link = 'NULL'
# find the related app
if (link.startswith("/somewords")):
# sleep
time.sleep(2)
# request the sub link
response = requests.get("some_websites" + link)
soup = bs4.BeautifulSoup(response.text,"html.parser")
# get the info you want: div label and class is o-content
info_you_want = str(soup.find("div", {"class": "o-content"}))
try:
sub_link = ((str(soup.find("div", {"class": "crumb clearfix"}))).split('</a>')[2]).split('')[0].strip()
except Exception as e_msg:
sub_link = "NULL_because_exception"
try:
info_you_want = (info_you_want.split('"o-content">')[1]).split('')[0].strip()
except Exception as e_msg:
info_you_want = "NULL_because_exception"
info_you_want = info_you_want.replace('\n', '')
info_you_want = info_you_want.replace('\r', '')
# write results into file
output.writelines(info_you_want + "\n" + "\n")
# not find the aimed link
else:
output.writelines(str(e) + "," + app_name[e] + "\n")
output.close()
相关文章