java爬虫抓取网页数据( 流行的Python库和框架是怎样的?(二))
优采云 发布时间: 2021-10-14 15:07java爬虫抓取网页数据(
流行的Python库和框架是怎样的?(二))
# loop though all links
for idx, link in enumerate(links_list):
# fetch the title of the post
post_title = link.get_text()
# fetch the link of the post
post_href = link.get('href')
# fetch the point text using the index of the link
# convert the point to integer
post_points = int(points_list[idx].get_text().replace(' points', ''))
# append to popular posts as a dictionary object if points is atleast 100
if post_points >= 100:
popular_posts.append(
{'title': post_title, 'link': post_href, 'points': post_points})
上述脚本仅从 Hacker News 的第一页获取热门帖子。但是,根据所需的目标,我们需要从前五页或可能输入的任意页数中获取列表。因此,您可以相应地修改脚本。
import requests
from bs4 import BeautifulSoup
import pprint
import time
BASE_URL = 'https://news.ycombinator.com'
# response = requests.get(BASE_URL)
def get_lists_and_points(soup):
# extract all the links using the class selector
links_list = soup.select('.storylink')
# extract all the points using the class selector
points_list = soup.select('.score')
return (links_list, points_list)
def parse_response(response):
# extract the text content of the web page
response_text = response.text
# parse HTML
soup = BeautifulSoup(response_text, 'html.parser')
return soup
def get_paginated_data(pages):
total_links_list = []
total_points_list = []
for page in range(pages):
URL = BASE_URL + f'?p={page+1}'
response = requests.get(URL)
soup = parse_response(response)
links_list, points_list = get_lists_and_points(soup)
for link in links_list:
total_links_list.append(link)
for point in points_list:
total_points_list.append(point)
# add 30 seconds delay as per hacker news robots.txt rules
time.sleep(30)
return (total_links_list, total_points_list)
def generate_popular_posts(links_list, points_list):
# create an empty popular posts list
popular_posts = []
# loop though all links
for idx, link in enumerate(links_list):
# fetch the title of the post
post_title = link.get_text()
# fetch the link of the post
post_href = link.get('href')
# fetch the point text using the index of the link
# convert the point to integer
# if points data is not available, assign it a default of 0
try:
post_points = int(
points_list[idx].get_text().replace(' points', ''))
except:
points_list = 0
# append to popular posts as a dictionary object if points is atleast 100
if post_points >= 100:
popular_posts.append(
{'title': post_title, 'link': post_href, 'points': post_points})
return popular_posts
def sort_posts_by_points(posts):
return sorted(posts, key=lambda x: x['points'], reverse=True)
def main():
total_links_list, total_points_list = get_paginated_data(5)
popular_posts = generate_popular_posts(total_links_list, total_points_list)
sorted_posts = sort_posts_by_points(popular_posts)
# print posts sorted by highest to lowest
pprint.pprint(sorted_posts)
if(__name__ == '__main__'):
main()
现在使用这个脚本,我们甚至不需要访问黑客新闻和搜索热门新闻。我们可以从控制台运行这个脚本并获取最新消息。随意根据您的需要调整脚本并对其进行试验或尝试从您最喜欢的 网站 中获取数据。
我们可以用上面的数据做很多事情,比如
热门爬虫库
Beautiful Soup 在从 网站 获取数据时有其局限性。使用起来非常简单,但是为了从客户端呈现的复杂网站(Angular,基于React的网站)中抓取数据,网站@时HTML标签将不可用> 已加载。要从这样的 网站 获取数据,您可以使用更高级的库。以下是一些流行的 Python 库和框架。
网页抓取是一个广阔的领域。对于 Beautiful Soup,我们可能只是触及了表面。这个领域有很多可能性,我会在探索更多使用Python进行数据分析的同时探索更多。希望我已经能够涵盖进一步探索所需的基本概念。
明天我将讨论使用 Python 进行 Web 开发的概念。