java爬虫抓取网页数据( 流行的Python库和框架是怎样的?(二))

优采云 发布时间: 2021-10-14 15:07

  java爬虫抓取网页数据(

流行的Python库和框架是怎样的?(二))

  # loop though all links

for idx, link in enumerate(links_list):

# fetch the title of the post

post_title = link.get_text()

# fetch the link of the post

post_href = link.get('href')

# fetch the point text using the index of the link

# convert the point to integer

post_points = int(points_list[idx].get_text().replace(' points', ''))

# append to popular posts as a dictionary object if points is atleast 100

if post_points >= 100:

popular_posts.append(

{'title': post_title, 'link': post_href, 'points': post_points})

  上述脚本仅从 Hacker News 的第一页获取热门帖子。但是,根据所需的目标,我们需要从前五页或可能输入的任意页数中获取列表。因此,您可以相应地修改脚本。

  import requests

from bs4 import BeautifulSoup

import pprint

import time

BASE_URL = 'https://news.ycombinator.com'

# response = requests.get(BASE_URL)

def get_lists_and_points(soup):

# extract all the links using the class selector

links_list = soup.select('.storylink')

# extract all the points using the class selector

points_list = soup.select('.score')

return (links_list, points_list)

def parse_response(response):

# extract the text content of the web page

response_text = response.text

# parse HTML

soup = BeautifulSoup(response_text, 'html.parser')

return soup

def get_paginated_data(pages):

total_links_list = []

total_points_list = []

for page in range(pages):

URL = BASE_URL + f'?p={page+1}'

response = requests.get(URL)

soup = parse_response(response)

links_list, points_list = get_lists_and_points(soup)

for link in links_list:

total_links_list.append(link)

for point in points_list:

total_points_list.append(point)

# add 30 seconds delay as per hacker news robots.txt rules

time.sleep(30)

return (total_links_list, total_points_list)

def generate_popular_posts(links_list, points_list):

# create an empty popular posts list

popular_posts = []

# loop though all links

for idx, link in enumerate(links_list):

# fetch the title of the post

post_title = link.get_text()

# fetch the link of the post

post_href = link.get('href')

# fetch the point text using the index of the link

# convert the point to integer

# if points data is not available, assign it a default of 0

try:

post_points = int(

points_list[idx].get_text().replace(' points', ''))

except:

points_list = 0

# append to popular posts as a dictionary object if points is atleast 100

if post_points >= 100:

popular_posts.append(

{'title': post_title, 'link': post_href, 'points': post_points})

return popular_posts

def sort_posts_by_points(posts):

return sorted(posts, key=lambda x: x['points'], reverse=True)

def main():

total_links_list, total_points_list = get_paginated_data(5)

popular_posts = generate_popular_posts(total_links_list, total_points_list)

sorted_posts = sort_posts_by_points(popular_posts)

# print posts sorted by highest to lowest

pprint.pprint(sorted_posts)

if(__name__ == '__main__'):

main()

  现在使用这个脚本,我们甚至不需要访问黑客新闻和搜索热门新闻。我们可以从控制台运行这个脚本并获取最新消息。随意根据您的需要调整脚本并对其进行试验或尝试从您最喜欢的 网站 中获取数据。

  我们可以用上面的数据做很多事情,比如

  热门爬虫库

  Beautiful Soup 在从 网站 获取数据时有其局限性。使用起来非常简单,但是为了从客户端呈现的复杂网站(Angular,基于React的网站)中抓取数据,网站@时HTML标签将不可用> 已加载。要从这样的 网站 获取数据,您可以使用更高级的库。以下是一些流行的 Python 库和框架。

  网页抓取是一个广阔的领域。对于 Beautiful Soup,我们可能只是触及了表面。这个领域有很多可能性,我会在探索更多使用Python进行数据分析的同时探索更多。希望我已经能够涵盖进一步探索所需的基本概念。

  明天我将讨论使用 Python 进行 Web 开发的概念。

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线