java爬虫抓取网页数据( 流行的Python库和框架是怎样的？（二）)

优采云发布时间: 2021-10-14 15:07

　　java爬虫抓取网页数据(

流行的Python库和框架是怎样的？（二）)

　　# loop though all links

for idx, link in enumerate(links_list):

# fetch the title of the post

post_title = link.get_text()

# fetch the link of the post

post_href = link.get('href')

# fetch the point text using the index of the link

# convert the point to integer

post_points = int(points_list[idx].get_text().replace(' points', ''))

# append to popular posts as a dictionary object if points is atleast 100

if post_points >= 100:

popular_posts.append(

{'title': post_title, 'link': post_href, 'points': post_points})

　　上述脚本仅从 Hacker News 的第一页获取热门帖子。但是，根据所需的目标，我们需要从前五页或可能输入的任意页数中获取列表。因此，您可以相应地修改脚本。

　　import requests

from bs4 import BeautifulSoup

import pprint

import time

BASE_URL = 'https://news.ycombinator.com'

# response = requests.get(BASE_URL)

def get_lists_and_points(soup):

# extract all the links using the class selector

links_list = soup.select('.storylink')

# extract all the points using the class selector

points_list = soup.select('.score')

return (links_list, points_list)

def parse_response(response):

# extract the text content of the web page

response_text = response.text

# parse HTML

soup = BeautifulSoup(response_text, 'html.parser')

return soup

def get_paginated_data(pages):

total_links_list = []

total_points_list = []

for page in range(pages):

URL = BASE_URL + f'?p={page+1}'

response = requests.get(URL)

soup = parse_response(response)

links_list, points_list = get_lists_and_points(soup)

for link in links_list:

total_links_list.append(link)

for point in points_list:

total_points_list.append(point)

# add 30 seconds delay as per hacker news robots.txt rules

time.sleep(30)

return (total_links_list, total_points_list)

def generate_popular_posts(links_list, points_list):

# create an empty popular posts list

popular_posts = []

# loop though all links

for idx, link in enumerate(links_list):

# fetch the title of the post

post_title = link.get_text()

# fetch the link of the post

post_href = link.get('href')

# fetch the point text using the index of the link

# convert the point to integer

# if points data is not available, assign it a default of 0

try:

post_points = int(

points_list[idx].get_text().replace(' points', ''))

except:

points_list = 0

# append to popular posts as a dictionary object if points is atleast 100

if post_points >= 100:

popular_posts.append(

{'title': post_title, 'link': post_href, 'points': post_points})

return popular_posts

def sort_posts_by_points(posts):

return sorted(posts, key=lambda x: x['points'], reverse=True)

def main():

total_links_list, total_points_list = get_paginated_data(5)

popular_posts = generate_popular_posts(total_links_list, total_points_list)

sorted_posts = sort_posts_by_points(popular_posts)

# print posts sorted by highest to lowest

pprint.pprint(sorted_posts)

if(__name__ == '__main__'):

main()

　　现在使用这个脚本，我们甚至不需要访问黑客新闻和搜索热门新闻。我们可以从控制台运行这个脚本并获取最新消息。随意根据您的需要调整脚本并对其进行试验或尝试从您最喜欢的网站中获取数据。

　　我们可以用上面的数据做很多事情，比如

　　热门爬虫库

　　Beautiful Soup 在从网站获取数据时有其局限性。使用起来非常简单，但是为了从客户端呈现的复杂网站（Angular，基于React的网站）中抓取数据，网站@时HTML标签将不可用> 已加载。要从这样的网站获取数据，您可以使用更高级的库。以下是一些流行的 Python 库和框架。

　　网页抓取是一个广阔的领域。对于 Beautiful Soup，我们可能只是触及了表面。这个领域有很多可能性，我会在探索更多使用Python进行数据分析的同时探索更多。希望我已经能够涵盖进一步探索所需的基本概念。

　　明天我将讨论使用 Python 进行 Web 开发的概念。

0

2021-10-14

java爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java爬虫抓取网页数据( 流行的Python库和框架是怎样的？（二）)

0 个评论

发起人