爬虫抓取网页数据(以写出即为a元素储存把title储存到_spider.txt)

优采云发布时间: 2022-02-11 14:27

　　我也把我下载的代码的网盘链接粘贴到：link

　　提取码：cxta

　　我应该让这个系列保持最新（至少在我学习期间）

　　以下文本开始：

　　申请流程

　　获取页面→\rightarrow→ 解析页面→\rightarrow→ 保存数据

　　一个简单的爬虫实例来获取

　　#coding: utf-8

import requests

link="http://www.santostang.com/"

headers={'User-Agent':'Mozilla/5.0 (Windows;U;Windows NT6.1;en-US;rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

r=request.get(link,headers=headers)

print(r.text)

　　首先定义链接（link）表示要爬取的URL地址头，定义请求头的浏览器代理，并进行伪装（第一个是“User-Agent”表示是浏览器代理，而第二个是具体内容 r是响应对象，即请求获取网页的reply get函数有两个参数（第一个是URL地址，第二个是请求头） text是文本内容，这里是HTML格式提取

　　您可以直接查看网页源代码或使用“inspect”（在网页中右键）获取所需信息，包括但不限于请求头、HTML源代码

　　#coding: utf-8

import requests

from bs4 import *

# 获取

link="http://www.santostang.com/"

headers={'User-Agent':'Mozilla/5.0 (Windows;U;Windows NT6.1;en-US;rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

r=request.get(link,headers=headers)

soup = BeautifulSoup(r.text, "html.parser") #用BeautifulSoup进行解析，parser解析

# 定位class是post-title的h1元素，提取其中a中的内容(text)并去除空格

title = soup.find("h1", class_="post-title").a.text.strip()

print(title)

　　获取文本后，使用 BeautifulSoup 进行解析。 BeautifulSoup 是一个专门解析 HTML 的库。 BeautifulSoup(r.text, "html.parser")，其中 r.text 是要解析的字符串，而 "html.parser" 是要使用的 Parser。 “html.parser”是Python自带的；还有“lxml”，它是一个基于C语言的解析器，但是需要自己安装find。有两个参数，第一个是节点，这里是h1，还有p...一般h1是标题，p是段落；第二个是类名，这里是“post-title”，用HTML写成a元素，存储为超链接元素

　　#coding: utf-8

import requests

from bs4 import *

# 获取

link="http://www.santostang.com/"

headers={'User-Agent':'Mozilla/5.0 (Windows;U;Windows NT6.1;en-US;rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

r=request.get(link,headers=headers)

# 提取

soup = BeautifulSoup(r.text, "html.parser") #用BeautifulSoup进行解析，parser解析

# 定位class是post-title的h1元素，提取其中a中的内容(text)并去除空格

title = soup.find("h1", class_="post-title").a.text.strip()

with open("first_spider.txt", "a+") as f:

f.write(title)

　　将标题保存到 first_spider.txt

0

2022-02-11

爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

爬虫抓取网页数据(以写出即为a元素储存把title储存到_spider.txt)

0 个评论

发起人