百度网页关键字抓取(分词保存详细过程分析百度搜索的url，提取网页)

优采云发布时间: 2021-09-09 20:05

　　本文是在网上学习了一些相关的博客和资料后的学习总结。是入门级爬虫

　　相关工具和环境

　　python3 及以上

　　网址库

　　美汤

　　jieba 分词

　　url2io（提取网页正文）

　　整体流程介绍

　　解析百度搜索的url，用urllib.request提取网页，用beausoup解析页面，分析搜索页面，找到搜索结果在页面中的结构位置，提取搜索结果，然后得到搜索结果真实url，提取网页正文，分词保存

　　详细流程1.解析百度搜索url获取页面

　　我们使用百度的时候，输入关键词，点击搜索，可以看到页面url有一大串字符。但是我们在使用爬虫获取页面的时候，并没有使用这样的字符。我们实际使用的 url 是这样的：#39; 关键词'&pn='页面'。 wd是你搜索的关键，pn是分页页，因为百度搜索每页有十个结果（最上面的可能是广告宣传，不是搜索结果），所以pn=0就是第一页，第二页就是pn=10，依此类推，你可以试试周杰伦&pn=20，得到的是关于周杰伦的搜索结果第三页。

　　word = '周杰伦'

　　url = 'http://www.baidu.com.cn/s?wd=' + urllib.parse.quote(word) + '&pn=0' # word为关键词，pn是百度用来分页的..

　　response = urllib.request.urlopen(url)

page = response.read()

　　上面这句话是一个简单的爬虫，得到百度搜索结果的页面，这个词是通过关键词传递的，如果收录中文，需要使用urllib.parse.quote来防止出错，因为超链接默认为ascii编码，不能直接出现中文。

　　2.分析页面的html结构，找到搜索链接在页面中的位置，得到真正的搜索链接

　　使用谷歌浏览器的开发者模式（F12或Fn+F12），点击左上角箭头，点击搜索结果之一，如下图，可以看到搜索到结果都在class="result c-container"的div中，每个div都收录class="t"的h3标签，h3标签收录a标签，搜索结果在href注释中。

　　知道url的位置很方便，我们使用beautifulsoup使用lxml解析页面（pip install beautifulsoup4，pip install lxml，如果pip安装出错，网上搜索相关安装教程）

　　headers = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Encoding': 'gzip, deflate, compress',

'Accept-Language': 'en-us;q=0.5,en;q=0.3',

'Cache-Control': 'max-age=0',

'Connection': 'keep-alive',

'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'

} #定义头文件，伪装成浏览器

　　 all = open('D:\\111\\test.txt', 'a')

　　 soup = BeautifulSoup(page, 'lxml')

tagh3 = soup.find_all('h3')

for h3 in tagh3:

href = h3.find('a').get('href')

baidu_url = requests.get(url=href, headers=headers, allow_redirects=False)

real_url = baidu_url.headers['Location'] #得到网页原始地址

if real_url.startswith('http'):

all.write(real_url + '\n')

　　因为页面除了搜索结果不收录其他h3标签，所以我们直接使用beautifulsoup获取所有h3标签，然后使用for循环获取每个搜索结果的url。

　　上面的请求也是爬虫包。在没有安装 huapip 的情况下安装它。我们可以使用这个包的get方法来获取相关页面的头文件信息。里面的Location对应的是网页的真实url。我们定期过滤掉一些无用的网址并保存。

　　注意有时伪装的头文件Accept-Encoding会导致乱码，可以删除。

　　3. 提取网页正文并进行分词

　　 api = url2io.API('bjb4w0WATrG7Lt6PVx_TrQ')

try:

ret = api.article(url=url,fields=['text', 'next'])

text = ret['text']

except:

return

　　我们可以用网上的第三方包url2io提取网页的body和url。但请注意，此包基于 pyhton2.7。其中使用的urllib2在python3版本中已经合并到urllib中。您需要自己修改它。 pyhton3中的basestring也删掉了改成str就够了，这个包可以提取大部分收录文本的网页，不能提取的情况用try语句处理。

　　我们使用 jieba 对提取的文本进行分割。 jieba的使用：点击打开链接。

　　# -*- coding:utf-8 -*-

import jieba

import jieba.posseg as pseg

import url2io

from pymongo import MongoClient

conn = MongoClient('localhost', 27017)

db = conn.test

count = db.count

count.remove()

def test():

filename = 'C:\\xxx\\include.txt'

jieba.load_userdict(filename)

seg_list = jieba.cut("我家住在青山区博雅豪庭大华南湖公园世家五栋十三号") #默认是精确模式

print(", ".join(seg_list))

fff = "我家住在青山区博雅豪庭大.华南湖公园世家啊说,法撒撒打算武汉工商学院五栋十三号"

result = pseg.cut(fff)

for w in result:

print(w.word, '/', w.flag, ',')

def get_address(url):

api = url2io.API('bjb4w0WATrG7Lt6PVx_TrQ')

try:

ret = api.article(url=url,fields=['text', 'next'])

text = ret['text']

filename = 'C:\\xxx\\include.txt'

jieba.load_userdict(filename)

result = pseg.cut(text)

for w in result:

if(w.flag=='wh'):

print(w.word)

res = count.find_one({"name": w.word})

if res:

count.update_one({"name": w.word},{"$set": {"sum": res['sum']+1}})

else:

count.insert({"name": w.word,"sum": 1})

except:

return

　　我结合使用自定义词典进行分词。

　　4.使用多进程（POOL进程池）提高爬行速度

　　为什么不使用多线程，因为python的多线程太鸡肋了，详细资料点百度就知道了。下面我就直接把代码全部放出来，有一种方法可以把地址保存在txt文件和MongoDB数据库中。

　　百度.py

　　# -*- coding:utf-8 -*-

'''

从百度把前10页的搜索到的url爬取保存

'''

import multiprocessing #利用pool进程池实现多进程并行

# from threading import Thread 多线程

import time

from bs4 import BeautifulSoup #处理抓到的页面

import sys

import requests

import importlib

importlib.reload(sys)#编码转换，python3默认utf-8,一般不用加

from urllib import request

import urllib

from pymongo import MongoClient

conn = MongoClient('localhost', 27017)

db = conn.test#数据库名

urls = db.cache#表名

urls.remove()

'''

all = open('D:\\111\\test.txt', 'a')

all.seek(0) #文件标记到初始位置

all.truncate() #清空文件

'''

headers = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Encoding': 'gzip, deflate, compress',

'Accept-Language': 'en-us;q=0.5,en;q=0.3',

'Cache-Control': 'max-age=0',

'Connection': 'keep-alive',

'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'

} #定义头文件，伪装成浏览器

def getfromBaidu(word):

start = time.clock()

url = 'http://www.baidu.com.cn/s?wd=' + urllib.parse.quote(word) + '&pn=' # word为关键词，pn是百度用来分页的..

pool = multiprocessing.Pool(multiprocessing.cpu_count())

for k in range(1, 5):

result = pool.apply_async(geturl, (url, k))# 多进程

pool.close()

pool.join()

end = time.clock()

print(end-start)

def geturl(url, k):

path = url + str((k - 1) * 10)

response = request.urlopen(path)

page = response.read()

soup = BeautifulSoup(page, 'lxml')

tagh3 = soup.find_all('h3')

for h3 in tagh3:

href = h3.find('a').get('href')

# print(href)

baidu_url = requests.get(url=href, headers=headers, allow_redirects=False)

real_url = baidu_url.headers['Location'] #得到网页原始地址

if real_url.startswith('http'):

urls.insert({"url": real_url})

# all.write(real_url + '\n')

if __name__ == '__main__':

getfromBaidu('周杰伦')

　　 pool = multiprocessing.Pool(multiprocessing.cpu_count())

　　根据cpu的核数确认进程池中的进程数。多进程和POOL的使用详情请点击打开链接

　　修改后的url2io.py

<p>#coding: utf-8

#

# This program is free software. It comes without any warranty, to

# the extent permitted by applicable law. You can redistribute it

# and/or modify it under the terms of the Do What The Fuck You Want

# To Public License, Version 2, as published by Sam Hocevar. See

# http://sam.zoy.org/wtfpl/COPYING (copied as below) for more details.

#

# DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE

# Version 2, December 2004

#

# Everyone is permitted to copy and distribute verbatim or modified

# copies of this license document, and changing it is allowed as long

# as the name is changed.

#

# DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE

# TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

#

# 0. You just DO WHAT THE FUCK YOU WANT TO.

"""a simple url2io sdk

example:

api = API(token)

api.article(url='http://www.url2io.com/products', fields=['next', 'text'])

"""

__all__ = ['APIError', 'API']

DEBUG_LEVEL = 1

import sys

import socket

import json

import urllib

from urllib import request

import time

from collections import Iterable

import importlib

importlib.reload(sys)

headers = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Cache-Control': 'max-age=0',

'Connection': 'keep-alive',

'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'

} #定义头文件，伪装成浏览器

class APIError(Exception):

code = None

"""HTTP status code"""

url = None

"""request URL"""

body = None

"""server response body; or detailed error information"""

def __init__(self, code, url, body):

self.code = code

self.url = url

self.body = body

def __str__(self):

return 'code={s.code}\nurl={s.url}\n{s.body}'.format(s = self)

__repr__ = __str__

class API(object):

token = None

server = 'http://api.url2io.com/'

decode_result = True

timeout = None

max_retries = None

retry_delay = None

def __init__(self, token, srv = None,

decode_result = True, timeout = 30, max_retries = 5,

retry_delay = 3):

""":param srv: The API server address

:param decode_result: whether to json_decode the result

:param timeout: HTTP request timeout in seconds

:param max_retries: maximal number of retries after catching URL error

or socket error

:param retry_delay: time to sleep before retrying"""

self.token = token

if srv:

self.server = srv

self.decode_result = decode_result

assert timeout >= 0 or timeout is None

assert max_retries >= 0

self.timeout = timeout

self.max_retries = max_retries

self.retry_delay = retry_delay

_setup_apiobj(self, self, [])

def update_request(self, request):

"""overwrite this function to update the request before sending it to

server"""

pass

def _setup_apiobj(self, apiobj, path):

if self is not apiobj:

self._api = apiobj

self._urlbase = apiobj.server + '/'.join(path)

lvl = len(path)

done = set()

for i in _APIS:

if len(i)

0

2021-09-09

百度网页关键字抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

百度网页关键字抓取(分词保存详细过程分析百度搜索的url，提取网页)

0 个评论

发起人