搜索指定网站内容(【Day1】词根+图文记忆+音频搜索+结果)

优采云发布时间: 2022-01-19 10:07

　　源代码和结果：%E7%99%BE%E5%BA%A6%E7%88%AC%E8%99%AB%E7%B3%BB%E5%88%97

　　在【爬虫百度搜索一】单个关键词URL结果汇总（给定关键词和页数）我们根据关键词得到查询URL结果，存入save_file_name.txt，然后需要save_file_name。在txt中逐行读取采集的URL，抓取给定URL中的所有URL。

　　【百度系列二】关键词搜索url结果汇总（给定关键词和页数）

　　【百度系列III】深度搜索（给定URL采集所有URL）

　　目的

　　给定一个URL和一个存储文件，在页面采集的所有URL下，可以指定文件存储。

　　思考

　　使用lxml解析工具解析请求的文本，分析url在网页中的位置和标签，会出现三种情况：

　　通过 href 获得的格式良好的 url。获取是相对路径：如“/game”，需补充“”格式。通过href获取的以“javascript”开头，跳过。代码

　　#coding:utf-8

# 网页url采集爬虫，给定网址，以及存储文件，将该网页内全部网址采集下，可指定文件存储方式

#author@ 许娜

#os : ubuntu16.04

#python: python2

import requests,time

from lxml import etree

"""

url:给定的url

save_file_name:为url存储文件

"""

def Redirect(url):

try:

res = requests.get(url,timeout=10)

url = res.url

except Exception as e:

print("4",e)

time.sleep(1)

return url

def requests_for_url(url, save_file_name, file_model):

headers = {

'pragma': "no-cache",

'accept-encoding': "gzip, deflate, br",

'accept-language': "zh-CN,zh;q=0.8",

'upgrade-insecure-requests': "1",

'user-agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",

'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",

'cache-control': "no-cache",

'connection': "keep-alive",

}

try:

response = requests.request("GET", url, headers=headers)

selector = etree.HTML(response.text, parser=etree.HTMLParser(encoding='utf-8'))

except Exception as e:

print ("页面加载失败", e)

return_set = set()

with open(save_file_name,file_model) as f:

try:

context = selector.xpath('//a/@href')

for i in context:

try:

if i[0] == "j":

continue

if i[0] == "/":

print i

i = url+i.replace("/","");

f.write(i)

f.write("\n")

return_set.add(i)

print(len(context),context[0],i)

except Exception as e:

print("1",e)

except Exception as e:

print("2",e)

return return_set

if __name__ == '__main__':

# 网页url采集爬虫，给定网址，以及存储文件，将该网页内全部网址采集下，可指定文件存储方式

url = "http://news.baidu.com/"

save_file_name = "save_url_2.txt"

return_set = requests_for_url(url,save_file_name,"a") #“a”:追加

print(len(return_set))

　　源代码和结果：

0

2022-01-19

搜索指定网站内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

搜索指定网站内容(【Day1】词根+图文记忆+音频搜索+结果)

0 个评论

发起人

AI时代内容工厂

搜索指定网站内容(【Day1】词根+图文记忆+音频搜索+结果)

0 个评论

发起人

相关问题