百度网页关键字抓取(网络爬虫的尺寸规模,爬取速度不敏感中规模 )
优采云 发布时间: 2021-12-30 07:13百度网页关键字抓取(网络爬虫的尺寸规模,爬取速度不敏感中规模
)
网络爬虫体积小,数据量小,爬取速度对中等规模不敏感,数据规模比较大,爬取速度对*敏*感*词*敏感,搜索引擎,爬行速度很关键
请求库
Scrapy 库
定制开发
抓取网页,播放网页
爬取网站,爬取系列网站
爬取全网
抓取网页的通用代码框架
import requests
def getHTMLText(url):
try:
headers = {'user-agent':'模拟浏览器信息'}
r = requests.get(url, headers = headers,timeout=30)
r.raise_for_status() # 如果不是200,产生异常requests.HTTPError
r.encoding = r.apparent_encoding
return r.text
except:
return '产生异常'
if __name__ == "__main__":
url = "http://www.baidu.com"
print(getHTMLText(url))
京东商品页面抓取
import requests
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status() # 如果不是200,产生异常requests.HTTPError
r.encoding = r.apparent_encoding
return r.text[:1000]
except:
return '产生异常'
if __name__ == "__main__":
url = "https://item.jd.com/100004404920.html"
print(getHTMLText(url))
百度搜索关键词提交
import requests
def getHTMLText(url):
try:
kv = {'wd':keyword}
r = requests.get(url,params=kv)
r.raise_for_status() # 如果不是200,产生异常requests.HTTPError
r.encoding = r.apparent_encoding
return len(r.text)
except:
return '产生异常'
if __name__ == "__main__":
keyword = 'python'
url = "https://www.baidu.com/s"
print(getHTMLText(url))
IP地址归属地自动查询
import requests
url = "http://m.ip138.com/ip.asp?ip=" # www.ip138.com 进行ip查询
try:
r = requests.get(url+'202.204.80.112')
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print('爬取失败')