Python网络数据采集4: 使用API

优采云发布时间: 2020-08-07 21:31

　　通常情况下，程序员可以使用HTPP协议向API发起请求以获取某些信息，并且API将以XML或JSON格式返回服务器的响应信息.

　　通常，您不会考虑将API用作网络数据采集，但实际上，两者（都发送HTTP请求）和结果（都获取信息）所使用的许多技术都是相似的；两者通常是同一个郑氏的关系.

　　例如，将Wikipedia编辑历史记录（与编辑者的IP地址）和IP地址解析API结合起来，以获取Wikipedia条目的编辑者的地理位置.

　　4.1 API概述

　　Google API

　　4.2 API通用规则

　　API使用非常标准的规则集来生成数据，并且所生成的数据以非常标准的方式组织.

　　四种方法: GET，POST，PUT，DELETE

　　验证: 需要客户端验证

　　4.3服务器响应

　　大多数反馈数据格式是XML和JSON

　　过去，服务器使用PHP和.NET等程序作为API的接收端. 现在，服务器端还使用了一些JavaScript框架作为API的发送和接收端，例如Angular或Backbone.

　　API调用:

　　4.4回声巢

　　Echo Nest音乐数据网站

　　4.5 Twitter API

　　pip安装推特

　　from twitter import Twitter

t = Twitter(auth=OAuth(,,,))

pythonTweets = t.search.tweets(q = "#python")

print(pythonTweets)

　　发布推文4.6 Google API

　　无论您要处理哪种信息，包括语言翻译，地理位置，日历，甚至是遗传数据，Google都会提供API. Google还为其一些知名应用程序（例如Gmail，YouTube和Blogger）提供API.

　　4.7解析JSON数据

　　import json

from urllib.request import urlopen

def getCountry(ipAddress):

response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')

responseJson = json.loads(response)

return responseJson.get("country_code")

print(getCountry("50.78.253.58"))

　　4.8返回主题

　　将多个数据源组合为新形式，或使用API作为一种工具，从新的角度解释采集的数据.

　　首先要做一个基本程序来采集维基百科，找到编辑历史记录页面，然后在编辑历史记录中找到IP地址

　　# -*- coding: utf-8 -*-

from urllib.request import urlopen

from bs4 import BeautifulSoup

import datetime

import random

import re

import json

random.seed(datetime.datetime.now())

# https://en.wikipedia.org/wiki/Python_(programming_language)

def getLinks(articleUrl):

html = urlopen("http://en.wikipedia.org"+articleUrl)

bsObj = BeautifulSoup(html)

return bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl):

# 编辑历史页面URL链接格式是：

# https://en.wikipedia.org/w/index.php?title=Python_(programming_language)&action=history

pageUrl = pageUrl.replace("/wiki/", "")

historyUrl = "https://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history"

print("history url is: "+historyUrl)

html = urlopen(historyUrl)

bsObj = BeautifulSoup(html)

# 找出class属性是"mw-anonuserlink"的链接

# 它们用IP地址代替用户名

ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"})

addressList = set()

for ipAddress in ipAddresses:

addressList.add(ipAddress.get_text())

return addressList

links = getLinks("/wiki/Python_(programming_language)")

def getCountry(ipAddress):

try:

response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')

except HTTPError:

return None

responseJson = json.loads(response)

return responseJson.get("country_code")

while (len(links) > 0):

for link in links:

print("-------------------")

historyIPs = getHistoryIPs(link.attrs["href"])

for historyIP in historyIPs:

#print(historyIP)

country = getCountry(historyIP)

if country is not None:

print(historyIP+" is from "+country)

newLink = links[random.randint(0, len(links)-1)].attrs["href"]

links = getLinks(newLink)

　　4.9关于API的更多信息

　　Leonard Richardson，Mike Amundsen和Sam Ruby的RESTful Web API（）为使用Web API提供了非常全面的理论和实践指导. 此外，Mike Amundsen的精彩视频教学课程“为Web（）设计API”还可以教您创建自己的API. 如果您想方便地共享自己采集的数据，他的视频非常有用

0

2020-08-07

文章采集api

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

Python网络数据采集4: 使用API

0 个评论

发起人

AI时代内容工厂

Python网络数据采集4: 使用API​​

0 个评论

发起人

相关问题

Python网络数据采集4: 使用API