抓取动态网页(您可能感兴趣的列队申请是:+Season+J(图) )

优采云 发布时间: 2022-02-06 16:18

  抓取动态网页(您可能感兴趣的列队申请是:+Season+J(图)

)

  在此示例中,Javascript 仅允许在网页上发送、接收和显示内容,而无需为每个请求实际重新加载网页。所以不需要解析javascript,只需要找到请求的信息,模拟那个请求,解析响应即可。为此,您可以在 Firefox 中使用 Firebug,或在 Chrome 中使用开发人员工具(在 Windows 中使用 ctrl+shift+J,在 Mac 中使用 cmd+opt+J)。在 Chrome 中,只需单击“网络”选项卡,当您单击 网站 时,您将看到请求和响应。

  在这个特定的示例中,当您想要获取克利夫兰团队“2008-09”的统计数据时,javascript 会发出多个请求。您可能感兴趣的队列应用是:+Season&Season=2008-09&PaceAdjust=N&DateFrom=&sortOrder=DES&VsConference=&OpponentTeamID=0&DateTo=&GameSegment=&LastNGames=0&VsDivision=&LeagueID=00&Outcome=&GameScope=&MeasureType=Base&PerMode=Per48&sortPeriod=0N&SeasonSegment=& =0&rowsPerPage=100

  下面是一个小蜘蛛的例子。你只需要定义 LineupItem 然后你就可以用 scrapy crawl stats -o output.json 来执行它。

  import json

from scrapy.spider import Spider

from scrapy.http import Request

from nba.items import LineupItem

from urllib import urlencode

class StatsSpider(Spider):

name = "stats"

allowed_domains = ["stats.nba.com"]

start_urls = (

'http://stats.nba.com/',

)

def parse(self, response):

return self.get_lineup('1610612739','2008-09')

def get_lineup(self, team_id, season):

params = {

'Season': season,

'SeasonType': 'Regular Season',

'LeagueID': '00',

'TeamID': team_id,

'MeasureType': 'Base',

'PerMode': 'Per48',

'PlusMinus': 'N',

'PaceAdjust': 'N',

'Rank': 'N',

'Outcome': '',

'Location': '',

'Month': '0',

'SeasonSegment': '',

'DateFrom': '',

'DateTo': '',

'OpponentTeamID': '0',

'VsConference': '',

'VsDivision': '',

'GameSegment': '',

'Period': '0',

'LastNGames': '0',

'GroupQuantity': '5',

'GameScope': '',

'GameID': '',

'pageNo': '1',

'rowsPerPage': '100',

'sortField': 'MIN',

'sortOrder': 'DES'

}

return Request(

url="http://stats.nba.com/stats/teamdashlineups?" + urlencode(params),

dont_filter=True,

callback=self.parse_lineup

)

def parse_lineup(self,response):

data = json.loads(response.body)

for lineup in data['resultSets'][1]['rowSet']:

item = LineupItem()

item['group_set'] = lineup[0]

item['group_id'] = lineup[1]

item['group_name'] = lineup[2]

item['gp'] = lineup[3]

item['w'] = lineup[4]

item['l'] = lineup[5]

item['w_pct'] = lineup[6]

item['min'] = lineup[7]

item['fgm'] = lineup[8]

item['fga'] = lineup[9]

item['fg_pct'] = lineup[10]

item['fg3m'] = lineup[11]

item['fg3a'] = lineup[12]

item['fg3_pct'] = lineup[13]

item['ftm'] = lineup[14]

item['fta'] = lineup[15]

item['ft_pct'] = lineup[16]

item['oreb'] = lineup[17]

item['dreb'] = lineup[18]

item['reb'] = lineup[19]

item['ast'] = lineup[20]

item['tov'] = lineup[21]

item['stl'] = lineup[22]

item['blk'] = lineup[23]

item['blka'] = lineup[24]

item['pf'] = lineup[25]

item['pfd'] = lineup[26]

item['pts'] = lineup[27]

item['plus_minus'] = lineup[28]

yield item

  这将产生一个json记录,例如:

  {"gp": 30, "fg_pct": 0.491, "group_name": "Ilgauskas,Zydrunas - James,LeBron - Wallace,Ben - West,Delonte - Williams,Mo", "group_set": "Lineups", "w_pct": 0.833, "pts": 103.0, "min": 484.9866666666667, "tov": 13.3, "fta": 21.6, "pf": 16.0, "blk": 7.7, "reb": 44.2, "blka": 3.0, "ftm": 16.6, "ft_pct": 0.771, "fg3a": 18.7, "pfd": 17.2, "ast": 23.3, "fg3m": 7.4, "fgm": 39.5, "fg3_pct": 0.397, "dreb": 32.0, "fga": 80.4, "plus_minus": 18.4, "stl": 8.3, "l": 5, "oreb": 12.3, "w": 25, "group_id": "980 - 2544 - 1112 - 2753 - 2590"}

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线