抓取动态网页(动态网站解析的动态网页爬取方法(组图) )

优采云发布时间: 2021-09-28 07:26

　　抓取动态网页(动态网站解析的动态网页爬取方法(组图)

)

　　我刚才讲的是抓取静态网页。本篇博客介绍动态网站的爬取。动态网站的爬取比静态网页更难，涉及的主要技术是Ajax和动态Html。简单的网页访问无法获取完整的数据，需要分析数据加载过程。将通过具体的例子来介绍不同的动态网页爬取方法。本篇博客主要使用ajax直接获取数据。

　　页面分析

　　本博客以MTime电影网为例，主要爬取电影收视率、票房等信息。首先使用火狐浏览器的控制台查看页面信息。

　　页面中的票房信息无法以 HTML 格式获取。是通过js动态加载得到的，然后搜索对应的js响应。就是从一堆js请求中查看一些收录ajax字段的请求。%3A%2F%%2F242129%2F&t=27406&Ajax_CallBackArgument0=242129

　　点击查看返回数据：

　　找到对应的链接并返回数据后，需要分析这个链接的构造方法，分析返回的数据。

　　（1）链接一共7个参数，我们首先要分析哪些参数没有变化，哪些参数在不同的电影中差异更大。通过对比两个不同的电影链接，可以发现其中4个是not 有动态变化的三个参数，分别是Ajax_RequestRrl、t和Ajax_CallBackArgument0，通过分析可以发现这三个参数分别代表当前页面url、当前请求时间、电影所代表的数量。

　　（2）提取响应数据。响应内容主要分为三类，分别是正在上映的电影信息，即将上映的电影信息，最后一种是即将上映的电影信息发布很久了，详情见代码。

　　具体实现代码

　　本文代码基于博客实现。本博客只修改需要改动的部分。

　　网页分析

　　在HtmlParser类中定义一个parser_url方法，代码如下：

def parser_url(self, page_url, response):

pattern = re.compile(r'(http://movie.mtime.com/(\d+)/)')

urls = pattern.findall(response)

if urls != None:

return list(set(urls))

else:

return None

　　提取响应数据中的有效数据：

def parser_json(self, page_url, respone):

"""

解析响应

:param page_url:

:param respone:

:return:

"""

#将“=”和“；”之间的内容提取出来

pattern = re.compile(r'=(.*?);')

result = pattern.findall(respone)[0]

if result != None:

value = json.loads(result)

try:

isRelease = value.get('value').get('isRelease')

except Exception,e:

print(e)

return None

if isRelease:

if value.get('value').get('hotValue') == None:

return self._parser_release(page_url, value)

else:

self._parser_no_release(page_url, value, isRelease = 2)

else:

return self._parser_no_release(page_url, value)

def _parser_release(self, page_url, value):

"""

解析已上映的影片

:param page_url:

:param value:

:return:

"""

try:

isRelease = 1

movieRating = value.get('value').get('movieRating')

boxOffice = value.get('value').get('boxOffice')

movieTitle = value.get('value').get('movieTitle')

RPictureFinal = movieRating.get('RPictureFinal')

RStoryFinal = movieRating.get('RStoryFinal')

RDirectorFinal = movieRating.get('RDirectorFinal')

ROtherFinal = movieRating.get('ROtherFinal')

RatingFinal = movieRating.get('RatingFinal')

MovieId = movieRating.get('MovieId')

Usercount = movieRating.get('Usercount')

AttitudeCount = movieRating.get('AttitudeCount')

TotalBoxOffice = boxOffice.get('TotalBoxOffice')

TotalBoxOfficeUnit = boxOffice.get('TotalBoxOfficeUnit')

TodayBoxOffice = boxOffice.get('TodayBoxOffice')

TodayBoxOfficeUnit = boxOffice.get('TodayBoxOfficeUnit')

showDays = boxOffice.get('ShowDays')

try:

Rank = boxOffice.get('Rank')

except Exception,e:

Rank = 0

return (MovieId,movieTitle,RatingFinal,

ROtherFinal,RPictureFinal,RDirectorFinal,

RStoryFinal,Usercount,AttitudeCount,

TotalBoxOffice+TotalBoxOfficeUnit,

TodayBoxOffice+TodayBoxOfficeUnit,

Rank,showDays,isRelease)

except Exception,e:

print(e,page_url,value)

return None

def _parser_no_release(self,page_url,value,isRelease = 0):

try:

movieRating = value.get('value').get('movieRating')

movieTitle = value.get('value').get('movieTitle')

RPictureFinal = movieRating.get('RPictureFinal')

RStoryFinal = movieRating.get('RStoryFinal')

RDirectorFinal = movieRating.get('RDirectorFinal')

ROtherFinal = movieRating.get('ROtherFinal')

RatingFinal = movieRating.get('RatingFinal')

MovieId = movieRating.get('MovieId')

Usercount = movieRating.get('Usercount')

AttitudeCount = movieRating.get('AttitudeCount')

try:

Rank = value.get('value').get('hotValue').get('Ranking')

except Exception,e:

Rank = 0

return (MovieId,movieTitle,RatingFinal,

ROtherFinal,RPictureFinal,RDirectorFinal,

RStoryFinal,Usercount,AttitudeCount,u'无',

u'无',Rank,0,isRelease)

except Exception,e:

print(e, page_url, value)

return None

　　爬虫调度器

def dynamic_crawl(self, root_url):

content = self.downloader.download(root_url)

urls = self.parser.parser_url(root_url,content)

for url in urls:

try:

t = time.strftime("%Y%m%d%H%M%S3282", time.localtime())

rank_url ='http://service.library.mtime.com/Movie.api'\

'?Ajax_CallBack=true'\

'&Ajax_CallBackType=Mtime.Library.Services'\

'&Ajax_CallBackMethod=GetMovieOverviewRating'\

'&Ajax_CrossDomain=1'\

'&Ajax_RequestRrl=%s'\

'&t=%s'\

'&Ajax_CallBackArgument0=%s'%(url[0],t,url[1])

rank_content = self.downloader.download(rank_url)

data = self.parser.parser_json(rank_url,rank_content)

print(data)

except Exception,e:

print('Crawl failed')

if __name__=="__main__":

spider_main = SpiderMain()

spider_main.dynamic_crawl("http://theater.mtime.com/China_Beijing/")

0

2021-09-28

抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取动态网页(动态网站解析的动态网页爬取方法(组图) )

0 个评论

发起人

AI时代内容工厂

抓取动态网页(动态网站解析的动态网页爬取方法(组图) )

0 个评论

发起人

相关问题