php 循环抓取网页内容( 一下函数run(self)中的循环过程(图))

优采云发布时间: 2022-01-16 22:02

　　php 循环抓取网页内容(

一下函数run(self)中的循环过程(图))

　　def send_result(self, type, task, result): if self.outqueue: self.outqueue.put((task, result))

　　这个最终函数将结果放入输出队列，等待内容处理程序读取它。

　　内容处理器

　　内容处理程序的目的是分析已爬回的页面。它的过程也是一个大循环，但是输出有3个队列（status_queue、newtask_queue和result_queue），输入只有一个队列（inqueue）。

　　让我们更深入地分析一下函数run()中的循环过程。

　　函数运行（自我）

　　def run(self): try: task, response = self.inqueue.get(timeout=1) self.on_task(task, response) self._exceptions = 0 except KeyboardInterrupt: break except Exception as e: self._exceptions += 1 if self._exceptions > self.EXCEPTION_LIMIT: break continue

　　这个函数的代码比较小，也比较容易理解，只是简单的从队列中取出下一个要分析的任务，使用on_task(task, response)函数进行分析。这个循环*敏*感*词*一个中断信号，一旦我们向 Python 发送这样一个信号，循环就会终止。最后，循环计算它引发的异常的数量。太多的异常将终止循环。

　　函数 on_task(self, task, response)

　　def on_task(self, task, response): response = rebuild_response(response) project = task['project'] project_data = self.project_manager.get(project, updatetime) ret = project_data['instance'].run( status_pack = { 'taskid': task['taskid'], 'project': task['project'], 'url': task.get('url'), ... } self.status_queue.put(utils.unicode_obj(status_pack)) if ret.follows: self.newtask_queue.put( [utils.unicode_obj(newtask) for newtask in ret.follows]) for project, msg, url in ret.messages: self.inqueue.put(({...},{...})) return True

　　函数 on_task() 是做实际工作的方法。

　　它尝试使用输入的任务来查找该任务所属的项目。然后它在项目中运行自定义脚本。最后，它分析自定义脚本返回的响应。如果一切顺利，将创建一个字典，其中收录我们从网页获得的所有信息。最后将字典放入队列status_queue，稍后会被调度器重用。

　　如果分析的页面中有一些新的链接需要处理，新的链接会被放入队列newtask_queue，供调度器稍后使用。

　　现在，如果需要，pyspider 会将结果发送到其他项目。

　　最后，如果出现问题，例如页面返回错误，错误消息将被添加到日志中。

　　结束！

0

2022-01-16

php 循环抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

php 循环抓取网页内容( 一下函数run(self)中的循环过程(图))

0 个评论

发起人

AI时代内容工厂

php 循环抓取网页内容( 一下函数run(self)中的循环过程(图))

0 个评论

发起人

相关问题