nodejs抓取动态网页(学会从网页源代码中提取数据这种最基本的爬虫使用json文件保存的数据)

优采云发布时间: 2021-11-04 03:18

　　其实爬虫是一项对计算机综合能力要求比较高的技术活动。

　　首先是对网络协议，尤其是http协议有基本的了解，能够分析网站的数据请求响应。学会使用一些工具，简单的使用chrome devtools的网络面板。我通常和邮递员或查尔斯合作分析。对于更复杂的情况，您可能需要使用专业的数据包捕获工具，例如wireshark。您对网站了解得越深，就越容易想出简单的方法来抓取您想要的信息。

　　除了了解一些计算机网络知识外，还需要具备一定的字符串处理能力，特别是正则表达式。其实正则表达式在一般的使用场景中不需要很多进阶知识，比较常用。稍微复杂一点的是分组、非贪婪匹配等。俗话说，如果你学好正则表达式，你就不怕处理字符串。

　　还有一点就是要掌握一些反爬虫的技巧。你在写爬虫的时候可能会遇到各种各样的问题，但是别怕，12306再复杂，也有人能爬，还有什么难我们。常见的爬虫会遇到服务器检查cookies、检查host和referer header、表单隐藏字段、验证码、访问频率限制、代理要求、spa网站等问题。其实爬虫遇到的大部分问题，最终都可以通过操纵浏览器来爬取。

　　本文使用nodejs编写爬虫系列的第二部分。对抗一个小爬虫，抓取流行的 github 项目。想要达到的目标：

　　学习从网页的源代码中提取数据。这个基本的爬虫使用一个 json 文件来保存捕获的数据。熟悉我在上一篇文章中介绍的一些模块。学习如何在node中处理用户输入分析需求

　　我们的需求是从github上抓取热门的项目数据，即star数最高的项目。但是github上好像没有任何页面可以看到排名靠前的项目。往往网站提供的搜索功能是我们爬虫作者分析的重点。

　　之前在灌v2ex的时候，看到一篇讨论996的帖子，刚教了一种用github star查看顶级仓库的方法。其实很简单，只需要在github搜索中加上star数的过滤条件，例如：stars:>60000，就可以搜索到github上所有star数大于60000的仓库。分析下面的截图，注意图片中的评论：

　　分析可以得到以下信息：

　　这个搜索结果页面通过get请求返回html文档，因为我的网络选择了Doc过滤url中的请求参数。有3个参数，p(page)表示页数，q(query)表示搜索内容，type表示搜索内容类型

　　然后我想知道github会不会检查cookies和其他请求头如referer、host等，根据是否有这些请求头来决定是否返回页面。

　　一个比较简单的测试方法是直接使用命令行工具curl进行测试，在gitbash中输入如下命令，即curl“请求的url”

　　curl "https://github.com/search?p=2&q=stars%3A%3E60000&type=Repositories"

　　不出所料，页面的源代码正常返回，这样我们的爬虫脚本就不需要添加请求头和cookie了。

　　通过chrome的搜索功能，可以看到网页源代码中有我们需要的项目信息

　　分析到此结束。这其实是一个很简单的小爬虫。我们只需要配置查询参数，通过http请求获取网页的源码，然后使用解析库进行解析，就可以在源码中得到我们需要的项目相关信息。，然后将数据处理成数组，最后序列化成json字符串存入json文件中。

　　动手实现这个小爬虫获取源代码

　　通过node获取源码，需要先配置url参数，然后通过发送http请求的模块superagent访问配置的url。

　　'use strict';

const requests = require('superagent');

const cheerio = require('cheerio');

const constants = require('../config/constants');

const logger = require('../config/log4jsConfig').log4js.getLogger('githubHotProjects');

const requestUtil = require('./utils/request');

const models = require('./models');

/**

* 获取 star 数不低于 starCount k 的项目第 page 页的源代码

* @param {number} starCount star 数量下限

* @param {number} page 页数

*/

const crawlSourceCode = async (starCount, page = 1) => {

// 下限为 starCount k star 数

starCount = starCount * 1024;

// 替换 url 中的参数

const url = constants.searchUrl.replace('${starCount}', starCount).replace('${page}', page);

// response.text 即为返回的源代码

const { text: sourceCode } = await requestUtil.logRequest(requests.get(encodeURI(url)));

return sourceCode;

}

　　上面代码中的constants模块用于保存工程中的一些常量配置。到时候需要改常量，直接改这个配置文件，配置信息比较集中，方便查看。

　　module.exports = {

searchUrl: 'https://github.com/search?q=stars:>${starCount}&p=${page}&type=Repositories',

};

　　解析源码获取项目信息

　　这里我把项目信息抽象成一个Repository类。在项目models目录下的Repository.js中。

　　const fs = require('fs-extra');

const path = require('path');

module.exports = class Repository {

static async saveToLocal(repositories, indent = 2) {

await fs.writeJSON(path.resolve(__dirname, '../../out/repositories.json'), repositories, { spaces: indent})

}

constructor({

name,

author,

language,

digest,

starCount,

lastUpdate,

} = {}) {

this.name = name;

this.author = author;

this.language = language;

this.digest = digest;

this.starCount = starCount;

this.lastUpdate = lastUpdate;

}

display() {

console.log(` 项目: ${this.name} 作者: ${this.author} 语言: ${this.language} star: ${this.starCount}

摘要: ${this.digest}

最后更新: ${this.lastUpdate}

`);

}

　　解析得到的源码，需要用到cheerio这个解析库，和jquery很像。

　　/**

* 获取 star 数不低于 starCount k 的项目页表

* @param {number} starCount star 数量下限

* @param {number} page 页数

*/

const crawlProjectsByPage = async (starCount, page = 1) => {

const sourceCode = await crawlSourceCode(starCount, page);

const $ = cheerio.load(sourceCode);

// 下面 cheerio 如果 jquery 比较熟应该没有障碍, 不熟的话 github 官方仓库可以查看 api, api 并不是很多

// 查看 elements 面板, 发现每个仓库的信息在一个 li 标签内, 下面的代码时建议打开开发者工具的 elements 面板, 参照着阅读

const repositoryLiSelector = '.repo-list-item';

const repositoryLis = $(repositoryLiSelector);

const repositories = [];

repositoryLis.each((index, li) => {

const $li = $(li);

// 获取带有仓库作者和仓库名的 a 链接

const nameLink = $li.find('h3 a');

// 提取出仓库名和作者名

const [author, name] = nameLink.text().split('/');

// 获取项目摘要

const digestP = $($li.find('p')[0]);

const digest = digestP.text().trim();

// 获取语言

// 先获取类名为 .repo-language-color 的那个 span, 在获取包含语言文字的父 div

// 这里要注意有些仓库是没有语言的, 是获取不到那个 span 的, language 为空字符串

const languageDiv = $li.find('.repo-language-color').parent();

// 这里注意使用 String.trim() 去除两侧的空白符

const language = languageDiv.text().trim();

// 获取 star 数量

const starCountLinkSelector = '.muted-link';

const links = $li.find(starCountLinkSelector);

// 选择器为 .muted-link 还有可能是那个 issues 链接

const starCountLink = $(links.length === 2 ? links[1] : links[0]);

const starCount = starCountLink.text().trim();

// 获取最后更新时间

const lastUpdateElementSelector = 'relative-time';

const lastUpdate = $li.find(lastUpdateElementSelector).text().trim();

const repository = new models.Repository({

name,

author,

language,

digest,

starCount,

lastUpdate,

});

repositories.push(repository);

});

return repositories;

}

　　有时搜索结果有很多页，所以我在这里写了一个新函数来获取指定页数的仓库。

　　const crawlProjectsByPagesCount = async (starCount, pagesCount) => {

if (pagesCount === undefined) {

pagesCount = await getPagesCount(starCount);

logger.warn(`未指定抓取的页面数量, 将抓取所有仓库, 总共${pagesCount}页`);

}

const allRepositories = [];

const tasks = Array.from({ length: pagesCount }, (ele, index) => {

// 因为页数是从 1 开始的, 所以这里要 i + 1

return crawlProjectsByPage(starCount, index + 1);

});

// 使用 Promise.all 来并发操作

const resultRepositoriesArray = await Promise.all(tasks);

resultRepositoriesArray.forEach(repositories => allRepositories.push(...repositories));

return allRepositories;

}

　　让爬虫项目更人性化

　　随便写个脚本，在代码中配置好参数，然后爬取，有点太粗暴了。这里我使用了readline-sync，一个可以同步获取用户输入的库，并添加了一点用户交互。在后续的爬虫教程中，我可能会考虑使用electron来做一个简单的界面。下面是程序的启动代码。

　　const readlineSync = require('readline-sync');

const { crawlProjectsByPage, crawlProjectsByPagesCount } = require('./crawlHotProjects');

const models = require('./models');

const logger = require('../config/log4jsConfig').log4js.getLogger('githubHotProjects');

const main = async () => {

let isContinue = true;

do {

const starCount = readlineSync.questionInt(`输入你想要抓取的 github 上项目的 star 数量下限, 单位(k): `, { encoding: 'utf-8'});

const crawlModes = [

'抓取某一页',

'抓取一定数量页数',

'抓取所有页'

];

const index = readlineSync.keyInSelect(crawlModes, '请选择一种抓取模式');

let repositories = [];

switch (index) {

case 0: {

const page = readlineSync.questionInt('请输入你要抓取的具体页数: ');

repositories = await crawlProjectsByPage(starCount, page);

break;

}

case 1: {

const pagesCount = readlineSync.questionInt('请输入你要抓取的页面数量: ');

repositories = await crawlProjectsByPagesCount(starCount, pagesCount);

break;

}

case 3: {

repositories = await crawlProjectsByPagesCount(starCount);

break;

}

repositories.forEach(repository => repository.display());

const isSave = readlineSync.keyInYN('请问是否要保存到本地(json 格式) ?');

isSave && models.Repository.saveToLocal(repositories);

isContinue = readlineSync.keyInYN('继续还是退出 ?');

} while (isContinue);

logger.info('程序正常退出...')

}

main();

　　来看看最后的效果

　　这里我想提一个readline-sync的bug。在windows的vscode中使用git bash时，不管你的文件格式是不是utf-8，都会出现中文乱码。搜索了一些问题，在powershell中把编码改成utf-8就可以正常显示了，也就是把页码剪成65001。

　　项目完整源码和后续教程源码将存放在我的github仓库：Spiders。如果我的教程对你有帮助，希望你不要吝啬你的星星。后续教程可能是更复杂的案例，通过分析ajax请求直接访问界面。

0

2021-11-04

nodejs抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

nodejs抓取动态网页(学会从网页源代码中提取数据这种最基本的爬虫使用json文件保存的数据)

0 个评论

发起人

AI时代内容工厂

nodejs抓取动态网页(学会从网页源代码中提取数据这种最基本的爬虫使用json文件保存的数据)

0 个评论

发起人

相关问题