java爬虫抓取动态网页(懂车帝进制抓取只是技术研究不涉及数据内容，请谨慎操作)

优采云发布时间: 2022-01-01 09:11

　　友情提示，了解Chedi Binary的捕获，但技术研究不涉及数据内容，请谨慎操作

　　本文以雅阁车友圈为例，地址：

　　分析

　　首先分析一下这个网站的内容交互方式。不难发现，内容是通过鼠标向下滚动来翻页，翻页后页面没有刷新。

　　我们按F12调出浏览器的Network，可以看到每个页面的内容都是通过XHR传递到前端显示的，如下图：

　　多次翻页后，通过这个url解析并显示内容。

　　问题

　　1.每个请求需要传递三个参数，min_behot_time、max_behot_time、max_cursor，如何在每次翻页时传递这三个参数？

　　在浏览器中搜索关键词以找到位置。都出现在community.js中，只是被webpack打包了，看不到源码了。

　　但是可以通过关键词搜索，然后配合浏览器自带的format js函数，大概可以看到这三个参数的由来，如下图：

　　相信你已经明白了。 [害羞]

　　2. 每次翻页，传递的数据数为20，但显示的页面少于20，而且每次都不固定。为什么？

　　这时候就需要对返回的json内容进行分析。有没有发现类型有点不同？

　　具体可以多对比分析，仅供参考：type=2328重点分析，type=2312精华帖。

　　代码片段

　　maven 引入：jsoup、fastjson

　　java 示例代码片段

　　long min_behot_time = 0l;

long max_behot_time = 0l;

long max_cursor = 0l;

int page = 1;

//请留意需要修改，您可以while中设置page>10 时，break;

while (true) {

boolean flag = false;

if (min_behot_time == 0) {

min_behot_time = System.currentTimeMillis() / 1000;

}

String url = "https://www.dcdapp.com/motor/discuss_ugc/cheyou_feed_list_v3/v1/?motor_id=" + motor_id + "&min_behot_time=" + min_behot_time + "&max_behot_time=" + max_behot_time + "&max_cursor=" + max_cursor + "&channel=m_web&device_platform=wap&category=dongtai&cmg_flag=dongtai&web_id=0&device_id=0&impression_info=%7B%22page_id%22%3A%22page_forum_home%22%2C%22product_name%22%3A%22pc%22%7D&tt_from=load_more&_t=" + System.currentTimeMillis();

try {

Connection.Response response = Jsoup.connect(url).headers(headers())

.timeout(50000)

.ignoreHttpErrors(true).ignoreContentType(true)

.execute();

String json = response.body();

JSONObject jsonObject = JSON.parseObject(json);

JSONObject dataObj = jsonObject.getJSONObject("data");

if (dataObj != null) {

JSONArray listArray = dataObj.getJSONArray("list");

if (listArray != null) {

for (int i = 0; i < listArray.size(); i++) {

int type = listArray.getJSONObject(i).getIntValue("type");

if (2328 == type) continue;

boolean prime = 2312 == type;

JSONObject items = listArray.getJSONObject(i).getJSONObject("info");

String uniqueIdStr = listArray.getJSONObject(i).getString("unique_id_str");

String link = "https://www.dcdapp.com/ugc/article/" + uniqueIdStr;

if (dcdapp.carTopicService.isExsit(DigestUtils.md5DigestAsHex(link.getBytes()))) {

flag = true;

}

String title = items.getString("title").replaceAll("[\r\n]+", "").replaceAll("[^\\u0000-\\uFFFF]", "");

if (title.length() > 100) {

title = title.substring(0, 100) + "……";

}

//阅读量

int hit = items.getIntValue("read_count");

//评论量

int reply = items.getIntValue("comment_count");

String author = items.getJSONObject("user_info").getString("name").replaceAll("[^\\u0000-\\uFFFF]", "");

String displayTime = TimeTools.timeFormat(items.getLongValue("display_time") * 1000, "");

boolean image = items.getJSONArray("image_list") != null && !items.getJSONArray("image_list").isEmpty();

//这里就是问题中的几个参数哦。。。

max_behot_time = items.getLongValue("behot_time");

max_cursor = items.getLongValue("cursor");

logger.warn(title + "\t" + author + "\t" + hit + "\t" + displayTime + "\t" + uniqueIdStr + "\t" + image);

}

} else {

break;

}

} else {

break;

}

Thread.sleep(5000L);

} catch (Exception ex) {

ex.printStackTrace();

logger.error("dcd error:"+ex);

}

logger.error("抓取"+sector + " 第" + page + "页结束");

page++;

if (flag) break;

}

　　可以通过上面的代码获取数据[得意]

0

2022-01-01

java爬虫抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java爬虫抓取动态网页(懂车帝进制抓取只是技术研究不涉及数据内容，请谨慎操作)

0 个评论

发起人

AI时代内容工厂

java爬虫抓取动态网页(懂车帝进制抓取只是技术研究不涉及数据内容，请谨慎操作)

0 个评论

发起人

相关问题