java爬虫抓取动态网页( 谷歌的network模拟请求和实现原理登录之后返回的网页源码)

优采云发布时间: 2022-01-10 11:11

　　java爬虫抓取动态网页(

谷歌的network模拟请求和实现原理登录之后返回的网页源码)

　　Java爬虫（四）使用Jsoup获取网站中需要登录的内容（不用验证码登录）

　　一、实现原理

　　登录后，进行数据分析，准确抓取数据。根据前面文章的代码，我们不仅获取了cookies，还获取了登录后返回的网页源代码，此时有以下几种情况：(1）如果我们需要的数据在登录后返回的源码中，那么我们可以直接通过Jsoup解析源码，然后使用Jsoup的选择器功能过滤掉我们需要的信息；(2）如果需要的数据需要通过请求源码链接获取，然后我们先解析源码，找到url，然后带cookies来模拟url的请求。（3）如果数据我们需要根本不在源代码中，那么我们不能使用它照顾这个源代码。让'

　　刚开始写模拟登录的时候，总觉得数据一定要在网页的源码中获取，所以当一个网页由一堆js组成的时候，我就傻眼了。然后希望能拿到渲染网页的源码，大家可以试试selenium，以后学着用。

　　二、详细实现流程

　　package debug;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

import org.jsoup.Connection;

import org.jsoup.Connection.Method;

import org.jsoup.Connection.Response;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import java.io.IOException;

import org.jsoup.select.Elements;

public class test {

public static String LOGIN_URL = "http://authserver.tjut.edu.cn/authserver/login";

public static String USER_AGENT = "User-Agent";

public static String USER_AGENT_VALUE = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0";

public static void main(String[] args) throws Exception {

// 模拟登陆github的用户名和密码

// String url = "http://ehall.tjut.edu.cn/publicapp/sys/zxzxapp/index.do#/consultingList";

String url ="http://ehall.tjut.edu.cn/publicapp/sys/zxzxapp/index.do";

get_html_num(url);

}

/**

* @param userName 用户名

* @param pwd 密码

* @throws Exception

*/

public static Map simulateLogin(String userName, String pwd) throws Exception {

/*

* 第一次请求 grab login form page first 获取登陆提交的表单信息，及修改其提交data数据（login，password）

*/

// get the response, which we will post to the action URL(rs.cookies())

Connection con = Jsoup.connect(LOGIN_URL); // 获取connection

con.header(USER_AGENT, USER_AGENT_VALUE); // 配置模拟浏览器

Response rs = con.execute(); // 获取响应

Document d1 = Jsoup.parse(rs.body()); // 通过Jsoup将返回信息转换为Dom树

List eleList = d1.select("#casLoginForm"); // 获取提交form表单，可以通过查看页面源码代码得知

// 获取cooking和表单属性

// lets make data map containing all the parameters and its values found in the

// form

Map datas = new HashMap();

for (Element e : eleList.get(0).getAllElements()) {

// 注意问题2：设置用户名注意equals（这个username和password也是要去自己的登录界面input里找name值）

if (e.attr("name").equals("username")) {

e.attr("value", userName);

}

// 设置用户密码

if (e.attr("name").equals("password")) {

e.attr("value", pwd);

}

// 排除空值表单属性

if (e.attr("name").length() > 0) {

datas.put(e.attr("name"), e.attr("value"));

}

/*

* 第二次请求，以post方式提交表单数据以及cookie信息

*/

Connection con2 = Jsoup.connect(

"http://authserver.tjut.edu.cn/authserver/login");

con2.header(USER_AGENT, USER_AGENT_VALUE);

// 设置cookie和post上面的map数据

Response login = con2.ignoreContentType(true).followRedirects(true).method(Method.POST).data(datas)

.cookies(rs.cookies()).execute();

//报错Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500,

// 报错原因：见上边注意问题2

// 打印，登陆成功后的信息

//System.out.println(login.body());

// 登陆成功后的cookie信息，可以保存到本地，以后登陆时，只需一次登陆即可

Map map = login.cookies();

// for (String s : map.keySet()) {

// System.out.println(s + " : " + map.get(s));

// }

return map;

}

// 实现切割某两个字之间的字符串

public static String findstr(String str1, String strstrat, String strend) {

String finalstr = new String();

int strStartIndex = str1.indexOf(strstrat);

int strEndIndex = str1.indexOf(strend);

finalstr = str1.substring(strStartIndex, strEndIndex).substring(strstrat.length());

return finalstr;

}

// 第一个，完整爬虫爬下来内容

public static void get_html_num(String url) throws Exception {

try {

Map cookies=simulateLogin("203128301", "密码保护");

// Document doc = Jsoup.connect(url).get();

Document doc = Jsoup.connect(url).cookies(cookies).post();

// 得到html中id为content下的所有内容

Element ele = doc.getElementById("consultingListDetail");

// 分离出下面的具体内容

// Elements tag = ele.getElementsByTag("td");

// for (Element e : tag) {

// String title = e.getElementsByTag("td").text();

// String Totals = findstr(title, "共", "条");

// System.out.println(Totals);

System.out.println(doc);

// }

} catch (IOException e) {

e.printStackTrace();

}

　　三、当前问题

　　目标界面的内容是通过AJAX动态加载的，使用jsoup无法获取目标信息。

　　什么是 AJAX

　　AJAX (Asynchronouse JavaScript And XML) 异步 JavaScript 和 XML。Ajax 可以通过在后台与服务器交换少量数据来异步更新网页。这意味着可以在不重新加载整个页面的情况下更新页面的某些部分。如果内容需要更新，传统网页（不使用 Ajax）必须重新加载整个网页。因为以传统的数据格式进行传输，使用 XML 语法。所谓AJAX，其实数据交互基本都是用JSON。AJAX加载的数据，即使是用JS渲染数据到浏览器，右键->查看网页源代码，还是看不到ajax加载的数据，只能看到使用此 url 加载的 html 代码。

　　解决方案：

　　①直接分析AJAX调用的接口。然后通过代码请求这个接口。

　　②使用selenium模拟点击解决问题。

　　实现过程参考以下两篇文章文章：

　　Java爬虫(五）使用selenium模拟点击获取动态页面内容

　　java爬虫(六）解析AJAX接口获取网页动态内容

0

2022-01-10

java爬虫抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java爬虫抓取动态网页( 谷歌的network模拟请求和实现原理登录之后返回的网页源码)

0 个评论

发起人

AI时代内容工厂

java爬虫抓取动态网页( 谷歌的network模拟请求和实现原理登录之后返回的网页源码)

0 个评论

发起人

相关问题