htmlunit抓取动态网页(为GET和POST请求添加请求参数和请求头（使用HttpClient） )

优采云发布时间: 2021-11-09 19:04

　　htmlunit抓取动态网页(为GET和POST请求添加请求参数和请求头（使用HttpClient）

)

　　缺点是需要手动查找post请求的url和对应的参数。

　　参考：

　　1.为GET和POST请求添加请求参数和请求头（使用HttpClient，Java）

　　2.关于抓取js加载的内容（参考博客流程，比如找到实际的请求url）

　　以一条新闻为例：

　　1.使用F12，先在网络文件列表中找到网页，双击弹出详细信息。“Body”查看了网页内容，发现网页上没有显示该信息对应的信息，说明是后来加载的。

　　2.尝试在文中搜索关键字，看看是请求哪些文档来获取数据。例如，如果您搜索文本中的第一个词“海关总署”，您可能会发现多个文件，需要判断和选择。

　　对应的请求体为“id:98212”，即请求参数

　　3.查看“标题”

　　主要取决于请求url和请求方式，有时需要设置user-agent。需要使用post方法

　　4.代码编写

　　创建 Java Maven 项目并添加依赖项：

org.apache.httpcomponents

httpclient

4.5.6

com.google.code.gson

gson

2.2.4

　　下载的jar包如下图：

　　代码如下，我只拿到了文章的body：

　　import java.io.IOException;

import java.io.UnsupportedEncodingException;

import java.util.ArrayList;

import java.util.LinkedList;

import java.util.List;

import java.util.Map;

import org.apache.http.HttpEntity;

import org.apache.http.NameValuePair;

import org.apache.http.client.entity.UrlEncodedFormEntity;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpPost;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.message.BasicNameValuePair;

import org.apache.http.util.EntityUtils;

import com.google.gson.Gson;

/**

* http://news.cqcoal.com/blank/nc.jsp?mid=98212

* 该网页的新闻主题是动态生成的，希望获取内容

* @author yangc_cong

*

*/

public class TestNewContent {

/**

* 针对请求的链接，使用post方法获取返回的数据

* @param urlStr String类型

* @return 这里是Map类型

*/

private Map getPageContByHttpCl(String urlStr) {

CloseableHttpClient httpclient = HttpClients.createDefault();

HttpPost post = new HttpPost(urlStr);

String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362";

post.setHeader("User-Agent", userAgent);

CloseableHttpResponse response = null;

String result = null;

// 创建请求参数

List list = new LinkedList();

BasicNameValuePair param1 = new BasicNameValuePair("id", "98212");

list.add(param1);

// 使用URL实体转换工具

UrlEncodedFormEntity entityParam = null;

try {

entityParam = new UrlEncodedFormEntity(list, "UTF-8");

post.setEntity(entityParam);

} catch (UnsupportedEncodingException e1) {

e1.printStackTrace();

}

try {

response = httpclient.execute(post);

HttpEntity entity = response.getEntity();

result = EntityUtils.toString(entity, "UTF-8");

} catch (Exception e) {

e.printStackTrace();

} finally {

try {

response.close();

httpclient.close();

} catch (IOException e) {

e.printStackTrace();

}

System.out.println(result);

Gson gson = new Gson();

Map map = gson.fromJson(result, Map.class);

return map;

}

private void parse_content(Map map) {

//java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.util.Map

ArrayList arrayList = (ArrayList)(map.get("rows"));

Map innerMap = (Map) arrayList.get(0);

String source = (String) innerMap.get("source");

String bodyhtml = (String) innerMap.get("body");

System.out.println("source: "+source);

System.out.println("bodyhtml:"+'\n'+bodyhtml);

}

public static void main(String[] args) {

TestNewContent test1 = new TestNewContent();

String urlStr = "http://news.cqcoal.com/manage/newsaction.do?method:getNewsArchives";

Map map = test1.getPageContByHttpCl(urlStr);

test1.parse_content(map);

}

　　运行截图：

　　补充：使用HtmlUnit抓取网页动态加载的body部分（一个简单的应用）

　　参考：HtmlUnit+Jsoup学习总结

　　1.maven项目中的配置

net.sourceforge.htmlunit

htmlunit

2.27

　　下载的jar包如下，有很多，所以建议使用maven进行配置：

　　2. 代码部分（根据参考博客写的）

　　import java.io.IOException;

import java.net.MalformedURLException;

import com.gargoylesoftware.htmlunit.BrowserVersion;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;

import com.gargoylesoftware.htmlunit.WebClient;

import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class HtmlUnitTest {

public static void main(String[] args) {

String url ="http://www.qidian.com";

url = "http://news.cqcoal.com/blank/nc.jsp?mid=98212";

// 1创建WebClient

WebClient webClient=new WebClient(BrowserVersion.CHROME);

// 2 启动JS

webClient.getOptions().setJavaScriptEnabled(true);

// 3 禁用Css，可避免自动二次請求CSS进行渲染

webClient.getOptions().setCssEnabled(false);

// 4 启动客戶端重定向

webClient.getOptions().setRedirectEnabled(true);

// 5 js运行错誤時，是否拋出异常

webClient.getOptions().setThrowExceptionOnScriptError(false);

// 6 设置超时

webClient.getOptions().setTimeout(50000); //获取网页

HtmlPage htmlPage = null;

try {

htmlPage = webClient.getPage(url);

} catch (FailingHttpStatusCodeException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (MalformedURLException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

// 等待JS驱动dom完成获得还原后的网页

webClient.waitForBackgroundJavaScript(10000);

// 网页内容

String pageHtml = htmlPage.asXml();

System.out.println(pageHtml);

System.out.println("\n------\n");

//网页内容---纯文本形式

String pageText = htmlPage.asText();

System.out.println(pageText );

//输出网页的title

String title = htmlPage.getTitleText();

System.out.println(title );

//close

webClient.close();

}

　　3.运行结果（代码需要加载js后的输出，收录标签的网页内容，纯文本---网页的文本部分，网页的标题---值title 属性的。这里只贴出纯文本输出的部分结果）

0

2021-11-09

htmlunit抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

htmlunit抓取动态网页(为GET和POST请求添加请求参数和请求头（使用HttpClient） )

0 个评论

发起人

AI时代内容工厂

htmlunit抓取动态网页(为GET和POST请求添加请求参数和请求头（使用HttpClient） )

0 个评论

发起人

相关问题