java从网页抓取数据(基本上在互联网上存在了问题是如何把它们整理成你所需要的)

优采云发布时间: 2021-10-08 21:36

　　你想要的任何信息基本上都存在于互联网上。问题是如何把它们组织成你需要的东西，比如抓取某个行业所有相关公司的名称网站、联系电话、Email等，然后保存在Excel中进行分析。网页信息抓取变得更加有用。

　　对于传统网页，网页服务器直接返回Html。这种类型的网页很容易捕获。不管用什么方法，只需要拿到html页面，然后做Dom分析即可。但对于需要 Javascript 生成的网页来说，就没有那么容易了。对于这个问题，张宇还没有找到很好的解决办法。有抓取javascript网页经验的朋友，欢迎指点。

　　所以今天我要讲的是从传统的html网页爬取信息。虽然我之前说过，没有技术难度，但是有没有比较简单的方法呢？用过jQuery等js框架的朋友可能会觉得javascript看起来像是抓取网页信息的天然助手，它为网页解析而生。当然，现在还有更多的应用，比如服务端的javascript应用，NodeJs。

　　如果能在我们的应用程序中使用jQuery来抓取网页，比如java程序，那绝对是一件令人兴奋的事情。确实有现成的方案，有Javascript引擎，有可以支持jQuery的环境。

　　工具：java、Rhino、envJs。其中Rhino是Mozzila提供的开源Javascript引擎，envJs是模拟浏览器环境，比如Window。代码如下，

　　package stony.zhang.scrape;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.IOException;

import java.lang.reflect.InvocationTargetException;

import org.mozilla.javascript.Context;

import org.mozilla.javascript.ContextFactory;

import org.mozilla.javascript.Scriptable;

import org.mozilla.javascript.ScriptableObject;

/**

* @author MyBeautiful

* @Emal: zhangyu0182@sina.com

* @date Mar 7, 2012

*/

public class RhinoScaper {

private String url;

private String jsFile;

private Context cx;

private Scriptable scope;

public String getUrl() {

return url;

}

public String getJsFile() {

return jsFile;

}

public void setUrl(String url) {

this.url = url;

putObject("url", url);

}

public void setJsFile(String jsFile) {

this.jsFile = jsFile;

}

public void init() {

cx = ContextFactory.getGlobal().enterContext();

scope = cx.initStandardObjects(null);

cx.setOptimizationLevel(-1);

cx.setLanguageVersion(Context.VERSION_1_5);

String[] file = { "./lib/env.rhino.1.2.js", "./lib/jquery.js" };

for (String f : file) {

evaluateJs(f);

}

try {

ScriptableObject.defineClass(scope, ExtendUtil.class);

} catch (IllegalAccessException e1) {

e1.printStackTrace();

} catch (InstantiationException e1) {

e1.printStackTrace();

} catch (InvocationTargetException e1) {

e1.printStackTrace();

}

ExtendUtil util = (ExtendUtil) cx.newObject(scope, "util");

scope.put("util", scope, util);

}

protected void evaluateJs(String f) {

try {

FileReader in = null;

in = new FileReader(f);

cx.evaluateReader(scope, in, f, 1, null);

} catch (FileNotFoundException e1) {

e1.printStackTrace();

} catch (IOException e1) {

e1.printStackTrace();

}

public void putObject(String name, Object o) {

scope.put(name, scope, o);

}

public void run() {

evaluateJs(this.jsFile);

}

　　测试代码：

　　package stony.zhang.scrape;

import java.util.HashMap;

import java.util.Map;

import junit.framework.TestCase;

public class RhinoScaperTest extends TestCase {

public RhinoScaperTest(String name) {

super(name);

}

public void testRun() {

RhinoScaper rs = new RhinoScaper();

rs.init();

rs.setUrl("http://www.baidu.com");

rs.setJsFile("test.js");

// Map o = new HashMap();

// rs.putObject("result", o);

rs.run();

// System.out.println(o.get("imgurl"));

}

　　test.js 文件，如下

　　$.ajax({

url: "http://www.baidu.com",

context: document.body,

success: function(data){

// util.log(data);

var result =parseHtml(data);

var $v= jQuery(result);

// util.log(result);

$v.find('#u a').each(function(index) {

util.log(index + ': ' + $(this).attr("href"));

// arr.add($(this).attr("href"));

});

}

});

function parseHtml(html) {

//Create an iFrame object that will be used to render the HTML in order to get the DOM objects

//created - this is a far quicker way of achieving the HTML to DOM conversion than trying

//to transform the HTML objects one-by-one

var oIframe = document.createElement('iframe');

//Hide the iFrame from view

oIframe.style.display = 'none';

if (document.body)

document.body.appendChild(oIframe);

else

document.documentElement.appendChild(oIframe);

//Open the iFrame DOM object and write in our HTML

oIframe.contentDocument.open();

oIframe.contentDocument.write(html);

oIframe.contentDocument.close();

//Return the document body object containing the HTML that was just

//added to the iFrame as DOM objects

var oBody = oIframe.contentDocument.body;

//TODO: Remove the iFrame object created to cleanup the DOM

return oBody;

}

　　我们执行Unit Test，从网页抓取的三个百度连接会打印在控制台上。

　　0:

　　1：

　　2：

　　测试成功，证明在java程序中使用jQuery抓取网页是可行的。

　　----------------------------------------------- -----------------------

　　张宇，我的美丽，

0

2021-10-08

java从网页抓取数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java从网页抓取数据(基本上在互联网上存在了问题是如何把它们整理成你所需要的)

0 个评论

发起人