jquery抓取网页内容(基本上在互联网上存在了问题是如何把它们整理成你所需要的)

优采云发布时间: 2022-01-16 07:16

　　任何你想要的信息基本上都在网上存在，问题是如何组织成你需要的，比如抓取某个行业所有相关公司的名字网站，联系电话，Email，等，然后将其保存在 Excel 中进行分析。网页抓取变得越来越有用。

　　对于传统网页，web服务器直接返回Html。这种类型的网页很容易掌握。不管用什么方法，只要拿到html页面，然后做Dom解析。但是对于需要Javascript来生成的网页来说，就不是那么容易了。张宇还没有找到解决这个问题的好办法。欢迎有抓javascript网页经验的朋友指点。

　　那么今天我要讲的就是传统html网页的信息爬取。虽然如前所述，没有技术难度，但是有没有相对简单的方法呢？使用过 jQuery 等 js 框架的朋友可能会认为 javascript 就像是抓取网页信息的天然助手，为网页解析而生。当然，现在有更多的应用，比如服务器端的javascript应用，NodeJs。

　　如果能够在我们的应用程序中使用 jQuery 来抓取网页，例如 java 程序，那将是非常令人兴奋的。确实有现成的解决方案，Javascript引擎，可以支持jQuery运行的环境。

　　工具：java、Rhino、envJs。其中Rhino是Mozzila提供的开源Javascript引擎，envJs是模拟浏览器环境，比如Window。代码如下，

　　package stony.zhang.scrape;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.IOException;

import java.lang.reflect.InvocationTargetException;

import org.mozilla.javascript.Context;

import org.mozilla.javascript.ContextFactory;

import org.mozilla.javascript.Scriptable;

import org.mozilla.javascript.ScriptableObject;

/**

* @author MyBeautiful

* @Emal: zhangyu0182@sina.com

* @date Mar 7, 2012

*/

public class RhinoScaper {

private String url;

private String jsFile;

private Context cx;

private Scriptable scope;

public String getUrl() {

return url;

}

public String getJsFile() {

return jsFile;

}

public void setUrl(String url) {

this.url = url;

putObject("url", url);

}

public void setJsFile(String jsFile) {

this.jsFile = jsFile;

}

public void init() {

cx = ContextFactory.getGlobal().enterContext();

scope = cx.initStandardObjects(null);

cx.setOptimizationLevel(-1);

cx.setLanguageVersion(Context.VERSION_1_5);

String[] file = { "./lib/env.rhino.1.2.js", "./lib/jquery.js" };

for (String f : file) {

evaluateJs(f);

}

try {

ScriptableObject.defineClass(scope, ExtendUtil.class);

} catch (IllegalAccessException e1) {

e1.printStackTrace();

} catch (InstantiationException e1) {

e1.printStackTrace();

} catch (InvocationTargetException e1) {

e1.printStackTrace();

}

ExtendUtil util = (ExtendUtil) cx.newObject(scope, "util");

scope.put("util", scope, util);

}

protected void evaluateJs(String f) {

try {

FileReader in = null;

in = new FileReader(f);

cx.evaluateReader(scope, in, f, 1, null);

} catch (FileNotFoundException e1) {

e1.printStackTrace();

} catch (IOException e1) {

e1.printStackTrace();

}

public void putObject(String name, Object o) {

scope.put(name, scope, o);

}

public void run() {

evaluateJs(this.jsFile);

}

　　测试代码：

　　package stony.zhang.scrape;

import java.util.HashMap;

import java.util.Map;

import junit.framework.TestCase;

public class RhinoScaperTest extends TestCase {

public RhinoScaperTest(String name) {

super(name);

}

public void testRun() {

RhinoScaper rs = new RhinoScaper();

rs.init();

rs.setUrl("http://www.baidu.com");

rs.setJsFile("test.js");

// Map o = new HashMap();

// rs.putObject("result", o);

rs.run();

// System.out.println(o.get("imgurl"));

}

　　test.js 文件，如下

　　$.ajax({

url: "http://www.baidu.com",

context: document.body,

success: function(data){

// util.log(data);

var result =parseHtml(data);

var $v= jQuery(result);

// util.log(result);

$v.find('#u a').each(function(index) {

util.log(index + ': ' + $(this).attr("href"));

// arr.add($(this).attr("href"));

});

}

});

function parseHtml(html) {

//Create an iFrame object that will be used to render the HTML in order to get the DOM objects

//created - this is a far quicker way of achieving the HTML to DOM conversion than trying

//to transform the HTML objects one-by-one

var oIframe = document.createElement('iframe');

//Hide the iFrame from view

oIframe.style.display = 'none';

if (document.body)

document.body.appendChild(oIframe);

else

document.documentElement.appendChild(oIframe);

//Open the iFrame DOM object and write in our HTML

oIframe.contentDocument.open();

oIframe.contentDocument.write(html);

oIframe.contentDocument.close();

//Return the document body object containing the HTML that was just

//added to the iFrame as DOM objects

var oBody = oIframe.contentDocument.body;

//TODO: Remove the iFrame object created to cleanup the DOM

return oBody;

}

　　当我们执行Unit Test的时候，会在控制台打印从网页抓取的三个百度连接，

　　0:

　　1：

　　2：

　　测试成功，证明在java程序中使用jQuery爬取网页是可行的。

　　----------------------------------- ---------- ------------

　　张宇，我的美丽，

0

2022-01-16

jquery抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

jquery抓取网页内容(基本上在互联网上存在了问题是如何把它们整理成你所需要的)

0 个评论

发起人

AI时代内容工厂

jquery抓取网页内容(基本上在互联网上存在了问题是如何把它们整理成你所需要的)

0 个评论

发起人

相关问题