js 抓取网页内容(用Python获取指定网页内容提取器的定义(组图))

优采云发布时间: 2021-09-21 21:20

　　一,。导言

　　本文介绍如何使用Java和JavaScript使用gooseeker API接口下载内容提取器。这是一个示例程序。什么是内容提取器？为什么会这样？来自Python实时网络爬虫开源项目：通过生成内容提取器，程序员的时间大大节省。有关详细信息，请参见内容提取器的定义

　　二,。使用Java下载内容提取器

　　这是一系列示例程序之一。从目前编程语言的发展来看，用Java实现网页内容抽取是不合适的。除了语言僵化、便捷之外，整个生态不够活跃，可选类库增长缓慢。此外，要从JavaScript动态网页中提取内容，Java也非常不方便，需要一个JavaScript引擎。使用JavaScript下载内容提取器，您可以直接跳到第3部分的内容

　　具体实施

　　注:

　　源代码如下：

　　 public static void main(String[] args)

{

InputStream xslt = null;

try

{

String grabUrl = "http://m.58.com/cs/qiuzu/22613961050143x.shtml"; // 抓取网址

String resultPath = "F:/temp/xslt/result.xml"; // 抓取结果文件的存放路径

// 通过GooSeeker API接口获得xslt

xslt = getGsExtractor();

// 抓取网页内容转换结果文件

convertXml(grabUrl, xslt, resultPath);

} catch (Exception e)

{

e.printStackTrace();

} finally

{

try

{

if (xslt != null)

xslt.close();

} catch (IOException e)

{

e.printStackTrace();

}

/**

* @description dom转换

*/

public static void convertXml(String grabUrl, InputStream xslt, String resultPath) throws Exception

{

// 这里的doc对象指的是jsoup里的Document对象

org.jsoup.nodes.Document doc = Jsoup.parse(new URL(grabUrl).openStream(), "UTF-8", grabUrl);

W3CDom w3cDom = new W3CDom();

// 这里的w3cDoc对象指的是w3c里的Document对象

org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(doc);

Source srcSource = new DOMSource(w3cDoc);

TransformerFactory tFactory = TransformerFactory.newInstance();

Transformer transformer = tFactory.newTransformer(new StreamSource(xslt));

transformer.transform(srcSource, new StreamResult(new FileOutputStream(resultPath)));

}

/**

* @description 获取API返回结果

*/

public static InputStream getGsExtractor()

{

// api接口

String apiUrl = "http://www.gooseeker.com/api/getextractor";

// 请求参数

Map params = new HashMap();

params.put("key", "xxx"); // Gooseeker会员中心申请的API KEY

params.put("theme", "xxx"); // 提取器名，就是用MS谋数台定义的规则名

params.put("middle", "xxx"); // 规则编号，如果相同规则名下定义了多个规则，需填写

params.put("bname", "xxx"); // 整理箱名，如果规则含有多个整理箱，需填写

String httpArg = urlparam(params);

apiUrl = apiUrl + "?" + httpArg;

InputStream is = null;

try

{

URL url = new URL(apiUrl);

HttpURLConnection urlCon = (HttpURLConnection) url.openConnection();

urlCon.setRequestMethod("GET");

is = urlCon.getInputStream();

} catch (ProtocolException e)

{

e.printStackTrace();

} catch (IOException e)

{

e.printStackTrace();

}

return is;

}

/**

* @description 请求参数

*/

public static String urlparam(Map data)

{

StringBuilder sb = new StringBuilder();

for (Map.Entry entry : data.entrySet())

{

try

{

sb.append(entry.getKey()).append("=").append(URLEncoder.encode(entry.getValue() + "", "UTF-8")).append("&");

} catch (UnsupportedEncodingException e)

{

e.printStackTrace();

}

return sb.toString();

}

　　返回的结果如下：

　　三,。使用JavaScript下载内容提取器

　　请注意，如果此示例的JavaScript代码在网页上运行，则由于跨域问题，它无法抓取非本地网页的内容。因此，有必要在特权JavaScript引擎上运行，如浏览器扩展程序、自行开发的浏览器、自己程序中的JavaScript引擎等

　　为了方便实验，这个例子仍然在网页上运行。为了避免跨域问题，保存和修改目标网页，并插入JavaScript。因此，许多手动操作仅用于实验，正式使用时需要考虑其他方法

　　具体实施

　　注:

　　以下是源代码：

　　// 目标网页网址为http://m.58.com/cs/qiuzu/22613961050143x.shtml，预先保存成本地html文件，并插入下述代码

$(document).ready(function(){

$.ajax({

type: "get",

url: "http://www.gooseeker.com/api/getextractor?key=申请的a*敏*感*词*ey&theme=规则主题名",

dataType: "xml",

success: function(xslt)

{

var result = convertXml(xslt, window.document);

alert("result:" + result);

}

});

/* 用xslt将dom转换为xml对象 */

function convertXml(xslt, dom)

{

// 定义XSLTProcessor对象

var xsltProcessor = new XSLTProcessor();

xsltProcessor.importStylesheet(xslt);

// transformToDocument方式

var result = xsltProcessor.transformToDocument(dom);

return result;

}

　　返回结果的屏幕截图如下

　　四,。前景

　　类似地，可以使用Python获取指定的网页内容。感觉Python的语法更简洁。稍后，您可以添加Python语言的示例，感兴趣的合作伙伴可以加入研究

　　五,。有关文件

　　一,。Python即时web爬虫：API描述

　　二,。内容提取器的定义

　　六,。Jisoke gooseeker开源代码下载源代码

　　一,。Gooseeker开源Python网络爬虫GitHub源代码

　　七,。文档修改历史记录

　　12016-06-20：V1.0

　　上一章API-下载内容提取器下一章使用Python驱动Firefox

0

2021-09-21

js 抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

js 抓取网页内容(用Python获取指定网页内容提取器的定义(组图))

0 个评论

发起人

AI时代内容工厂

js 抓取网页内容(用Python获取指定网页内容提取器的定义(组图))

0 个评论

发起人

相关问题