java分布式爬虫引擎j2crawler

优采云发布时间: 2020-07-02 08:01

　　j2crawler是一个通用的、最小化依赖第三方组件、灵活扩充组件、开箱即用分布式爬虫java，简单易用性、支持目前主流的通用的解析句型、灵活多变的实时/离线抓取形式、遵循Springboot规范、并且支持分布式布署的Java爬虫引擎，能够最大程度的提升一个爬虫菜鸟建立一个高可用性、高性能的爬虫应用的门槛分布式爬虫java，并且提高开发爬虫系统的开发效率，只须要具备一些简单的网页解析句型同时遵守j2crawler少量开发约束即可。

　　J2crawler爬虫引擎构架图：

　　J2crawler爬虫引擎内部组件构架图：

　　添加starter依赖

<groupId>com.saas.jplogiccloud</groupId>

<artifactId>jplogiccloud-starter-j2crawler</artifactId>

</dependency>

　　在springboot应用配置实例demo

　　按照引擎的规范创建FetchJob即可（具体原理详见以上构架图），引擎启动时手动添加Job到引擎上下文中并根据自己的订制调度该FetchJob;

　　package com.saas.jplogiccloud.crawler.jobs;

import com.saas.jplogiccloud.starter.j2crawler.annotation.FetchJob;

import com.saas.jplogiccloud.starter.j2crawler.core.*;

import lombok.extern.slf4j.Slf4j;

import org.seimicrawler.xpath.JXDocument;

import java.util.ArrayList;

import java.util.List;

@FetchJob(fetchTimeOut = 60000, jobName = "demoFetchJob")

@Slf4j

public class DemoFetchJob extends BaseFetchJob {

@Override

public List<FetchReq> initFetchReqs() {

List<FetchReq> fetchReqs = new ArrayList<>();

FetchReq fetchReq = FetchReq.builder()

.reqUrl("http://www.ip3366.net/?stype=1&page=1")

.onFetchBack("onFetch")

.fetcherType(FetcherType.WEBDRIVER)

.build();

fetchReqs.add(fetchReq);

return fetchReqs;

}

@Override

public String[] initFetchUrls() {

return null;

}

@Override

public void onFetch(FetchResp resp) {

try {

JXDocument doc = FetchParser.getJXDoc(resp);

String url = resp.getUrl();

if(url.indexOf("www.ip3366.net") != -1){

getCloudProxyIp(resp, doc);

}

} catch (Exception e) {

log.info(">>>> demoFetchJob-> 抓取数据异常:{}", e.getMessage());

e.printStackTrace();

}

private void getCloudProxyIp(FetchResp resp, JXDocument doc) {

}

　　配置springboot引擎配置application.yml

　　j2crawler:

application:

enabled: true

jobnames: "poxyIpFetchJob"

threadunit: 2

driver:

driverKey: "webdriver.chrome.driver"

driverPath: "C://Users//Administrator//AppData//Local//Google//Chrome//Application//chromedriver.exe"

　　剩下的就是springboot应用的其他配置了，在这里省略；

　　1、PoxyIpFetchJob ===> 免费代理IP抓取；

　　2、NCoVFetchJob ===> 2019NCov新型冠状疫情信息实时抓取；

0

2020-07-02

抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java分布式爬虫引擎j2crawler

0 个评论

发起人