java爬虫抓取网页数据(2.网络爬虫(英语:webcrawler)抓取测试类4.测试)
优采云 发布时间: 2022-02-02 13:05java爬虫抓取网页数据(2.网络爬虫(英语:webcrawler)抓取测试类4.测试)
1. 网络爬虫
网络爬虫,也称为网络蜘蛛,是一种用于自动浏览万维网的网络机器人。它的目的通常是编译一个网络索引。简单来说就是获取被请求页面的源码,然后通过正则表达式获取你需要的内容。实现大致分为以下几个步骤:
(1)爬取网页源码
(2)用正则截取你需要的内容(我这里截取问题,下面回答)
2.爬取网页源码
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class Spider {
/**
* @auther: Ragty
* @describe: 爬虫爬取网页源码
* @param: [url]
* @return: java.lang.String
* @date: 2019/1/23
*/
public static String getSource (String url) {
BufferedReader reader = null;
String result = "";
try {
URL realurl = new URL(url);
URLConnection conn = realurl.openConnection(); //连接外部url
reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line = "";
while ( (line = reader.readLine()) != null ) {
result += line;
}
if (reader != null) {
reader.close();
}
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
}
3.爬取规则和实体类
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import JavaSpider.spider.Spider;
public class Imooc {
public String question;
public String quesUrl;
public String quesDescription;
public Map answers;
public String nextUrl;
/**
* @auther: Ragty
* @describe: 爬取慕课问答界面的问题及回答
* @param: [url]
* @return:
* @date: 2019/1/23
*/
public Imooc(String url) {
question="";
quesUrl=url;
quesDescription="";
answers = new HashMap();
nextUrl="";
//获取单个问题页面源码
String codeSource = Spider.getSource(url);
//正则获取question
Pattern pattern=Pattern.compile("js-qa-wenda-title.+?>(.+?)");
Matcher matcher=pattern.matcher(codeSource);
if(matcher.find()){
question = matcher.group(1);
}
//正则表达式获取问题描述
pattern=Pattern.compile("js-qa-wenda.+?rich-text\">(.+?)");
matcher=pattern.matcher(codeSource);
if(matcher.find()){
quesDescription = matcher.group(1).replace("<p>", "").replace("", "");
}
//正则表达式获取答案列表
pattern=Pattern.compile("nickname.+?>(.+?)</a>.+?answer-desc rich-text aimgPreview.+?>(.+?)");
matcher=pattern.matcher(codeSource);
while(matcher.find()){
String answer = matcher.group(2).replace("
", "");
answer = answer.replace("", "");
answer = answer.replace("<br />", "");
String name = matcher.group(1);
answers.put(name.trim(), answer.trim());
}
//正则表达式获取下一个url 爬取获取相关问题的url
pattern=Pattern.compile("class=\"r relwenda\".+?href=\"(.+?)\".+?</a>");//获取回答者name
matcher=pattern.matcher(codeSource);
while(matcher.find()){
nextUrl="http://www.imooc.com"+matcher.group(1);
//只取第一个推荐
if(!nextUrl.equals(quesUrl)){
break;
}
}
}
@Override
public String toString() {
return "问题为:"+ question +"\n问题地址为:"+quesUrl+
"\n问题的表述为:"+quesDescription+"\n"
+ "回答的内容为:"+answers+"\n指向下一个链接地址为:"+nextUrl+"\n";
}
}
</p>
3.抢试课
<p>package JavaSpider.main;
import JavaSpider.bean.Imooc;
public class Main {
public static void main(String[] args) {
String url = "http://www.imooc.com/wenda/detail/351144";
Imooc imooc;
for(int i=0; i