java爬虫抓取动态网页(什么是爬虫爬虫()代码首先定义一个方法)

优采云发布时间: 2022-01-10 11:12

　　JAVA爬虫简介及案例实现什么是爬虫

　　爬虫是根据一定的规则自动爬取万维网上信息的程序或脚本

　　怎么爬

　　每次我们登陆一个网页，都可以观察链接，以阶梯阅读()为例

　　这是主页的链接。当我们点击一年级的所有新闻时，链接会变成，我们可以找到更多的东西，然后点击任何课程，链接会变成，按f12跟踪代码，他这里有音频链接写的比较隐秘，最后发现他是从js赋值的链接，最后抓到：Nfu0sVOTopaEPyBt299hxFv7R_k=是音频链接

　　于是我们就想，只要我们的代码能捕捉到这段代码，捕捉后就可以下载音频了，但这只是一个资源，所以我们需要在代码中模拟一层一层打开网页，然后获取所有资源

　　下面附上代码

　　首先定义一个方法来获取网页中所有html代码中的a标签代码

　　public static Set getHtmlToA(String html) {

Pattern p = Pattern.compile("]", Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(html);

Set hashSet = new HashSet();

while (m.find()) {

String link = m.group(2).trim();

hashSet.add(link);

}

return hashSet;

}

　　从代码中可以看出，我们通过正则表达式获取到a标签后面的链接，Pattern类可以帮助我们检索获取到的每一段代码是否符合要求

　　下面是获取网页所有html代码的方法

　　public static BufferedReader getBR(String html) {

URL urls = null;

try {

urls = new URL(html);

in = urls.openStream();

isr = new InputStreamReader(in);

} catch (MalformedURLException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

return new BufferedReader(isr);

}

　　这个返回的BufferedReader就是我们想要的网页的所有代码

　　下面是运行代码

　　public static void main(String args[]) throws Exception {

String url = "http://pati.edu-china.com";

try {

bufr = getBR(url);

String str;

String http = "";

while ((str = bufr.readLine()) != null) {

if (str.indexOf("book-link") > 0) {

http += str.substring(str.indexOf("年级")) + "\n";

// System.out.println(http);

// System.out.println(http.indexOf("年级"));

Set set = getHtmlToA(http);

Iterator it = set.iterator();

while (it.hasNext()) {

String newUrl = url + it.next().toString();

// System.out.println(newUrl);

bufr = getBR(newUrl);

String http1 = "";

while ((str = bufr.readLine()) != null) {

if (str.indexOf("/source") > 0) {

http1 += str + "\n";

}

Set set1 = getHtmlToA(http1);

Iterator it1 = set1.iterator();

while (it1.hasNext()) {

String name = it1.next().toString();

String newUrl1 = url + name;

System.out.println(newUrl1);

BufferedReader br = getBR(newUrl1);

String strBR = "";

while ((strBR = (br.readLine())) != null) {

if (strBR.indexOf("http://prvstatic.edu-china.com/upload/pati/audio/") > 0) {

String endHttp = strBR.substring(7, strBR.length() - 1);

try {

InputStream ins = new URL(endHttp).openConnection().getInputStream(); //创建连接、输入流

FileOutputStream f = new FileOutputStream("D:/MyPaChong/" + name + ".mp3");//创建文件输出流

byte[] bb = new byte[1024]; //接收缓存

int len;

while ((len = ins.read(bb)) > 0) { //接收

f.write(bb, 0, len); //写入文件

}

f.close();

ins.close();

System.out.println(name + "爬取成功^_^");

} catch (MalformedURLException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

} catch (Exception e) {

e.printStackTrace();

} finally {

bufr.close();

isr.close();

in.close();

}

　　这是我刚刚处理的实例的网站捕获。可以提出的地方很多，不过太麻烦就不提了。有兴趣的可以重写。

　　学爬，记录在这里

　　原创地址：。％257B％2522重量％255FID％2522％253A％25226780366510269％2522％252℃％2522％252℃％2522％2522％253A％25222014071 4. PC％255降低％2522％257D＆Request_id = 678036510269 & biz_id = 0 & utm_medium =distribute.pc_search_result.none-task-blog-2~all~first_rank_ecpm_v1~rank_v31_ecpm-16-90757192.pc_search_result_cache&utm_term=java%E7%88%AC%E8%99%AB

0

2022-01-10

java爬虫抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java爬虫抓取动态网页(什么是爬虫爬虫()代码首先定义一个方法)

0 个评论

发起人

AI时代内容工厂

java爬虫抓取动态网页(什么是爬虫爬虫()代码首先定义一个方法)

0 个评论

发起人

相关问题