java爬虫抓取动态网页(爬取免费代理IP数据遇到的js加密cookie问题的原因)

优采云 发布时间: 2021-10-12 07:14

  java爬虫抓取动态网页(爬取免费代理IP数据遇到的js加密cookie问题的原因)

  第一个序列:

  由于需要爬取数据,代理和验证码的识别是一个不可避免的问题。本文总结了爬取免费代理IP数据遇到的js加密cookie问题。

  两个问题:

  对于常见的静态页面,jsoup 的解析更为常见。

  

  但是如果用这个网站直接用jsoup去取,就会报错。

  org.jsoup.HttpStatusException: HTTP error fetching URL. Status=521, URL=http://www.kuaidaili.com/ops/proxylist/1

at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:679)

at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)

at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)

at org.jsoup.helper.HttpConnection.get(HttpConnection.java:249)

  三、问题分析与解决:

  其实浏览器是可以正常浏览的,我们打开浏览器看看流程。以 Chrome 为例

  

  

  可以清楚的看到第一次报错,HTTP状态码是521,不是200.

  第二次是200,但是第二次多了cookies:_ydclearance=a3fd46bd1a232b52d7313218-72dc-4427-aa33-5690668af31d-1506323606

  这就是问题的原因。

  使用程序模拟并打印参数:

   CloseableHttpClient httpClient = HttpClients.createDefault();

Registry cookieSpecProviderRegistry = RegistryBuilder.create()

.register("myCookieSpec", context -> new MyCookieSpec()).build();//注册自定义CookieSpec

String url = baseUrl + i;

HttpGet get = new HttpGet(url);

HttpClientContext context = HttpClientContext.create();

context.setCookieSpecRegistry(cookieSpecProviderRegistry);

get.setConfig(RequestConfig.custom().setCookieSpec("myCookieSpec").build());

WebRequest request = null;

WebClient wc = null;

try {

//1、获取521状态时返回setcookie

CloseableHttpResponse response = httpClient.execute(get, context);

// 响应状态

System.out.println("status:" + response.getStatusLine());

System.out.println(">>>>>>headers:");

HeaderIterator iterator = response.headerIterator();

while (iterator.hasNext()) {

System.out.println("\t" + iterator.next());

}

System.out.println(">>>>>>cookies:");

// context.getCookieStore().getCookies().forEach(System.out::println);

String cookie =getCookie(context);

System.out.println("cookie="+cookie);

response.close();

  输出日志:

  status:HTTP/1.1 521

>>>>>>headers:

Date: Mon, 25 Sep 2017 07:10:25 GMT

Content-Type: text/html

Connection: keep-alive

Set-Cookie: yd_cookie=fa424be4-70a9-4478226851c4b3f3e8e031e4ed7860052980; Expires=1506330625; Path=/; HttpOnly

Cache-Control: no-cache, no-store

Server: WAF/2.4-12.1

>>>>>>cookies:

cookie=yd_cookie=fa424be4-70a9-4478226851c4b3f3e8e031e4ed7860052980;Expires=Mon Sep 25 17:10:25 CST 2017;Path=/

  TM真的是很深的套路。不愧是代理爬虫网站,还有一套更后的爬虫。

  获取此 cookie 并再次调用目标 URL。看反馈数据:

   HttpGet secGet = new HttpGet(url);

secGet.setHeader("Cookie",cookie);

//测试用,对比获取结果

CloseableHttpResponse secResponse = httpClient.execute(secGet, context);

System.out.println("secstatus:" + secResponse.getStatusLine());

String content = EntityUtils.toString(secResponse.getEntity());

System.out.println(content);

secResponse.close();

  本以为会返回正常的结果数据,结果发现是我太笨了,返回了一段js,而且被加密了

<p>window.οnlοad=setTimeout("dv(43)", 200); function dv(VC) {var qo, mo="", no="", oo = [0x43,0xe5,0xb0,0x27,0x71,0x6f,0xe9,0x58,0xd8,0x21,0x55,0x56,0xd0,0x4f,0xcd,0x91,0x1c,0x9e,0x09,0xe7,0x80,0x6f,0x8d,0xf3,0x60,0x73,0xe9,0x66,0xd4,0x47,0x1e,0x76,0xec,0x69,0xe3,0xbc,0x27,0x02,0x70,0xe0,0xf2,0x65,0xd1,0xac,0x19,0xf3,0x6c,0xe4,0x57,0x3c,0xc3,0xa8,0x13,0xe1,0xb4,0x37,0x0e,0xf2,0x5f,0x32,0x43,0x0e,0x88,0xfe,0xd7,0x99,0x68,0xe0,0xdb,0xa6,0xd2,0xab,0x80,0x57,0x52,0xd6,0xa3,0x7c,0x5d,0xd5,0x29,0x28,0xa0,0x75,0x48,0x3d,0x18,0x13,0xdd,0x4a,0x21,0xf1,0xbe,0x89,0x70,0x72,0xe0,0x4b,0x16,0xee,0xa7,0x74,0x41,0x40,0x13,0xb3,0x7e,0x53,0x24,0xfa,0x12,0xec,0xbd,0x8e,0x5b,0x37,0x02,0xec,0xdd,0x4c,0xf8,0x59,0xad,0x34,0x8c,0x93,0xfd,0x58,0x33,0x6d,0xc6,0x45,0xc5,0xc2,0xb7,0xac,0x81,0x50,0x4b,0x61,0x98,0x03,0x57,0x52,0x25,0x8d,0x60,0x51,0x26,0x09,0xa2,0x8b,0x5e,0x33,0x1c,0xaa,0x77,0x42,0x37,0x65,0x35,0x73,0x7f,0x66,0x5b,0xae,0x1b,0x99,0x18,0x8a,0x18,0x9a,0x1b,0xf5,0xf6,0x66,0xf0,0x3b,0xad,0x34,0x50,0xbc,0x2f,0xb1,0x2e,0xd9,0x60,0x5d,0xd7,0x56,0x5f,0xd9,0xc4,0xb5,0x0a,0xf9,0x70,0xbc,0x3d,0x1c,0xae,0xad,0xa0,0x87,0x7c,0x47,0x95,0x1c,0x9c,0x05,0x45,0xc7,0x16,0x17,0x83,0x33,0xb1,0x2c,0x76,0xf4,0x2e,0x9c,0x19,0x65,0x66,0x3d,0xb9,0x3c,0xb2,0x25,0xbd,0x0a,0x90,0x0f,0x8f,0xce,0xa9,0x16,0x98,0x0f,0xc5,0x14,0x8e,0xfc,0x79,0xe1,0x2e,0x2f,0x39,0x51,0x42,0x94,0x3b];qo = "qo=251; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>2)|((oo[qo]>>>>headers:

Date: Mon, 25 Sep 2017 07:10:25 GMT

Content-Type: text/html

Connection: keep-alive

Set-Cookie: yd_cookie=fa424be4-70a9-4478226851c4b3f3e8e031e4ed7860052980; Expires=1506330625; Path=/; HttpOnly

Cache-Control: no-cache, no-store

Server: WAF/2.4-12.1

>>>>>>cookies:

cookie=yd_cookie=fa424be4-70a9-4478226851c4b3f3e8e031e4ed7860052980;Expires=Mon Sep 25 17:10:25 CST 2017;Path=/

15:10:27.023 [main] INFO c.g.htmlunit.WebClient - statusCode=[521] contentType=[text/html]

15:10:27.033 [main] INFO c.g.htmlunit.WebClient - window.οnlοad=setTimeout("kt(180)", 200); function kt(OD) {var qo, mo="", no="", oo = [0xe7,0xd1,0x34,0xe1,0x60,0xfd,0x51,0x2f,0xc4,0x2b,0xc2,0x90,0xb5,0x43,0xf0,0x7e,0xfb,0xd9,0x22,0x42,0x32,0x5f,0x5d,0x23,0xe6,0xb4,0x5a,0x38,0xf5,0x2c,0xd6,0x94,0x2a,0xf7,0xd5,0xf5,0x99,0xe9,0x52,0x92,0x68,0x46,0x45,0x4d,0x03,0x53,0x8b,0x61,0xc6,0x2f,0x0d,0x45,0x13,0xe0,0xc7,0x08,0x80,0x80,0xb8,0x9e,0xf2,0x53,0x53,0xa3,0x81,0x21,0xdc,0xb2,0x88,0xc8,0x01,0xc0,0x68,0xd0,0x29,0x61,0xd1,0x71,0x01,0xde,0x1f,0xdc,0x1d,0xbc,0x04,0x6c,0xa4,0xec,0x35,0x3d,0x67,0xcf,0x28,0xf5,0xab,0xe3,0x08,0x78,0x4e,0xed,0x2e,0x8e,0x34,0x7c,0xd4,0x05,0x55,0x9d,0x32,0x8a,0xc2,0x3b,0x2b,0xf2,0x89,0x67,0x6d,0xb3,0x31,0x67,0x78,0x56,0x84,0xa4,0x41,0xee,0x7e,0x14,0xbb,0x63,0xbb,0x1c,0xd4,0x74,0xa1,0x9f,0xc5,0x65,0x25,0x65,0xd5,0x9d,0xc5,0xc5,0xa4,0xbc,0xfc,0x25,0x3d,0x75,0x50,0xa8,0x70,0x5d,0xf9,0x5f,0x37,0x27,0xee,0xd4,0x62,0x20,0xe7,0xa5,0x23,0xd8,0xf8,0x90,0xdd,0x4b,0xa9,0x67,0xe4,0xca,0xdd,0x9b,0x19,0xbe,0x3c,0xd3,0x9f,0x4d,0xfa,0x98,0x88,0x50,0x14,0x5a,0x18,0x7e,0xe3,0x04,0x0a,0xb9,0x89,0x99,0x41,0xaf,0x98,0x16,0xab,0x91,0x3f,0x8d,0x21,0xd8,0xbe,0x4c,0x1a,0x78,0x47,0xe4,0xc2,0x78,0xde,0x76,0xd2,0x78,0x06,0xd3,0x91,0xf7,0xf0,0x6e,0x1c,0xb1,0xd1,0xb7,0x3e,0xcb,0x99,0xf7,0x95,0x73,0x6e,0x04,0x6a,0x02,0x7f,0xb4,0x22,0x87,0x3b];qo = "qo=241; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>5)|((oo[qo]>5)|((oo[qo]>2)|((oo[qo]>2)|((oo[qo]

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线