爬虫抓取网页数据( 利用httpclient抓取到数据为该index.html静态页面的源码)

优采云发布时间: 2021-10-04 00:09

　　爬虫抓取网页数据(

利用httpclient抓取到数据为该index.html静态页面的源码)

　　 1 @Test

2 public void crawSignHtmlTest() {

3 CloseableHttpClient httpclient = HttpClients.createDefault();

4 try {

5 //创建httpget

6 HttpGet httpget = new HttpGet("http://127.0.0.1:8080/index.html?companyName=testCompany");

7

8 httpget.setHeader("Accept", "text/html, */*; q=0.01");

9 httpget.setHeader("Accept-Encoding", "gzip, deflate,sdch");

10 httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8");

11 httpget.setHeader("Connection", "keep-alive");

12 httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)");

13

14 //System.out.println("executing request " + httpget.getURI());

15 //执行get请求

16 CloseableHttpResponse response = httpclient.execute(httpget);

17 try {

18 //获取响应实体

19 HttpEntity entity = response.getEntity();

20 //响应状态

21 System.out.println(response.getStatusLine());

22 if(entity != null) {

23 //响应内容长度

24 //System.out.println("response length: " + entity.getContentLength());

25 //响应内容

26 System.out.println("response content: ");

27 System.out.println(EntityUtils.toString(entity));

28 }

29 } finally {

30 response.close();

31 }

32 } catch (ClientProtocolException e) {

33 e.printStackTrace();

34 } catch (ParseException e) {

35 e.printStackTrace();

36 } catch (IOException e) {

37 e.printStackTrace();

38 } finally {

39 //关闭链接,释放资源

40 try {

41 httpclient.close();

42 } catch(IOException e) {

43 e.printStackTrace();

44 }

45 }

46 }

　　httpclient捕获的数据是index.html静态页面的源代码。如果HTML页面中有JS要执行的代码，那么此时不会对捕获的页面执行JS

　　如果您想在JS呈现后捕获HTML源代码，可以通过htmlunit获取它

　　2、Htmlunit

　　引入htmlunit的jar并在JS执行后调用获取代码

　　 1 @Test

2 public void htmlUnitSignTest() throws Exception {

3 WebClient wc = new WebClient(BrowserVersion.CHROME);

4 wc.setJavaScriptTimeout(5000);

5 wc.getOptions().setUseInsecureSSL(true);//接受任何主机连接无论是否有有效证书

6 wc.getOptions().setJavaScriptEnabled(true);//设置支持javascript脚本

7 wc.getOptions().setCssEnabled(false);//禁用css支持

8 wc.getOptions().setThrowExceptionOnScriptError(false);//js运行错误时不抛出异常

9 wc.getOptions().setTimeout(100000);//设置连接超时时间

10 wc.getOptions().setDoNotTrackEnabled(false);

11 wc.getOptions().setActiveXNative(true);

12

13 wc.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3");

14 wc.addRequestHeader("Accept-Encoding", "gzip, deflate, br");

15 wc.addRequestHeader("Accept-Language", "zh-CN,zh;q=0.9");

16 wc.addRequestHeader("Connection", "keep-alive");

17 wc.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36");

18

19

20 //HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/demo.html?companyName=testCompany");

21 HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/sign.html?companyName=testCompany&p=1");

22 String res = htmlpage.asXml();

23 //处理源码

24 System.out.println(res);

25

26 // HtmlForm form = htmlpage.getFormByName("f");

27 // HtmlButton button = form.getButtonByName("btnDomName"); // 获取提交按钮

28 // HtmlPage nextPage = button.click();

29 // System.out.println("等待20秒");

30 // Thread.sleep(2000);

31 // System.out.println(nextPage.asText());

32 wc.close();

33 }

　　Htmlunit通过创建新的webclient（）构建浏览器模拟器，然后使用获得的HTML源代码执行JS渲染，最后在JS执行后获得HTML源代码

　　但在一些特殊场景中，如抓取画布绘制的Base64数据，发现数据存在问题，与直接在浏览器上执行的结果不一致（巨坑，浪费大量时间）

　　3、硒

　　介绍硒罐。此外，您需要下载chromedriver.exe。您还可以通过调用

　　 1 public static void main(String[] args) throws IOException {

2

3 System.setProperty("webdriver.chrome.driver", "/srv/chromedriver.exe");// chromedriver服务地址

4 ChromeOptions options = new ChromeOptions();

5 options.addArguments("--headless");

6 //WebDriver driver = new ChromeDriver(options); // 新建一个WebDriver 的对象，但是new 的是谷歌的驱动

7

8 WebDriver driver = new ChromeDriver();

9 String url = "http://127.0.0.1:8080/index.html?companyName=testCompany";

10 driver.get(url); // 打开指定的网站

11

12 //获取当前浏览器的信息

13 System.out.println("Title:" + driver.getTitle());

14 System.out.println("currentUrl:" + driver.getCurrentUrl());

15

16

17 WebElement imgDom = ((ChromeDriver) driver).findElementById("imgDom");

18 System.out.println(imgDom.getText());

19

20 //String imgBase64 = URLDecoder.decode(imgDom.getText(), "UTF-8");

21 //imgBase64 = imgBase64.substring(imgBase64.indexOf(",") + 1);

22 byte[] fromBASE64ToByte = Base64Util.getFromBASE64ToByte(imgDom.getText());

23 FileUtils.writeByteArrayToFile(new File("/srv/charter44.png"),fromBASE64ToByte);

24 driver.close();

25 }

　　Selenium还使用新的webdriver（）构建浏览器模拟器。它不仅对获取的HTML源代码执行JS呈现，而且在JS执行后获得一个HTML源代码。甚至在上面的htmlunit执行中对canvas canvas的不友好支持也在这里得到了完美解决。硒喜欢

0

2021-10-04

爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

爬虫抓取网页数据( 利用httpclient抓取到数据为该index.html静态页面的源码)

0 个评论

发起人