浏览器抓取网页(从IE浏览器获取当前页面内容可能有多种方式的介绍 )

优采云 发布时间: 2022-03-08 00:02

  浏览器抓取网页(从IE浏览器获取当前页面内容可能有多种方式的介绍

)

  从IE浏览器获取当前页面内容的方法可能有很多种,今天就介绍其中的一种。基本原理:鼠标点击当前IE页面时,获取鼠标的坐标位置,根据鼠标位置获取当前页面的句柄,然后根据句柄调用win32的东西获取页面内容。具体代码:

   1 private void timer1_Tick(object sender, EventArgs e)

2 {

3 lock (currentLock)

4 {

5 System.Drawing.Point MousePoint = System.Windows.Forms.Form.MousePosition;

6 if (_leftClick)

7 {

8 timer1.Stop();

9 _leftClick = false;

10

11 _lastDocument = GetHTMLDocumentFormHwnd(GetPointControl(MousePoint, false));

12 if (_lastDocument != null)

13 {

14 if (_getDocument)

15 {

16 _getDocument = true;

17 try

18 {

19 string url = _lastDocument.url;

20 string html = _lastDocument.documentElement.outerHTML;

21 string cookie = _lastDocument.cookie;

22 string domain = _lastDocument.domain;

23

24 var resolveParams = new ResolveParam

25 {

26 Url = new Uri(url),

27 Html = html,

28 PageCookie = cookie,

29 Domain = domain

30 };

31

32 RequetResove(resolveParams);

33 }

34 catch (Exception ex)

35 {

36 System.Windows.MessageBox.Show(ex.Message);

37 Console.WriteLine(ex.Message);

38 Console.WriteLine(ex.StackTrace);

39 }

40 }

41 }

42 else

43 {

44 new MessageTip().Show("xx", "当前页面不是IE浏览器页面,或使用了非IE内核浏览器,如火狐,搜狗等。请使用IE浏览器打开网页");

45 }

46

47 _getDocument = false;

48 }

49 else

50 {

51 _pointFrm.Left = MousePoint.X + 10;

52 _pointFrm.Top = MousePoint.Y + 10;

53 }

54 }

55

56 }

  在第11行GetHTMLDocumentFormHwnd(GetPointControl(MousePoint, false))的分解下,首先从鼠标坐标获取页面句柄:

   1 public static IntPtr GetPointControl(System.Drawing.Point p, bool allControl)

2 {

3 IntPtr handle = Win32APIsFull.WindowFromPoint(p);

4 if (handle != IntPtr.Zero)

5 {

6 System.Drawing.Rectangle rect = default(System.Drawing.Rectangle);

7 if (Win32APIsFull.GetWindowRect(handle, out rect))

8 {

9 return Win32APIsFull.ChildWindowFromPointEx(handle, new System.Drawing.Point(p.X - rect.X, p.Y - rect.Y), allControl ? Win32APIsFull.CWP.ALL : Win32APIsFull.CWP.SKIPINVISIBLE);

10 }

11 }

12 return IntPtr.Zero;

13

14 }

  接下来根据句柄获取页面内容:

   1 public static HTMLDocument GetHTMLDocumentFormHwnd(IntPtr hwnd)

2 {

3 IntPtr result = Marshal.AllocHGlobal(4);

4 Object obj = null;

5

6 Console.WriteLine(Win32APIsFull.SendMessageTimeoutA(hwnd, HTML_GETOBJECT_mid, 0, 0, 2, 1000, result));

7 if (Marshal.ReadInt32(result) != 0)

8 {

9 Console.WriteLine(Win32APIsFull.ObjectFromLresult(Marshal.ReadInt32(result), ref IID_IHTMLDocument, 0, out obj));

10 }

11

12 Marshal.FreeHGlobal(result);

13

14 return obj as HTMLDocument;

15 }

  一般原则:

  

  向IE窗体发送消息,获取指向IE浏览器内存块的指针(非托管),然后根据该指针获取HTMLDocument对象。

  该方法涉及win32的两个功能:

   [System.Runtime.InteropServices.DllImportAttribute("user32.dll", EntryPoint = "SendMessageTimeoutA")]

public static extern int SendMessageTimeoutA(

[InAttribute()] System.IntPtr hWnd,

uint Msg, uint wParam, int lParam,

uint fuFlags,

uint uTimeout,

System.IntPtr lpdwResult);

   [System.Runtime.InteropServices.DllImportAttribute("oleacc.dll", EntryPoint = "ObjectFromLresult")]

public static extern int ObjectFromLresult(

int lResult,

ref Guid riid,

int wParam,

[MarshalAs(UnmanagedType.IDispatch), Out]

out Object pObject

);

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线