浏览器抓取网页(从IE浏览器获取当前页面内容可能有多种方式的介绍 )
优采云 发布时间: 2021-11-07 03:26浏览器抓取网页(从IE浏览器获取当前页面内容可能有多种方式的介绍
)
IE浏览器获取当前页面内容的方法可能有很多种,今天介绍的就是其中一种。基本原理:鼠标点击当前IE页面时,获取鼠标的坐标位置,根据鼠标位置获取当前页面的句柄,然后根据句柄调用win32的东西来获取页面内容。具体代码:
1 private void timer1_Tick(object sender, EventArgs e)
2 {
3 lock (currentLock)
4 {
5 System.Drawing.Point MousePoint = System.Windows.Forms.Form.MousePosition;
6 if (_leftClick)
7 {
8 timer1.Stop();
9 _leftClick = false;
10
11 _lastDocument = GetHTMLDocumentFormHwnd(GetPointControl(MousePoint, false));
12 if (_lastDocument != null)
13 {
14 if (_getDocument)
15 {
16 _getDocument = true;
17 try
18 {
19 string url = _lastDocument.url;
20 string html = _lastDocument.documentElement.outerHTML;
21 string cookie = _lastDocument.cookie;
22 string domain = _lastDocument.domain;
23
24 var resolveParams = new ResolveParam
25 {
26 Url = new Uri(url),
27 Html = html,
28 PageCookie = cookie,
29 Domain = domain
30 };
31
32 RequetResove(resolveParams);
33 }
34 catch (Exception ex)
35 {
36 System.Windows.MessageBox.Show(ex.Message);
37 Console.WriteLine(ex.Message);
38 Console.WriteLine(ex.StackTrace);
39 }
40 }
41 }
42 else
43 {
44 new MessageTip().Show("xx", "当前页面不是IE浏览器页面,或使用了非IE内核浏览器,如火狐,搜狗等。请使用IE浏览器打开网页");
45 }
46
47 _getDocument = false;
48 }
49 else
50 {
51 _pointFrm.Left = MousePoint.X + 10;
52 _pointFrm.Top = MousePoint.Y + 10;
53 }
54 }
55
56 }
在第11行GetHTMLDocumentFormHwnd(GetPointControl(MousePoint, false))的分解下,首先从鼠标坐标获取页面的句柄:
1 public static IntPtr GetPointControl(System.Drawing.Point p, bool allControl)
2 {
3 IntPtr handle = Win32APIsFull.WindowFromPoint(p);
4 if (handle != IntPtr.Zero)
5 {
6 System.Drawing.Rectangle rect = default(System.Drawing.Rectangle);
7 if (Win32APIsFull.GetWindowRect(handle, out rect))
8 {
9 return Win32APIsFull.ChildWindowFromPointEx(handle, new System.Drawing.Point(p.X - rect.X, p.Y - rect.Y), allControl ? Win32APIsFull.CWP.ALL : Win32APIsFull.CWP.SKIPINVISIBLE);
10 }
11 }
12 return IntPtr.Zero;
13
14 }
接下来根据句柄获取页面内容:
1 public static HTMLDocument GetHTMLDocumentFormHwnd(IntPtr hwnd)
2 {
3 IntPtr result = Marshal.AllocHGlobal(4);
4 Object obj = null;
5
6 Console.WriteLine(Win32APIsFull.SendMessageTimeoutA(hwnd, HTML_GETOBJECT_mid, 0, 0, 2, 1000, result));
7 if (Marshal.ReadInt32(result) != 0)
8 {
9 Console.WriteLine(Win32APIsFull.ObjectFromLresult(Marshal.ReadInt32(result), ref IID_IHTMLDocument, 0, out obj));
10 }
11
12 Marshal.FreeHGlobal(result);
13
14 return obj as HTMLDocument;
15 }
一般原则:
向IE表单发送消息,获取一个指向IE浏览器内存块的指针(非托管),然后根据这个指针获取HTMLDocument对象。
这个方法涉及到win32的两个功能:
[System.Runtime.InteropServices.DllImportAttribute("user32.dll", EntryPoint = "SendMessageTimeoutA")]
public static extern int SendMessageTimeoutA(
[InAttribute()] System.IntPtr hWnd,
uint Msg, uint wParam, int lParam,
uint fuFlags,
uint uTimeout,
System.IntPtr lpdwResult);
[System.Runtime.InteropServices.DllImportAttribute("oleacc.dll", EntryPoint = "ObjectFromLresult")]
public static extern int ObjectFromLresult(
int lResult,
ref Guid riid,
int wParam,
[MarshalAs(UnmanagedType.IDispatch), Out]
out Object pObject
);