浏览器抓取网页(从IE浏览器获取当前页面内容的多种方法)

优采云 发布时间: 2021-11-05 06:11

  浏览器抓取网页(从IE浏览器获取当前页面内容的多种方法)

  IE浏览器获取当前页面内容的方法可能有很多种,今天介绍的就是其中一种。基本原理:鼠标点击当前IE页面时,获取鼠标的坐标位置,根据鼠标位置获取当前页面的句柄,然后根据句柄调用win32的东西来获取页面内容。有兴趣的朋友可以参考这篇文章

   private void timer1_Tick(object sender, EventArgs e) { lock (currentLock) { System.Drawing.Point MousePoint = System.Windows.Forms.Form.MousePosition; if (_leftClick) { timer1.Stop(); _leftClick = false; _lastDocument = GetHTMLDocumentFormHwnd(GetPointControl(MousePoint, false)); if (_lastDocument != null) { if (_getDocument) { _getDocument = true; try { string url = _lastDocument.url; string html = _lastDocument.documentElement.outerHTML; string cookie = _lastDocument.cookie; string domain = _lastDocument.domain; var resolveParams = new ResolveParam { Url = new Uri(url), Html = html, PageCookie = cookie, Domain = domain }; RequetResove(resolveParams); } catch (Exception ex) { System.Windows.MessageBox.Show(ex.Message); Console.WriteLine(ex.Message); Console.WriteLine(ex.StackTrace); } } } else { new MessageTip().Show("xx", "当前页面不是IE浏览器页面,或使用了非IE内核浏览器,如火狐,搜狗等。请使用IE浏览器打开网页"); } _getDocument = false; } else { _pointFrm.Left = MousePoint.X + 10; _pointFrm.Top = MousePoint.Y + 10; } } }

  在第11行GetHTMLDocumentFormHwnd(GetPointControl(MousePoint, false))的分解下,首先从鼠标坐标获取页面的句柄:

   public static IntPtr GetPointControl(System.Drawing.Point p, bool allControl) { IntPtr handle = Win32APIsFull.WindowFromPoint(p); if (handle != IntPtr.Zero) { System.Drawing.Rectangle rect = default(System.Drawing.Rectangle); if (Win32APIsFull.GetWindowRect(handle, out rect)) { return Win32APIsFull.ChildWindowFromPointEx(handle, new System.Drawing.Point(p.X - rect.X, p.Y - rect.Y), allControl ? Win32APIsFull.CWP.ALL : Win32APIsFull.CWP.SKIPINVISIBLE); } } return IntPtr.Zero; }

  接下来根据句柄获取页面内容:

   public static HTMLDocument GetHTMLDocumentFormHwnd(IntPtr hwnd) { IntPtr result = Marshal.AllocHGlobal(4); Object obj = null; Console.WriteLine(Win32APIsFull.SendMessageTimeoutA(hwnd, HTML_GETOBJECT_mid, 0, 0, 2, 1000, result)); if (Marshal.ReadInt32(result) != 0) { Console.WriteLine(Win32APIsFull.ObjectFromLresult(Marshal.ReadInt32(result), ref IID_IHTMLDocument, 0, out obj)); } Marshal.FreeHGlobal(result); return obj as HTMLDocument; }

  一般原则:

  

  向IE表单发送消息,获取一个指向IE浏览器内存块的指针(非托管),然后根据这个指针获取HTMLDocument对象。

  这个方法涉及到win32的两个功能:

   [System.Runtime.InteropServices.DllImportAttribute("user32.dll", EntryPoint = "SendMessageTimeoutA")] public static extern int SendMessageTimeoutA( [InAttribute()] System.IntPtr hWnd, uint Msg, uint wParam, int lParam, uint fuFlags, uint uTimeout, System.IntPtr lpdwResult);

   [System.Runtime.InteropServices.DllImportAttribute("oleacc.dll", EntryPoint = "ObjectFromLresult")] public static extern int ObjectFromLresult( int lResult, ref Guid riid, int wParam, [MarshalAs(UnmanagedType.IDispatch), Out] out Object pObject );

  以上是c#从IE浏览器获取当前页面的详细内容。更多详情请关注其他相关html中文网站文章!

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线