excelvba抓取网页数据( 如何在不使用IE的情况下自动从框架源拉入电子表格 )

优采云 发布时间: 2022-03-28 21:18

  excelvba抓取网页数据(

如何在不使用IE的情况下自动从框架源拉入电子表格

)

  如何在不在 vba 中创建 Internet Explorer 对象的情况下解析 html?

  htmlvbaexcelinternet-explorer

  如何在不在 vba 中创建 Internet Explorer 对象的情况下解析 html?,html,vba,excel,internet-explorer,Html,Vba,Excel,Internet Explorer,我在工作的任何计算机上都没有 Internet Explorer,所以我无法创建 Internet Explorer 的对象并使用 ie.navigate to解析 html 和搜索标签。我的问题是,如何在不使用 IE 的情况下自动将带有标记的特定数据从框架源提取到电子表格中?答案中的代码示例将非常有用:) 谢谢您可以使用 XMLHTTP 检索网页的 HTML 源代码: Function GetHTML(url As String) As StringWith CreateObject("MSXML2.XMLHTTP")

  我在工作的任何计算机上都没有 Internet Explorer,所以我无法创建 Internet Explorer 的对象并使用 ie.navigate 来解析 html 和搜索标记。我的问题是,如何在不使用 IE 的情况下自动将带有标记的特定数据从框架源提取到电子表格中?答案中的代码示例将非常有用:) 谢谢您可以使用 XMLHTTP 检索网页的 HTML 源代码:

  Function GetHTML(url As String) As String

With CreateObject("MSXML2.XMLHTTP")

.Open "GET", url, False

.Send

GetHTML = .ResponseText

End With

End Function

  我不建议把它当成sheet函数使用,否则每次重新计算sheet时都会重新查询站点URL。一些 网站 具有通过频繁重复调用来检测抓取的逻辑,您的 IP 可能会被暂时或永久禁止,具体取决于 网站

  一旦你有了源 HTML 字符串(最好存储在一个变量中以避免不必要的重复调用),你可以使用基本的文本函数来解析字符串以搜索标签

  这个基函数将返回

  和

  之间的值:

  Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String

Dim html As String, pStart As Long, pEnd As Long, o As Integer

html = GetHTML(url)

'remove if they exist so we can add our own

If Left(tag, 1) = "" Then

tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)

End If

' default to Occurrence #1

If occurNum = 0 Then occurNum = 1

pEnd = 1

For o = 1 To occurNum

' find start beginning at 1 (or after previous Occurence)

pStart = InStr(pEnd, html, "", vbTextCompare)

If pStart = 0 Then

getTag = "{Not Found}"

Exit Function

End If

pStart = pStart + Len("")

' find first end after start

pEnd = InStr(pStart, html, "", vbTextCompare)

Next o

'return string between start & end

getTag = Mid(html, pStart, pEnd - pStart)

End Function

  公共函数getTag(url作为字符串,标记作为字符串,可选的occurNum作为整数)作为字符串

Dim html为字符串,pStart为长,pEnd为长,o为整数

html=GetHTML(url)

'如果存在,请删除,以便我们可以添加自己的

如果左(标记,1)=“”,则

标签=左(右(标签,透镜(标签)-1),透镜(右(标签,透镜(标签)-1))-1)

如果结束

'默认为事件#1

如果occurNum=0,则occurNum=1

pEnd=1

当o=1时发生

'查找从1开始的开始(或在上一次发生后)

pStart=InStr(pEnd,html,“,vbTextCompare)

如果pStart=0,则

getTag=“{Not Found}”

退出功能

如果结束

pStart=pStart+Len(“”)

'在开始后查找第一个端点

pEnd=InStr(pStart,html,“,vbTextCompare)

下一个o

'在开始和结束之间返回字符串

getTag=Mid(html、pStart、pEnd-pStart)

端函数

  这只会找到基本的

  , 但您可以添加/删除/更改文本功能以满足您的需要

  使用示例:

  Sub-findTagExample()

常量testURL=”https://en.wikipedia.org/wiki/Web_scraping"

'搜索第二次出现的标记:即“内容”:

打印getTag(testURL,“,2)

“…这将返回第8次出现的“导航菜单”:

打印getTag(testURL,“,8)

“…这将返回一个HTML,其中包含“法律问题”部分的标题:

调试。打印getTag(“https://en.wikipedia.org/wiki/Web_scraping", "", 4)

端接头

  任何做过网页抓取的人都会熟悉如何创建 Internet Explorer (IE) 实例并导航到一个网址,然后当页面准备好时,开始使用“Microsoft HTML 对象库”(MSHTML) 类型库来导航 DOM . 问题是,如果 IE 不可用怎么办。我的盒子运行 Windows 10 时也遇到了同样的情况

  我曾怀疑可以独立于 IE 创建 MSHTML.HTMLDocument 的实例,但它的创建并不明显。感谢提问者现在提出这个问题。答案在于方法。您需要一个本地文件才能工作(编辑:您实际上也可以在其中放置一个 WebbyURL!),但是我们有一个漂亮而简洁的 WindowsAPI 函数来下载文件

  该代码在运行 Microsoft Edge 而不是 Internet Explorer 的 My Windows 10 机器上运行。这是一个重要的发现,感谢提出这一发现的提问者

  Option Explicit

'* Tools->Refernces Microsoft HTML Object Library

'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx

Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _

(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _

ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long

Sub Test()

Dim fso As Object

Set fso = CreateObject("Scripting.FileSystemObject")

Dim sLocalFilename As String

sLocalFilename = Environ$("TMP") & "\urlmon.html"

Dim sURL As String

sURL = "https://stackoverflow.com/users/3607273/s-meaden"

Dim bOk As Boolean

bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)

If bOk Then

If fso.FileExists(sLocalFilename) Then

'* Tools->Refernces Microsoft HTML Object Library

Dim oHtml4 As MSHTML.IHTMLDocument4

Set oHtml4 = New MSHTML.HTMLDocument

Dim oHtml As MSHTML.HTMLDocument

Set oHtml = Nothing

'* IHTMLDocument4.createDocumentFromUrl

'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx

Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")

'* need to wait a little whilst the document parses

'* because it is multithreaded

While oHtml.readyState "complete"

DoEvents '* do not comment this out it is required to break into the code if in infinite loop

Wend

Debug.Assert oHtml.readyState = "complete"

Dim sTest As String

sTest = Left$(oHtml.body.outerHTML, 100)

Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete

'* page specific logic goes here

Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection

Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")

Dim lAnswerLoop As Long

For lAnswerLoop = 0 To htmlAnswers.Length - 1

Dim vAnswerLoop

Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)

Debug.Print vAnswerLoop.outerText

Next

End If

End If

End Sub

  OPTION EXPLICIT '*Tools->Reference Microsoft HTML Object Library'*MSDN-URLDownloadToFileFunction-(v=vs.85).aspx 私有声明 PtrSafe 函数 URLDownloadToFile Lib "urlmon" alias "URLDownloadToFileA"_(ByVal pCaller is Long , ByVal szURL 是 String, ByVal szFileName 是 String String sLocalFilename=Environ$("tmp") and "\urlmon.html" as string Dim sURL sur="" Dim-bOk as boolean bOk=(URLDownloadToFile(0, sURL, sLocalFilename , 0, 0)=0)如果 fso.files(sLocalFilename) 存在则可用,则 '*tools->reference Microsoft HTML 对象库将 oHtml4 标记为 MSHTML.IHTMLDocument4 set oHtml4=New MSHTML.HTMLDocumentDim oHtml作为 MSHTML.HTMLDocument 设置 oHtml=Nothing'*IHTMLDocument4.createDocumentFromUrl'*MSDN-IHTMLDocument4 createDocumentFromUrl 方法-(v=vs.85).aspx set oHtml=oHtml4.createDocumentFromUrl(sLocalFilename, "") "*等待文档解析"* oHtml.readyState '完成' DoEvents' 因为它是多线程的'* 不要对此发表评论。

  如果是死循环,需要断代码 Wind Debug.Assert oHtml.readyState="finished" 像字符串一样的暗棒 sTest=Left$(oHtml.body.outerHTML, 100)Assert Len( Trim (sTest)) > 50' * 刚测试我们有很多文本块,随意删除" * 页*敏*感*词*体逻辑如下 Dim HtmlLanswers as object MSHTML.dispHtmlLement采集 set htmlAnswers=oHtml.getElementsByClassName("answer hyperlinks" ) 暗绿色,如 long For lAnswerLoop=0 to htmlAnswers.Length-1 Dim WasswerLoop set vAnswerLoop=htmlAnswers.Item(lAnswerLoop) Debug.Print vAnswerLoop.outerText

  Option Explicit

'* Tools->Refernces Microsoft HTML Object Library

'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx

Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _

(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _

ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long

Sub Test()

Dim fso As Object

Set fso = CreateObject("Scripting.FileSystemObject")

Dim sLocalFilename As String

sLocalFilename = Environ$("TMP") & "\urlmon.html"

Dim sURL As String

sURL = "https://stackoverflow.com/users/3607273/s-meaden"

Dim bOk As Boolean

bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)

If bOk Then

If fso.FileExists(sLocalFilename) Then

'* Tools->Refernces Microsoft HTML Object Library

Dim oHtml4 As MSHTML.IHTMLDocument4

Set oHtml4 = New MSHTML.HTMLDocument

Dim oHtml As MSHTML.HTMLDocument

Set oHtml = Nothing

'* IHTMLDocument4.createDocumentFromUrl

'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx

Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")

'* need to wait a little whilst the document parses

'* because it is multithreaded

While oHtml.readyState "complete"

DoEvents '* do not comment this out it is required to break into the code if in infinite loop

Wend

Debug.Assert oHtml.readyState = "complete"

Dim sTest As String

sTest = Left$(oHtml.body.outerHTML, 100)

Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete

'* page specific logic goes here

Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection

Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")

Dim lAnswerLoop As Long

For lAnswerLoop = 0 To htmlAnswers.Length - 1

Dim vAnswerLoop

Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)

Debug.Print vAnswerLoop.outerText

Next

End If

End If

End Sub

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线