excelvba抓取网页数据( 如何在不使用IE的情况下自动从框架源拉入电子表格 )
优采云 发布时间: 2022-03-28 21:18excelvba抓取网页数据(
如何在不使用IE的情况下自动从框架源拉入电子表格
)
如何在不在 vba 中创建 Internet Explorer 对象的情况下解析 html?
htmlvbaexcelinternet-explorer
如何在不在 vba 中创建 Internet Explorer 对象的情况下解析 html?,html,vba,excel,internet-explorer,Html,Vba,Excel,Internet Explorer,我在工作的任何计算机上都没有 Internet Explorer,所以我无法创建 Internet Explorer 的对象并使用 ie.navigate to解析 html 和搜索标签。我的问题是,如何在不使用 IE 的情况下自动将带有标记的特定数据从框架源提取到电子表格中?答案中的代码示例将非常有用:) 谢谢您可以使用 XMLHTTP 检索网页的 HTML 源代码: Function GetHTML(url As String) As StringWith CreateObject("MSXML2.XMLHTTP")
我在工作的任何计算机上都没有 Internet Explorer,所以我无法创建 Internet Explorer 的对象并使用 ie.navigate 来解析 html 和搜索标记。我的问题是,如何在不使用 IE 的情况下自动将带有标记的特定数据从框架源提取到电子表格中?答案中的代码示例将非常有用:) 谢谢您可以使用 XMLHTTP 检索网页的 HTML 源代码:
Function GetHTML(url As String) As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.Send
GetHTML = .ResponseText
End With
End Function
我不建议把它当成sheet函数使用,否则每次重新计算sheet时都会重新查询站点URL。一些 网站 具有通过频繁重复调用来检测抓取的逻辑,您的 IP 可能会被暂时或永久禁止,具体取决于 网站
一旦你有了源 HTML 字符串(最好存储在一个变量中以避免不必要的重复调用),你可以使用基本的文本函数来解析字符串以搜索标签
这个基函数将返回
和
之间的值:
Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
Dim html As String, pStart As Long, pEnd As Long, o As Integer
html = GetHTML(url)
'remove if they exist so we can add our own
If Left(tag, 1) = "" Then
tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
End If
' default to Occurrence #1
If occurNum = 0 Then occurNum = 1
pEnd = 1
For o = 1 To occurNum
' find start beginning at 1 (or after previous Occurence)
pStart = InStr(pEnd, html, "", vbTextCompare)
If pStart = 0 Then
getTag = "{Not Found}"
Exit Function
End If
pStart = pStart + Len("")
' find first end after start
pEnd = InStr(pStart, html, "", vbTextCompare)
Next o
'return string between start & end
getTag = Mid(html, pStart, pEnd - pStart)
End Function
公共函数getTag(url作为字符串,标记作为字符串,可选的occurNum作为整数)作为字符串
Dim html为字符串,pStart为长,pEnd为长,o为整数
html=GetHTML(url)
'如果存在,请删除,以便我们可以添加自己的
如果左(标记,1)=“”,则
标签=左(右(标签,透镜(标签)-1),透镜(右(标签,透镜(标签)-1))-1)
如果结束
'默认为事件#1
如果occurNum=0,则occurNum=1
pEnd=1
当o=1时发生
'查找从1开始的开始(或在上一次发生后)
pStart=InStr(pEnd,html,“,vbTextCompare)
如果pStart=0,则
getTag=“{Not Found}”
退出功能
如果结束
pStart=pStart+Len(“”)
'在开始后查找第一个端点
pEnd=InStr(pStart,html,“,vbTextCompare)
下一个o
'在开始和结束之间返回字符串
getTag=Mid(html、pStart、pEnd-pStart)
端函数
这只会找到基本的
, 但您可以添加/删除/更改文本功能以满足您的需要
使用示例:
Sub-findTagExample()
常量testURL=”https://en.wikipedia.org/wiki/Web_scraping"
'搜索第二次出现的标记:即“内容”:
打印getTag(testURL,“,2)
“…这将返回第8次出现的“导航菜单”:
打印getTag(testURL,“,8)
“…这将返回一个HTML,其中包含“法律问题”部分的标题:
调试。打印getTag(“https://en.wikipedia.org/wiki/Web_scraping", "", 4)
端接头
任何做过网页抓取的人都会熟悉如何创建 Internet Explorer (IE) 实例并导航到一个网址,然后当页面准备好时,开始使用“Microsoft HTML 对象库”(MSHTML) 类型库来导航 DOM . 问题是,如果 IE 不可用怎么办。我的盒子运行 Windows 10 时也遇到了同样的情况
我曾怀疑可以独立于 IE 创建 MSHTML.HTMLDocument 的实例,但它的创建并不明显。感谢提问者现在提出这个问题。答案在于方法。您需要一个本地文件才能工作(编辑:您实际上也可以在其中放置一个 WebbyURL!),但是我们有一个漂亮而简洁的 WindowsAPI 函数来下载文件
该代码在运行 Microsoft Edge 而不是 Internet Explorer 的 My Windows 10 机器上运行。这是一个重要的发现,感谢提出这一发现的提问者
Option Explicit
'* Tools->Refernces Microsoft HTML Object Library
'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Sub Test()
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim sLocalFilename As String
sLocalFilename = Environ$("TMP") & "\urlmon.html"
Dim sURL As String
sURL = "https://stackoverflow.com/users/3607273/s-meaden"
Dim bOk As Boolean
bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
If bOk Then
If fso.FileExists(sLocalFilename) Then
'* Tools->Refernces Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
'* need to wait a little whilst the document parses
'* because it is multithreaded
While oHtml.readyState "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim sTest As String
sTest = Left$(oHtml.body.outerHTML, 100)
Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete
'* page specific logic goes here
Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
Dim lAnswerLoop As Long
For lAnswerLoop = 0 To htmlAnswers.Length - 1
Dim vAnswerLoop
Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
Debug.Print vAnswerLoop.outerText
Next
End If
End If
End Sub
OPTION EXPLICIT '*Tools->Reference Microsoft HTML Object Library'*MSDN-URLDownloadToFileFunction-(v=vs.85).aspx 私有声明 PtrSafe 函数 URLDownloadToFile Lib "urlmon" alias "URLDownloadToFileA"_(ByVal pCaller is Long , ByVal szURL 是 String, ByVal szFileName 是 String String sLocalFilename=Environ$("tmp") and "\urlmon.html" as string Dim sURL sur="" Dim-bOk as boolean bOk=(URLDownloadToFile(0, sURL, sLocalFilename , 0, 0)=0)如果 fso.files(sLocalFilename) 存在则可用,则 '*tools->reference Microsoft HTML 对象库将 oHtml4 标记为 MSHTML.IHTMLDocument4 set oHtml4=New MSHTML.HTMLDocumentDim oHtml作为 MSHTML.HTMLDocument 设置 oHtml=Nothing'*IHTMLDocument4.createDocumentFromUrl'*MSDN-IHTMLDocument4 createDocumentFromUrl 方法-(v=vs.85).aspx set oHtml=oHtml4.createDocumentFromUrl(sLocalFilename, "") "*等待文档解析"* oHtml.readyState '完成' DoEvents' 因为它是多线程的'* 不要对此发表评论。
如果是死循环,需要断代码 Wind Debug.Assert oHtml.readyState="finished" 像字符串一样的暗棒 sTest=Left$(oHtml.body.outerHTML, 100)Assert Len( Trim (sTest)) > 50' * 刚测试我们有很多文本块,随意删除" * 页*敏*感*词*体逻辑如下 Dim HtmlLanswers as object MSHTML.dispHtmlLement采集 set htmlAnswers=oHtml.getElementsByClassName("answer hyperlinks" ) 暗绿色,如 long For lAnswerLoop=0 to htmlAnswers.Length-1 Dim WasswerLoop set vAnswerLoop=htmlAnswers.Item(lAnswerLoop) Debug.Print vAnswerLoop.outerText
Option Explicit
'* Tools->Refernces Microsoft HTML Object Library
'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Sub Test()
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim sLocalFilename As String
sLocalFilename = Environ$("TMP") & "\urlmon.html"
Dim sURL As String
sURL = "https://stackoverflow.com/users/3607273/s-meaden"
Dim bOk As Boolean
bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
If bOk Then
If fso.FileExists(sLocalFilename) Then
'* Tools->Refernces Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
'* need to wait a little whilst the document parses
'* because it is multithreaded
While oHtml.readyState "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim sTest As String
sTest = Left$(oHtml.body.outerHTML, 100)
Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete
'* page specific logic goes here
Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
Dim lAnswerLoop As Long
For lAnswerLoop = 0 To htmlAnswers.Length - 1
Dim vAnswerLoop
Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
Debug.Print vAnswerLoop.outerText
Next
End If
End If
End Sub