在线抓取网页(如何使用这个类来抓取网页中需要的信息？运行结果)

优采云发布时间: 2021-09-23 20:27

　　云翔在线*敏*感*词*提供在线*敏*感*词*，WEBIM，网络磁盘和其他服务，WEBIM可以免费建立一个组（无限数），聊天日志在线存储。

　　当您之前做*敏*感*词*时，由于*敏*感*词*中的新闻读取的功能，您从网页中写了一类（例如最新的标题新闻，新闻来源，标题，内容等。），本文如何使用此类来捕获网页中所需的信息。本文将拍摄博客标题和Blogguan主页的链接作为示例：

　　图片显示博客花园主页的DOM树。很明显，只有班级是post_item的div，然后从TileLnk的一旗中提取类。此类功能可以通过以下功能实现：

　　///

/// 在文本html的文本查找标志名为tagName,并且属性attrName的值为attrValue的所有标志

/// 例如：FindTagByAttr(html, "div", "class", "demo")

/// 返回所有class为demo的div标志

///

public static List FindTagByAttr(String html, String tagName, String attrName, String attrValue)

{

String format = String.Format(@"", tagName, attrName, attrValue);

return FindTag(html, tagName, format);

}

public static List FindTag(String html, String name, String format)

{

Regex reg = new Regex(format, RegexOptions.IgnoreCase);

Regex tagReg = new Regex(String.Format(@"", name), RegexOptions.IgnoreCase);

List tags = new List();

int start = 0;

while (true)

{

Match match = reg.Match(html, start);

if (match.Success)

{

start = match.Index + match.Length;

Match tagMatch = null;

int beginTagCount = 1;

while (true)

{

tagMatch = tagReg.Match(html, start);

if (!tagMatch.Success)

{

tagMatch = null;

break;

}

start = tagMatch.Index + tagMatch.Length;

if (tagMatch.Groups[1].Value == "/") beginTagCount--;

else beginTagCount++;

if (beginTagCount == 0) break;

}

if (tagMatch != null)

{

HtmlTag tag = new HtmlTag(name, match.Value, html.Substring(match.Index + match.Length, tagMatch.Index - match.Index - match.Length));

tags.Add(tag);

}

else

{

break;

}

else

{

break;

}

return tags;

}

　　使用上述功能，您可以提取所需的HTML标志。为实现，您需要下载下载Web的功能：

　　public static String GetHtml(string url)

{

try

{

HttpWebRequest req = HttpWebRequest.Create(url) as HttpWebRequest;

req.Timeout = 30 * 1000;

HttpWebResponse response = req.GetResponse() as HttpWebResponse;

Stream stream = response.GetResponseStream();

MemoryStream buffer = new MemoryStream();

Byte[] temp = new Byte[4096];

int count = 0;

while ((count = stream.Read(temp, 0, 4096)) > 0)

{

buffer.Write(temp, 0, count);

}

return Encoding.GetEncoding(response.CharacterSet).GetString(buffer.GetBuffer());

}

catch

{

return String.Empty;

}

　　以下要获取文章标题和博客主页的链接作为一个例子，介绍如何使用htmltag类来捕获Web信息：

　　class Program

{

static void Main(string[] args)

{

String html = HtmlTag.GetHtml("http://www.cnblogs.com");

List tags = HtmlTag.FindTagByAttr(html, "div", "id", "post_list");

if (tags.Count > 0)

{

List item_tags = tags[0].FindTagByAttr("div", "class", "post_item");

foreach (HtmlTag item_tag in item_tags)

{

List a_tags = item_tag.FindTagByAttr("a", "class", "titlelnk");

if (a_tags.Count > 0)

{

Console.WriteLine("标题:{0}", a_tags[0].InnerHTML);

Console.WriteLine("链接:{0}", a_tags[0].GetAttribute("href"));

Console.WriteLine("");

}

　　结果如下：

　　源代码下载

0

2021-09-23

在线抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

在线抓取网页(如何使用这个类来抓取网页中需要的信息？运行结果)

0 个评论

发起人

AI时代内容工厂

在线抓取网页(如何使用这个类来抓取网页中需要的信息？运行结果)

0 个评论

发起人

相关问题