网页抓取数据(_)

优采云发布时间: 2022-02-05 08:04

　　网页抓取数据(_)

　　curl_setopt($ch, CURLOPT_HEADER, 1);

　　//我们不需要页面内容

　　// curl_setopt($ch, CURLOPT_NOBODY, 1);

　　// 转到结果而不是输出它

　　curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

　　$html = curl_exec($ch);

　　$info = curl_getinfo($ch);

　　如果（$html === 假）{

　　回显“卷曲错误：”。 curl_error($ch);

　　}

　　curl_close($ch);

　　$linkarr = _striplinks($html);

　　//主机本地，待补

　　$host = '#39;;

　　if (is_array($linkarr)) {

　　foreach ($linkarr as $k => $v) {

　　$linkresult[$k] = _expandlinks($v, $host);

　　}

　　printf("

　　此页面上的所有链接是：

　　%s

　　n", var_export($linkresult , true));

　　function.php的内容如下（即前两篇文章中两个函数的集合）：

　　函数 _striplinks($document) {

　　preg_match_all("']+))'isx", $document, $links);

　　// 连接条件子模式中的非空匹配项

　　while (list($key, $val) = each($links[2])) {

　　如果 (!empty($val))

　　$match[] = $val;

　　} while (list($key, $val) = each($links[3])) {

　　如果 (!empty($val))

　　$match[] = $val;

　　}

　　//返回链接

　　返回$匹配；

　　}

　　/*============================================== = ========================*

　　函数：_expandlinks

　　目的：将每个链接扩展成一个完全限定的 URL

　　输入：$链接要限定的链接

　　$URI 获取基础的完整 URI

　　输出：$expandedLinks 展开的链接

　　*=============================================== ==== ========================*/

　　函数_expandlinks($links,$URI)

　　{

　　$URI_PARTS = parse_url($URI);

　　$host = $URI_PARTS["host"];

　　preg_match("/^[^?]+/",$URI,$match);

　　$match = preg_replace("|/[^/.]+.[^/.]+$|","",$match[0]);

　　$match = preg_replace("|/$|","",$match);

　　$match_part = parse_url($match);

　　$match_root =

　　$match_part["scheme"]."://".$match_part["host"];

　　$search = array("|^".preg_quote($host)."|i",

　　"|^(/)|i",

　　"|^(?!)(?!mailto:)|i",

　　"|/./|",

　　"|/[^/]+/../|"

　　);

　　$replace = array("",

　　$match_root."/",

　　$match."/",

　　"/",

　　"/"

　　);

　　$expandedLinks = preg_replace($search,$replace,$links);

　　返回 $expandedLinks;

　　}

　　如果想和file_get_contents做个详细对比，可以使用linux下的time命令查看两者执行的时间。根据目前的测试，CURL 更快。第一个链接与以上两个功能的介绍有关。

0

2022-02-05

网页抓取数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页抓取数据(_)

0 个评论

发起人