c爬虫抓取网页数据( 两个重要的包()预处理器定义(_szbuffer))

优采云发布时间: 2022-03-09 01:15

　　c爬虫抓取网页数据(

两个重要的包()预处理器定义(_szbuffer))

　　在windows下的C++通过Http协议实现对网页的内容抓取：

　　首先介绍两个重要的包（一般是linux下的开源数据包，windows下的动态链接库dll）：curl包和pthreads_dll，其中curl包被解释为命令行浏览器，由调用curl_easy_setopt等内置函数可以实现对特定网页内容的获取（要正确编译导入的curl链接库，需要另外一个包C-ares）。pthreads 是一个多线程控制包，其中包括互斥变量的锁定和解锁。程序进程分配等功能。

　　下载地址：点击打开链接。其中，正确导入外部动态链接库需要步骤：1.Project->Properties->Configuration Properties->C/C++->General->Additional Include Directories（添加include的路径），2.Project- >Properties->Configuration Properties->Linker->General->Additional Library Directory（添加lib收录的路径）；3、在Linker->Input->Additional Dependencies(添加了libcurld.lib;pthreadVC2.lib;ws2_3 2.lib;winmm.lib;wldap32.lib;areslib.lib) 4 , 在 c/c++->预处理器->预处理器定义(_CONSOLE;BUILDING_LIBCURL;HTTP_ONLY)

　　具体实现过程介绍：

　　1：自定义hashTable结构，存放获取到的字符串字符。以hashTable类的形式实现，包括hash表集合类型，以及add、find等几个常用的字符串hash函数

　　代码：

　　///HashTable.h

#ifndef HashTable_H

#define HashTable_H

#include

class HashTable

{

public:

HashTable(void);

~HashTable(void);

unsigned int ForceAdd(const std::string& str);

unsigned int Find(const std::string& str);

/*string的常见的hash方式*/

unsigned int RSHash(const std::string& str);

unsigned int JSHash (const std::string& str);

unsigned int PJWHash (const std::string& str);

unsigned int ELFHash (const std::string& str);

unsigned int BKDRHash(const std::string& str);

unsigned int SDBMHash(const std::string& str);

unsigned int DJBHash (const std::string& str);

unsigned int DEKHash (const std::string& str);

unsigned int BPHash (const std::string& str);

unsigned int FNVHash (const std::string& str);

unsigned int APHash (const std::string& str);

private:

std::set HashFunctionResultSet;

std::vector hhh;

};

#endif

　　/////HashTable.cpp

#include "HashTable.h"

HashTable::HashTable(void)

{

}

HashTable::~HashTable(void)

{

}

unsigned int HashTable::ForceAdd(const std::string& str)

{

unsigned int i=ELFHash(str);

HashFunctionResultSet.insert(i);

return i;

}

unsigned int HashTable::Find(const std::string& str)

{

int ff=hhh.size();

const unsigned int i=ELFHash(str);

std::set::const_iterator it;

if(HashFunctionResultSet.size()>0)

{

it=HashFunctionResultSet.find(i);

if(it==HashFunctionResultSet.end())

return -1;

}

else

{

return -1;

}

return i;

}

/*几种常见的字符串hash方式实现函数*/

unsigned int HashTable::APHash(const std::string& str)

{

unsigned int hash=0xAAAAAAAA;

for(std::size_t i=0;i 3)) :

(~((hash > 5)));

}

return hash;

}

unsigned int HashTable::BKDRHash(const std::string& str)

{

unsigned int seed=131; //31 131 1313 13131 131313 etc

unsigned int hash=0;

for(std::size_t i=0;isetBuffer((char*)buffer,size,nmemb);

}

bool Http::InitCurl(const std::string& url, std::string& szbuffer)

{

pthread_mutex_init(&m_http_mutex,NULL);

Http::m_szUrl=url;

CURLcode result;

if(m_pcurl)

{

curl_easy_setopt(m_pcurl, CURLOPT_ERRORBUFFER, Http::m_errorBuffer);

curl_easy_setopt(m_pcurl, CURLOPT_URL,m_szUrl.c_str());

curl_easy_setopt(m_pcurl, CURLOPT_HEADER, 0);

curl_easy_setopt(m_pcurl, CURLOPT_FOLLOWLOCATION, 1);

curl_easy_setopt(m_pcurl, CURLOPT_WRITEFUNCTION,Http::writer);

curl_easy_setopt(m_pcurl, CURLOPT_WRITEDATA,this);

result = curl_easy_perform(m_pcurl);

}

if(result!=CURLE_OK)

return false;

szbuffer=m_szbuffer;

m_szbuffer.clear();

m_szUrl.clear();

pthread_mutex_destroy(&m_http_mutex);

return true;

}

bool Http::DeInitCurl()

{

curl_easy_cleanup(m_pcurl);

curl_global_cleanup();

m_pcurl = NULL;

return true;

}

const string Http::getBuffer()

{

return m_szbuffer;

}

string Http::setUrl()

{

return Http::m_szUrl;

}

void Http::setUrl(const std::string& url)

{

Http::m_szUrl = url;

}

　　其中，m_szbuffer存放的是网页的内容。初始网页的内容存储在 Init 函数的形参中。

0

2022-03-09

c爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

c爬虫抓取网页数据( 两个重要的包()预处理器定义(_szbuffer))

0 个评论

发起人