新浪微博中大批量爬取爬取新浪微博数据的思路及解决办法

优采云发布时间: 2021-08-11 06:01

　　近期项目需要从新浪微博大量抓取新浪微博数据。当然，众所周知的方法是使用 API 来获取数据，但这有其局限性。只能搜索对应用户的home_timeline，不能按键。词量很大（几十万甚至几千万条数据），API有一个限制，就是每小时的请求数。当然，每个网站都有自己的保护措施。当检测到某个ip在短时间内发起大量请求时，会认为这个操作有问题，需要填写验证码或者直接封ip。几天后可以再次解锁。这种机制是必要的，因为它可以防止恶意的连续请求导致服务器崩溃。其他的就不多说了，直接看核心代码：

　　String termString[] = { "中国", "日本", "韩国", "美国", "俄国", "英美" };

String userAgent = "User-Agent";

String userAgentValue = "Mozilla/12.0 (compatible; MSIE 8.0; Windows NT)";

String contentType = "Content-Type";

String contentTypeValue = "application/x-www-form-urlencoded";

String pathString = "";

for (int i = 0; i < termString.length; i++) {

for (int j = 0; j < 50; j++) {//对每个term（搜索关键字），取50页的内容

pathString = "http://s.weibo.com/weibo/"

+ URLEncoder.encode(termString[i], "UTF-8") + "&page="

+ (j + 1);

try {

OutputStream os = null;

InputStreamReader isr = null;

URL url = new URL(pathString);

HttpURLConnection httpConn = (HttpURLConnection) url

.openConnection();

httpConn.setRequestMethod("POST");

httpConn.setConnectTimeout(60000);

httpConn.setReadTimeout(60000);

httpConn.setRequestProperty(userAgent, userAgentValue);

httpConn.setRequestProperty(contentType, contentTypeValue);

httpConn.setDoOutput(true);

httpConn.setDoInput(true);

os = httpConn.getOutputStream();

os.flush();

isr = new InputStreamReader(httpConn.getInputStream());

StringBuffer content = new StringBuffer();

int c;

while ((c = isr.read()) != -1) {

content.append((char) c);

}

System.out.println(decodeUnicode(content.toString()));

} catch (Exception e) {

e.printStackTrace();

}

Random random = new Random();

int slpmillsecs = 40000 + random.nextInt(20000) + 1;

Thread.sleep(slpmillsecs);// 停40~60s

}

　　一开始我用了10s左右的间隔，发现新浪还是可以检测到的。后来，我果断地改了。正常情况下，40s以下的间隔也是可以的，但是具体的最小值我没有测试。有兴趣的可以试试。

　　PS：本文提出的爬取网页内容的方法仅供参考，以后再看。我估计了在这个采集数据上花费的时间。它与使用 API采集数据大致相同甚至更多。这里只是对采集新浪微博数据的一种思考方式。如果你不喜欢它，不要喷它。 . .

0

2021-08-11

使用新浪微博开放平台api同步微博内容至自己网站

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

新浪微博中大批量爬取爬取新浪微博数据的思路及解决办法

0 个评论

发起人