微博爬虫的源码，不同的爬取地址和思路

优采云发布时间: 2021-07-04 04:29

　　微博爬虫的源码，不同的爬取地址和思路

　　Python爬虫，微博爬虫，需要知道微博用户id号，可以抓取微博用户首页的内容，获取用户发布的内容、时间、点赞数、转发数等数据。当然，上面是书人渣是通过复制修改网上代码获得的！

　　待抓取的微博地址：

　　BUT，我们实际应用的获取地址：（移动端的微博地址）

　　LSP的最爱，各种美女，随便爬，赶紧采集！

　　通过在浏览器中抓包，我们可以了解到几个重要的参数：

　　type: uid

value: 5118612601

containerid: 1005055118612601

　　其实还有一个更重要的参数，就是翻页：'page'：page！

　　还有一个SSL错误问题，可以自己处理！

　　import logging

logging.captureWarnings(True)

# 屏蔽warning信息

requests.packages.urllib3.disable_warnings()

html=requests.get(self.url,headers=self.headers,params=params,timeout=5,verify=False).content.decode('utf-8')

　　几个关键点

　　 def get_containerid(self):

url = f'https://m.weibo.cn/api/container/getIndex?type=uid&value={self.uid}'

data = requests.get(url,headers=self.headers,timeout=5,verify=False).content.decode('utf-8')

content = json.loads(data).get('data')

for data in content.get('tabsInfo').get('tabs'):

if (data.get('tab_type') == 'weibo'):

containerid = data.get('containerid')

self.containerid=containerid

　　 def get_content(self,i):

params={

'type': 'uid',

'value': self.uid,

'containerid': self.containerid,

'page':i,

}

html=requests.get(self.url,headers=self.headers,params=params,timeout=5,verify=False).content.decode('utf-8')

data=json.loads(html)['data']

cards=data['cards']

#print(cards)

j = 1

for card in cards:

if "mblog" in str(card):

mblog = card['mblog']

raw_text = mblog['raw_text'] # 文本内容

print(raw_text)

scheme=card['scheme'] #微博链接

attitudes_count = mblog.get('attitudes_count') #点赞数

comments_count = mblog.get('comments_count') #评论数

created_at = mblog.get('created_at') #发布时间

reposts_count = mblog.get('reposts_count') #转发数

print(scheme)

img_path=f'{self.path}{i}/{j}'

os.makedirs(f'{img_path}/',exist_ok=True)

with open(f'{img_path}/{j}.txt', 'a', encoding='utf-8') as f:

f.write(f'{raw_text}')

img_urls=[]

if mblog.get('pics') != None:

img_datas=mblog['pics']

for img_data in img_datas:

img_url=img_data['large']['url']

img_urls.append(img_url)

print(img_urls)

#多线程下载图片

self.get_imgs(img_urls,img_path)

#多进程下载图片

#self.get_pimgs(img_urls)

with open(f'{self.uid}/{self.uid}.txt', 'a', encoding='utf-8') as fh:

fh.write("----第" + str(i) + "页，第" + str(j) + "条微博----" + "\n")

fh.write(f"微博地址： {str(scheme)}\n微博内容：{raw_text}\n"

f"发布时间：{str(created_at)}\n转发数：{str(reposts_count)}\n"

f"点赞数：{str(attitudes_count)}\n评论数：{str(comments_count)}\n\n")

j=j+1

time.sleep(2)

　　 #多线程下载图片

def get_imgs(self,img_urls,img_path):

threadings = []

for img_url in img_urls:

t = threading.Thread(target=self.get_img, args=(img_url,img_path))

threadings.append(t)

t.start()

for x in threadings:

x.join()

print("多线程下载图片完成")

def get_img(self, img_url,img_path):

img_name = img_url.split('/')[-1]

print(f'>> 正在下载图片：{img_name} ..')

r = requests.get(img_url, timeout=8, headers=self.headers,verify=False)

with open(f'{img_path}/{img_name}', 'wb') as f:

f.write(r.content)

print(f'>> 图片：{img_name} 下载完成！')

　　本来想做多进程的，结果车翻了，报各种秃头，所以不做了！！

　　微博爬虫有两种来源，不同的爬取地址和思路，分享给大家，仅供参考！

　　一个副本还包括一个GUI界面，当然这是这个人渣参考的主要源代码！

　　亲测可以运行！！

0

2021-07-04

内容采集

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

微博爬虫的源码，不同的爬取地址和思路

0 个评论

发起人