网页爬虫抓取百度图片(Python爬虫抓取csdn博客+C和Ctrl+V)
优采云 发布时间: 2022-03-06 06:12网页爬虫抓取百度图片(Python爬虫抓取csdn博客+C和Ctrl+V)
Python爬虫csdn博客
昨晚为了把某csdn大佬的博文全部下载保存,写了一个爬虫自动抓取文章保存为txt文本,当然也可以保存为html网页。这样Ctrl+C和Ctrl+V都可以用,很方便,抓取其他网站也差不多。
为了解析爬取的网页,使用了第三方模块BeautifulSoup。这个模块对于解析 html 文件非常有用。当然也可以使用正则表达式自己解析,但是比较麻烦。
由于csdn网站的robots.txt文件显示禁止任何爬虫,所以爬虫必须伪装成浏览器,不能频繁爬取。如果他们睡一会儿,他们就会被封锁。 ,但可以使用代理ip。
<p>#-*- encoding: utf-8 -*-
'''
Created on 2014-09-18 21:10:39
@author: Mangoer
@email: 2395528746@qq.com
'''
import urllib2
import re
from bs4 import BeautifulSoup
import random
import time
class CSDN_Blog_Spider:
def __init__(self,url):
print '\n'
print('已启动网络爬虫。。。')
print '网页地址: ' + url
user_agents = [
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",
]
# use proxy ip
# ips_list = ['60.220.204.2:63000','123.150.92.91:80','121.248.150.107:8080','61.185.21.175:8080','222.216.109.114:3128','118.144.54.190:8118',
# '1.50.235.82:80','203.80.144.4:80']
# ip = random.choice(ips_list)
# print '使用的代理ip地址: ' + ip
# proxy_support = urllib2.ProxyHandler({'http':'http://'+ip})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)
agent = random.choice(user_agents)
req = urllib2.Request(url)
req.add_header('User-Agent',agent)
req.add_header('Host','blog.csdn.net')
req.add_header('Accept','*/*')
req.add_header('Referer','http://blog.csdn.net/mangoer_ys?viewmode=list')
req.add_header('GET',url)
html = urllib2.urlopen(req)
page = html.read().decode('gbk','ignore').encode('utf-8')
self.page = page
self.title = self.getTitle()
self.content = self.getContent()
self.saveFile()
def printInfo(self):
print('文章标题是: '+self.title + '\n')
print('内容已经存储到out.txt文件中!')
def getTitle(self):
rex = re.compile('(.*?)',re.DOTALL)
match = rex.search(self.page)
if match:
return match.group(1)
return 'NO TITLE'
def getContent(self):
bs = BeautifulSoup(self.page)
html_content_list = bs.findAll('div',{'id':'article_content','class':'article_content'})
html_content = str(html_content_list[0])
rex_p = re.compile(r'(?:.*?)>(.*?)