网页抓取 加密html(在看廖雪峰老师的Python教程,常见内置模块HTMLParser )
优采云 发布时间: 2022-04-16 06:12网页抓取 加密html(在看廖雪峰老师的Python教程,常见内置模块HTMLParser
)
看廖雪峰老师的Python教程,常用内置模块HTMLParser:
作业:找一个网页,比如用浏览器查看源码并复制,然后尝试解析HTML输出Python官网公布的会议时间、名称和地点。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date : 2017-06-01 09:08:30
# @Author : kk (zwk.patrick@foxmail.com)
# @Link : blog.csdn.net/PatrickZheng
import HTMLParser, urllib
class MyHTMLParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self._title = [False]
self._time =[False]
self._place = [False]
self.time = '' # 用于拼接时间
def _attr(self, attrlist, attrname):
for attr in attrlist:
if attr[0] == attrname:
return attr[1]
return None
def handle_starttag(self, tag, attrs):
#print('' % tag)
if tag == 'h3' and self._attr(attrs, 'class') == 'event-title':
self._title[0] = True
if tag == 'time':
self._time[0] = True
if tag == 'span' and self._attr(attrs, 'class') == 'event-location':
self._place[0] = True
def handle_endtag(self, tag):
# 结束拼接
if tag == 'time':
self._time.append(self.time) # 将time完整内容放入self._time
self.time = '' # 初始化 self.time
self._time[0] = False
def handle_startendtag(self, tag, attrs):
#print('' % tag)
pass
def handle_data(self, data):
#print('data: %s' % data)
if self._title[0] == True:
self._title.append(data)
self._title[0] = False
if self._time[0] == True:
self.time += data # 拼接time
if self._place[0] == True:
self._place.append(data)
self._place[0] = False
def handle_comment(self, comment):
#print('' % comment)
pass
def handle_entityref(self, name):
if self._time[0] == True:
self.time += '-' # &ndash -> '-'
def handle_charref(self, name):
#print('&#%s:' % name)
pass
def show_content(self):
for n in range(1, len(self._title)):
print 'Title: %s' % self._title[n]
print 'Time: %s' % self._time[n]
print 'Place: %s' % self._place[n]
print '--------------------------------------'
html = ''
try:
page = urllib.urlopen('https://www.python.org/events/python-events/') # 打开网页
html = page.read() # 读取网页内容
finally:
page.close()
parser = MyHTMLParser()
parser.feed(html)
parser.show_content()
运行结果:
Title: PyCon Taiwan 2017
Time: 06 June - 12 June 2017
Place: Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan
--------------------------------------
Title: PyCon CZ 2017
Time: 09 June - 12 June 2017
Place: Prague, Czechia
--------------------------------------
Title: PythonDay Mexico
Time: 10 June - 11 June 2017
Place: Isabel la Católica 51, Centro, 06010 Mexico City, Mexico
--------------------------------------
Title: PyParis 2017
Time: 12 June - 14 June 2017
Place: Paris, France
--------------------------------------
Title: PyCon Israel 2017
Time: 12 June - 15 June 2017
Place: Wahl Center, Max VeAnna Webb st., Ramat Gan, Israel
--------------------------------------
Title: PyData Berlin 2017
Time: 30 June - 03 July 2017
Place: Treskowallee 8, 10318 Berlin, Germany
--------------------------------------
Title: PyConWEB 2017
Time: 27 May - 29 May 2017
Place: Munich, Germany
--------------------------------------
Title: PyDataBCN 2017
Time: 19 May - 22 May 2017
Place: Barcelona, Spain
--------------------------------------
***Repl Closed***