网页抓取 加密html(在看廖雪峰老师的Python教程,常见内置模块HTMLParser )

优采云 发布时间: 2022-04-16 06:12

  网页抓取 加密html(在看廖雪峰老师的Python教程,常见内置模块HTMLParser

)

  看廖雪峰老师的Python教程,常用内置模块HTMLParser:

  作业:找一个网页,比如用浏览器查看源码并复制,然后尝试解析HTML输出Python官网公布的会议时间、名称和地点。

  #!/usr/bin/env python

# -*- coding: utf-8 -*-

# @Date : 2017-06-01 09:08:30

# @Author : kk (zwk.patrick@foxmail.com)

# @Link : blog.csdn.net/PatrickZheng

import HTMLParser, urllib

class MyHTMLParser(HTMLParser.HTMLParser):

def __init__(self):

HTMLParser.HTMLParser.__init__(self)

self._title = [False]

self._time =[False]

self._place = [False]

self.time = '' # 用于拼接时间

def _attr(self, attrlist, attrname):

for attr in attrlist:

if attr[0] == attrname:

return attr[1]

return None

def handle_starttag(self, tag, attrs):

#print('' % tag)

if tag == 'h3' and self._attr(attrs, 'class') == 'event-title':

self._title[0] = True

if tag == 'time':

self._time[0] = True

if tag == 'span' and self._attr(attrs, 'class') == 'event-location':

self._place[0] = True

def handle_endtag(self, tag):

# 结束拼接

if tag == 'time':

self._time.append(self.time) # 将time完整内容放入self._time

self.time = '' # 初始化 self.time

self._time[0] = False

def handle_startendtag(self, tag, attrs):

#print('' % tag)

pass

def handle_data(self, data):

#print('data: %s' % data)

if self._title[0] == True:

self._title.append(data)

self._title[0] = False

if self._time[0] == True:

self.time += data # 拼接time

if self._place[0] == True:

self._place.append(data)

self._place[0] = False

def handle_comment(self, comment):

#print('' % comment)

pass

def handle_entityref(self, name):

if self._time[0] == True:

self.time += '-' # &ndash -> '-'

def handle_charref(self, name):

#print('&#%s:' % name)

pass

def show_content(self):

for n in range(1, len(self._title)):

print 'Title: %s' % self._title[n]

print 'Time: %s' % self._time[n]

print 'Place: %s' % self._place[n]

print '--------------------------------------'

html = ''

try:

page = urllib.urlopen('https://www.python.org/events/python-events/') # 打开网页

html = page.read() # 读取网页内容

finally:

page.close()

parser = MyHTMLParser()

parser.feed(html)

parser.show_content()

  运行结果:

  Title: PyCon Taiwan 2017

Time: 06 June - 12 June 2017

Place: Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan

--------------------------------------

Title: PyCon CZ 2017

Time: 09 June - 12 June 2017

Place: Prague, Czechia

--------------------------------------

Title: PythonDay Mexico

Time: 10 June - 11 June 2017

Place: Isabel la Católica 51, Centro, 06010 Mexico City, Mexico

--------------------------------------

Title: PyParis 2017

Time: 12 June - 14 June 2017

Place: Paris, France

--------------------------------------

Title: PyCon Israel 2017

Time: 12 June - 15 June 2017

Place: Wahl Center, Max VeAnna Webb st., Ramat Gan, Israel

--------------------------------------

Title: PyData Berlin 2017

Time: 30 June - 03 July 2017

Place: Treskowallee 8, 10318 Berlin, Germany

--------------------------------------

Title: PyConWEB 2017

Time: 27 May - 29 May 2017

Place: Munich, Germany

--------------------------------------

Title: PyDataBCN 2017

Time: 19 May - 22 May 2017

Place: Barcelona, Spain

--------------------------------------

***Repl Closed***

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线