从网页抓取数据(一下Python从零开始的网页抓取过程：安装Python点击下载)

优采云发布时间: 2021-12-17 13:09

　　有许多不同语言的开源网络抓取程序。

　　这里分享一下Python从零开始爬取的过程

　　第 1 步：安装 Python

　　点击下载合适的版本

　　我选择安装Python2.7.11

　　第二步：可以选择安装PythonIDE，这里是PyCharm

　　点击地址：#section=windows

　　下载安装后，可以选择新建一个工程，然后将需要编译的py文件放入工程中。

　　第三步安装参考包

　　在编译过程中，会发现BeautifulSoup和xlwt这两个包的引用失败。前者是html标签的解析库，后者可以将分析的数据导出为excel文件。

　　美汤下载

　　下载

　　安装方法一样，这里的安装类似于Linux依赖安装包。

　　常见安装步骤

　　1.在系统PATH环境变量中添加Python安装目录

　　2. 解压需要安装的包，打开CMD命令窗口，切换到安装包目录，分别运行python setup.py build和python setup.py install

　　这样两个包就安装好了

　　第四步，编译运行

　　以下是编译执行的抓包代码，可根据实际需要更改。简单的实现网页阅读，数据抓取就很简单了。

　　#coding:utf-8

import urllib2

import os

import sys

import urllib

import string

from bs4 import BeautifulSoup #导入解析html源码模块

import xlwt #导入excel操作模块

row = 0

style0 = xlwt.easyxf('font: name Times SimSun')

wb = xlwt.Workbook(encoding='utf-8')

ws = wb.add_sheet('Sheet1')

for num in range(1,100):#页数控制

url = "http://www.xxx.com/Suppliers.asp?page="+str(num)+"&hdivision=" #循环ip地址

header = {

"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0",

"Referer":"http://www.xxx.com/suppliers.asp"

}

req = urllib2.Request(url,data=None,headers=header)

ope = urllib2.urlopen(req)

#请求创建完成

soup = BeautifulSoup(ope.read(), 'html.parser')

url_list = [] #当前url列表

for _ in soup.find_all("td",class_="a_blue"):

companyname=_.a.string.encode('utf-8').replace("\r\n"," ").replace('|','')#公司名称

detailc=''#厂商详情基本信息

a_href='http://www.xxx.com/'+ _.a['href']+'' #二级页面

temphref=_.a['href'].encode('utf-8')

if temphref.find("otherproduct") == -1:

print companyname

print a_href

reqs = urllib2.Request(a_href.encode('utf-8'), data=None, headers=header)

opes = urllib2.urlopen(reqs)

deatilsoup = BeautifulSoup(opes.read(), 'html.parser')

for content in deatilsoup.find_all("table", class_="zh_table"): #输出第一种*敏*感*词*详情

detailc=content.text.encode('utf-8').replace("\r\n", "")

#print detailc # 输出详细信息

row = row + 1 # 添加一行

ws.write(row,0,companyname,style0) # 第几行，列1 列2...列n

ws.write(row,1, detailc,style0)

print '正在抓取'+str(row)

wb.save('bio-equip11-20.xls')

print '操作完成！'

　　运行结束，会在PycharmProjects项目目录下创建一个带有采集好的数据的excel文件。

0

2021-12-17

从网页抓取数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

从网页抓取数据(一下Python从零开始的网页抓取过程：安装Python点击下载)

0 个评论

发起人

AI时代内容工厂

从网页抓取数据(一下Python从零开始的网页抓取过程：安装Python点击下载)

0 个评论

发起人

相关问题