网页视频抓取脚本(利用百度AI开发平台的OCR文字识别API也可以识别 )

优采云发布时间: 2022-02-22 13:22

　　网页视频抓取脚本(利用百度AI开发平台的OCR文字识别API也可以识别

)

　　一、Selenium 网页截图，图片定位二次精准截图

　　第三方模块“selenium”用于 Python 自动化与 Web 浏览器交互。

　　1.安装模块 pip install selenium

　　2.安装对应版本浏览器的驱动

　　谷歌查看浏览器版本访问“chrome://version/”

　　google驱动下载地址

　　注意下载解压后的驱动放在系统环境变量PATH的路径下

　　3. 代码如下：

　　# -*- coding:utf-8 -*-

from selenium import webdriver

import easyocr

import time

from PIL import Image

def screenshots(): # 访问网页截屏

driver = webdriver.Chrome() # 初始化一个谷歌浏览器实例

driver.maximize_window() # 打开最大窗口

driver.get("http://quote.eastmoney.com/sh600797.html") # 访问网页

js = "var q=document.documentElement.scrollTop=500" # 下拉500个像素

driver.execute_script(js) # 执行下拉500个像素操作

time.sleep(3)

driver.get_screenshot_as_file(

r"C:\Zzlong\%s.png" % time.strftime('%Y-%m-%d %H-%M', time.localtime(time.time()))

) # 截图保存为C:\Zzlong\2022-02-20 17-30.png

driver.quit() #关闭浏览器

# imgelement = driver.find_element_by_id('rgt1')

# imgelement = driver.find_element_by_class_name('line24')

# location = imgelement.location

# print(location) # {'x': 1104, 'y': 917}

# size = imgelement.size

# print(size) # {'height': 12, 'width': 26}

def crop(): # 定位二次截图

picture = Image.open(

r"C:\Zzlong\%s.png" % time.strftime('%Y-%m-%d %H-%M', time.localtime(time.time()))

) # 打开第一次的截图

picture = picture.crop((1320,520,1450,550)) # 定位二次截图

# 注意: crop截图规则，(宽 - x坐标)为截图的宽位置 (高 - y坐标)为截图的高位置，坐标(0,0)位于左上角

picture.save(

r"C:\Zzlong\img%s.png" % time.strftime('%Y-%m-%d %H-%M', time.localtime(time.time()))

) # 保存图片

# print(picture.size) # 输出宽、高 (1920, 888)

# picture = picture.crop((0, 0, 1920, 888)) # 截取全图(x坐标，y坐标，宽，高)

　　二、easyocr 提取图片文本

　　Python-EasyOCR 中有一个很好的 OCR 库，在 GitHub 上有 9700stars。它可以在python中调用以识别图像中的文本并输出为文本。

　　安装过程比较简单，使用pip或者conda安装。

　　pip install easyocr

　　如果您使用 PyPl 源代码，安装可能需要一些时间。建议您使用清华源安装，几秒钟即可安装。

　　指示

　　EasyOCR的使用非常简单，分为三个步骤：

　　# 导入easyocr

import easyocr

# 创建reader对象

reader = easyocr.Reader(['ch_sim','en'])

# 读取图像

result = reader.readtext('test.jpg')

# 结果

print(result)

# 使用easyocr报错“Unknown C++ exception from OpenCV code，CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU. ”

# Python与CUDA版本不对应，导致Python安装的OpenCV版本与CUDA版本不照应

# pip install opencv-python==4.3.0.38 -i https://pypi.tuna.tsinghua.edu.cn/simple

　　这段代码有个参数['ch_sim','en']，就是要识别的语言列表（所有语言列表都放在文章的底部），因为里面有中文和英文路牌，所以将列表添加到列表 ch_sim（简体中文），en（英文）。

　　识别文本的准确率还是很高的，然后提取文本部分。

　　for i in result:

word = i[1]

print(word)

　　三、使用百度AI开发平台的OCR文字识别API也可以识别提取图片中的文字。

　　首先我们需要一个百度账号，然后打开百度AI开放平台()并登录，点击“控制台”，在左侧栏输入“文本识别”，创建一个应用，记住你的AppID、API Key和Secret Key .

　　然后，我们在cmd窗口中安装百度ai界面的库。

　　pip install baidu—aip

　　好了，基本的工作已经到这里了。接下来是文本识别和提取的核心部分：

　　def baiduOCR(picfile, outfile): #想要利用百度api识别文本，我们需要设置： #1、图片文件名为：picfile #2、输出文件为：outfile filename = path.basename(picfile) #接下来，我们需要将刚刚获取的ID、KEY和SECRECT KEY填入 APP_ID = '****' # 刚才获取的 ID，下同 API_KEY = '****' SECRECT_KEY = '****' client = AipOcr(APP_ID, API_KEY, SECRECT_KEY) #接下来，打开并识别图片信息 i = open(picfile, 'rb') img = i.read() print("正在识别图片：\t" + filename) #在这里，我们有两种识别方法：通用识别、高精度识别message = client.basicGeneral(img)#通用文字识别，每天50000次免费#message =client.basicAccurate(img)#通用文字高精度识别，每天800次免费 print("识别成功！") i.close();

　　以上就是使用百度api文本识别提取的识别部分。接下来，您只需要提取提取的文本。

　　要提取识别的文本，我们需要做以下设置：

　　with open(outfile, 'a+') as fo: fo.writelines("+" * 60 + '\n') fo.writelines("识别图片：\t" + filename + "\n" * 2) fo.writelines("文本内容：\n") # 输出文本内容 for text in message.get('words_result'): fo.writelines(text.get('words') + '\n') fo.writelines('\n'*2) print("文本导出成功！") print()

　　现在我们导入一张手机拍的照片：

　　识别结果：

　　从结果可以看出，精读的识别度非常高，效果非常好。

　　详细步骤请参考代码和注释：

　　import glob

from os import path

import os

from aip import AipOcr

from PIL import Image

def convertimg(picfile, outdir):

'''调整图片大小，对于过大的图片进行压缩

picfile: 图片路径

outdir：图片输出路径

'''

img = Image.open(picfile)

width, height = img.size

while(width*height > 4000000): # 该数值压缩后的图片大约两百多k

width = width // 2

height = height // 2

new_img=img.resize((width, height),Image.BILINEAR)

new_img.save(path.join(outdir,os.path.basename(picfile)))

def baiduOCR(picfile, outfile):

#想要利用百度api识别文本，我们需要设置：

#1、图片文件名为：picfile

#2、输出文件为：outfile

filename = path.basename(picfile)

#接下来，我们需要将刚刚获取的ID、KEY和SECRECT KEY填入

APP_ID = '****' # 刚才获取的 ID，下同

API_KEY = '****'

SECRECT_KEY = '****'

client = AipOcr(APP_ID, API_KEY, SECRECT_KEY)

#接下来，打开并识别图片信息

i = open(picfile, 'rb')

img = i.read()

print("正在识别图片：\t" + filename)

#在这里，我们有两种识别方法：通用识别、高精度识别

message = client.basicGeneral(img) # 通用文字识别，每天 50 000 次免费

#message = client.basicAccurate(img) # 通用文字高精度识别，每天 800 次免费

print("识别成功！")

i.close();

#以上即为识别过程

#想要将识别的文字提取出来，我们需要做以下设置：

with open(outfile, 'a+') as fo:

fo.writelines("+" * 60 + '\n')

fo.writelines("识别图片：\t" + filename + "\n" * 2)

fo.writelines("文本内容：\n")

# 输出文本内容

for text in message.get('words_result'):

fo.writelines(text.get('words') + '\n')

fo.writelines('\n'*2)

print("文本导出成功！")

print()

if __name__ == "__main__":

outfile = 'export.txt'

outdir = 'tmp'

if path.exists(outfile):

os.remove(outfile)

if not path.exists(outdir):

os.mkdir(outdir)

print("压缩过大的图片...")

#首先对过大的图片进行压缩，以提高识别速度，将压缩的图片保存与临时文件夹中

for picfile in glob.glob("picture/*"):

convertimg(picfile, outdir)

print("图片识别...")

for picfile in glob.glob("tmp/*"):

baiduOCR(picfile, outfile)

os.remove(picfile)

print('图片文本提取结束！文本输出结果位于 %s 文件中。' % outfile)

os.removedirs(outdir)

0

2022-02-22

网页视频抓取脚本

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页视频抓取脚本(利用百度AI开发平台的OCR文字识别API也可以识别 )

0 个评论

发起人