Python+Streamlit在网页中提取PDF中文字、表格对象

优采云发布时间: 2022-06-19 13:11

　　大家好，今天给大家带来的是结合Streamlit，我们提取PDF文档中的一些内容的方法，如提取PDF的基本信息、文本信息、表格。

　　实现效果实现代码

import streamlit as st import pdfplumber import io from pandas import DataFrame import pandas as pd import fitz import streamlit.components.v1 as components st.set_page_config(page_title="操作PDF", layout="wide") css = """ #MainMenu {visibility:hidden;} footer {visibility:hidden;} .stDownloadButton>button { background-color: #0099ff; color:#ffffff; } .stDownloadButton>button:hover { background-color: #00ff00; color:#ff0000; } """ st.markdown(css, unsafe_allow_html=True) def convert_df(df): st.download_button( label="点我下载表格", data=df.to_csv().encode('gbk'), file_name='table.csv', mime='text/csv', ) def draw_table(df, theme, table_height): columns = df.columns thead1="""""" thead_temp = [] for k in range(len(list(columns))): thead_temp.append(""""""+str(list(columns)[k])+"""""") header = thead1+"".join(thead_temp)+"""""" rows = [] rows_temp = [] for i in range(df.shape[0]): rows.append(""""""+str(i+1)+"""""") rows_temp.append(df.iloc[i].values.tolist()) td_temp = [] for j in range(len(rows_temp)): for m in range(len(rows_temp[j])): td_temp.append(""""""+str(rows_temp[j][m])+"""""") td_temp2 = [] for n in range(len(td_temp)): td_temp2.append(td_temp[n:n+df.shape[1]]) td_temp3 = [] for x in range(len(td_temp2)): if int(x % (df.shape[1])) == 0: td_temp3.append(td_temp2[x]) td_temp4 = [] for y in range(len(td_temp3)): td_temp4.append("".join(td_temp3[y])) td_temp5 = [] for v in range(len(td_temp4)): td_temp5.append(""""""+str(v+1)+""""""+str(td_temp4[v])+"""""") table_html = """"""+\ """"""+\ """

0

2022-06-19

网页表格抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

Python+Streamlit在网页中提取PDF中文字、表格对象

0 个评论

发起人