PDF按行页读取文字

news/2024/4/27 18:50:06/文章来源:https://blog.csdn.net/WIK_7264/article/details/137102591

PDF按行&页读取文字

- 前言
- - Apache PDFBox
  - iText
  - 其他库
- pom文件
- 代码
- - Apache PDFBox
  - - 1.按行读取
    - 2.按页读取
  - iText

前言

Apache PDFBox

Apache PDFBox是一个强大的开源Java库，用于处理PDF文档。它可以读取、创建、修改以及转换PDF文件。使用PDFBox，可以轻松地从PDF中抽取文本内容，包括按行读取、提取表单数据等。

iText

iText也是Java领域流行的PDF处理库，不仅可以读取PDF，还能生成和修改PDF。尽管它的重点更多在于PDF内容的创作，但它也可以用来提取PDF的文本内容。
使用iText读取PDF文本并不像PDFBox那样直接提供现成的方法，通常需要更多的低级别操作。

其他库

jPod
PDFsam SDK
PDFSmartCopy (也是iText的一部分，主要用于复制PDF页面内容)

注意：Apache PDFBox通常是读取pdf文字首选解决方案，因为它对于纯文本内容的提取更为简单直接。如果需要进行更复杂的PDF操作，如表单填写、签名认证等，iText则提供了更多的功能。

pom文件

<dependency><groupId>org.apache.pdfbox</groupId><artifactId>pdfbox</artifactId><version>最新版本号</version>
</dependency>

链接: Maven 中央仓库

代码

Apache PDFBox

1.按行读取

public static List<String> readRowPdf(String pdfUrl){// 返回对象List<String> result = new ArrayList<>();//加载PDF文档try (InputStream inputStream = new UrlResource(pdfUrl).getInputStream();//创建一个PDDocument对象PDDocument pdDocument = PDDocument.load(inputStream)) {//创建一个PDFTextStripper对象PDFTextStripper stripper = new PDFTextStripper();//设置起始页码和结束页码stripper.setStartPage(1);stripper.setEndPage(pdDocument.getNumberOfPages());//将PDF文档的内容提取为一个字符串String text = stripper.getText(pdDocument);//使用tokenizeToStringArray来分割文本（考虑到Windows/Linux系统的换行符差异）String[] split = StringUtils.tokenizeToStringArray(text, "\r\n");//添加到返回集合result.addAll(Arrays.asList(split));} catch (IOException e) {log.info("pdf按行读取 失败 原因：{}",e.getMessage());return result;}// 返回数据return result;}

注意： PDF文档的内容可能不是按行排列的，所以按行读取PDF内容可能会出现一些问题。

2.按页读取

/*** 传入一个.pdf 地址* @param pdfUrl 地址* @throws Exception*/public static List<String> readPdf(String pdfUrl) throws Exception {// 是否排序boolean sort = false;// 编码方式String encoding = "UTF-8";// 开始提取页数int startPage = 1;// 内存中存储的PDF DocumentPDDocument pdDocument = null;//输入流InputStream inputStream = null;try {try {// 当作一个URL来装载文件URL url = new URL(pdfUrl);URLConnection con = url.openConnection();con.setConnectTimeout(3 * 1000);inputStream = con.getInputStream();pdDocument = PDDocument.load(inputStream);} catch (MalformedURLException e) {}// 获取页码int endPage = pdDocument.getNumberOfPages();PDFTextStripper stripper = null;stripper = new PDFTextStripper();// 设置是否排序stripper.setSortByPosition(sort);List<String> texts=new ArrayList<>();for (int i = 0; i < endPage; i++) {int page=i+1;// 设置起始页stripper.setStartPage(page);// 设置结束页stripper.setEndPage(page);texts.add(stripper.getText(pdDocument));}return texts;} finally {if (inputStream != null) {// 关闭输出流inputStream.close();}if (pdDocument != null) {// 关闭PDF DocumentpdDocument.close();}}}

iText

pom

<dependencies><!-- iText 7 Core --><dependency><groupId>com.itextpdf</groupId><artifactId>itext7-core</artifactId><version>7.2.0</version></dependency><!-- 布局和样式 （需要加不需要不加）--><dependency><groupId>com.itextpdf</groupId><artifactId>itext7-layout</artifactId><version>7.2.0</version></dependency>
</dependencies>

import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.layout.renderer.DocumentRenderer;
import com.itextpdf.layout.text.TextLine;/*** 此方法用于从给定路径的PDF文件中逐行提取文本内容，并将其存储在一个List<String>中返回。** @param filePath PDF文件的路径* @return 包含PDF文件所有文本行的列表* @throws IOException 当读取或处理PDF文件过程中出现IO异常时抛出*/
public List<String> readPdfWithItext(String filePath) throws IOException {List<String> contentLines = new ArrayList<>();// 使用try-with-resources语句确保PdfDocument和PdfReader在使用完毕后会被正确关闭try (PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath))) {// 创建一个DocumentRenderer对象来渲染PDF页面上的内容DocumentRenderer renderer = new DocumentRenderer(pdfDoc);// 遍历PDF中的所有页面for (int pageNum = 1; pageNum <= pdfDoc.getNumberOfPages(); pageNum++) {// 设置当前渲染器处理的页码renderer.setPageNumber(pageNum);// 循环获取并处理页面上的每一行文本while (renderer.hasMoreLines()) {// 获取下一行文本TextLine line = renderer.getNextLine();// 将文本行添加到结果列表中contentLines.add(line.getText());}}}// 返回结果return contentLines;
}