正如 plinth 和 David van Driessche 在他们的回答中已经指出的那样,从 PDF 文件中提取文本并非易事。幸运的是,iText 解析器包中的类为您完成了大部分繁重的工作。您已经从该包中找到了至少一个类,PdfTextExtractor,
但如果您只对页面的纯文本感兴趣,那么这个类本质上是一个使用 iText 解析器功能的便利实用程序。对于您的情况,您必须更仔细地查看该包中的类。
获取有关使用 iText 进行文本提取主题的信息的起点是第 15.3 节解析 PDF of iText 实际应用 — 第二版 http://itextpdf.com/book/index.php,特别是方法extractText
样本的解析HelloWorld.java http://itextpdf.com/examples/iia.php?id=275:
public void extractText(String src, String dest) throws IOException
{
PrintWriter out = new PrintWriter(new FileOutputStream(dest));
PdfReader reader = new PdfReader(src);
RenderListener listener = new MyTextRenderListener(out);
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.getPageN(1);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic);
out.flush();
out.close();
}
它利用了RenderListener
执行MyTextRenderListener.java http://itextpdf.com/examples/iia.php?id=282:
public class MyTextRenderListener implements RenderListener
{
[...]
/**
* @see RenderListener#renderText(TextRenderInfo)
*/
public void renderText(TextRenderInfo renderInfo) {
out.print("<");
out.print(renderInfo.getText());
out.print(">");
}
}
虽然这RenderListener
实现仅输出文本,文本渲染信息 http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html它检查的对象提供了更多信息:
public LineSegment getBaseline(); // the baseline for the text (i.e. the line that the text 'sits' on)
public LineSegment getAscentLine(); // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)
public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)
public float getRise() ; // the rise which represents how far above the nominal baseline the text should be rendered
public String getText(); // the text to render
public int getTextRenderMode(); // the text render mode
public DocumentFont getFont(); // the font
public float getSingleSpaceWidth(); // the width, in user space units, of a single space character in the current font
public List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation
因此,如果您的RenderListener
除了检查文本之外getText()
还考虑getBaseline()
or evengetAscentLine()
andgetDescentLine().
您拥有可能需要的所有坐标。
PS:有一个代码的包装类ParsingHelloWorld.extractText()
, PdfReader内容解析器 http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/PdfReaderContentParser.html,它允许您简单地编写以下给定的内容PdfReader reader,
anint page,
and aRenderListener renderListener:
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(page, renderListener);