Optimize PDF Word Search

I have an application that iterates over a directory of pdf files and searches for a string. I am using PDFBox to extract the text from the PDF and the code is pretty straightforward. At first to search through 13 files it was taking a minute in a half to load the results but I noticed that PDFBox was putting a lot of stuff in the log file file. I changed the logging level and that helped alot but it is still taking over 30 seconds to load a page. Does anybody have any suggestions on how I can optimize the code or another way to determine how many hits are in a document? I played around with Lucene but it seems to only give you the number of hits in a directory not number of hits in a particular file. Here is my code to get the text out of a PDF. public static String parsePDF (String filename) throws IOException { FileInputStream fi = new FileInputStream(new File(filename)); PDFParser parser = new PDFParser(fi); parser.parse(); COSDocument cd = parser.getDocument(); PDFTextStripper stripper = new PDFTextStripper(); String pdfText = stripper.getText(new PDDocument(cd)); cd.close(); return pdfText; }

以上就是Optimize PDF Word Search的详细内容,更多请关注web前端其它相关文章!

赞(0) 打赏
未经允许不得转载:web前端首页 » JavaScript 答疑

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

前端开发相关广告投放 更专业 更精准