Can't get report reader to read a pdf

5 posts / 0 new
Last post
pbarber
Offline
Joined: 07/22/2019 - 16:56
Can't get report reader to read a pdf

Hello,
 
I'm having an issue getting IDEA to recognize the text in PDFs, I've had it happen across multiple different files.
 
The PDF looks fine when I open it in Adobe Reader, the text is highlightable, I can't use OCR because the text is already selectable.
 
But when I go to import the PDF into IDEA, it just gives me blank lines. Any idea why that is?
 
 

Images: 
Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Is this an image file or a pdf text file, can you copy the text?  Also is it password protected, that could be a problem.  Report reader doesn't handle images.  Can chance you can share the file so I or someone else can have a look at it?

pbarber
Offline
Joined: 07/22/2019 - 16:56

It's a report printed out of our timekeeping system, it has selectable and copyable text, so it's not just an image.
 
The file doesn't have a password on it to open it, but when I try to print the file to PDF (one of the things I tried to see if that would work), the newly saved copy does prompt an error in IDEA asking to enter a password... but the file itself doesn't have a password! The security settings in the PDF say "no security".
 
I'm trying to see what I can do to strip it of any hidden protections or something like that.

Images: 
osaajah
Offline
Joined: 05/25/2018 - 02:33

Hi pbarber,
One of my client has same problem with you in the past.  After searching in google, I have found that it was caused by unsupported PDF by Report Reader.  My client got the PDF from a web application that generate it using Apache component called "Apache FOP".  But the component version was 0.20.4, which generates PDF files with old PDF encoding.  "Apache FOP" version 1.0 and later has encoding that recognized by Report Reader. So, you should check the PDF properties first, as shown in the picture.

Images: 
klmi
Offline
Joined: 02/13/2019 - 08:41

You can try the following workarounds:
1) Print the PDF with a PDF printer to a new PDF file.
2) You can select the whole text from your PDF viewer (f.e. Adobe Acrobat Viewer) by pressing CTRL+A and copy and paste it to Notepad. However my experience is that often table layouts (probably tabs) are destroyed. So I could read the text file with ReportReader but had problems defining the mask.
3) Better results than 2) I had with tools like PDF2TXT and Some PDF to TXT Converter.
4) Use OCR software to read the PDF again and save it to a new PDF, a text file or a Excel file (try different output options).