Converting Difficult Formatted PDF

5 posts / 0 new
Last post
audit55
Offline
Joined: 10/16/2018 - 16:26
Converting Difficult Formatted PDF

I'm having trouble converting a PDF with more difficult formatting. It is formatted similar to below for example. I can grab the personal info first line easily. I just need the Number total for both before tax and after tax as seperate columns. The issue is there is nothing different in either line to differentiate them.  So I've grabbed the Number Total using a blank trap in the start of the line + a number trap where there is always number. This highlights both totals. However when you import this, it only takes the first total and ignores the second. 
How can I have both totals in seperate columns during import if they both get highlighted/trapped everytime?  Also note the number of funds per person and per type always varies.
I need columns to import like this:
Name      Dob    doh   id     Before tax total number       after tax total number
                                              
This is the original format:
                                                    Contribution
John, Smith          DOB  1/1/2000              1/2/200 DOH                 55555 ID
Before Tax
Fund 1
Fund 2
                           ---------------------------------------------------------------------------------
                                             (Number total right here)
After Tax
Fund 1
Fund 2
Fund 3
Fund 4
                 -------------------------------------------------------------------------------------------------
                                         (Number total right here)

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Hi Audit55,

I think I came up with a solution for you.

I first created a text file that looks somewhat like your file:

I opened this file in the report reader.  I use the base layer as the total for the Before and After Tax.  The trap I used a bunch of spaces at the beginning to get rid of the fund and I used the decimal place in the total amount.

So this will capture both the Before and After tax totals which is what I want.

I next create a second layer that captures if the amount is the Before or After amount.

It doesn't really matter if you also pick-up the name of the person as that will be ignored because the 800 is the base layer and the Before is the transaction that is closest to the 800 so it will overwright the name if it is picked-up.

For the third layer I pick up the name, DOB, DOG and ID.

I then bring in the file into IDEA and get the following:

I next use the Key Value Extraction to create two files, one containing the Before and one the After.  You could also use Direct Extraction to do this.

So you will have an after file that looks like this.  I renamed the amount field in both files to AFTER_AMT and BEFORE_AMT

You then want to join the before and after files together.  In this case it doesn't really matter which is primary and which is secondary.  The match I use the Name, DOB, DOH and ID as they will be the same for both the Before and After file.

In the secondary file I clear all the fields except for the amount field.

After I do the join I have a file that looks like this:

which is what I think you are hoping to get.  You could probably get rid of the BEFORE file in the primary file as I don't think you need it in the final file.

Hopefully this helps you out.

Brian

 

audit55
Offline
Joined: 10/16/2018 - 16:26

Thank you! I never realized that changing which layer is the base layer would make such a big difference.  Your first result that had the person's name listed multiple times (1 for each type) is actually fine.
Unfortunately I did run into an error (I was able to get 90-95% people your way) that I'm not sure is fixable. It appears the format is changed for a few people on this report who therefore don't get captured.  Basically these people just have 1 fund and therefore they don't have a totals line. However if I try to capture the lines for these people, it will capture all of the fund names for the entire report, ruining the other working format you showed above.
I know this is a longshot but is there anyway to only grab the lines with a fund name if there is only 1 fund?
The newformat is:
John, Smith          DOB  1/1/2000              1/2/200 DOH                 55555 ID Before Tax
Fund 1                                (Number total is here)                   

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

The base layer is the most important layer to select as it is the layer that should hold the transactional data and all the other layers get added to the base layer.  In your example the base layer is the total amount as you want the amount for each record and the other information gets added to that layer.

I think you might have to do this in two passes as I am not seeing an easy way to grab all the information in one pass.  The second pass you would only be grabbing the information for all the Fund 1 transactions, so your base layer would be the Fund 1 layer and you would add the name and dob items to that layer.  

You would then bring it into IDEA.  You would then do a join with the first file with the Fund 1 file being the primary file and your first file being the secondary.  The reason we did this is you would have all the Fund 1 transactions for each account and we want to get rid of the ones you already have.  So your join would be no secondary matchs and the match fields would be your Name, DOB, DOH and ID.  The resulting file would only contain the funds for items that are not in the first file.  You can then do an append to create one file based on your first file and the file with only the Fund 1.

Hopefully that makes sense.  If not let me know and I can go through it step by step for you.

Brian

suaizai89
Offline
Joined: 01/07/2019 - 21:18

Hi All,
I think I am having similar difficulty in importing a difficult PDF, the information provided in the PDF are not aligned from column to column and I've read through some advice and video online that data sorting is one of the solutions, (create a standard layer) then extracting the standard information from the standard layer created. 
But due to the messiness of the PDF Files, It seems like I have to use a script to extract the information that I need. Do you all have any recommendation in fixing my scenario here?

Images: