[GNC-dev] Import PDF to GnuCash

Thu Jul 26 17:38:13 EDT 2018

Hello, Delta Tango:

I am not one of the Gnucash developers, but I am a software engineer who 
used to work for Adobe Systems (the creator of PDF), and so my ears 
perked up at your question.

On 2018-07-26 12:56, deltatango wrote:
> Hello,
>
> Very interested in the possibility of importing PDF statements into GnuCash.
>
> I know Quickbooks now has this functionality.
Fascinating!  Could you perhaps send a link to a Quicken page explaining 
what PDF import functionality Quicken has?  I see this page 
<https://www.quicken.com/support/how-do-i-import-data-quicken-windows> 
which says, "Please be aware that Quicken cannot import … PDF … files."
> …I was envisioning a system where you select a PDF statement to be imported.
>
> The program then asks you to select the area of the statement which contains
> the transactions, much like a photoshop selection. (And perhaps you could
> save templates of selections for different statements).
>
> Then some kind of OCR scanning reads the columns and data and convert it to
> columns/rows.
>
> Is this in the realm of possibility for some future release?
> …

I am not a Gnucash developer, so I can't speak about what is in the 
realm of possibility for GnuCash.

I can speak about importing PDF files, in general.

PDF is a container; it can contain different kinds of content which 
might look the same to a human reading the PDF file. It might have a 
collection of commands, "use this font, draw this number '€2.500,00' at 
this location on the page". It might have those same commands, with 
annotations saying, "this is a subtotal". Or, it might have a bitmapped 
image which is a picture of a printed page with those numbers.  
Importing those different kinds of content are very different tasks.

It is like the answer to the question, "is it easy to pour out the 
contents of a cup?"  If the cup contains water: very easy. If the cup 
contains paint: easy for 80%, and you have to use a scraper to get out 
the other 20%. If the cup contains dried concrete: very hard.

If a PDF has a special kind of content which is marked up for easy 
extraction, then maybe it would be less effort to make an importer for 
GnuCash. If the PDF does not have markup, but does have commands to draw 
numbers at specific places, then my guess is that any importer would be 
doing the same thing as a tool that converts the PDF file to a CSV file, 
then imports the CSV file into GnuCash.

If the PDF file contains an image, the best you can do is perform OCR, 
then correct the OCR mistakes. Then you have a PDF file with commands to 
draw numbers at specific places, which you handle as in the case above.

It is conceivable to write a machine learning / artificial intelligence 
program to convert PDF files with statements into a data format which is 
practical to import to GnuCash. But the starting requirement for this is 
tens of thousands of PDF files with example statements, and perhaps the 
tens of thousands of data files corresponding to the PDF files, to use 
as references for the learning.

Now, a lot has changed since I worked on PDF. Maybe something new is 
possible that I don't know about. But this is the situation as I see it.

Best regards,

          —Jim DeLaHunt, Vancouver, Canada

-- 
     --Jim DeLaHunt, jdlh at jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant

       355-1027 Davie St, Vancouver BC V6E 4L2, Canada
          Canada mobile +1-604-376-8953