[GNC-dev] Import PDF to GnuCash
Jim DeLaHunt
from.gnucash at jdlh.com
Thu Jul 26 17:38:13 EDT 2018
Hello, Delta Tango:
I am not one of the Gnucash developers, but I am a software engineer who
used to work for Adobe Systems (the creator of PDF), and so my ears
perked up at your question.
On 2018-07-26 12:56, deltatango wrote:
> Hello,
>
> Very interested in the possibility of importing PDF statements into GnuCash.
>
> I know Quickbooks now has this functionality.
Fascinating! Could you perhaps send a link to a Quicken page explaining
what PDF import functionality Quicken has? I see this page
<https://www.quicken.com/support/how-do-i-import-data-quicken-windows>
which says, "Please be aware that Quicken cannot import … PDF … files."
> …I was envisioning a system where you select a PDF statement to be imported.
>
> The program then asks you to select the area of the statement which contains
> the transactions, much like a photoshop selection. (And perhaps you could
> save templates of selections for different statements).
>
> Then some kind of OCR scanning reads the columns and data and convert it to
> columns/rows.
>
> Is this in the realm of possibility for some future release?
> …
I am not a Gnucash developer, so I can't speak about what is in the
realm of possibility for GnuCash.
I can speak about importing PDF files, in general.
PDF is a container; it can contain different kinds of content which
might look the same to a human reading the PDF file. It might have a
collection of commands, "use this font, draw this number '€2.500,00' at
this location on the page". It might have those same commands, with
annotations saying, "this is a subtotal". Or, it might have a bitmapped
image which is a picture of a printed page with those numbers.
Importing those different kinds of content are very different tasks.
It is like the answer to the question, "is it easy to pour out the
contents of a cup?" If the cup contains water: very easy. If the cup
contains paint: easy for 80%, and you have to use a scraper to get out
the other 20%. If the cup contains dried concrete: very hard.
If a PDF has a special kind of content which is marked up for easy
extraction, then maybe it would be less effort to make an importer for
GnuCash. If the PDF does not have markup, but does have commands to draw
numbers at specific places, then my guess is that any importer would be
doing the same thing as a tool that converts the PDF file to a CSV file,
then imports the CSV file into GnuCash.
If the PDF file contains an image, the best you can do is perform OCR,
then correct the OCR mistakes. Then you have a PDF file with commands to
draw numbers at specific places, which you handle as in the case above.
It is conceivable to write a machine learning / artificial intelligence
program to convert PDF files with statements into a data format which is
practical to import to GnuCash. But the starting requirement for this is
tens of thousands of PDF files with example statements, and perhaps the
tens of thousands of data files corresponding to the PDF files, to use
as references for the learning.
Now, a lot has changed since I worked on PDF. Maybe something new is
possible that I don't know about. But this is the situation as I see it.
Best regards,
—Jim DeLaHunt, Vancouver, Canada
--
--Jim DeLaHunt, jdlh at jdlh.com http://blog.jdlh.com/ (http://jdlh.com/)
multilingual websites consultant
355-1027 Davie St, Vancouver BC V6E 4L2, Canada
Canada mobile +1-604-376-8953
More information about the gnucash-devel
mailing list