Pruning Bayes Import Map

Helge Ramthun HRamthun at gmx.de
Tue Mar 3 14:12:01 EST 2015


Hi,

I had a problem with Bayes not recognising upon import some fairly
common transactions. So eventually I looked into the xml file for the
data stored there. Before, I knew nothing of what is stored. I found a
bunch of stuff, I didn't like:

- Accounts are stored a strings, that is, if you change the name of an
account, delete or or whatever, you loose the "learning process" of
Bayes. Also, you have dead wood in the data, as the accounts don't even
exist anymore. And I do wonder, if the problem is related to Bayes
recognising this as a really perfect match for a non-existing account?

- There are very many entries in their with weight 1 or 2. This is fine
if you just started using the import. But I have used it not for 4 or 5
years. Everything that has a weight of 1 or 2 is simply not significant.
This is also plausible, if you look at the corresponding tokens: Often
they contain even a daytime ("9:43") or are just some kind of
transaction number which is unique.

- Some really good information is lost, since only spaces are used as
separators. However, I have quite a few companies sending my
descriptions like customerid/date e.g. 123456/201501. Now, the customer
id never changes and would, if used as a token separated by / have
really high weight. Obviously, the date changes each month - so with
monthly trx this means, it alsways has weight 1 and the good information
of the customer id is lost. I understand that for some "/" may be rather
part of a token. After having looked for hours at my raw xml, I didn't
find any of these cases. So I for compatibility reasons, "/" should be
used as a token not a separator, maybe it would be possible to make this
a property the user can choose: Use "/" as a sparator (and spaces of
course).

Well, to fix my problem and in particular the obsolete accounts, I have
created a perl script, which allows pruning of the XML.

It can take a file, containing a list of (parent-) account, which should
never appear (e.g. obsolete accounts or things like "Ausgleichskonto-EUR").
It also allows to define a minimal weight. If an account does not have
sufficient weight for a token, it is removed.
If after all this, a token is empty, it is removed.

I set minimal weight to 3 and had gave it a list of obsolete account,
resulting in:
- 5858 Accounts removed.
- 3277 Tokens removed.
- 1,8 MB uncompressed size reduction (of 16,6 MB originally)

Then I deleted my last imports and redid them. There are a few surpises
that didn't get matched properly, but the original problems were gone
and I'm fairly confident Bayes will relearn the few surprises. Overall,
the import looked really good. Not sure if this will be ture for
everyone, I still wonder how much the obsolete accounts hurt the algo.

I also tested removing very short tokens (e.g. everything with less than
4 chars). This was not a good idea. First off,  it wasn't even that many
more tokens beeing removed. That is due to the fact that tokens which
contain only accounts with low weight will be removed no matter what.
And vice versa, this rule would actually delete significant accounts
with high weight. It is in the script and you can enable it at your own
risk, but  I would discourage you from doing so.

The script is  at the bottom. Don't expect anything too clever...
Here the relevant documentation:
$pruneObsoleteAccs: Set to 0 to not used this pruning method.

$obsoleteAccsFile: Name of file with obsolete accounts, or obsolete
top-accounts. One acc per row. Do not have an empty last row, or
everything gets deleted.

$pruneMinimalWeight = 1: Set to 0 to not enforce a minimal weight. Use
this option only, if you have been training Bayes for a while, otherwise
you will keep it from learning.

$minimalWeight = 3: Minimal weight enforced. For me, 3 worked fine.
Lower means more prudent, but also less pruning.

$pruneShortTokens = 0: Set to 1 to remove short tokens, independent of
weight of contained accounts. This is not recommended!

$minimalTokenLenght = 2: Yes, there is a typo. Anyway, this is the
minimal length for the above option. Anything shorter will be removed.
Not recommended.

$inFile: Name of the uncompressed original XML.
$outFile: Name of the output file.

The tool assumes there is only one account using Bayes. It should be
straight forward to extend it if you have more than one.

So, usual disclaimers:
- Use at your own risk.
- The author will not be held liable.
- It works for me, no guarantee for anyone else.
- Make a backup, duh!

Best,

YeOldHinnerk
---

use strict;

my $pruneObsoleteAccs = 1;
my $obsoleteAccsFile = "obsoleteAccs.txt";
my @obsoleteAccs;
my $obsAcc;
my $pruneMinimalWeight = 1;
my $minimalWeight = 2;
my $pruneShortTokens = 0;
my $minimalTokenLenght = 2;

my $inFile = "Haushaltsbuch v11.xml";
my $outFile = "Haushaltsbuch v11 pruned.xml";
my @lines;
my $line;
my $row;

# Read in obsolte Account Names.
if ($pruneObsoleteAccs) {
    open my $oac, '<', $obsoleteAccsFile or die "$obsoleteAccsFile: $!";
    push @obsoleteAccs, <$oac>;
    close $oac or die "$oac: $!";
    chomp(@obsoleteAccs);
}

# Read in XML
open my $ifh, '<', $inFile or die "$inFile: $!";
push @lines, <$ifh>;
close $ifh or die "$inFile: $!";
chomp @lines;

# Create output XML
open my $ofh, '>', $outFile or die "$outFile: $!";
$row = 0;
$line = $lines[$row];
while ($line !~ /import-map-bayes/) {
    print $ofh ($line . "\n");
    $row = $row +1;
    $line = $lines[$row];
}
print $ofh ($line . "\n");  # This line contains "import-map-bayes"
$row = $row +1;
$line = $lines[$row];
print $ofh ($line . "\n");  # "Frame" row of import-map.
$row = $row +1;
$line = $lines[$row];

my $currentToken;
my $currentAcc;
my $currentWght;
my $cntSlotValue = 0;             # If this value is negative, continue
with writing everything out.

my $importOpen = 1;
my $tokenOpen = 0;
my $accOpen = 0;
my %tokenAccs;
my $value;

my $cntPrunedTokens = 0;
my $cntPrunedAccs = 0;

while ($importOpen) {
    if ($line =~ /<slot>/) {            # slot starts
        if ($tokenOpen) {
            $accOpen = 1;
        } else {
            $tokenOpen = 1;
            $accOpen = 0;
        }
    } elsif ($line =~ /<\/slot>/) {    # slot ends
        if ($accOpen) {
            $accOpen = 0;
        } elsif ($tokenOpen) {
            # add pruning of @tokenAccs and @tokenWghts here.

            # Short Tokens
            if ($pruneShortTokens) {
                if (length($currentToken) < $minimalTokenLenght) {
                    %tokenAccs = ();
                }                           
            }
           
            # Obsolete Accounts.
            if ($pruneObsoleteAccs) {
                foreach $currentAcc (keys %tokenAccs) {
                    foreach $obsAcc (@obsoleteAccs) {
                        if ($currentAcc =~ /^$obsAcc/) {
                            delete $tokenAccs{$currentAcc};
                            $cntPrunedAccs = $cntPrunedAccs +1;
                            last;
                        }
                    }
                }
            }
           
            # Minimal Weight
            if ($pruneMinimalWeight) {
                foreach $currentAcc (keys %tokenAccs) {
                    if ($tokenAccs{$currentAcc} < $minimalWeight) {
                        delete $tokenAccs{$currentAcc};
                        $cntPrunedAccs = $cntPrunedAccs +1;
                        # print STDOUT "Minimal Weight: Deleted
$currentAcc\n";
                    }
                }
            }
           
           
            # write all remaining @tokenAccs and @tokenWghts to <$ofh>.
            if (%tokenAccs != ()) { # only if accounts remain in tokenAccs.
                print $ofh "        <slot>\n";
                print $ofh "         
<slot:key>$currentToken<\/slot:key>\n";
                print $ofh "          <slot:value type=\"frame\">\n";
       
                # Write remaining accs.
                foreach $currentAcc (keys %tokenAccs) {
                    print $ofh "            <slot>\n";
                    print $ofh "             
<slot:key>$currentAcc<\/slot:key>\n";
                    $currentWght = $tokenAccs{$currentAcc};
                    print $ofh "              <slot:value
type=\"integer\">$currentWght<\/slot:value>\n";
                    print $ofh "            <\/slot>\n";
                }
               
                # close Token
                print $ofh "          <\/slot:value>\n";
                print $ofh "        <\/slot>\n";
            } else {
                # token gets dropped.
                $cntPrunedTokens = $cntPrunedTokens + 1;
            }
            # reset variables
            %tokenAccs = ();
            $accOpen = 0;
            $tokenOpen = 0;
        } else {
            $importOpen = 0;
            print $ofh "<\/slot:value>\n"; # Bayes ends
            print $ofh "<\/slot>\n";
        }
    } elsif ($line =~ /<slot:value type="integer">(.*)<\/slot:value>/) {
# <slot:value type="integer">1</slot:value>
        $tokenAccs{$currentAcc} = $1;
    } elsif ($line =~ /^<slot:value
type="frame">/){                           
        # ignore the slot:value="Frame"
    } elsif ($line =~ /^<\/slot:value>/){
        # ignore the slot:value="Frame"
        if (!$tokenOpen) {
            $importOpen = 0;
            print $ofh ($line . "\n");  # ending "Frame" row of import-map.
        }
    } elsif ($line =~ /<slot:key>(.*)<\/slot:key>/) {
        if ($accOpen) {
            $currentAcc = $1;
            $tokenAccs{$currentAcc}= 0;
        } else {
            $currentToken = $1;
        }
    }
    $row = $row +1;
    $line = $lines[$row];
}

# Write back everything
while ($row <= $#lines) {
    print $ofh ($line . "\n");
    $row = $row +1;
    $line = $lines[$row];
}
close $ofh or die "$outFile: $!";

print STDOUT "Accounts removed: $cntPrunedAccs\n";
print STDOUT "Tokens removed:   $cntPrunedTokens\n";











More information about the gnucash-devel mailing list