Successful data recovery (was Re: File signatures??)

Sat Jul 1 10:51:09 EDT 2017

Matt,

I’m glad that you managed to recover your data.

Now go set up a proper backup program so that you never have to go through that again. Do it today!

Regards,
John Ralls

> On Jun 30, 2017, at 2:01 PM, matt at considine.net wrote:
> 
> Max, all,
> 
> Thank you for the pointers and help.  I'm pleased to say that I seem to have recovered my data.  Still remaining to be found would be customizations I made to the standard report.  But if what I have found stands up to some checks against known bank balances, etc, then I won't be too far off of where I was a month ago.
> 
> What I have been trying to sift through (to recap a bit) is the result of a recovery done with testdisk/photorec, which left a blizzard of files and file fragments on a multi-terabyte harddrive.  By and large the filenames were lost (though not in all cases) and photorec uses a list of known file signatures to try to append the appropriate file extension.  This largely works, but also not always.  Finally, if I had known of a definitive file signature *before* I started the recovery, that might have helped.  But for text-oriented files (vs JPEGs, PDFs, executables, etc) that's not always reliable or available.
> 
> Fortunately, photorec seems to recognize XML and xml.gz formatted files.  Diving head first into a pool I hadn't been in before, I came up with bash scripts (this is a Linux machine I'm working on) to do recursive searches.  Basically, I would open a terminal window and
>  gedit ~/.bashrc
> 
> and add the following to the end:
> 
> 
> function odsgrep(){
> term="$1"
> echo Start search : $term
> OIFS="$IFS"
> IFS=$'\n'
> for file in $(find . -name "*.ods"); do
>    echo $file;
>    unzip -p "$file" content.xml | tidy -q -xml 2> /dev/null | grep -i -F "$term" > /dev/null;
>    if [ $? -eq 0 ]; then
>       echo FOUND FILE $file;
>       echo $file;
>    fi;
> done
> IFS="$OIFS"
> echo Finished search : $term
> }
> 
> function mattpdfgrep(){
> term="$1"
> echo Start search : $term
> OIFS="$IFS"
> IFS=$'\n'
> for file in $(find . -name "*.pdf"); do
>    #echo $file;
>    pdftotext -htmlmeta "$file" - | grep --with-filename --label="$file" --color -i -F "$term" ;
>    if [ $? -eq 0 ]; then
>      echo $file;
>      pdfinfo $file;
>    fi;
> done
> IFS="$OIFS"
> echo Finished search : $term
> }
> 
> function mattxlsgrep(){
> term="$1"
> echo Start search : $term
> OIFS="$IFS"
> IFS=$'\n'
> for file in $(find . -name "*.xlsx"); do
>    #echo $file;
>    xlsx2csv "$file" | grep --with-filename --label="$file" --color -i -F "$term" ;
>    if [ $? -eq 0 ]; then
>      echo $file;
>    fi;
> done
> for file in $(find . -name "*.xls"); do
>    #echo $file;
>    xls2csv "$file" | grep --with-filename --label="$file" --color -i -F "$term" ;
>    if [ $? -eq 0 ]; then
>      echo $file;
>    fi;
> done
> IFS="$OIFS"
> echo Finished search : $term
> }
> 
> function mattxmlgzgrep(){
> term="$1"
> echo Start search : $term
> OIFS="$IFS"
> IFS=$'\n'
> for file in $(find . -name "*.xml.gz"); do
>    #echo $file;
>    gunzip -c "$file" | tidy -q -xml 2> /dev/null | grep -i -F "$term" > /dev/null;
>    if [ $? -eq 0 ]; then
>       echo FOUND FILE $file;
>       #echo $file;
>    fi;
> done
> IFS="$OIFS"
> echo Finished search : $term
> }
> 
> function matttxtgrep(){
> term="$1"
> echo Start search : $term
> OIFS="$IFS"
> IFS=$'\n'
> for file in $(find . -name "*.txt"); do
>    #echo $file;
>    grep -i -F "$term" "$file"> /dev/null;
>    if [ $? -eq 0 ]; then
>       echo FOUND FILE $file;
>       #echo $file;
>    fi;
> done
> IFS="$OIFS"
> echo Finished search : $term
> }
> 
> These custom commands (built from a 'net search that turned up a variant of the first one) allow for recursive file searches as well as subsequent unzipping and string search operations.  Importantly, they attempt to look inside of spreadsheets and pdfs which aren't otherwise "grep-able".
> 
> To find the data, I used the mattxmlgzgrep routine to search *backwards* in time for the following
>  <ts:date>2017-06
> 
> It found no files, which was expected, since I had last worked on this account in March or April, around US tax season.  The next search for
>  <ts:date>2017-05
> also turned up nothing.  But searching <ts:date>2017-04 turned up 1 hit and <ts:date>2017-03 turned up a large number.  So even though the timestamp on the file was dated as of the recovery, by searching backwards for entries I was able to narrow things down.
> 
> Examining the file in gnucash (it seemed to have been pulled in cleanly) showed all the categories, accounts, data, etc that I expected to see.
> 
> It would be great to find the files related to the standard report customizations and I'll spend a little time trying to do that.  Not sure what would be a suitable "marker" yet but I think I have a candidate or two.  But after that I need to find the other records that made up some of this workflow.  Fortunately, they were all digital to begin with and I believe I still have access online.
> 
> Thanks again for anyone's help.  If there's anything I can share in return, let me know.
> 
> Matt
> 
> 
> On 2017-06-30 14:06, max at hyre.net wrote:
>> Dear Matt:
>>> The problem is that the recovery operation (using
>>> Testdisk/Photorec) results in files and file fragments
>>> that may or may not be correctly identified by file
>>> extensions.
>>   It sounds like what you want is a magic number (file-format ID:
>> https://en.wikipedia.org/wiki/File_format#Magic_number) for .gnucash
>> files.  Looking at my file it appears that ``<gnc-v2'' starting at the
>> 41st character in the file whould do it.  (I presumee the `2' in
>> ``-v2'' is a version number, and could change at some future date, but
>> for now that's not a problem.)
>>   It would be nice if the recovery program lets you add to the
>> file-ID list, otherwise you're back to grep.  I hope that it
>> recognizes gzipped files (possible GNUCash files, compressed), but if
>> not you want to look for the first two characters = 0x1f 0x8b.  Of
>> course, then you'll have to unzip them to see whether they're really
>> what you want.  :-/
>>   Gurus:  Is this right?  For future-proofing, can we assume the
>> magic number will always be in position 41?  Is there an actual,
>> designated, magic number for GNUCash files somewhere?
>>   Hope this makes sense/helps...
>>       Best wishes,
>>           Max Hyre
> _______________________________________________
> gnucash-user mailing list
> gnucash-user at gnucash.org
> https://lists.gnucash.org/mailman/listinfo/gnucash-user
> -----
> Please remember to CC this list on all your replies.
> You can do this by using Reply-To-List or Reply-All.