Successful data recovery (was Re: File signatures??)

Fri Jun 30 17:01:31 EDT 2017

Max, all,

Thank you for the pointers and help.  I'm pleased to say that I seem to 
have recovered my data.  Still remaining to be found would be 
customizations I made to the standard report.  But if what I have found 
stands up to some checks against known bank balances, etc, then I won't 
be too far off of where I was a month ago.

What I have been trying to sift through (to recap a bit) is the result 
of a recovery done with testdisk/photorec, which left a blizzard of 
files and file fragments on a multi-terabyte harddrive.  By and large 
the filenames were lost (though not in all cases) and photorec uses a 
list of known file signatures to try to append the appropriate file 
extension.  This largely works, but also not always.  Finally, if I had 
known of a definitive file signature *before* I started the recovery, 
that might have helped.  But for text-oriented files (vs JPEGs, PDFs, 
executables, etc) that's not always reliable or available.

Fortunately, photorec seems to recognize XML and xml.gz formatted files. 
  Diving head first into a pool I hadn't been in before, I came up with 
bash scripts (this is a Linux machine I'm working on) to do recursive 
searches.  Basically, I would open a terminal window and
   gedit ~/.bashrc

and add the following to the end:

function odsgrep(){
  term="$1"
  echo Start search : $term
  OIFS="$IFS"
  IFS=$'\n'
  for file in $(find . -name "*.ods"); do
     echo $file;
     unzip -p "$file" content.xml | tidy -q -xml 2> /dev/null | grep -i 
-F "$term" > /dev/null;
     if [ $? -eq 0 ]; then
        echo FOUND FILE $file;
        echo $file;
     fi;
  done
  IFS="$OIFS"
  echo Finished search : $term
}

function mattpdfgrep(){
  term="$1"
  echo Start search : $term
  OIFS="$IFS"
  IFS=$'\n'
  for file in $(find . -name "*.pdf"); do
     #echo $file;
     pdftotext -htmlmeta "$file" - | grep --with-filename --label="$file" 
--color -i -F "$term" ;
     if [ $? -eq 0 ]; then
       echo $file;
       pdfinfo $file;
     fi;
  done
  IFS="$OIFS"
  echo Finished search : $term
}

function mattxlsgrep(){
  term="$1"
  echo Start search : $term
  OIFS="$IFS"
  IFS=$'\n'
  for file in $(find . -name "*.xlsx"); do
     #echo $file;
     xlsx2csv "$file" | grep --with-filename --label="$file" --color -i 
-F "$term" ;
     if [ $? -eq 0 ]; then
       echo $file;
     fi;
  done
  for file in $(find . -name "*.xls"); do
     #echo $file;
     xls2csv "$file" | grep --with-filename --label="$file" --color -i -F 
"$term" ;
     if [ $? -eq 0 ]; then
       echo $file;
     fi;
  done
  IFS="$OIFS"
  echo Finished search : $term
}

function mattxmlgzgrep(){
  term="$1"
  echo Start search : $term
  OIFS="$IFS"
  IFS=$'\n'
  for file in $(find . -name "*.xml.gz"); do
     #echo $file;
     gunzip -c "$file" | tidy -q -xml 2> /dev/null | grep -i -F "$term" > 
/dev/null;
     if [ $? -eq 0 ]; then
        echo FOUND FILE $file;
        #echo $file;
     fi;
  done
  IFS="$OIFS"
  echo Finished search : $term
}

function matttxtgrep(){
  term="$1"
  echo Start search : $term
  OIFS="$IFS"
  IFS=$'\n'
  for file in $(find . -name "*.txt"); do
     #echo $file;
     grep -i -F "$term" "$file"> /dev/null;
     if [ $? -eq 0 ]; then
        echo FOUND FILE $file;
        #echo $file;
     fi;
  done
  IFS="$OIFS"
  echo Finished search : $term
}

These custom commands (built from a 'net search that turned up a variant 
of the first one) allow for recursive file searches as well as 
subsequent unzipping and string search operations.  Importantly, they 
attempt to look inside of spreadsheets and pdfs which aren't otherwise 
"grep-able".

To find the data, I used the mattxmlgzgrep routine to search *backwards* 
in time for the following
   <ts:date>2017-06

It found no files, which was expected, since I had last worked on this 
account in March or April, around US tax season.  The next search for
   <ts:date>2017-05
also turned up nothing.  But searching <ts:date>2017-04 turned up 1 hit 
and <ts:date>2017-03 turned up a large number.  So even though the 
timestamp on the file was dated as of the recovery, by searching 
backwards for entries I was able to narrow things down.

Examining the file in gnucash (it seemed to have been pulled in cleanly) 
showed all the categories, accounts, data, etc that I expected to see.

It would be great to find the files related to the standard report 
customizations and I'll spend a little time trying to do that.  Not sure 
what would be a suitable "marker" yet but I think I have a candidate or 
two.  But after that I need to find the other records that made up some 
of this workflow.  Fortunately, they were all digital to begin with and 
I believe I still have access online.

Thanks again for anyone's help.  If there's anything I can share in 
return, let me know.

Matt

On 2017-06-30 14:06, max at hyre.net wrote:
> Dear Matt:
> 
>> The problem is that the recovery operation (using
>> Testdisk/Photorec) results in files and file fragments
>> that may or may not be correctly identified by file
>> extensions.
> 
>    It sounds like what you want is a magic number (file-format ID:
> https://en.wikipedia.org/wiki/File_format#Magic_number) for .gnucash
> files.  Looking at my file it appears that ``<gnc-v2'' starting at the
> 41st character in the file whould do it.  (I presumee the `2' in
> ``-v2'' is a version number, and could change at some future date, but
> for now that's not a problem.)
> 
>    It would be nice if the recovery program lets you add to the
> file-ID list, otherwise you're back to grep.  I hope that it
> recognizes gzipped files (possible GNUCash files, compressed), but if
> not you want to look for the first two characters = 0x1f 0x8b.  Of
> course, then you'll have to unzip them to see whether they're really
> what you want.  :-/
> 
>    Gurus:  Is this right?  For future-proofing, can we assume the
> magic number will always be in position 41?  Is there an actual,
> designated, magic number for GNUCash files somewhere?
> 
>    Hope this makes sense/helps...
> 
> 
>        Best wishes,
> 
>            Max Hyre