decode_base64paper/README.md

Introduction
------------

Back when I was young and innocent, I created my GPG key for Debian using
subkeys according to this article: https://wiki.debian.org/Subkeys

I then stored the master key pair on a usb drive and (just in case) printed a
base64 encoded gzipped tarball of the $GNUPGHOME containing the master key to
five pages of A4 paper. Back then I didn't find any better solution to store
binary data on paper so since I didn't expect I needed it ever anyways, I went
for the simplest solution.

Fast forward four years I wanted to sign the key of a friend for which I need
my master key. So I got the usb stick out of its closet and plugged it in...
The usb stick was bust. No reaction anymore. No idea what happened.

So now I was lost with five pages of base64 as the only remaining artifact of
my secret master key.

Key Recovery
------------

Manually typing five pages with 4332 characters each is no fun, so even though
it would probably take longer to do so (but would be more fun) I investigated
how to automatically do it.

Complete OCR solutions like tesseract already exist and while they are
performing incredibly well for real text, they do not perform well on my input
data because they expect the text to be real human language and because they
are trained for a wide variety of fonts, making them less precise for the
specific font I used. I could train tesseract with my specific font but then I
would learn more about how to use tesseract than how to teach a machine how to
read base64 text.

After some trial and error I found out that I apparently had printed the five
pages in question with "Liberation Mono". After lots of trial and error to find
the right approach, I settled on the following method:

 1. find the minimum rectangle around the text
 2. rotate the image so that the text is exactly evenly aligned
 3. divide the text into the required raster of letters (gladly I used a monospaced font to print)
 4. attempt to recognize each letter
 5. present the result to the user, one letter at a time, comparing the scanned original and the guessed letter, allowing to make corrections
 6. store the result as a text file

Improving character recognition
-------------------------------

I started with training my k-nearest neighbors model with blurred synthetic
images that I generated from SVGs that showed one base64 letter in "Liberation
Mono" each. It seems that there is no simple way to put letters into a bitmap
which preserves the same font baseline, thus I used SVG for the task.

With that training set, the k-nearest neighbors algorithm was able to correctly
detect 40 out of 43 characters. The detection rate dramatically increased once
I fed the actually scanned characters from the first page into the training set
for detecting characters on the second page. After having scanned all five
pages, the k-nearest algorithm made less than 5 errors per 10000 pages.
Specifically it had problems differentiating 0 from O.

Finding the needle in the haystack
----------------------------------

Even with such a high recognition rate, the project still fails unless every
last character has been read in correctly. Only one bit swapped will lead to a
useless secret key. The problem here was how to determine when I was actually
done with the translation. The OCR engine can make errors but so can I when I
check the result. In the worst case I'd never get any indication that there is
still something wrong.

I didn't end up storing any checksum of the data I printed, which was a big
mistake. On the other hand, I was lucky that what I stored was gzipped data and
the gzip format contains a CRC. Thus, using `gzip -dtv` I was able to check
whether the data I had decoded was correct or whether an error was still hiding
somewhere.

It would've helped a lot if I had stored my data in a way that would allow few
errors to be present and still be able to recover the original data.

Storing binary data on paper
----------------------------

Now I'm done with recovering my key and I have some software that does the job
nearly automatically. The underlying problem though is still not solved today,
four years later.

A bunch of software exists which promises to solve this problem but they are
each not very popular and if I trust some data to a medium that is gonna surive
decades then I want to be sure that I can still process the data decades later
and that the tool to do so hasn't vanished by then.

Using base64 is attractive because one can be certain that the letters can
still be read by *some* method at any point in the future. Unfortunately, some
characters are hard to distinguish from each other, so using a much smaller
subset would probably make more sense but also be much more wasteful.

There is the problem of error correction. There exists the PAR2 format and the
pyFileFixity tool but they both require knowledge of the exact algorithm to
make sense of the stored codes. There are qrcodes but they are limited in the
amount of data that they can store and it seems as if they are also limited to
store text only (no arbitrary bytes) and thus data has to be encoded again
before handing them to the qrcode generator.

Some helpful resources:

 - http://blog.liw.fi/posts/qr-backup/
 - http://www.ollydbg.de/Paperbak/index.html
 - https://github.com/lrq3000/pyFileFixity

Lessons learned
---------------

 - A backup for which you didn't make sure that you can read it back in is useless.

 - Do not encode information as base64 on paper. Some characters are just too similar.

 - Using a monospaced font helps a lot.

 - OpenCV character recognition using CvKNearest performs exceptionally well, given a good training set

 - Always store a checksum of the data you print

 - Use error correction methods (reed-solomon)

 - There still is no good way to store arbitrary binary data reliably and future-proof on paper