Johannes 'josch' Schauer
||6 years ago|
|README.md||6 years ago|
|base64.conf||6 years ago|
|create_training_set.sh||6 years ago|
|letter_template.svg||6 years ago|
|run.py||6 years ago|
Back when I was young and innocent, I created my GPG key for Debian using subkeys according to this article: https://wiki.debian.org/Subkeys
I then stored the master key pair on a usb drive and (just in case) printed a base64 encoded gzipped tarball of the $GNUPGHOME containing the master key to five pages of A4 paper. Back then I didn't find any better solution to store binary data on paper so since I didn't expect I needed it ever anyways, I went for the simplest solution.
Fast forward four years I wanted to sign the key of a friend for which I need my master key. So I got the usb stick out of its closet and plugged it in... The usb stick was bust. No reaction anymore. No idea what happened.
So now I was lost with five pages of base64 as the only remaining artifact of my secret master key.
Manually typing five pages with 4332 characters each is no fun, so even though it would probably take longer to do so (but would be more fun) I investigated how to automatically do it.
Complete OCR solutions like tesseract already exist and while they are performing incredibly well for real text, they do not perform well on my input data because they expect the text to be real human language and because they are trained for a wide variety of fonts, making them less precise for the specific font I used. I could train tesseract with my specific font but then I would learn more about how to use tesseract than how to teach a machine how to read base64 text.
After some trial and error I found out that I apparently had printed the five pages in question with "Liberation Mono". After lots of trial and error to find the right approach, I settled on the following method:
- find the minimum rectangle around the text
- rotate the image so that the text is exactly evenly aligned
- divide the text into the required raster of letters (gladly I used a monospaced font to print)
- attempt to recognize each letter
- present the result to the user, one letter at a time, comparing the scanned original and the guessed letter, allowing to make corrections
- store the result as a text file
Improving character recognition
I started with training my k-nearest neighbors model with blurred synthetic images that I generated from SVGs that showed one base64 letter in "Liberation Mono" each. It seems that there is no simple way to put letters into a bitmap which preserves the same font baseline, thus I used SVG for the task.
With that training set, the k-nearest neighbors algorithm was able to correctly detect 40 out of 43 characters. The detection rate dramatically increased once I fed the actually scanned characters from the first page into the training set for detecting characters on the second page. After having scanned all five pages, the k-nearest algorithm made less than 5 errors per 10000 pages. Specifically it had problems differentiating 0 from O.
Finding the needle in the haystack
Even with such a high recognition rate, the project still fails unless every last character has been read in correctly. Only one bit swapped will lead to a useless secret key. The problem here was how to determine when I was actually done with the translation. The OCR engine can make errors but so can I when I check the result. In the worst case I'd never get any indication that there is still something wrong.
I didn't end up storing any checksum of the data I printed, which was a big
mistake. On the other hand, I was lucky that what I stored was gzipped data and
the gzip format contains a CRC. Thus, using
gzip -dtv I was able to check
whether the data I had decoded was correct or whether an error was still hiding
It would've helped a lot if I had stored my data in a way that would allow few errors to be present and still be able to recover the original data.
Storing binary data on paper
Now I'm done with recovering my key and I have some software that does the job nearly automatically. The underlying problem though is still not solved today, four years later.
A bunch of software exists which promises to solve this problem but they are each not very popular and if I trust some data to a medium that is gonna surive decades then I want to be sure that I can still process the data decades later and that the tool to do so hasn't vanished by then.
Using base64 is attractive because one can be certain that the letters can still be read by some method at any point in the future. Unfortunately, some characters are hard to distinguish from each other, so using a much smaller subset would probably make more sense but also be much more wasteful.
There is the problem of error correction. There exists the PAR2 format and the pyFileFixity tool but they both require knowledge of the exact algorithm to make sense of the stored codes. There are qrcodes but they are limited in the amount of data that they can store and it seems as if they are also limited to store text only (no arbitrary bytes) and thus data has to be encoded again before handing them to the qrcode generator.
Some helpful resources:
A backup for which you didn't make sure that you can read it back in is useless.
Do not encode information as base64 on paper. Some characters are just too similar.
Using a monospaced font helps a lot.
OpenCV character recognition using CvKNearest performs exceptionally well, given a good training set
Always store a checksum of the data you print
Use error correction methods (reed-solomon)
There still is no good way to store arbitrary binary data reliably and future-proof on paper