You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
126 lines
5.8 KiB
Markdown
126 lines
5.8 KiB
Markdown
Introduction
|
|
------------
|
|
|
|
Back when I was young and innocent, I created my GPG key for Debian using
|
|
subkeys according to this article: https://wiki.debian.org/Subkeys
|
|
|
|
I then stored the master key pair on a usb drive and (just in case) printed a
|
|
base64 encoded gzipped tarball of the $GNUPGHOME containing the master key to
|
|
five pages of A4 paper. Back then I didn't find any better solution to store
|
|
binary data on paper so since I didn't expect I needed it ever anyways, I went
|
|
for the simplest solution.
|
|
|
|
Fast forward four years I wanted to sign the key of a friend for which I need
|
|
my master key. So I got the usb stick out of its closet and plugged it in...
|
|
The usb stick was bust. No reaction anymore. No idea what happened.
|
|
|
|
So now I was lost with five pages of base64 as the only remaining artifact of
|
|
my secret master key.
|
|
|
|
Key Recovery
|
|
------------
|
|
|
|
Manually typing five pages with 4332 characters each is no fun, so even though
|
|
it would probably take longer to do so (but would be more fun) I investigated
|
|
how to automatically do it.
|
|
|
|
Complete OCR solutions like tesseract already exist and while they are
|
|
performing incredibly well for real text, they do not perform well on my input
|
|
data because they expect the text to be real human language and because they
|
|
are trained for a wide variety of fonts, making them less precise for the
|
|
specific font I used. I could train tesseract with my specific font but then I
|
|
would learn more about how to use tesseract than how to teach a machine how to
|
|
read base64 text.
|
|
|
|
After some trial and error I found out that I apparently had printed the five
|
|
pages in question with "Liberation Mono". After lots of trial and error to find
|
|
the right approach, I settled on the following method:
|
|
|
|
1. find the minimum rectangle around the text
|
|
2. rotate the image so that the text is exactly evenly aligned
|
|
3. divide the text into the required raster of letters (gladly I used a monospaced font to print)
|
|
4. attempt to recognize each letter
|
|
5. present the result to the user, one letter at a time, comparing the scanned original and the guessed letter, allowing to make corrections
|
|
6. store the result as a text file
|
|
|
|
Improving character recognition
|
|
-------------------------------
|
|
|
|
I started with training my k-nearest neighbors model with blurred synthetic
|
|
images that I generated from SVGs that showed one base64 letter in "Liberation
|
|
Mono" each. It seems that there is no simple way to put letters into a bitmap
|
|
which preserves the same font baseline, thus I used SVG for the task.
|
|
|
|
With that training set, the k-nearest neighbors algorithm was able to correctly
|
|
detect 40 out of 43 characters. The detection rate dramatically increased once
|
|
I fed the actually scanned characters from the first page into the training set
|
|
for detecting characters on the second page. After having scanned all five
|
|
pages, the k-nearest algorithm made less than 5 errors per 10000 pages.
|
|
Specifically it had problems differentiating 0 from O.
|
|
|
|
Finding the needle in the haystack
|
|
----------------------------------
|
|
|
|
Even with such a high recognition rate, the project still fails unless every
|
|
last character has been read in correctly. Only one bit swapped will lead to a
|
|
useless secret key. The problem here was how to determine when I was actually
|
|
done with the translation. The OCR engine can make errors but so can I when I
|
|
check the result. In the worst case I'd never get any indication that there is
|
|
still something wrong.
|
|
|
|
I didn't end up storing any checksum of the data I printed, which was a big
|
|
mistake. On the other hand, I was lucky that what I stored was gzipped data and
|
|
the gzip format contains a CRC. Thus, using `gzip -dtv` I was able to check
|
|
whether the data I had decoded was correct or whether an error was still hiding
|
|
somewhere.
|
|
|
|
It would've helped a lot if I had stored my data in a way that would allow few
|
|
errors to be present and still be able to recover the original data.
|
|
|
|
Storing binary data on paper
|
|
----------------------------
|
|
|
|
Now I'm done with recovering my key and I have some software that does the job
|
|
nearly automatically. The underlying problem though is still not solved today,
|
|
four years later.
|
|
|
|
A bunch of software exists which promises to solve this problem but they are
|
|
each not very popular and if I trust some data to a medium that is gonna surive
|
|
decades then I want to be sure that I can still process the data decades later
|
|
and that the tool to do so hasn't vanished by then.
|
|
|
|
Using base64 is attractive because one can be certain that the letters can
|
|
still be read by *some* method at any point in the future. Unfortunately, some
|
|
characters are hard to distinguish from each other, so using a much smaller
|
|
subset would probably make more sense but also be much more wasteful.
|
|
|
|
There is the problem of error correction. There exists the PAR2 format and the
|
|
pyFileFixity tool but they both require knowledge of the exact algorithm to
|
|
make sense of the stored codes. There are qrcodes but they are limited in the
|
|
amount of data that they can store and it seems as if they are also limited to
|
|
store text only (no arbitrary bytes) and thus data has to be encoded again
|
|
before handing them to the qrcode generator.
|
|
|
|
Some helpful resources:
|
|
|
|
- http://blog.liw.fi/posts/qr-backup/
|
|
- http://www.ollydbg.de/Paperbak/index.html
|
|
- https://github.com/lrq3000/pyFileFixity
|
|
|
|
Lessons learned
|
|
---------------
|
|
|
|
- A backup for which you didn't make sure that you can read it back in is useless.
|
|
|
|
- Do not encode information as base64 on paper. Some characters are just too similar.
|
|
|
|
- Using a monospaced font helps a lot.
|
|
|
|
- OpenCV character recognition using CvKNearest performs exceptionally well, given a good training set
|
|
|
|
- Always store a checksum of the data you print
|
|
|
|
- Use error correction methods (reed-solomon)
|
|
|
|
- There still is no good way to store arbitrary binary data reliably and future-proof on paper
|