57 lines
3.3 KiB
Text
57 lines
3.3 KiB
Text
This file is intended to clarify why and how we are using Unicode in gocr. It's
|
|
probably only interesting if you intend to do something similar in a project of
|
|
yours or to develop gocr.
|
|
|
|
History
|
|
0.1 initial version
|
|
|
|
---
|
|
Why to use Unicode? While in this early development stage gocr doesn't
|
|
recognize much more than the ASCII characters, we hope that someday it will
|
|
support many different languages with different character sets; that it will
|
|
recognize mathematical expressions, and so on. Even in this early stage, we are
|
|
trying to support other Latin languages --- accented characters. Since once we
|
|
aren't using ASCII characters anymore we are subject to the character set
|
|
loaded in the machine if we use the 0x80-0xFF characters, we had to solve the
|
|
problem.
|
|
|
|
Against what Andrew Tanenbaum once said, "The good thing about standards is
|
|
that there are so many to choose from", we decided to not invent a new one and
|
|
stick to one of the current; Unicode is the most famous, so we chose it.
|
|
|
|
To my dismay, Unicode's support, at this time, sucks. There are few libraries
|
|
around to deal with it, contrary to what one would expect. The libraries I
|
|
found, though very good, did not provide the kind of support we needed in gocr:
|
|
to work internally with hundreds of different characters. They were all focused
|
|
in handling external files, user interface --- i18n, in short --- something that
|
|
I'm sure is much more needed and used than what gocr needs.
|
|
|
|
That's why we wrote our own Unicode code. We implemented only what we needed,
|
|
and in a practical way to the developer --- composing characters, etc. Since no
|
|
one I know will want the output of their scanned and OCR document in Unicode or
|
|
UTF-8 format (though I hope that one format will eventually be used in every OS
|
|
and computer around, and ASCII will go to a museum, and though gocr can output
|
|
in one of these formats too), we had to output in some format more friendly;
|
|
the choices are existing character maps, TeX, SGML and HTML initially, and who
|
|
knows what else later. Once we can recognize the text and keep the formatting,
|
|
these formats will be desired even more.
|
|
|
|
How to implement it (careful: developer's view)? Fortunatedly, there is partial
|
|
support for it now. The wchar_t type defined in <stddef.h> is a standard (only
|
|
sometimes 16, sometimes 32, perhaps even 64 somewhere). Do we need the libc's
|
|
string functions? If we do, they also exist for wchar_t. Some conversion
|
|
functions were needed: ASCII -> Unicode, Unicode -> everything else.
|
|
|
|
The ASCII -> Unicode conversion (done by the compose() function) is written to
|
|
be called by the ocr engine, when it recognizes a character. You can also use
|
|
the Unicode codes #defined in unicode.h, but the compose function allows a
|
|
simpler use. It's recommended to use the symbols itself for ASCII codes (don't
|
|
need to LATIN_CAPITAL_LETTER_A, use 'A').
|
|
|
|
The Unicode -> etc conversion (done by decode()) is a bit more difficult
|
|
sometimes, since previous symbols may interact with the current one. For
|
|
example, if you're converting to TeX, two characters that are in math mode will
|
|
call two times math mode; for example, "\( \pi \) \( \iota \)", instead of
|
|
"\( \pi \iota \)". Possibly a wider conversion function, decode_text(), which
|
|
deals with the entire text at once should be provided; this function will also
|
|
create headers, etc.
|