Tesseract-ocr seems to be able to encapsulate 1 bit/indexed/gray PNGs to PDFs with out increasing size #41

Closed
opened 3 years ago by josch · 0 comments
josch commented 3 years ago
Owner

By Ren Young on 2018-03-14T17:06:00.145Z

I've a minor obsession with creating small PDFs from small PNGs which led me to img2pdf.
But I was disappointed to find it's results were the same as other tools I've tried (ImageMagick, poppler, ghostscript).

Then I read this and was surprised:

Other raster graphics formats are losslessly stored in a zip/flate encoding of
their RGB representation. This might increase file size and does not store
transparency. There is nothing that can be done about that until the PDF format
allows embedding other image formats like PNG.

That surprised me as I've found the PDFs created by tesseract-ocr seem to have encapsulated 1 bit/indexed/gray PNGs without increasing the size (all non-interlaced and no alpha transparency PNGs.)
'pdfimages' tool confirms the images are 1 bit/indexed/gray.

The (small) downside is the PDFs have a hidden OCR text layer.
But the tesseract project seems to be doing what you say can't be done or am I missing something?

$ tesseract --version
tesseract 3.04.01
 leptonica-1.74.4
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0

I think this might be the bit of code tesseract uses to make PDFs (but I'm not a developer.)
f3d7ee868b/src/pdfio1.c


By Ren Young on 2018-03-14T19:33:09.494Z


tesseract PDF sizes
-------------------
 20473 bw.png
 23546 bw.pdf
179645 col8.png
183139 col8.pdf
488477 grey.png
490795 grey.pdf
187013 rgb.png
189774 rgb.pdf
img2pdf PDF sizes
-----------------
   20473 bw.png
  122551 bw.png.pdf
  179645 col8.png
  447027 col8.png.pdf
  488477 grey.png
 3730001 grey.png.pdf
  187013 rgb.png
36234677 rgb.png.pdf

By josch on 2018-03-14T20:07:01.873Z


Thanks for going through the trouble of uploading these files!

I now know what's happening. There is something in the PDF spec for flate encoded datastreams that is called DecodeParams that allows to specify a predictor. See section 7.4.4.4 in ISO 32000-1:2008. With these predictors it is possible to achieve PNG-like compression using the flate encoding even in PDF!

Thanks for curing my ignorance! So it is indeed possible to also compress raster graphics down to what PNG is able to achieve while still being lossless!


By Ren Young on 2018-03-14T20:18:26.362Z


No, thank you and developers like you for making great free open source software.
And thanks for confirming my little hack.
I was beginning to doubt it myself :)


By josch on 2018-03-14T20:23:38.898Z


Don't celebrate yet. I seem to be unable to find an existing implementation that encodes a raster image this way. I only am able to find decoders, for example in pdfrw: https://github.com/pmaupin/pdfrw/blob/HEAD/pdfrw/uncompress.py but I also suspect that any encoder I can come up with might make img2pdf quite slow because it will all be in Python... 😟


By josch on 2018-03-14T20:41:54.558Z


Ooooor..... I just do what tesseract does and use libleptonica.... I'd just have to hook into the shared library with Python somehow for example using ctypes which is what pyleptonica already does... hrm...


By josch on 2018-03-14T22:14:35.379Z


Turns out, that PDF is able to directly embed the IDAT chunk and interpret it correctly if the DecodeParms are set correctly. I have a local version that is able to embed RGB PNG images into a PDF which is just as fast as before (because no data is transformed) but there is no increase in filesize (again, because the PNG data is pasted as is just minus the header and other unneeded chunks). Now I have to add support for other PNG types.

@monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached?


By josch on 2018-03-14T23:40:46.924Z


There is a problem with palette PNG images: https://github.com/pmaupin/pdfrw/issues/128


By Ren Young on 2018-03-15T10:18:41.324Z


Ooooor..... I just do what tesseract does and use libleptonica

That's just what I came here to suggest.

@monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached?

Its actually not my image. Its from here on Wikimedia Commons.
Its by User:Marc_Mongenet and its licensed CC-BY-SA-2.5.

(This is probably not what your talking about in pdfrw issue 128 above but Evince displays the col8.pdf no problem for me.)


By josch on 2018-03-15T10:35:42.318Z


Turns out, that libleptonica is not even needed! PDF directly supports exactly the same filters that the PNG format uses. So what I can do is to directly dump the PNG IDAT chunk into the PDF and by adding the right metadata to the DecodeParms dictionary, PDF viewers will be able to make sense of it! This even means that, PNG input is now as fast as JPEG input because nothing needs to be re-encoded but the data is just copied as-is into the PDF. For other formats than PNG I could use libleptonica but I went for the simpler method to just use PIL for turning the image into a PNG and then extracting the IDAT chunk from the result. It's not pretty but it's fast and doesn't add any dependencies. I pushed my proof-of-concept to the master branch in commit 1d9a25dfd2 in case you want to have a look!


By Ren Young on 2018-03-15T17:36:08.301Z


Wow. That was quick! Great work 👍


By Ren Young on 2018-03-15T17:43:17.741Z


Does it work with transparent and or interlaced PNGs? I don't think Leptonica does.


By josch on 2018-03-15T17:56:16.745Z


img2pdf removes the alpha channel of its input. I'm considering changing this behaviour because technically this means that img2pdf is not always lossless. Instead, I would just simple forbid any input with an alpha channel. The reason for this is, that I don't see why or what img2pdf should do about images with transparency situation. If you can tell me a usecase I would be all ears!

I didn't come across interlaced PNGs yet. If you have some to test on, then we can see what to do about them.


By Ren Young on 2018-03-15T19:38:21.715Z


Instead, I would just simple forbid any input with an alpha channel.

Perfect. That's what I'd do too.

I don't think I've ever come across an interlaced PNG in the wild.
I don't even know what its for. I had to look it up.
The answers here seem to say its not a feature you'd actually want to use. (And definitly not useful in PDFs).

I think I'd just forbid them as well.


By Ren Young on 2018-03-15T19:49:56.674Z


Here is an interlaced example if you still want it.
Curiously its slightly smaller than the original. I thought it should be bigger.
rgb-interlace


By Ren Young on 2018-03-15T19:57:13.171Z


Here are interlaced versions of the others too. All except the 1 bit b&w one are smaller.
The greyscale one is significantly smaller!? (Maybe interlace might be useful after all?)
interlaced.zip


By josch on 2018-03-24T19:00:35.073Z


The latest commits forbid images with alpha and interlaced pngs. Since we now use PNG encoding for any non-jpeg input, I'm closing this bug. Thanks for your help!


By josch on 2018-03-24T19:00:35.244Z


Status changed to closed


By Ren Young on 2018-06-28T09:30:44.735Z


Mentioned in issue #45


By josch on 2018-07-18T11:41:34.137Z


@monobot Since I can very much understand your "minor obsession with creating small PDFs from small PNGs" I just wanted to let you know that I finally managed to fix the testsuite and make a release of img2pdf that includes the improvements that I addded due to your bug report.
https://pypi.org/project/img2pdf/
Thanks!


By Ren Young on 2018-08-21T16:05:15.910Z


Nice work. Now that that minor obsession has been put in it place, its time to find a new one... my work is never done 😃

*By Ren Young on 2018-03-14T17:06:00.145Z* I've a minor obsession with creating small PDFs from small PNGs which led me to img2pdf. But I was disappointed to find it's results were the same as other tools I've tried (ImageMagick, poppler, ghostscript). Then I read this and was surprised: > Other raster graphics formats are losslessly stored in a zip/flate encoding of > their RGB representation. This might increase file size and does not store > transparency. There is nothing that can be done about that until the PDF format > allows embedding other image formats like PNG. That surprised me as I've found the PDFs created by tesseract-ocr seem to have encapsulated 1 bit/indexed/gray PNGs without increasing the size (all non-interlaced and no alpha transparency PNGs.) 'pdfimages' tool confirms the images are 1 bit/indexed/gray. The (small) downside is the PDFs have a hidden OCR text layer. But the tesseract project seems to be doing what you say can't be done or am I missing something? ``` $ tesseract --version tesseract 3.04.01 leptonica-1.74.4 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0 ``` I think this might be the bit of code tesseract uses to make PDFs (but I'm not a developer.) https://github.com/DanBloomberg/leptonica/blob/f3d7ee868b4864cdca7ea57349d49f3d2b4a63ec/src/pdfio1.c --- *By Ren Young on 2018-03-14T19:33:09.494Z* --- ``` tesseract PDF sizes ------------------- 20473 bw.png 23546 bw.pdf 179645 col8.png 183139 col8.pdf 488477 grey.png 490795 grey.pdf 187013 rgb.png 189774 rgb.pdf ``` ``` img2pdf PDF sizes ----------------- 20473 bw.png 122551 bw.png.pdf 179645 col8.png 447027 col8.png.pdf 488477 grey.png 3730001 grey.png.pdf 187013 rgb.png 36234677 rgb.png.pdf ``` --- *By josch on 2018-03-14T20:07:01.873Z* --- Thanks for going through the trouble of uploading these files! I now know what's happening. There is something in the PDF spec for flate encoded datastreams that is called DecodeParams that allows to specify a predictor. See section 7.4.4.4 in [ISO 32000-1:2008](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf). With these predictors it is possible to achieve PNG-like compression using the flate encoding even in PDF! Thanks for curing my ignorance! So it is indeed possible to also compress raster graphics down to what PNG is able to achieve while still being lossless! --- *By Ren Young on 2018-03-14T20:18:26.362Z* --- No, thank you and developers like you for making great free open source software. And thanks for confirming my little hack. I was beginning to doubt it myself :) --- *By josch on 2018-03-14T20:23:38.898Z* --- Don't celebrate yet. I seem to be unable to find an existing implementation that encodes a raster image this way. I only am able to find decoders, for example in pdfrw: https://github.com/pmaupin/pdfrw/blob/HEAD/pdfrw/uncompress.py but I also suspect that any encoder I can come up with might make img2pdf quite slow because it will all be in Python... :worried: --- *By josch on 2018-03-14T20:41:54.558Z* --- Ooooor..... I just do what tesseract does and use libleptonica.... I'd just have to hook into the shared library with Python somehow for example using ctypes which is what [pyleptonica](https://github.com/jsbueno/pyleptonica) already does... hrm... --- *By josch on 2018-03-14T22:14:35.379Z* --- Turns out, that PDF is able to directly embed the `IDAT` chunk and interpret it correctly if the `DecodeParms` are set correctly. I have a local version that is able to embed RGB PNG images into a PDF which is just as fast as before (because no data is transformed) but there is no increase in filesize (again, because the PNG data is pasted as is just minus the header and other unneeded chunks). Now I have to add support for other PNG types. @monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached? --- *By josch on 2018-03-14T23:40:46.924Z* --- There is a problem with palette PNG images: https://github.com/pmaupin/pdfrw/issues/128 --- *By Ren Young on 2018-03-15T10:18:41.324Z* --- > Ooooor..... I just do what tesseract does and use libleptonica That's just what I came here to suggest. >@monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached? Its actually not my image. Its from [here on Wikimedia Commons](https://commons.wikimedia.org/wiki/File:16777216colors.png). Its by [User:Marc_Mongenet]( https://commons.wikimedia.org/wiki/User:Marc_Mongenet) and its licensed CC-BY-SA-2.5. (This is probably not what your talking about in pdfrw issue 128 above but Evince displays the col8.pdf no problem for me.) --- *By josch on 2018-03-15T10:35:42.318Z* --- Turns out, that libleptonica is not even needed! PDF directly supports exactly the same filters that the PNG format uses. So what I can do is to directly dump the PNG IDAT chunk into the PDF and by adding the right metadata to the DecodeParms dictionary, PDF viewers will be able to make sense of it! This even means that, PNG input is now as fast as JPEG input because nothing needs to be re-encoded but the data is just copied as-is into the PDF. For other formats than PNG I could use libleptonica but I went for the simpler method to just use PIL for turning the image into a PNG and then extracting the IDAT chunk from the result. It's not pretty but it's fast and doesn't add any dependencies. I pushed my proof-of-concept to the master branch in commit 1d9a25dfd2e5 in case you want to have a look! --- *By Ren Young on 2018-03-15T17:36:08.301Z* --- Wow. That was quick! Great work :thumbsup: --- *By Ren Young on 2018-03-15T17:43:17.741Z* --- Does it work with transparent and or interlaced PNGs? I don't think Leptonica does. --- *By josch on 2018-03-15T17:56:16.745Z* --- img2pdf removes the alpha channel of its input. I'm considering changing this behaviour because technically this means that img2pdf is not always lossless. Instead, I would just simple forbid any input with an alpha channel. The reason for this is, that I don't see why or what img2pdf should do about images with transparency situation. If you can tell me a usecase I would be all ears! I didn't come across interlaced PNGs yet. If you have some to test on, then we can see what to do about them. --- *By Ren Young on 2018-03-15T19:38:21.715Z* --- > Instead, I would just simple forbid any input with an alpha channel. Perfect. That's what I'd do too. I don't think I've ever come across an interlaced PNG in the wild. I don't even know what its for. I had to look it up. The [answers here](https://stackoverflow.com/questions/13449314/when-to-interlace-an-image) seem to say its not a feature you'd actually want to use. (And definitly not useful in PDFs). I think I'd just forbid them as well. --- *By Ren Young on 2018-03-15T19:49:56.674Z* --- Here is an interlaced example if you still want it. Curiously its slightly smaller than the original. I thought it should be bigger. ![rgb-interlace](/uploads/d0db6d33a9ad5979b2601456ec23210a/rgb-interlace.png) --- *By Ren Young on 2018-03-15T19:57:13.171Z* --- Here are interlaced versions of the others too. All except the 1 bit b&w one are smaller. The greyscale one is significantly smaller!? (Maybe interlace might be useful after all?) [interlaced.zip](/uploads/1e83e27faccdbbdf6d124aa5583d6d5f/interlaced.zip) --- *By josch on 2018-03-24T19:00:35.073Z* --- The latest commits forbid images with alpha and interlaced pngs. Since we now use PNG encoding for any non-jpeg input, I'm closing this bug. Thanks for your help! --- *By josch on 2018-03-24T19:00:35.244Z* --- Status changed to closed --- *By Ren Young on 2018-06-28T09:30:44.735Z* --- Mentioned in issue #45 --- *By josch on 2018-07-18T11:41:34.137Z* --- @monobot Since I can very much understand your "minor obsession with creating small PDFs from small PNGs" I just wanted to let you know that I finally managed to fix the testsuite and make a release of img2pdf that includes the improvements that I addded due to your bug report. https://pypi.org/project/img2pdf/ Thanks! --- *By Ren Young on 2018-08-21T16:05:15.910Z* --- Nice work. Now that that minor obsession has been put in it place, its time to find a new one... my work is never done :smiley:
josch closed this issue 3 years ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#41
Loading…
There is no content yet.