josch/img2pdf

Fork 10

Tesseract-ocr seems to be able to encapsulate 1 bit/indexed/gray PNGs to PDFs with out increasing size #41

New issue

Closed

opened 2021-04-25 19:58:02 +00:00 by josch · 0 comments

josch commented

2021-04-25 19:58:02 +00:00

Owner

By Ren Young on 2018-03-14T17:06:00.145Z

I've a minor obsession with creating small PDFs from small PNGs which led me to img2pdf.
But I was disappointed to find it's results were the same as other tools I've tried (ImageMagick, poppler, ghostscript).

Then I read this and was surprised:

Other raster graphics formats are losslessly stored in a zip/flate encoding of
their RGB representation. This might increase file size and does not store
transparency. There is nothing that can be done about that until the PDF format
allows embedding other image formats like PNG.

That surprised me as I've found the PDFs created by tesseract-ocr seem to have encapsulated 1 bit/indexed/gray PNGs without increasing the size (all non-interlaced and no alpha transparency PNGs.)
'pdfimages' tool confirms the images are 1 bit/indexed/gray.

The (small) downside is the PDFs have a hidden OCR text layer.
But the tesseract project seems to be doing what you say can't be done or am I missing something?

$ tesseract --version
tesseract 3.04.01
 leptonica-1.74.4
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0

I think this might be the bit of code tesseract uses to make PDFs (but I'm not a developer.)
f3d7ee868b/src/pdfio1.c

By Ren Young on 2018-03-14T19:33:09.494Z

tesseract PDF sizes
-------------------
 20473 bw.png
 23546 bw.pdf
179645 col8.png
183139 col8.pdf
488477 grey.png
490795 grey.pdf
187013 rgb.png
189774 rgb.pdf

img2pdf PDF sizes
-----------------
   20473 bw.png
  122551 bw.png.pdf
  179645 col8.png
  447027 col8.png.pdf
  488477 grey.png
 3730001 grey.png.pdf
  187013 rgb.png
36234677 rgb.png.pdf

By josch on 2018-03-14T20:07:01.873Z

Thanks for going through the trouble of uploading these files!

I now know what's happening. There is something in the PDF spec for flate encoded datastreams that is called DecodeParams that allows to specify a predictor. See section 7.4.4.4 in ISO 32000-1:2008. With these predictors it is possible to achieve PNG-like compression using the flate encoding even in PDF!

Thanks for curing my ignorance! So it is indeed possible to also compress raster graphics down to what PNG is able to achieve while still being lossless!

By Ren Young on 2018-03-14T20:18:26.362Z

No, thank you and developers like you for making great free open source software.
And thanks for confirming my little hack.
I was beginning to doubt it myself :)

By josch on 2018-03-14T20:23:38.898Z

Don't celebrate yet. I seem to be unable to find an existing implementation that encodes a raster image this way. I only am able to find decoders, for example in pdfrw: https://github.com/pmaupin/pdfrw/blob/HEAD/pdfrw/uncompress.py but I also suspect that any encoder I can come up with might make img2pdf quite slow because it will all be in Python... 😟

By josch on 2018-03-14T20:41:54.558Z

Ooooor..... I just do what tesseract does and use libleptonica.... I'd just have to hook into the shared library with Python somehow for example using ctypes which is what pyleptonica already does... hrm...

By josch on 2018-03-14T22:14:35.379Z

Turns out, that PDF is able to directly embed the IDAT chunk and interpret it correctly if the DecodeParms are set correctly. I have a local version that is able to embed RGB PNG images into a PDF which is just as fast as before (because no data is transformed) but there is no increase in filesize (again, because the PNG data is pasted as is just minus the header and other unneeded chunks). Now I have to add support for other PNG types.

@monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached?

By josch on 2018-03-14T23:40:46.924Z

There is a problem with palette PNG images: https://github.com/pmaupin/pdfrw/issues/128

By Ren Young on 2018-03-15T10:18:41.324Z

Ooooor..... I just do what tesseract does and use libleptonica

That's just what I came here to suggest.

@monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached?

Its actually not my image. Its from here on Wikimedia Commons.
Its by User:Marc_Mongenet and its licensed CC-BY-SA-2.5.

(This is probably not what your talking about in pdfrw issue 128 above but Evince displays the col8.pdf no problem for me.)

By josch on 2018-03-15T10:35:42.318Z

Turns out, that libleptonica is not even needed! PDF directly supports exactly the same filters that the PNG format uses. So what I can do is to directly dump the PNG IDAT chunk into the PDF and by adding the right metadata to the DecodeParms dictionary, PDF viewers will be able to make sense of it! This even means that, PNG input is now as fast as JPEG input because nothing needs to be re-encoded but the data is just copied as-is into the PDF. For other formats than PNG I could use libleptonica but I went for the simpler method to just use PIL for turning the image into a PNG and then extracting the IDAT chunk from the result. It's not pretty but it's fast and doesn't add any dependencies. I pushed my proof-of-concept to the master branch in commit 1d9a25dfd2 in case you want to have a look!

By Ren Young on 2018-03-15T17:36:08.301Z

Wow. That was quick! Great work 👍

By Ren Young on 2018-03-15T17:43:17.741Z

Does it work with transparent and or interlaced PNGs? I don't think Leptonica does.

By josch on 2018-03-15T17:56:16.745Z

img2pdf removes the alpha channel of its input. I'm considering changing this behaviour because technically this means that img2pdf is not always lossless. Instead, I would just simple forbid any input with an alpha channel. The reason for this is, that I don't see why or what img2pdf should do about images with transparency situation. If you can tell me a usecase I would be all ears!

I didn't come across interlaced PNGs yet. If you have some to test on, then we can see what to do about them.

By Ren Young on 2018-03-15T19:38:21.715Z

Instead, I would just simple forbid any input with an alpha channel.

Perfect. That's what I'd do too.

I don't think I've ever come across an interlaced PNG in the wild.
I don't even know what its for. I had to look it up.
The answers here seem to say its not a feature you'd actually want to use. (And definitly not useful in PDFs).

I think I'd just forbid them as well.

By Ren Young on 2018-03-15T19:49:56.674Z

Here is an interlaced example if you still want it.
Curiously its slightly smaller than the original. I thought it should be bigger.

By Ren Young on 2018-03-15T19:57:13.171Z

Here are interlaced versions of the others too. All except the 1 bit b&w one are smaller.
The greyscale one is significantly smaller!? (Maybe interlace might be useful after all?)
interlaced.zip

By josch on 2018-03-24T19:00:35.073Z

The latest commits forbid images with alpha and interlaced pngs. Since we now use PNG encoding for any non-jpeg input, I'm closing this bug. Thanks for your help!

By josch on 2018-03-24T19:00:35.244Z

Status changed to closed

By Ren Young on 2018-06-28T09:30:44.735Z

Mentioned in issue #45

By josch on 2018-07-18T11:41:34.137Z

@monobot Since I can very much understand your "minor obsession with creating small PDFs from small PNGs" I just wanted to let you know that I finally managed to fix the testsuite and make a release of img2pdf that includes the improvements that I addded due to your bug report.
https://pypi.org/project/img2pdf/
Thanks!

By Ren Young on 2018-08-21T16:05:15.910Z

Nice work. Now that that minor obsession has been put in it place, its time to find a new one... my work is never done 😃

*By Ren Young on 2018-03-14T17:06:00.145Z* I've a minor obsession with creating small PDFs from small PNGs which led me to img2pdf. But I was disappointed to find it's results were the same as other tools I've tried (ImageMagick, poppler, ghostscript). Then I read this and was surprised: > Other raster graphics formats are losslessly stored in a zip/flate encoding of > their RGB representation. This might increase file size and does not store > transparency. There is nothing that can be done about that until the PDF format > allows embedding other image formats like PNG. That surprised me as I've found the PDFs created by tesseract-ocr seem to have encapsulated 1 bit/indexed/gray PNGs without increasing the size (all non-interlaced and no alpha transparency PNGs.) 'pdfimages' tool confirms the images are 1 bit/indexed/gray. The (small) downside is the PDFs have a hidden OCR text layer. But the tesseract project seems to be doing what you say can't be done or am I missing something? ``` $ tesseract --version tesseract 3.04.01 leptonica-1.74.4 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0 ``` I think this might be the bit of code tesseract uses to make PDFs (but I'm not a developer.) https://github.com/DanBloomberg/leptonica/blob/f3d7ee868b4864cdca7ea57349d49f3d2b4a63ec/src/pdfio1.c --- *By Ren Young on 2018-03-14T19:33:09.494Z* --- ``` tesseract PDF sizes ------------------- 20473 bw.png 23546 bw.pdf 179645 col8.png 183139 col8.pdf 488477 grey.png 490795 grey.pdf 187013 rgb.png 189774 rgb.pdf ``` ``` img2pdf PDF sizes ----------------- 20473 bw.png 122551 bw.png.pdf 179645 col8.png 447027 col8.png.pdf 488477 grey.png 3730001 grey.png.pdf 187013 rgb.png 36234677 rgb.png.pdf ``` --- *By josch on 2018-03-14T20:07:01.873Z* --- Thanks for going through the trouble of uploading these files! I now know what's happening. There is something in the PDF spec for flate encoded datastreams that is called DecodeParams that allows to specify a predictor. See section 7.4.4.4 in [ISO 32000-1:2008](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf). With these predictors it is possible to achieve PNG-like compression using the flate encoding even in PDF! Thanks for curing my ignorance! So it is indeed possible to also compress raster graphics down to what PNG is able to achieve while still being lossless! --- *By Ren Young on 2018-03-14T20:18:26.362Z* --- No, thank you and developers like you for making great free open source software. And thanks for confirming my little hack. I was beginning to doubt it myself :) --- *By josch on 2018-03-14T20:23:38.898Z* --- Don't celebrate yet. I seem to be unable to find an existing implementation that encodes a raster image this way. I only am able to find decoders, for example in pdfrw: https://github.com/pmaupin/pdfrw/blob/HEAD/pdfrw/uncompress.py but I also suspect that any encoder I can come up with might make img2pdf quite slow because it will all be in Python... :worried: --- *By josch on 2018-03-14T20:41:54.558Z* --- Ooooor..... I just do what tesseract does and use libleptonica.... I'd just have to hook into the shared library with Python somehow for example using ctypes which is what [pyleptonica](https://github.com/jsbueno/pyleptonica) already does... hrm... --- *By josch on 2018-03-14T22:14:35.379Z* --- Turns out, that PDF is able to directly embed the `IDAT` chunk and interpret it correctly if the `DecodeParms` are set correctly. I have a local version that is able to embed RGB PNG images into a PDF which is just as fast as before (because no data is transformed) but there is no increase in filesize (again, because the PNG data is pasted as is just minus the header and other unneeded chunks). Now I have to add support for other PNG types. @monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached? --- *By josch on 2018-03-14T23:40:46.924Z* --- There is a problem with palette PNG images: https://github.com/pmaupin/pdfrw/issues/128 --- *By Ren Young on 2018-03-15T10:18:41.324Z* --- > Ooooor..... I just do what tesseract does and use libleptonica That's just what I came here to suggest. >@monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached? Its actually not my image. Its from [here on Wikimedia Commons](https://commons.wikimedia.org/wiki/File:16777216colors.png). Its by [User:Marc_Mongenet]( https://commons.wikimedia.org/wiki/User:Marc_Mongenet) and its licensed CC-BY-SA-2.5. (This is probably not what your talking about in pdfrw issue 128 above but Evince displays the col8.pdf no problem for me.) --- *By josch on 2018-03-15T10:35:42.318Z* --- Turns out, that libleptonica is not even needed! PDF directly supports exactly the same filters that the PNG format uses. So what I can do is to directly dump the PNG IDAT chunk into the PDF and by adding the right metadata to the DecodeParms dictionary, PDF viewers will be able to make sense of it! This even means that, PNG input is now as fast as JPEG input because nothing needs to be re-encoded but the data is just copied as-is into the PDF. For other formats than PNG I could use libleptonica but I went for the simpler method to just use PIL for turning the image into a PNG and then extracting the IDAT chunk from the result. It's not pretty but it's fast and doesn't add any dependencies. I pushed my proof-of-concept to the master branch in commit 1d9a25dfd2e5 in case you want to have a look! --- *By Ren Young on 2018-03-15T17:36:08.301Z* --- Wow. That was quick! Great work :thumbsup: --- *By Ren Young on 2018-03-15T17:43:17.741Z* --- Does it work with transparent and or interlaced PNGs? I don't think Leptonica does. --- *By josch on 2018-03-15T17:56:16.745Z* --- img2pdf removes the alpha channel of its input. I'm considering changing this behaviour because technically this means that img2pdf is not always lossless. Instead, I would just simple forbid any input with an alpha channel. The reason for this is, that I don't see why or what img2pdf should do about images with transparency situation. If you can tell me a usecase I would be all ears! I didn't come across interlaced PNGs yet. If you have some to test on, then we can see what to do about them. --- *By Ren Young on 2018-03-15T19:38:21.715Z* --- > Instead, I would just simple forbid any input with an alpha channel. Perfect. That's what I'd do too. I don't think I've ever come across an interlaced PNG in the wild. I don't even know what its for. I had to look it up. The [answers here](https://stackoverflow.com/questions/13449314/when-to-interlace-an-image) seem to say its not a feature you'd actually want to use. (And definitly not useful in PDFs). I think I'd just forbid them as well. --- *By Ren Young on 2018-03-15T19:49:56.674Z* --- Here is an interlaced example if you still want it. Curiously its slightly smaller than the original. I thought it should be bigger. ![rgb-interlace](/uploads/d0db6d33a9ad5979b2601456ec23210a/rgb-interlace.png) --- *By Ren Young on 2018-03-15T19:57:13.171Z* --- Here are interlaced versions of the others too. All except the 1 bit b&w one are smaller. The greyscale one is significantly smaller!? (Maybe interlace might be useful after all?) [interlaced.zip](/uploads/1e83e27faccdbbdf6d124aa5583d6d5f/interlaced.zip) --- *By josch on 2018-03-24T19:00:35.073Z* --- The latest commits forbid images with alpha and interlaced pngs. Since we now use PNG encoding for any non-jpeg input, I'm closing this bug. Thanks for your help! --- *By josch on 2018-03-24T19:00:35.244Z* --- Status changed to closed --- *By Ren Young on 2018-06-28T09:30:44.735Z* --- Mentioned in issue #45 --- *By josch on 2018-07-18T11:41:34.137Z* --- @monobot Since I can very much understand your "minor obsession with creating small PDFs from small PNGs" I just wanted to let you know that I finally managed to fix the testsuite and make a release of img2pdf that includes the improvements that I addded due to your bug report. https://pypi.org/project/img2pdf/ Thanks! --- *By Ren Young on 2018-08-21T16:05:15.910Z* --- Nice work. Now that that minor obsession has been put in it place, its time to find a new one... my work is never done :smiley: