Tesseract-ocr seems to be able to encapsulate 1 bit/indexed/gray PNGs to PDFs with out increasing size #41
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
By Ren Young on 2018-03-14T17:06:00.145Z
I've a minor obsession with creating small PDFs from small PNGs which led me to img2pdf.
But I was disappointed to find it's results were the same as other tools I've tried (ImageMagick, poppler, ghostscript).
Then I read this and was surprised:
That surprised me as I've found the PDFs created by tesseract-ocr seem to have encapsulated 1 bit/indexed/gray PNGs without increasing the size (all non-interlaced and no alpha transparency PNGs.)
'pdfimages' tool confirms the images are 1 bit/indexed/gray.
The (small) downside is the PDFs have a hidden OCR text layer.
But the tesseract project seems to be doing what you say can't be done or am I missing something?
I think this might be the bit of code tesseract uses to make PDFs (but I'm not a developer.)
f3d7ee868b/src/pdfio1.c
By Ren Young on 2018-03-14T19:33:09.494Z
By josch on 2018-03-14T20:07:01.873Z
Thanks for going through the trouble of uploading these files!
I now know what's happening. There is something in the PDF spec for flate encoded datastreams that is called DecodeParams that allows to specify a predictor. See section 7.4.4.4 in ISO 32000-1:2008. With these predictors it is possible to achieve PNG-like compression using the flate encoding even in PDF!
Thanks for curing my ignorance! So it is indeed possible to also compress raster graphics down to what PNG is able to achieve while still being lossless!
By Ren Young on 2018-03-14T20:18:26.362Z
No, thank you and developers like you for making great free open source software.
And thanks for confirming my little hack.
I was beginning to doubt it myself :)
By josch on 2018-03-14T20:23:38.898Z
Don't celebrate yet. I seem to be unable to find an existing implementation that encodes a raster image this way. I only am able to find decoders, for example in pdfrw: https://github.com/pmaupin/pdfrw/blob/HEAD/pdfrw/uncompress.py but I also suspect that any encoder I can come up with might make img2pdf quite slow because it will all be in Python... 😟
By josch on 2018-03-14T20:41:54.558Z
Ooooor..... I just do what tesseract does and use libleptonica.... I'd just have to hook into the shared library with Python somehow for example using ctypes which is what pyleptonica already does... hrm...
By josch on 2018-03-14T22:14:35.379Z
Turns out, that PDF is able to directly embed the
IDAT
chunk and interpret it correctly if theDecodeParms
are set correctly. I have a local version that is able to embed RGB PNG images into a PDF which is just as fast as before (because no data is transformed) but there is no increase in filesize (again, because the PNG data is pasted as is just minus the header and other unneeded chunks). Now I have to add support for other PNG types.@monobot can I just take the images you pasted into this bug report and add them into the img2pdf testsuite? I assume because of their simplicity their is no special license attached?
By josch on 2018-03-14T23:40:46.924Z
There is a problem with palette PNG images: https://github.com/pmaupin/pdfrw/issues/128
By Ren Young on 2018-03-15T10:18:41.324Z
That's just what I came here to suggest.
Its actually not my image. Its from here on Wikimedia Commons.
Its by User:Marc_Mongenet and its licensed CC-BY-SA-2.5.
(This is probably not what your talking about in pdfrw issue 128 above but Evince displays the col8.pdf no problem for me.)
By josch on 2018-03-15T10:35:42.318Z
Turns out, that libleptonica is not even needed! PDF directly supports exactly the same filters that the PNG format uses. So what I can do is to directly dump the PNG IDAT chunk into the PDF and by adding the right metadata to the DecodeParms dictionary, PDF viewers will be able to make sense of it! This even means that, PNG input is now as fast as JPEG input because nothing needs to be re-encoded but the data is just copied as-is into the PDF. For other formats than PNG I could use libleptonica but I went for the simpler method to just use PIL for turning the image into a PNG and then extracting the IDAT chunk from the result. It's not pretty but it's fast and doesn't add any dependencies. I pushed my proof-of-concept to the master branch in commit
1d9a25dfd2
in case you want to have a look!By Ren Young on 2018-03-15T17:36:08.301Z
Wow. That was quick! Great work 👍
By Ren Young on 2018-03-15T17:43:17.741Z
Does it work with transparent and or interlaced PNGs? I don't think Leptonica does.
By josch on 2018-03-15T17:56:16.745Z
img2pdf removes the alpha channel of its input. I'm considering changing this behaviour because technically this means that img2pdf is not always lossless. Instead, I would just simple forbid any input with an alpha channel. The reason for this is, that I don't see why or what img2pdf should do about images with transparency situation. If you can tell me a usecase I would be all ears!
I didn't come across interlaced PNGs yet. If you have some to test on, then we can see what to do about them.
By Ren Young on 2018-03-15T19:38:21.715Z
Perfect. That's what I'd do too.
I don't think I've ever come across an interlaced PNG in the wild.
I don't even know what its for. I had to look it up.
The answers here seem to say its not a feature you'd actually want to use. (And definitly not useful in PDFs).
I think I'd just forbid them as well.
By Ren Young on 2018-03-15T19:49:56.674Z
Here is an interlaced example if you still want it.
![rgb-interlace](/josch/img2pdf/uploads/d0db6d33a9ad5979b2601456ec23210a/rgb-interlace.png)
Curiously its slightly smaller than the original. I thought it should be bigger.
By Ren Young on 2018-03-15T19:57:13.171Z
Here are interlaced versions of the others too. All except the 1 bit b&w one are smaller.
The greyscale one is significantly smaller!? (Maybe interlace might be useful after all?)
interlaced.zip
By josch on 2018-03-24T19:00:35.073Z
The latest commits forbid images with alpha and interlaced pngs. Since we now use PNG encoding for any non-jpeg input, I'm closing this bug. Thanks for your help!
By josch on 2018-03-24T19:00:35.244Z
Status changed to closed
By Ren Young on 2018-06-28T09:30:44.735Z
Mentioned in issue #45
By josch on 2018-07-18T11:41:34.137Z
@monobot Since I can very much understand your "minor obsession with creating small PDFs from small PNGs" I just wanted to let you know that I finally managed to fix the testsuite and make a release of img2pdf that includes the improvements that I addded due to your bug report.
https://pypi.org/project/img2pdf/
Thanks!
By Ren Young on 2018-08-21T16:05:15.910Z
Nice work. Now that that minor obsession has been put in it place, its time to find a new one... my work is never done 😃