feature request: compress image losslessly within pdf #199
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
e.g.
There are many online PDF shrink tools that can shrink my PDF losslessly again. But I can't find any offline tool that can do this. So this is a feature request to img2pdf. Or does anyone know an offline tool that can do this?
Can you link to some of them? I'm interested in how they achieve that. Are they just re-compressing the jpeg stream with gzip? Usually that has barely any effect on the file size.
Are you able to share your example pdf or jpeg?
I think this should rather be a future request for one of the pdf-creation libraries supported by img2pdf. For example, pikepdf has in its save() method:
Or:
But even those do not re-compress jpeg -- and why should they? Maybe you should approach pikepdf developers with your plea and ask if they would be willing to add a flag which allows to re-compress jpeg streams?
I am unable to find one that also re-compresses jpeg streams. I think this makes sense because usually you should only be able to shove 1% or 2% of size off a jpeg file by sending it through gzip, so why spend cpu time on this?
For example there is this:
Or this:
I'm sorry,
yesterday I used one of the losslessly web tools and it seemed to work. But they are not telling the truth. I only checked the size and not the content. They reduced resolution, which is no longer lossless. Now I tried a couple of those losslessly online tools and they are all cheaters.
I can mail it to you, but not public. Or next week I can produce another example to share
Thanks for the hint, I will have a look at pikepdf.
I already checked qpdf before. It reduces just 1kB.
Based on your hint, I also installed mupdf-tools. But again only 1kB reduction
Sure, most jpeg files cannot be compressed much by gzip. But these behave differently. I'm able to reduce losslessly the size to ~600kB using 'jpegoptim' tool, but not to 488kB like gzip or the source pdf.
Keeping up this promise is not easy. This is why I will always refuse to even put the option of being lossy in img2pdf. If you want lossy conversion, there are lots of other tools available (see README.md). Being lossless is the reason of existence for img2pdf.
I don't think this is necessary anymore since it turns out that the tools cheated and actually changed your jpeg, no? If you do find a jpeg that compresses well with gzip, I'd be very interested in seeing that.
Right, but that also does a few other tricks. :)
I have already sent you the PDF file of my examples by email. This JPEG file can be compressed as shown.
But now I have actually found a tool to compress the JPEG file back into a PDF. See this example:
You can see that ocrmypdf can "deflate" the jpeg file. However, there is a 2 byte difference in the JPEG files. In terms of quality, however, there is no noticeable difference at all. Perhaps the deflating method from ocrmypdf can also be used for img2jpg? It's also python: https://github.com/ocrmypdf/OCRmyPDF/tree/main/src/ocrmypdf
The irony: ocrmypdf depends on img2pdf for its functionality ;)
How ocrmypdf does it? Like this:
So after DCTDecode (that's the jpeg encoding) it sends it through gzip. I was though unable to find a single jpeg where this had any great effect.
I fail to find that mail in my inbox. What subject or message id was that?
From: Dominik d0m1n1k@geekmail.de
To: "josch" gitlab@mister-muffin.de
Subject: Re: [josch/img2pdf] feature request: compress image losslessly within pdf (#199)
Date: Sun, 30 Jun 2024 08:24:15 +0200
But... who is gitlab@mister-muffin.de?
josch?
Sorry, what is your mailbox address?
The gitlab@mister-muffin.de address is the one that you can use to talk to this instance of gitea. It used to be a gitlab instance and hence the name. But my personal email is a different one. My contact details you can either get from
author_email
insetup.py
src/img2pdf.py
It's either josch@debian.org or josch@mister-muffin.de
Thank you, I now found the solution to this problem. The jpeg format apparently compresses badly for large areas of equal color. Steps to reproduce:
I now wonder whether img2pdf should add gzip compression to jpegs by default or whether that is just a waste of cpu cycles...
To save cpu cycles by default, an optional option for img2pdf like --gzip or similar would be useful.
I feared you would suggest that. But would yet another command line option be a good interface? I'd argue that
--gzip
is too general of an option as that could mean anything and maybe is needed for something else in the future. So it would be named--jpeg-gzip
or similar and then be applied to all images that use DCTDecode. But then how many people would use that versus making the man page and documentation longer and harder to read?I am also the maintainer of this software: https://manpages.debian.org/bookworm/sbuild/sbuild.1.en.html If you have a glance at the SYNOPSIS section you might get an idea why I dread adding tons of command line flags to my software. I think it was a mistake to do so for sbuild but once a flag exists it can never again be removed because people will start relying on it.
Or in other words: i rather make gzip compression the default before I add yet another command line option.
Sorry, I'm not a developer. You have a lot more experience and I'm sure you know what's best. I've been using Debian for over 20 years (since woody - before that slackware) and I generally only look in man pages when I need something more specific. Then I'm glad that a lot of things can still be set optionally without having to use the source!
Since systemd, everything has become so complicated for me that I can't understand it anymore. But I'm not that young anymore! For a while I continued to use devuan with sysvinit until I couldn't resist anymore.
In that respect, the good old KISS principle is gone anyway, in my opinion.
I think img2pdf is used less for time-critical problems. In that respect, it's certainly okay that it requires a bit more CPU load with modern hardware. On the other hand, data storage is no longer as limited as it used to be, so the few kbytes more wouldn't be the end of the world. In total, with several thousand pages of documents, this can add up. And documents often have large areas of equal color.
Fewer and fewer documents are black and white and can simply be saved as monochrome jbig2. BTW: Is there a plan to be able to repack jbig2 files extracted with pdfimages using img2pdf? Or is that already possible somehow? Oh I see, that is not possible: #112
No problem, I can reenc jbig2 pdfs and merge with pdftk, as I always do
Thank you for your development!
Just an unrefined thought in passing, in this interesting discussion...
Why not trigger gzip compression (if it doesn't already exist) based on a "sufficiently simple" (== low CPU) statistical criterion for the image?
You are not alone. In the #debian-devel IRC channel it happens quite a few times that nobody knows the answer to very simple questions because the software stack has changed quite a bit compared to 20 years ago and the current stack is just much more complicated as the number of lines of code grows...
I'm tempted to just turn on gzip compression for jpegs by default and wait until/if somebody complains and then decide again. :)
It may become possible soon! !184
gzip compression is very fast. My laptop runs an ARM Cortex A53. To give you an idea on how slow it is: opening a page on youtube, it takes 30 seconds before the video finally starts playing. Even on that slow of a platform, I barely notice gzip compression of a few MB of data. That being said: computing whether gzip would make sense would likely take longer than applying gzip by default. :)
That's exactly what the table says. PNG files (if they are interlaced or if they contain transparency) are zlib compressed image data processed by the Paeth filter. TIFF images are either CCITT Group 4 compressed if they are monochrome or they are gzip compressed if they are just raw pixel data.
Everything else, like JPEGs, PNGs without interlacing or transparency or TIFF files that already are CCITT Group 4 compressed get copied in without modification.
No. It uses the default used by
zlib.compress()
.Why? Even if we want to compress all object streams by default (currently this is the choice of the PDF library), the PDF 1.3 allows for this as well.
JPEG does not use gzip to compress image data. JPEG uses DCT encoding to do lossy compression. And that fares badly in my tests. If then afterwards you apply gzip compression then you get a huge size reduction because DCT compression compressed badly.
If I do that, will you handle the bug reports of those people who started using img2pdf because of how fast it used to be? You know that people use img2pdf to convert whole directory of images?
I have tried converting the same non-transparent PNG image, which was saved as interlaced and not interlaced, and in both cases it was zlib compressed.
Have you done any tests to see how much slower compression gets when compression level is increased from 6 to 9? Maybe impact is not that huge and it is worth increasing compression level?
I have read that "what object Streams allow you to do is to put lots of PDF objects together inside a single binary stream. The binary stream still has a text header, telling the PDF parser how to find and extract the PDF objects, but all the PDF objects themselves can be compressed. This makes the PDF smaller, potentially more secure and possibly faster to load." Also that "Version 1.5 of the PDF format introduces a new type of stream, an object stream (ObjStm). It is a collection of many PDF objects together inside a single binary stream. The purpose of this type of object is to allow the compression of PDF objects not of the stream type. This process considerably reduces the size of PDF files."
So, I was wondering why img2pdf does not do that - why it does not put all data into a single stream and compress it?
I don't understand your argument here. You have suggested to turn on gzip compression for jpegs by default and to do that because gzip compression is very fast.
My suggestion was just to add one extra step: after compression compare compressed and uncompressed image size. That extra step most definitely will not have any significant impact to whole conversion process.
Of course, that's how the PNG format works.
If I'd change the compression level, I'd rather go the other way and decrease it. Storage is cheap. CPU time is not.
Doing this is the job of the pdf engine. If you want img2pdf to do it, you should for example contact the pikepdf developers and suggest them to add this feature.
But why? Why add even further complexity for the very unlikely situation where applying gzip actually makes the compressed data bigger than the original? Maybe I'm wrong and it's actually not that bad. Can you prepare a MR with this feature so that I can see how what you propose would look like?
I still do not understand this part. I have expected to see one image being compressed and another just being directly embedded.
I guess it is a matter of taste. I would do the opposite given how rapidly the CPU performance is improving.
Why one extra IF statement is in any significant way increasing complexity is beyond my understanding. I do not code in Python, but here is a proof of concept: https://dotnetfiddle.net/hoODop. jpeg2 is made of random bytes hence compression is not effective, and jpeg1's size could be reduced by 15%, so compression could be used.
The PDF format allows for images to be encoded using the Paeth filter which is also used by PNG. So if img2pdf receives a PNG image that is not interlaced without transparency, it can directly embed the Paeth encoding without computing it from scratch. That encoding is then compressed with gzip.
It is far from being as simple as your code example makes it look like. The function in img2pdf that does most of the magic and contains all the special-casing depending on what kind of image one throws at it is the function
read_images
which is 490 lines long. The output of that function has a certain type which then has to be communicated to its caller, theconvert
function, which is itself 190 lines long and which sticks the result ofread_images
intopdf.add_imagepage
, a function with 23 arguments and 268 lines long. If you think this is easy, please show me a patch, then we can talk about it.Please add the part where you compress all jpegs, then I will show you how I imagine to do that one extra check.