feature request: compress image losslessly within pdf #199

Open
opened 2024-06-29 18:44:43 +00:00 by d0m1n1k · 23 comments

e.g.

--> ls -lh a.pdf
-rw------- 1 user user 490K 29. Jun 08:23 a.pdf

--> pdfimages -list a.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    4672  6624  rgb     3   8  jpeg   no         6  0   400   400  488K 0.5%
   
--> pdfimages -j a.pdf a
 
--> ls -lh a-000.jpg
-rw-r--r-- 1 user user 898K 29. Jun 08:53 a-000.jpg

--> img2pdf a-000.jpg > b.pdf
--> ls -lh b.pdf    
-rw-r--r-- 1 user user 899K 29. Jun 08:55 b.pdf

--> gzip a-000.jpg
--> ls -lh a-000.jpg.gz 
-rw-r--r-- 1 user user 488K 29. Jun 20:31 a-000.jpg.gz

There are many online PDF shrink tools that can shrink my PDF losslessly again. But I can't find any offline tool that can do this. So this is a feature request to img2pdf. Or does anyone know an offline tool that can do this?

e.g. ``` --> ls -lh a.pdf -rw------- 1 user user 490K 29. Jun 08:23 a.pdf --> pdfimages -list a.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 4672 6624 rgb 3 8 jpeg no 6 0 400 400 488K 0.5% --> pdfimages -j a.pdf a --> ls -lh a-000.jpg -rw-r--r-- 1 user user 898K 29. Jun 08:53 a-000.jpg --> img2pdf a-000.jpg > b.pdf --> ls -lh b.pdf -rw-r--r-- 1 user user 899K 29. Jun 08:55 b.pdf --> gzip a-000.jpg --> ls -lh a-000.jpg.gz -rw-r--r-- 1 user user 488K 29. Jun 20:31 a-000.jpg.gz ``` There are many online PDF shrink tools that can shrink my PDF losslessly again. But I can't find any offline tool that can do this. So this is a feature request to img2pdf. Or does anyone know an offline tool that can do this?
Owner

There are many online PDF shrink tools that can shrink my PDF losslessly again

Can you link to some of them? I'm interested in how they achieve that. Are they just re-compressing the jpeg stream with gzip? Usually that has barely any effect on the file size.

Are you able to share your example pdf or jpeg?

So this is a feature request to img2pdf

I think this should rather be a future request for one of the pdf-creation libraries supported by img2pdf. For example, pikepdf has in its save() method:

recompress_flate=True, object_stream_mode=pikepdf.ObjectStreamMode.generate

Or:

pikepdf.settings.set_flate_compression_level(9)

But even those do not re-compress jpeg -- and why should they? Maybe you should approach pikepdf developers with your plea and ask if they would be willing to add a flag which allows to re-compress jpeg streams?

Or does anyone know an offline tool that can do this?

I am unable to find one that also re-compresses jpeg streams. I think this makes sense because usually you should only be able to shove 1% or 2% of size off a jpeg file by sending it through gzip, so why spend cpu time on this?

For example there is this:

qpdf input.pdf --recompress-flate --compression-level=9 --object-streams=generate output.pdf

Or this:

mutool clean -gggggz in.pdf out.pdf
> There are many online PDF shrink tools that can shrink my PDF losslessly again Can you link to some of them? I'm interested in how they achieve that. Are they just re-compressing the jpeg stream with gzip? Usually that has barely any effect on the file size. Are you able to share your example pdf or jpeg? > So this is a feature request to img2pdf I think this should rather be a future request for one of the pdf-creation libraries supported by img2pdf. For example, pikepdf has in its save() method: recompress_flate=True, object_stream_mode=pikepdf.ObjectStreamMode.generate Or: pikepdf.settings.set_flate_compression_level(9) But even those do not re-compress jpeg -- and why should they? Maybe you should approach pikepdf developers with your plea and ask if they would be willing to add a flag which allows to re-compress jpeg streams? > Or does anyone know an offline tool that can do this? I am unable to find one that also re-compresses jpeg streams. I think this makes sense because usually you should only be able to shove 1% or 2% of size off a jpeg file by sending it through gzip, so why spend cpu time on this? For example there is this: qpdf input.pdf --recompress-flate --compression-level=9 --object-streams=generate output.pdf Or this: mutool clean -gggggz in.pdf out.pdf
Author

There are many online PDF shrink tools that can shrink my PDF losslessly again

Can you link to some of them? I'm interested in how they achieve that. Are they just re-compressing the jpeg stream with gzip? Usually that has barely any effect on the file size.

I'm sorry,
yesterday I used one of the losslessly web tools and it seemed to work. But they are not telling the truth. I only checked the size and not the content. They reduced resolution, which is no longer lossless. Now I tried a couple of those losslessly online tools and they are all cheaters.

Are you able to share your example pdf or jpeg?

I can mail it to you, but not public. Or next week I can produce another example to share

But even those do not re-compress jpeg -- and why should they? Maybe you should approach pikepdf developers with your plea and ask if they would be willing to add a flag which allows to re-compress jpeg streams?

Thanks for the hint, I will have a look at pikepdf.

Or does anyone know an offline tool that can do this?

I am unable to find one that also re-compresses jpeg streams. I think this makes sense because usually you should only be able to shove 1% or 2% of size off a jpeg file by sending it through gzip, so why spend cpu time on this?

For example there is this:

qpdf input.pdf --recompress-flate --compression-level=9 --object-streams=generate output.pdf

Or this:

mutool clean -gggggz in.pdf out.pdf

I already checked qpdf before. It reduces just 1kB.
Based on your hint, I also installed mupdf-tools. But again only 1kB reduction
Sure, most jpeg files cannot be compressed much by gzip. But these behave differently. I'm able to reduce losslessly the size to ~600kB using 'jpegoptim' tool, but not to 488kB like gzip or the source pdf.

> > There are many online PDF shrink tools that can shrink my PDF losslessly again > > Can you link to some of them? I'm interested in how they achieve that. Are they just re-compressing the jpeg stream with gzip? Usually that has barely any effect on the file size. I'm sorry, yesterday I used one of the losslessly web tools and it seemed to work. But they are not telling the truth. I only checked the size and not the content. They reduced resolution, which is no longer lossless. Now I tried a couple of those losslessly online tools and they are all cheaters. > > Are you able to share your example pdf or jpeg? I can mail it to you, but not public. Or next week I can produce another example to share > But even those do not re-compress jpeg -- and why should they? Maybe you should approach pikepdf developers with your plea and ask if they would be willing to add a flag which allows to re-compress jpeg streams? Thanks for the hint, I will have a look at pikepdf. > > Or does anyone know an offline tool that can do this? > > I am unable to find one that also re-compresses jpeg streams. I think this makes sense because usually you should only be able to shove 1% or 2% of size off a jpeg file by sending it through gzip, so why spend cpu time on this? > > For example there is this: > > qpdf input.pdf --recompress-flate --compression-level=9 --object-streams=generate output.pdf > > Or this: > > mutool clean -gggggz in.pdf out.pdf I already checked qpdf before. It reduces just 1kB. Based on your hint, I also installed mupdf-tools. But again only 1kB reduction Sure, most jpeg files cannot be compressed much by gzip. But these behave differently. I'm able to reduce losslessly the size to ~600kB using 'jpegoptim' tool, but not to 488kB like gzip or the source pdf.
Owner

yesterday I used one of the losslessly web tools and it seemed to work. But they are not telling the truth. I only checked the size and not the content. They reduced resolution, which is no longer lossless. Now I tried a couple of those losslessly online tools and they are all cheaters.

Keeping up this promise is not easy. This is why I will always refuse to even put the option of being lossy in img2pdf. If you want lossy conversion, there are lots of other tools available (see README.md). Being lossless is the reason of existence for img2pdf.

I can mail it to you, but not public. Or next week I can produce another example to share

I don't think this is necessary anymore since it turns out that the tools cheated and actually changed your jpeg, no? If you do find a jpeg that compresses well with gzip, I'd be very interested in seeing that.

I'm able to reduce losslessly the size to ~600kB using 'jpegoptim' tool

Right, but that also does a few other tricks. :)

> yesterday I used one of the losslessly web tools and it seemed to work. But they are not telling the truth. I only checked the size and not the content. They reduced resolution, which is no longer lossless. Now I tried a couple of those losslessly online tools and they are all cheaters. Keeping up this promise is not easy. This is why I will always refuse to even put the *option* of being lossy in img2pdf. If you want lossy conversion, there are lots of other tools available (see README.md). Being lossless is the reason of existence for img2pdf. > I can mail it to you, but not public. Or next week I can produce another example to share I don't think this is necessary anymore since it turns out that the tools cheated and actually changed your jpeg, no? If you do find a jpeg that compresses well with gzip, I'd be very interested in seeing that. > I'm able to reduce losslessly the size to ~600kB using 'jpegoptim' tool Right, but that also does a few other tricks. :)
Author

I don't think this is necessary anymore since it turns out that the tools cheated and actually changed your jpeg, no? If you do find a jpeg that compresses well with gzip, I'd be very interested in seeing that.

I have already sent you the PDF file of my examples by email. This JPEG file can be compressed as shown.

But now I have actually found a tool to compress the JPEG file back into a PDF. See this example:

--> ls -lh a.pdf
-rw------- 1 user user 490K 29. Jun 08:23 a.pdf

--> pdfimages -list a.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    4672  6624  rgb     3   8  jpeg   no         6  0   400   400  488K 0.5%
   
--> pdfimages -j a.pdf a

--> img2pdf a-000.jpg > b.pdf

--> pdfimages -list b.pdf         
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    4672  6624  rgb     3   8  jpeg   no         7  0   400   400  898K 1.0%

--> ocrmypdf b.pdf bocr.pdf 
[..]
Deflating JPEGs: 100%|█████████████████████████| 1/1 [00:00<00:00, 35.65image/s]
Optimize ratio: 1.81 savings: 44.7%
[..]


--> pdfimages -list bocr.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    4672  6624  rgb     3   8  jpeg   no        13  0   400   400  488K 0.5%

--> pdfimages -j bocr.pdf b 

--> ls -l a-000.jpg b-000.jpg 
-rw-r--r-- 1 user user 919106  6. Jul 06:54 a-000.jpg
-rw-r--r-- 1 user user 919108  6. Jul 06:57 b-000.jpg

ls -l a.pdf bocr.pdf b.pdf 
-rw------- 1 user user 501191 26. Jun 08:23 a.pdf
-rw-r--r-- 1 user user 518514  6. Jul 06:55 bocr.pdf
-rw-r--r-- 1 user user 920573  6. Jul 06:54 b.pdf

You can see that ocrmypdf can "deflate" the jpeg file. However, there is a 2 byte difference in the JPEG files. In terms of quality, however, there is no noticeable difference at all. Perhaps the deflating method from ocrmypdf can also be used for img2jpg? It's also python: https://github.com/ocrmypdf/OCRmyPDF/tree/main/src/ocrmypdf

> I don't think this is necessary anymore since it turns out that the tools cheated and actually changed your jpeg, no? If you do find a jpeg that compresses well with gzip, I'd be very interested in seeing that. I have already sent you the PDF file of my examples by email. This JPEG file can be compressed as shown. But now I have actually found a tool to compress the JPEG file back into a PDF. See this example: ``` --> ls -lh a.pdf -rw------- 1 user user 490K 29. Jun 08:23 a.pdf --> pdfimages -list a.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 4672 6624 rgb 3 8 jpeg no 6 0 400 400 488K 0.5% --> pdfimages -j a.pdf a --> img2pdf a-000.jpg > b.pdf --> pdfimages -list b.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 4672 6624 rgb 3 8 jpeg no 7 0 400 400 898K 1.0% --> ocrmypdf b.pdf bocr.pdf [..] Deflating JPEGs: 100%|█████████████████████████| 1/1 [00:00<00:00, 35.65image/s] Optimize ratio: 1.81 savings: 44.7% [..] --> pdfimages -list bocr.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 4672 6624 rgb 3 8 jpeg no 13 0 400 400 488K 0.5% --> pdfimages -j bocr.pdf b --> ls -l a-000.jpg b-000.jpg -rw-r--r-- 1 user user 919106 6. Jul 06:54 a-000.jpg -rw-r--r-- 1 user user 919108 6. Jul 06:57 b-000.jpg ls -l a.pdf bocr.pdf b.pdf -rw------- 1 user user 501191 26. Jun 08:23 a.pdf -rw-r--r-- 1 user user 518514 6. Jul 06:55 bocr.pdf -rw-r--r-- 1 user user 920573 6. Jul 06:54 b.pdf ``` You can see that ocrmypdf can "deflate" the jpeg file. However, there is a 2 byte difference in the JPEG files. In terms of quality, however, there is no noticeable difference at all. Perhaps the deflating method from ocrmypdf can also be used for img2jpg? It's also python: https://github.com/ocrmypdf/OCRmyPDF/tree/main/src/ocrmypdf
Owner

Perhaps the deflating method from ocrmypdf can also be used for img2jpg

The irony: ocrmypdf depends on img2pdf for its functionality ;)

How ocrmypdf does it? Like this:

<< /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /FlateDecode /DCTDecode ] /Height 480 /Subtype /Image /Width 640 /Length 56398 >>

So after DCTDecode (that's the jpeg encoding) it sends it through gzip. I was though unable to find a single jpeg where this had any great effect.

> Perhaps the deflating method from ocrmypdf can also be used for img2jpg The irony: ocrmypdf depends on img2pdf for its functionality ;) How ocrmypdf does it? Like this: << /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /FlateDecode /DCTDecode ] /Height 480 /Subtype /Image /Width 640 /Length 56398 >> So after DCTDecode (that's the jpeg encoding) it sends it through gzip. I was though unable to find a single jpeg where this had any great effect.
Owner

I have already sent you the PDF file of my examples by email.

I fail to find that mail in my inbox. What subject or message id was that?

> I have already sent you the PDF file of my examples by email. I fail to find that mail in my inbox. What subject or message id was that?
Author

I fail to find that mail in my inbox. What subject or message id was that?

From: Dominik d0m1n1k@geekmail.de
To: "josch" gitlab@mister-muffin.de
Subject: Re: [josch/img2pdf] feature request: compress image losslessly within pdf (#199)
Date: Sun, 30 Jun 2024 08:24:15 +0200

> I fail to find that mail in my inbox. What subject or message id was that? From: Dominik <d0m1n1k@geekmail.de> To: "josch" <gitlab@mister-muffin.de> Subject: Re: [josch/img2pdf] feature request: compress image losslessly within pdf (#199) Date: Sun, 30 Jun 2024 08:24:15 +0200
Owner

But... who is gitlab@mister-muffin.de?

But... who is gitlab@mister-muffin.de?
Author

josch?

From: "josch" <gitlab@mister-muffin.de>
To: d0m1n1k@geekmail.de
Subject: Re: [josch/img2pdf] feature request: compress image losslessly  within pdf (#199)
Date: Sun, 07 Jul 2024 06:54:14 +0000
X-Mailer: Gitea

But... who is gitlab@mister-muffin.de?

Sorry, what is your mailbox address?

josch? ``` From: "josch" <gitlab@mister-muffin.de> To: d0m1n1k@geekmail.de Subject: Re: [josch/img2pdf] feature request: compress image losslessly within pdf (#199) Date: Sun, 07 Jul 2024 06:54:14 +0000 X-Mailer: Gitea But... who is gitlab@mister-muffin.de? ``` Sorry, what is your mailbox address?
Owner

The gitlab@mister-muffin.de address is the one that you can use to talk to this instance of gitea. It used to be a gitlab instance and hence the name. But my personal email is a different one. My contact details you can either get from

It's either josch@debian.org or josch@mister-muffin.de

The gitlab@mister-muffin.de address is the one that you can use to talk to this instance of gitea. It used to be a gitlab instance and hence the name. But my personal email is a different one. My contact details you can either get from * https://pypi.org/project/img2pdf/ under "Author:" * `author_email` in `setup.py` * the git commit metadata * the copyright line at the top of `src/img2pdf.py` It's either josch@debian.org or josch@mister-muffin.de
Owner

Thank you, I now found the solution to this problem. The jpeg format apparently compresses badly for large areas of equal color. Steps to reproduce:

$ convert -size 8000x8000 xc:white white.jpg
$ gzip --keep white.jpg
$ ls -lha white.jpg*
-rw-r--r-- 1 josch josch 245K Jul  7 10:34 white.jpg
-rw-r--r-- 1 josch josch  411 Jul  7 10:34 white.jpg.gz

I now wonder whether img2pdf should add gzip compression to jpegs by default or whether that is just a waste of cpu cycles...

Thank you, I now found the solution to this problem. The jpeg format apparently compresses badly for large areas of equal color. Steps to reproduce: ``` $ convert -size 8000x8000 xc:white white.jpg $ gzip --keep white.jpg $ ls -lha white.jpg* -rw-r--r-- 1 josch josch 245K Jul 7 10:34 white.jpg -rw-r--r-- 1 josch josch 411 Jul 7 10:34 white.jpg.gz ``` I now wonder whether img2pdf should add gzip compression to jpegs by default or whether that is just a waste of cpu cycles...
Author

To save cpu cycles by default, an optional option for img2pdf like --gzip or similar would be useful.

To save cpu cycles by default, an optional option for img2pdf like --gzip or similar would be useful.
Owner

I feared you would suggest that. But would yet another command line option be a good interface? I'd argue that --gzip is too general of an option as that could mean anything and maybe is needed for something else in the future. So it would be named --jpeg-gzip or similar and then be applied to all images that use DCTDecode. But then how many people would use that versus making the man page and documentation longer and harder to read?

I am also the maintainer of this software: https://manpages.debian.org/bookworm/sbuild/sbuild.1.en.html If you have a glance at the SYNOPSIS section you might get an idea why I dread adding tons of command line flags to my software. I think it was a mistake to do so for sbuild but once a flag exists it can never again be removed because people will start relying on it.

Or in other words: i rather make gzip compression the default before I add yet another command line option.

I feared you would suggest that. But would yet another command line option be a good interface? I'd argue that `--gzip` is too general of an option as that could mean anything and maybe is needed for something else in the future. So it would be named `--jpeg-gzip` or similar and then be applied to all images that use DCTDecode. But then how many people would use that versus making the man page and documentation longer and harder to read? I am also the maintainer of this software: https://manpages.debian.org/bookworm/sbuild/sbuild.1.en.html If you have a glance at the SYNOPSIS section you might get an idea why I dread adding tons of command line flags to my software. I think it was a mistake to do so for sbuild but once a flag exists it can never again be removed because people will start relying on it. Or in other words: i rather make gzip compression the default before I add yet another command line option.
Author

Sorry, I'm not a developer. You have a lot more experience and I'm sure you know what's best. I've been using Debian for over 20 years (since woody - before that slackware) and I generally only look in man pages when I need something more specific. Then I'm glad that a lot of things can still be set optionally without having to use the source!

Since systemd, everything has become so complicated for me that I can't understand it anymore. But I'm not that young anymore! For a while I continued to use devuan with sysvinit until I couldn't resist anymore.
In that respect, the good old KISS principle is gone anyway, in my opinion.

I think img2pdf is used less for time-critical problems. In that respect, it's certainly okay that it requires a bit more CPU load with modern hardware. On the other hand, data storage is no longer as limited as it used to be, so the few kbytes more wouldn't be the end of the world. In total, with several thousand pages of documents, this can add up. And documents often have large areas of equal color.
Fewer and fewer documents are black and white and can simply be saved as monochrome jbig2. BTW: Is there a plan to be able to repack jbig2 files extracted with pdfimages using img2pdf? Or is that already possible somehow? Oh I see, that is not possible: #112
No problem, I can reenc jbig2 pdfs and merge with pdftk, as I always do

Thank you for your development!

Sorry, I'm not a developer. You have a lot more experience and I'm sure you know what's best. I've been using Debian for over 20 years (since woody - before that slackware) and I generally only look in man pages when I need something more specific. Then I'm glad that a lot of things can still be set optionally without having to use the source! Since systemd, everything has become so complicated for me that I can't understand it anymore. But I'm not that young anymore! For a while I continued to use devuan with sysvinit until I couldn't resist anymore. In that respect, the good old KISS principle is gone anyway, in my opinion. I think img2pdf is used less for time-critical problems. In that respect, it's certainly okay that it requires a bit more CPU load with modern hardware. On the other hand, data storage is no longer as limited as it used to be, so the few kbytes more wouldn't be the end of the world. In total, with several thousand pages of documents, this can add up. And documents often have large areas of equal color. Fewer and fewer documents are black and white and can simply be saved as monochrome jbig2. BTW: Is there a plan to be able to repack jbig2 files extracted with pdfimages using img2pdf? Or is that already possible somehow? Oh I see, that is not possible: https://gitlab.mister-muffin.de/josch/img2pdf/issues/112 No problem, I can reenc jbig2 pdfs and merge with pdftk, as I always do Thank you for your development!

Just an unrefined thought in passing, in this interesting discussion...

Why not trigger gzip compression (if it doesn't already exist) based on a "sufficiently simple" (== low CPU) statistical criterion for the image?

Just an unrefined thought in passing, in this interesting discussion... Why not trigger gzip compression (if it doesn't already exist) based on a "sufficiently simple" (== low CPU) statistical criterion for the image?
Owner

Since systemd, everything has become so complicated for me that I can't understand it anymore

You are not alone. In the #debian-devel IRC channel it happens quite a few times that nobody knows the answer to very simple questions because the software stack has changed quite a bit compared to 20 years ago and the current stack is just much more complicated as the number of lines of code grows...

I think img2pdf is used less for time-critical problems. In that respect, it's certainly okay that it requires a bit more CPU load with modern hardware. On the other hand, data storage is no longer as limited as it used to be, so the few kbytes more wouldn't be the end of the world. In total, with several thousand pages of documents, this can add up. And documents often have large areas of equal color.

I'm tempted to just turn on gzip compression for jpegs by default and wait until/if somebody complains and then decide again. :)

Is there a plan to be able to repack jbig2 files extracted with pdfimages using img2pdf? Or is that already possible somehow? Oh I see, that is not possible: #112

It may become possible soon! !184

Why not trigger gzip compression (if it doesn't already exist) based on a "sufficiently simple" (== low CPU) statistical criterion for the image?

gzip compression is very fast. My laptop runs an ARM Cortex A53. To give you an idea on how slow it is: opening a page on youtube, it takes 30 seconds before the video finally starts playing. Even on that slow of a platform, I barely notice gzip compression of a few MB of data. That being said: computing whether gzip would make sense would likely take longer than applying gzip by default. :)

> Since systemd, everything has become so complicated for me that I can't understand it anymore You are not alone. In the #debian-devel IRC channel it happens quite a few times that nobody knows the answer to very simple questions because the software stack has changed quite a bit compared to 20 years ago and the current stack is just much more complicated as the number of lines of code grows... > I think img2pdf is used less for time-critical problems. In that respect, it's certainly okay that it requires a bit more CPU load with modern hardware. On the other hand, data storage is no longer as limited as it used to be, so the few kbytes more wouldn't be the end of the world. In total, with several thousand pages of documents, this can add up. And documents often have large areas of equal color. I'm tempted to just turn on gzip compression for jpegs by default and wait until/if somebody complains and then decide again. :) > Is there a plan to be able to repack jbig2 files extracted with pdfimages using img2pdf? Or is that already possible somehow? Oh I see, that is not possible: #112 It may become possible soon! !184 > Why not trigger gzip compression (if it doesn't already exist) based on a "sufficiently simple" (== low CPU) statistical criterion for the image? gzip compression is very fast. My laptop runs an ARM Cortex A53. To give you an idea on how slow it is: opening a page on youtube, it takes 30 seconds before the video finally starts playing. Even on that slow of a platform, I barely notice gzip compression of a few MB of data. That being said: computing whether gzip would make sense would likely take longer than applying gzip by default. :)
  1. Do I understand correctly that img2pdf currently compresses almost all image types (png, tiff etc.) and only jpegs are left uncompressed? README file suggest the opposite, but Adobe Acrobat identifies "filters: zlib/deflate" for embedded png and tiff images.
  2. Does img2pdf always use the highest compression level (9)?
  3. img2pdf produces PDF V1.3 files. If I understand correctly, with PDF V1.5 and above all object streams should be compressed by default. Maybe it would be useful to upgrade to V1.5?
  4. You have said "The jpeg format apparently compresses badly for large areas of equal color", but your data shows the opposite: image size was reduced by 99.8% (from 245K to 411).
  5. You have suggested: "I'm tempted to just turn on gzip compression for jpegs by default and wait until/if somebody complains and then decide again. :)". My advice would be to compare compressed and uncompressed jpeg’s size and then decide whether it is worth to use compression or not. In extreme cases, compressed image can become larger than uncompressed, more about that here, paragraph 3.3.3.
1. Do I understand correctly that img2pdf currently compresses almost all image types (png, tiff etc.) and only jpegs are left uncompressed? README file suggest the opposite, but Adobe Acrobat identifies "filters: zlib/deflate" for embedded png and tiff images. 2. Does img2pdf always use the highest compression level (9)? 3. img2pdf produces PDF V1.3 files. If I understand correctly, with PDF V1.5 and above all object streams should be compressed by default. Maybe it would be useful to upgrade to V1.5? 4. You have said "The jpeg format apparently compresses badly for large areas of equal color", but your data shows the opposite: image size was reduced by 99.8% (from 245K to 411). 5. You have suggested: "I'm tempted to just turn on gzip compression for jpegs by default and wait until/if somebody complains and then decide again. :)". My advice would be to compare compressed and uncompressed jpeg’s size and then decide whether it is worth to use compression or not. In extreme cases, compressed image can become larger than uncompressed, more about that [here](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.5_v6.pdf), paragraph 3.3.3.
Owner

Do I understand correctly that img2pdf currently compresses almost all image types (png, tiff etc.) and only jpegs are left uncompressed? README file suggest the opposite, but Adobe Acrobat identifies "filters: zlib/deflate" for embedded png and tiff images.

That's exactly what the table says. PNG files (if they are interlaced or if they contain transparency) are zlib compressed image data processed by the Paeth filter. TIFF images are either CCITT Group 4 compressed if they are monochrome or they are gzip compressed if they are just raw pixel data.

Everything else, like JPEGs, PNGs without interlacing or transparency or TIFF files that already are CCITT Group 4 compressed get copied in without modification.

Does img2pdf always use the highest compression level (9)?

No. It uses the default used by zlib.compress().

img2pdf produces PDF V1.3 files. If I understand correctly, with PDF V1.5 and above all object streams should be compressed by default. Maybe it would be useful to upgrade to V1.5?

Why? Even if we want to compress all object streams by default (currently this is the choice of the PDF library), the PDF 1.3 allows for this as well.

You have said "The jpeg format apparently compresses badly for large areas of equal color", but your data shows the opposite: image size was reduced by 99.8% (from 245K to 411).

JPEG does not use gzip to compress image data. JPEG uses DCT encoding to do lossy compression. And that fares badly in my tests. If then afterwards you apply gzip compression then you get a huge size reduction because DCT compression compressed badly.

You have suggested: "I'm tempted to just turn on gzip compression for jpegs by default and wait until/if somebody complains and then decide again. :)". My advice would be to compare compressed and uncompressed jpeg’s size and then decide whether it is worth to use compression or not. In extreme cases, compressed image can become larger than uncompressed, more about that here, paragraph 3.3.3.

If I do that, will you handle the bug reports of those people who started using img2pdf because of how fast it used to be? You know that people use img2pdf to convert whole directory of images?

> Do I understand correctly that img2pdf currently compresses almost all image types (png, tiff etc.) and only jpegs are left uncompressed? README file suggest the opposite, but Adobe Acrobat identifies "filters: zlib/deflate" for embedded png and tiff images. That's exactly what the table says. PNG files (if they are interlaced or if they contain transparency) are zlib compressed image data processed by the Paeth filter. TIFF images are either CCITT Group 4 compressed if they are monochrome or they are gzip compressed if they are just raw pixel data. Everything else, like JPEGs, PNGs without interlacing or transparency or TIFF files that already are CCITT Group 4 compressed get copied in without modification. > Does img2pdf always use the highest compression level (9)? No. It uses the default used by `zlib.compress()`. > img2pdf produces PDF V1.3 files. If I understand correctly, with PDF V1.5 and above all object streams should be compressed by default. Maybe it would be useful to upgrade to V1.5? Why? Even if we want to compress all object streams by default (currently this is the choice of the PDF library), the PDF 1.3 allows for this as well. > You have said "The jpeg format apparently compresses badly for large areas of equal color", but your data shows the opposite: image size was reduced by 99.8% (from 245K to 411). JPEG does not use gzip to compress image data. JPEG uses DCT encoding to do lossy compression. And that fares badly in my tests. If then afterwards you apply gzip compression then you get a huge size reduction because DCT compression compressed badly. > You have suggested: "I'm tempted to just turn on gzip compression for jpegs by default and wait until/if somebody complains and then decide again. :)". My advice would be to compare compressed and uncompressed jpeg’s size and then decide whether it is worth to use compression or not. In extreme cases, compressed image can become larger than uncompressed, more about that here, paragraph 3.3.3. If I do that, will you handle the bug reports of those people who started using img2pdf because of how fast it used to be? You know that people use img2pdf to convert whole directory of images?

PNG files (if they are interlaced or if they contain transparency) are zlib compressed image data

I have tried converting the same non-transparent PNG image, which was saved as interlaced and not interlaced, and in both cases it was zlib compressed.

Does img2pdf always use the highest compression level (9)?

No. It uses the default used by zlib.compress().

Have you done any tests to see how much slower compression gets when compression level is increased from 6 to 9? Maybe impact is not that huge and it is worth increasing compression level?

img2pdf produces PDF V1.3 files. If I understand correctly, with PDF V1.5 and above all object streams should be compressed by default. Maybe it would be useful to upgrade to V1.5?

Why? Even if we want to compress all object streams by default (currently this is the choice of the PDF library), the PDF 1.3 allows for this as well.

I have read that "what object Streams allow you to do is to put lots of PDF objects together inside a single binary stream. The binary stream still has a text header, telling the PDF parser how to find and extract the PDF objects, but all the PDF objects themselves can be compressed. This makes the PDF smaller, potentially more secure and possibly faster to load." Also that "Version 1.5 of the PDF format introduces a new type of stream, an object stream (ObjStm). It is a collection of many PDF objects together inside a single binary stream. The purpose of this type of object is to allow the compression of PDF objects not of the stream type. This process considerably reduces the size of PDF files."
So, I was wondering why img2pdf does not do that - why it does not put all data into a single stream and compress it?

If I do that, will you handle the bug reports of those people who started using img2pdf because of how fast it used to be?

I don't understand your argument here. You have suggested to turn on gzip compression for jpegs by default and to do that because gzip compression is very fast.
My suggestion was just to add one extra step: after compression compare compressed and uncompressed image size. That extra step most definitely will not have any significant impact to whole conversion process.

>PNG files (if they are interlaced or if they contain transparency) are zlib compressed image data I have tried converting the same non-transparent PNG image, which was saved as interlaced and not interlaced, and in both cases it was zlib compressed. >>Does img2pdf always use the highest compression level (9)? >No. It uses the default used by zlib.compress(). Have you done any tests to see how much slower compression gets when compression level is increased from 6 to 9? Maybe impact is not that huge and it is worth increasing compression level? >>img2pdf produces PDF V1.3 files. If I understand correctly, with PDF V1.5 and above all object streams should be compressed by default. Maybe it would be useful to upgrade to V1.5? >Why? Even if we want to compress all object streams by default (currently this is the choice of the PDF library), the PDF 1.3 allows for this as well. I have read that "what object Streams allow you to do is to put lots of PDF objects together inside a **single binary stream**. The binary stream still has a text header, telling the PDF parser how to find and extract the PDF objects, but all the PDF objects themselves can be compressed. **This makes the PDF smaller**, potentially more secure and possibly faster to load." Also that "Version 1.5 of the PDF format introduces a new type of stream, an object stream (ObjStm). It is a collection of many PDF objects together inside a single binary stream. The purpose of this type of object is to allow the compression of PDF objects not of the stream type. **This process considerably reduces the size of PDF files**." So, I was wondering why img2pdf does not do that - why it does not put all data into a single stream and compress it? >If I do that, will you handle the bug reports of those people who started using img2pdf because of how fast it used to be? I don't understand your argument here. You have suggested to turn on gzip compression for jpegs by default and to do that because gzip compression is very fast. My suggestion was just to add one extra step: after compression compare compressed and uncompressed image size. That extra step most definitely will not have any significant impact to whole conversion process.
Owner

I have tried converting the same non-transparent PNG image, which was saved as interlaced and not interlaced, and in both cases it was zlib compressed.

Of course, that's how the PNG format works.

Have you done any tests to see how much slower compression gets when compression level is increased from 6 to 9? Maybe impact is not that huge and it is worth increasing compression level?

If I'd change the compression level, I'd rather go the other way and decrease it. Storage is cheap. CPU time is not.

So, I was wondering why img2pdf does not do that - why it does not put all data into a single stream and compress it?

Doing this is the job of the pdf engine. If you want img2pdf to do it, you should for example contact the pikepdf developers and suggest them to add this feature.

I don't understand your argument here. You have suggested to turn on gzip compression for jpegs by default and to do that because gzip compression is very fast. My suggestion was just to add one extra step: after compression compare compressed and uncompressed image size. That extra step most definitely will not have any significant impact to whole conversion process.

But why? Why add even further complexity for the very unlikely situation where applying gzip actually makes the compressed data bigger than the original? Maybe I'm wrong and it's actually not that bad. Can you prepare a MR with this feature so that I can see how what you propose would look like?

> I have tried converting the same non-transparent PNG image, which was saved as interlaced and not interlaced, and in both cases it was zlib compressed. Of course, that's how the PNG format works. > Have you done any tests to see how much slower compression gets when compression level is increased from 6 to 9? Maybe impact is not that huge and it is worth increasing compression level? If I'd change the compression level, I'd rather go the other way and decrease it. Storage is cheap. CPU time is not. > So, I was wondering why img2pdf does not do that - why it does not put all data into a single stream and compress it? Doing this is the job of the pdf engine. If you want img2pdf to do it, you should for example contact the pikepdf developers and suggest them to add this feature. > I don't understand your argument here. You have suggested to turn on gzip compression for jpegs by default and to do that because gzip compression is very fast. My suggestion was just to add one extra step: after compression compare compressed and uncompressed image size. That extra step most definitely will not have any significant impact to whole conversion process. But why? Why add even further complexity for the very unlikely situation where applying gzip actually makes the compressed data bigger than the original? Maybe I'm wrong and it's actually not that bad. Can you prepare a MR with this feature so that I can see how what you propose would look like?

I have tried converting the same non-transparent PNG image, which was saved as interlaced and not interlaced, and in both cases it was zlib compressed.

Of course, that's how the PNG format works.

I still do not understand this part. I have expected to see one image being compressed and another just being directly embedded.

If I'd change the compression level, I'd rather go the other way and decrease it. Storage is cheap. CPU time is not.

I guess it is a matter of taste. I would do the opposite given how rapidly the CPU performance is improving.

But why? Why add even further complexity for the very unlikely situation where applying gzip actually makes the compressed data bigger than the original? Maybe I'm wrong and it's actually not that bad. Can you prepare a MR with this feature so that I can see how what you propose would look like?

Why one extra IF statement is in any significant way increasing complexity is beyond my understanding. I do not code in Python, but here is a proof of concept: https://dotnetfiddle.net/hoODop. jpeg2 is made of random bytes hence compression is not effective, and jpeg1's size could be reduced by 15%, so compression could be used.

>>I have tried converting the same non-transparent PNG image, which was saved as interlaced and not interlaced, and in both cases it was zlib compressed. >Of course, that's how the PNG format works. I still do not understand this part. I have expected to see one image being compressed and another just being directly embedded. >If I'd change the compression level, I'd rather go the other way and decrease it. Storage is cheap. CPU time is not. I guess it is a matter of taste. I would do the opposite given how rapidly the CPU performance is improving. >But why? Why add even further complexity for the very unlikely situation where applying gzip actually makes the compressed data bigger than the original? Maybe I'm wrong and it's actually not that bad. Can you prepare a MR with this feature so that I can see how what you propose would look like? Why one extra IF statement is in any significant way increasing complexity is beyond my understanding. I do not code in Python, but here is a proof of concept: https://dotnetfiddle.net/hoODop. jpeg2 is made of random bytes hence compression is not effective, and jpeg1's size could be reduced by 15%, so compression could be used.
Owner

I still do not understand this part. I have expected to see one image being compressed and another just being directly embedded.

The PDF format allows for images to be encoded using the Paeth filter which is also used by PNG. So if img2pdf receives a PNG image that is not interlaced without transparency, it can directly embed the Paeth encoding without computing it from scratch. That encoding is then compressed with gzip.

Why one extra IF statement is in any significant way increasing complexity is beyond my understanding.

It is far from being as simple as your code example makes it look like. The function in img2pdf that does most of the magic and contains all the special-casing depending on what kind of image one throws at it is the function read_images which is 490 lines long. The output of that function has a certain type which then has to be communicated to its caller, the convert function, which is itself 190 lines long and which sticks the result of read_images into pdf.add_imagepage, a function with 23 arguments and 268 lines long. If you think this is easy, please show me a patch, then we can talk about it.

> I still do not understand this part. I have expected to see one image being compressed and another just being directly embedded. The PDF format allows for images to be encoded using the Paeth filter which is also used by PNG. So if img2pdf receives a PNG image that is not interlaced without transparency, it can directly embed the Paeth encoding without computing it from scratch. That encoding is then compressed with gzip. > Why one extra IF statement is in any significant way increasing complexity is beyond my understanding. It is far from being as simple as your code example makes it look like. The function in img2pdf that does most of the magic and contains all the special-casing depending on what kind of image one throws at it is the function `read_images` which is 490 lines long. The output of that function has a certain type which then has to be communicated to its caller, the `convert` function, which is itself 190 lines long and which sticks the result of `read_images` into `pdf.add_imagepage`, a function with 23 arguments and 268 lines long. If you think this is easy, please show me a patch, then we can talk about it.

If you think this is easy, please show me a patch, then we can talk about it.

Please add the part where you compress all jpegs, then I will show you how I imagine to do that one extra check.

>If you think this is easy, please show me a patch, then we can talk about it. Please add the part where you compress all jpegs, then I will show you how I imagine to do that one extra check.
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#199
No description provided.