I notice that (fairly recently) support for group4-encoded tiffs was added, which is great. However other tiffs seem to be converted to png. Most of the time, that doesn't matter to me, but sometimes it does. Any chance support for other tiffs can be added?
Thanks.
By josch on 2020-12-18T21:55:34.016Z
I don't understand. What is the bug in img2pdf?
By John on 2020-12-18T22:24:15.079Z
img2pdf image.tiff -o new.pdf
results in a pdf of a different file size than the original tiff, unless the tiff is Group4 compressed (maybe under other conditions as well). If I use, say, pdfimages to get the image back out of the pdf, it's not a tiff, but a png. JPG remains the same.
Could be that I'm not entirely understanding what's going on, but why does the format change on the tiff?
By josch on 2020-12-18T22:37:41.079Z
The format changes, because PDF does not understand the tiff format.
What both PDF and TIFF understand is group4 encoding, so if your tiff is compressed with ccitt group4, then the image in the pdf will use the same. But all other ways of storing raster image data in tiff are not supported by pdf. Thus, the pixel data has to be moved to a different format. The most space saving lossless format is paeth encoding (used in png).
Where is the bug?
By John on 2020-12-18T22:42:45.918Z
Thanks, that's helpful. I didn't understand about the problems pdf has with tiff.
By John on 2020-12-18T22:42:47.468Z
Status changed to closed
By josch on 2020-12-18T22:51:34.056Z
I wouldn't call it a "problem". I mean jpeg also does support ccitt group4 or the paeth filter and we also don't call it a problem. It's just one of the properties of the pdf format. The only supported image formats of pdf are jpeg, plain uncompressed pixel data, paeth filter and ccitt group4. What I would rather call a "problem" when it comes to what pdf supports, then that it cannot store anything with an alpha channel. ;)
By John on 2020-12-19T00:15:49.740Z
Well, jpeg is a file format and pdf is...uh...something else.
I'd also say anytime it throws away data (like it always does with dpi) is a problem.
If there's a readable document that explains what happens with images in PDF, I'd be interested. For instance, I've been using imagemagick to convert images to pdf and figured I'd switch to img2pdf which seems noticeably faster. img2pdf seems pretty much to produce pdfs with the same file size (format permitting), but IM will sometimes be smaller (as with some pngs) or larger (as with tiff where it seems not to use any compression if it's not ccitt). Some png info clearly can get lost on the way back out of the pdf using pdfimages.
By josch on 2020-12-19T08:47:13.844Z
PDF is also a file format but it's more of a container format. JPEG and PNG only contain one kind of data. PDF and TIFF are both able to store data of many different kinds, similar to how MP4 or MKV are video containers able to store their data in many different ways (codecs).
By a document explaining what happens with images in pdf, do you mean a document how pdf works? That's here: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
Based on that, programs producing PDF have the choice of how they encode their images. What img2pdf does, is to choose an encoding of the pixel data that allows it to be lossless while at the same time resulting in a small output file. In case of JPEG, the whole JPEG file is just dumped into the PDF. For all other raster images, the pixel data is the only thing retained, so for all other formats than JPEG all metadata will be lost.
The reason img2pdf is faster is, that in most cases it is able to just copy the image data into the pdf without re-encoding it. Look at the README for a table that explains which input allows direct inclusion and in which cases some computations have to happen.
IM will be smaller, yes. But it will also change the pixel data by re-encoding.
By John on 2020-12-20T00:24:27.987Z
IM will be smaller, yes. But it will also change the pixel data by re-encoding.
I'm not sure there's recoding that modifies the data in the case, or ever. If I magick a png to pdf and then extract the image with pdfimages, I get the same thing back, even though the pdf file size is smaller than the original png. Asking over at the IM boards.
By josch on 2020-12-20T08:59:29.392Z
Oooh? That would be new! Can you share the png image and the imagemagick command you used?
By John on 2020-12-20T20:39:32.177Z
I attach an example png. If I convert it to pdf via:
magick orig.png demo.pdf
the 458k png becomes a 379k PDF. Extracting the png gives me back a 452k file:
pdfimages -all demo.pdf demo
The two png files seem to be pixel-identical.
It's also the case that if I use pngcrush on the file and then use IM to embed it, it has exactly the same file size as the pdf IM makes from the original png.
By josch on 2020-12-20T22:14:16.110Z
Thank you! The mystery is solved. :)
You managed to find a PNG (first time I see it) where the paeth filter makes the zlib compression worse. The better option is actually to not apply the paeth filter at all and then compress the pixel data with zlib. This is what pdf does and that's why it's smaller than your original png. You can see this in the pdf from this:
The /FlateDecode filter means, that the raw pixel data is compressed with zlib. Your input png image on the other hand utilizes all kind of filters more or less randomly -- no idea which software encoded it. You can see this from this output of pngcheck:
As you can see, filters 1, 2, 3 and 4 get used but never filter 0, which would be no filter. For this input data, the best compression is to use no filter at all. This is also what pngcrush realizes, so after running your image through pngcrush you get:
All those zeros mean, that no filter was applied. The resulting png image is then also smaller than the pdf created by imagemagick. Funnily, when using pdfimages -all, the newly created png image also seems to pick all kinds of filters in the hopes that this improves compression -- it doesn't. :)
Thanks for this input! This is the first time I'm seeing this. :)
By John on 2020-12-21T20:09:22.861Z
To get the png, I ran the attached jpg through a basic IM command:
magick orig.jpg orig.png
Running through all the IM combinations for compression filter, level, and strategy, I can't get down to the pngcrush size. The best I can do is by setting those values, respectively, to 130, 131, 230, 231, 330, 331, 430, and 431. All of which give identically sized files: 357630. The pngcrush output was 343905.
If I understand it correctly, pngcheck shows multiple data chunks with no compression, so presumably it's the chunking that's using up the extra 14k?
By josch on 2020-12-21T21:05:12.670Z
You are probably talking about the options png:compression-filter, png:compression-level and png:compression-strategy? Those only control the zip compression. How png works and how it achieves such good compression compared to bmp or gif is the paeth filter. That filter is not a compression. It just filters the data in a way that makes it really well suited for zip compression. In most cases, that is. You found one example, where the paeth filter makes it worse. So you would have to somehow tell imagemagick not to use the paeth filter but hand the data to the zip compressor without filtering. I don't know if there is an option for that.
By John on 2020-12-21T21:31:52.523Z
I'm pretty sure that I covered paeth in there, and I think it's a filter value of "4". The other png compression method is using the -quality option. I tried all of those options as well. See details here: https://legacy.imagemagick.org/Usage/formats/#png_quality
I'm hoping to find out on the IM boards whether IM uses pngcrush before putting a png into a PDF.
*By John on 2020-12-18T19:21:16.345Z*
I notice that (fairly recently) support for group4-encoded tiffs was added, which is great. However other tiffs seem to be converted to png. Most of the time, that doesn't matter to me, but sometimes it does. Any chance support for other tiffs can be added?
Thanks.
---
*By josch on 2020-12-18T21:55:34.016Z*
---
I don't understand. What is the bug in img2pdf?
---
*By John on 2020-12-18T22:24:15.079Z*
---
`img2pdf image.tiff -o new.pdf`
results in a pdf of a different file size than the original tiff, unless the tiff is Group4 compressed (maybe under other conditions as well). If I use, say, pdfimages to get the image back out of the pdf, it's not a tiff, but a png. JPG remains the same.
Could be that I'm not entirely understanding what's going on, but why does the format change on the tiff?
---
*By josch on 2020-12-18T22:37:41.079Z*
---
The format changes, because PDF does not understand the tiff format.
What both PDF and TIFF understand is group4 encoding, so if your tiff is compressed with ccitt group4, then the image in the pdf will use the same. But all other ways of storing raster image data in tiff are not supported by pdf. Thus, the pixel data has to be moved to a different format. The most space saving lossless format is paeth encoding (used in png).
Where is the bug?
---
*By John on 2020-12-18T22:42:45.918Z*
---
Thanks, that's helpful. I didn't understand about the problems pdf has with tiff.
---
*By John on 2020-12-18T22:42:47.468Z*
---
Status changed to closed
---
*By josch on 2020-12-18T22:51:34.056Z*
---
I wouldn't call it a "problem". I mean jpeg also does support ccitt group4 or the paeth filter and we also don't call it a problem. It's just one of the properties of the pdf format. The only supported image formats of pdf are jpeg, plain uncompressed pixel data, paeth filter and ccitt group4. What I would rather call a "problem" when it comes to what pdf supports, then that it cannot store anything with an alpha channel. ;)
---
*By John on 2020-12-19T00:15:49.740Z*
---
Well, jpeg is a file format and pdf is...uh...something else.
I'd also say anytime it throws away data (like it always does with dpi) is a problem.
If there's a readable document that explains what happens with images in PDF, I'd be interested. For instance, I've been using imagemagick to convert images to pdf and figured I'd switch to img2pdf which seems noticeably faster. img2pdf seems pretty much to produce pdfs with the same file size (format permitting), but IM will sometimes be smaller (as with some pngs) or larger (as with tiff where it seems not to use any compression if it's not ccitt). Some png info clearly can get lost on the way back out of the pdf using pdfimages.
---
*By josch on 2020-12-19T08:47:13.844Z*
---
PDF is also a file format but it's more of a container format. JPEG and PNG only contain one kind of data. PDF and TIFF are both able to store data of many different kinds, similar to how MP4 or MKV are video containers able to store their data in many different ways (codecs).
By a document explaining what happens with images in pdf, do you mean a document how pdf works? That's here: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
Based on that, programs producing PDF have the choice of how they encode their images. What img2pdf does, is to choose an encoding of the pixel data that allows it to be lossless while at the same time resulting in a small output file. In case of JPEG, the whole JPEG file is just dumped into the PDF. For all other raster images, the pixel data is the only thing retained, so for all other formats than JPEG all metadata will be lost.
The reason img2pdf is faster is, that in most cases it is able to just copy the image data into the pdf without re-encoding it. Look at the README for a table that explains which input allows direct inclusion and in which cases some computations have to happen.
IM will be smaller, yes. But it will also change the pixel data by re-encoding.
---
*By John on 2020-12-20T00:24:27.987Z*
---
> IM will be smaller, yes. But it will also change the pixel data by re-encoding.
I'm not sure there's recoding that modifies the data in the case, or ever. If I magick a png to pdf and then extract the image with pdfimages, I get the same thing back, even though the pdf file size is smaller than the original png. Asking over at the IM boards.
---
*By josch on 2020-12-20T08:59:29.392Z*
---
Oooh? That would be new! Can you share the png image and the imagemagick command you used?
---
*By John on 2020-12-20T20:39:32.177Z*
---
I attach an example png. If I convert it to pdf via:
`magick orig.png demo.pdf`
the 458k png becomes a 379k PDF. Extracting the png gives me back a 452k file:
`pdfimages -all demo.pdf demo`
The two png files seem to be pixel-identical.
![orig](/uploads/c74b62ffaf5bbc3fbddd88631aa011f2/orig.png)
It's also the case that if I use pngcrush on the file and then use IM to embed it, it has exactly the same file size as the pdf IM makes from the original png.
---
*By josch on 2020-12-20T22:14:16.110Z*
---
Thank you! The mystery is solved. :)
You managed to find a PNG (first time I see it) where the paeth filter makes the zlib compression worse. The better option is actually to not apply the paeth filter at all and then compress the pixel data with zlib. This is what pdf does and that's why it's smaller than your original png. You can see this in the pdf from this:
```
<<
/Type /XObject
/Subtype /Image
/Name /Im0
/Filter [ /FlateDecode ]
/Width 650
/Height 827
/ColorSpace 10 0 R
/BitsPerComponent 8
/Length 9 0 R
>>
```
The `/FlateDecode` filter means, that the raw pixel data is compressed with zlib. Your input png image on the other hand utilizes all kind of filters more or less randomly -- no idea which software encoded it. You can see this from this output of pngcheck:
```
File: orig.png (458405 bytes)
chunk IHDR at offset 0x0000c, length 13
650 x 827 image, 24-bit RGB, non-interlaced
chunk gAMA at offset 0x00025, length 4: 0.45455
chunk cHRM at offset 0x00035, length 32
White x = 0.3127 y = 0.329, Red x = 0.64 y = 0.33
Green x = 0.3 y = 0.6, Blue x = 0.15 y = 0.06
chunk bKGD at offset 0x00061, length 6
red = 0x00ff, green = 0x00ff, blue = 0x00ff
chunk pHYs at offset 0x00073, length 9: 7874x7874 pixels/meter (200 dpi)
chunk IDAT at offset 0x00088, length 32768
zlib: deflated, 32K window, maximum compression
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
1 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 1 1 1 1 1 3 4 2 2 2 2 2 2 2 3 1 4 4 4 4 4 4
4 1 2 2 2 2 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 1 4 1 4 4 4 4 2 2 4 4 2 2 4 2 3 1 3 1 3 2 4 3 2
2 (226 out of 827)
chunk IDAT at offset 0x08094, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
4 2 3 1 1 3 1 2 2 4 2 2 2 2 4 3 1 3 1 3 2 4 4 2 2
4 2 3 1 3 3 3 2 4 4 2 2 4 2 (265 out of 827)
chunk IDAT at offset 0x100a0, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
3 1 1 3 3 2 2 2 2 2 2 4 2 3 1 3 3 2 2 4 3 2 2 2 2
3 1 3 1 3 2 4 4 2 4 4 (301 out of 827)
chunk IDAT at offset 0x180ac, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
2 3 1 3 3 3 2 2 4 3 2 2 2 3 3 1 3 1 2 2 2 3 2 4 4
2 3 1 1 4 3 3 4 4 2 4 4 4 3 1 4 1 1 3 4 3 3 4 4 2
(351 out of 827)
chunk IDAT at offset 0x200b8, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
3 1 1 3 4 3 4 4 3 4 4 4 4 1 1 3 1 3 4 4 4 2 4 4 4
3 1 4 1 4 3 4 4 4 4 4 4 (388 out of 827)
chunk IDAT at offset 0x280c4, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
3 1 1 1 3 3 4 4 2 4 4 4 3 1 1 4 1 4 4 4 3 4 4 4 4
1 1 3 3 3 3 4 3 3 4 4 4 (425 out of 827)
chunk IDAT at offset 0x300d0, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
3 1 4 3 4 3 3 4 3 4 4 4 3 1 1 3 3 3 4 4 3 2 4 4 2
3 1 3 4 3 2 4 4 2 4 4 2 (462 out of 827)
chunk IDAT at offset 0x380dc, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
3 1 3 3 4 4 3 4 2 4 4 4 3 3 1 3 4 2 2 4 3 4 4 2 2
3 1 3 4 3 4 4 4 2 4 4 4 3 1 (501 out of 827)
chunk IDAT at offset 0x400e8, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
1 1 3 2 3 3 3 2 2 4 3 4 4 3 1 3 3 3 2 4 2 2 4 1 1
1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2
2 4 1 2 4 4 4 4 4 3 3 2 2 4 2 4 4 3 (569 out of 827)
chunk IDAT at offset 0x480f4, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
1 3 4 2 2 4 2 1 4 3 1 2 4 2 3 2 1 3 3 3 2 2 4 2 4
2 3 1 3 3 2 2 4 2 0 2 3 1 3 3 4 2 4 (612 out of 827)
chunk IDAT at offset 0x50100, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
2 4 2 3 1 3 3 2 2 2 2 4 2 3 1 4 3 2 4 4 2 4 4 3 1
3 3 2 2 4 3 4 3 3 4 3 3 2 4 3 3 4 (654 out of 827)
chunk IDAT at offset 0x5810c, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
3 1 1 2 2 2 3 2 4 4 3 1 3 3 4 2 4 3 3 4 3 1 2 4 2
2 2 3 4 2 3 1 2 2 2 2 4 3 4 (693 out of 827)
chunk IDAT at offset 0x60118, length 32768
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
2 3 1 3 2 2 2 2 3 4 2 3 4 2 3 2 4 2 3 1 3 3 2 4 2
2 2 3 1 1 3 3 2 4 2 2 2 3 1 3 3 (734 out of 827)
chunk IDAT at offset 0x68124, length 32011
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
3 2 4 2 2 2 3 1 3 3 3 3 2 2 4 2 3 1 3 3 3 4 2 2 2
3 1 1 3 3 2 3 2 2 2 3 1 3 3 1 3 2 1 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 (827 out of 827)
chunk tEXt at offset 0x6fe3b, length 37, keyword: date:create
chunk tEXt at offset 0x6fe6c, length 37, keyword: date:modify
chunk IEND at offset 0x6fe9d, length 0
No errors detected in orig.png (22 chunks, 71.6% compression).
```
As you can see, filters 1, 2, 3 and 4 get used but never filter 0, which would be no filter. For this input data, the best compression is to use no filter at all. This is also what `pngcrush` realizes, so after running your image through `pngcrush` you get:
```
File: pngout.png (371230 bytes)
chunk IHDR at offset 0x0000c, length 13
650 x 827 image, 24-bit RGB, non-interlaced
chunk gAMA at offset 0x00025, length 4: 0.45455
chunk cHRM at offset 0x00035, length 32
White x = 0.3127 y = 0.329, Red x = 0.64 y = 0.33
Green x = 0.3 y = 0.6, Blue x = 0.15 y = 0.06
chunk bKGD at offset 0x00061, length 6
red = 0x00ff, green = 0x00ff, blue = 0x00ff
chunk pHYs at offset 0x00073, length 9: 7874x7874 pixels/meter (200 dpi)
chunk IDAT at offset 0x00088, length 370976
zlib: deflated, 32K window, maximum compression
row filters (0 none, 1 sub, 2 up, 3 avg, 4 paeth):
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 (827 out of 827)
chunk tEXt at offset 0x5a9b4, length 37, keyword: date:create
chunk tEXt at offset 0x5a9e5, length 37, keyword: date:modify
chunk IEND at offset 0x5aa16, length 0
No errors detected in pngout.png (9 chunks, 77.0% compression).
```
All those zeros mean, that no filter was applied. The resulting png image is then also smaller than the pdf created by imagemagick. Funnily, when using `pdfimages -all`, the newly created png image also seems to pick all kinds of filters in the hopes that this improves compression -- it doesn't. :)
Thanks for this input! This is the first time I'm seeing this. :)
---
*By John on 2020-12-21T20:09:22.861Z*
---
To get the png, I ran the attached jpg through a basic IM command:
`magick orig.jpg orig.png`
Running through all the IM combinations for compression filter, level, and strategy, I can't get down to the pngcrush size. The best I can do is by setting those values, respectively, to 130, 131, 230, 231, 330, 331, 430, and 431. All of which give identically sized files: 357630. The pngcrush output was 343905.
If I understand it correctly, pngcheck shows multiple data chunks with no compression, so presumably it's the chunking that's using up the extra 14k?
![orig](/uploads/c65a87339671a4f332128defd7332cea/orig.jpg)
---
*By josch on 2020-12-21T21:05:12.670Z*
---
You are probably talking about the options `png:compression-filter`, `png:compression-level` and `png:compression-strategy`? Those only control the zip compression. How png works and how it achieves such good compression compared to bmp or gif is the paeth filter. That filter is *not* a compression. It just filters the data in a way that makes it really well suited for zip compression. In most cases, that is. You found one example, where the paeth filter makes it worse. So you would have to somehow tell imagemagick not to use the paeth filter but hand the data to the zip compressor without filtering. I don't know if there is an option for that.
---
*By John on 2020-12-21T21:31:52.523Z*
---
I'm pretty sure that I covered paeth in there, and I think it's a filter value of "4". The other png compression method is using the -quality option. I tried all of those options as well. See details here: https://legacy.imagemagick.org/Usage/formats/#png_quality
I'm hoping to find out on the IM boards whether IM uses pngcrush before putting a png into a PDF.
By John on 2020-12-18T19:21:16.345Z
I notice that (fairly recently) support for group4-encoded tiffs was added, which is great. However other tiffs seem to be converted to png. Most of the time, that doesn't matter to me, but sometimes it does. Any chance support for other tiffs can be added?
Thanks.
By josch on 2020-12-18T21:55:34.016Z
I don't understand. What is the bug in img2pdf?
By John on 2020-12-18T22:24:15.079Z
img2pdf image.tiff -o new.pdf
results in a pdf of a different file size than the original tiff, unless the tiff is Group4 compressed (maybe under other conditions as well). If I use, say, pdfimages to get the image back out of the pdf, it's not a tiff, but a png. JPG remains the same.
Could be that I'm not entirely understanding what's going on, but why does the format change on the tiff?
By josch on 2020-12-18T22:37:41.079Z
The format changes, because PDF does not understand the tiff format.
What both PDF and TIFF understand is group4 encoding, so if your tiff is compressed with ccitt group4, then the image in the pdf will use the same. But all other ways of storing raster image data in tiff are not supported by pdf. Thus, the pixel data has to be moved to a different format. The most space saving lossless format is paeth encoding (used in png).
Where is the bug?
By John on 2020-12-18T22:42:45.918Z
Thanks, that's helpful. I didn't understand about the problems pdf has with tiff.
By John on 2020-12-18T22:42:47.468Z
Status changed to closed
By josch on 2020-12-18T22:51:34.056Z
I wouldn't call it a "problem". I mean jpeg also does support ccitt group4 or the paeth filter and we also don't call it a problem. It's just one of the properties of the pdf format. The only supported image formats of pdf are jpeg, plain uncompressed pixel data, paeth filter and ccitt group4. What I would rather call a "problem" when it comes to what pdf supports, then that it cannot store anything with an alpha channel. ;)
By John on 2020-12-19T00:15:49.740Z
Well, jpeg is a file format and pdf is...uh...something else.
I'd also say anytime it throws away data (like it always does with dpi) is a problem.
If there's a readable document that explains what happens with images in PDF, I'd be interested. For instance, I've been using imagemagick to convert images to pdf and figured I'd switch to img2pdf which seems noticeably faster. img2pdf seems pretty much to produce pdfs with the same file size (format permitting), but IM will sometimes be smaller (as with some pngs) or larger (as with tiff where it seems not to use any compression if it's not ccitt). Some png info clearly can get lost on the way back out of the pdf using pdfimages.
By josch on 2020-12-19T08:47:13.844Z
PDF is also a file format but it's more of a container format. JPEG and PNG only contain one kind of data. PDF and TIFF are both able to store data of many different kinds, similar to how MP4 or MKV are video containers able to store their data in many different ways (codecs).
By a document explaining what happens with images in pdf, do you mean a document how pdf works? That's here: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
Based on that, programs producing PDF have the choice of how they encode their images. What img2pdf does, is to choose an encoding of the pixel data that allows it to be lossless while at the same time resulting in a small output file. In case of JPEG, the whole JPEG file is just dumped into the PDF. For all other raster images, the pixel data is the only thing retained, so for all other formats than JPEG all metadata will be lost.
The reason img2pdf is faster is, that in most cases it is able to just copy the image data into the pdf without re-encoding it. Look at the README for a table that explains which input allows direct inclusion and in which cases some computations have to happen.
IM will be smaller, yes. But it will also change the pixel data by re-encoding.
By John on 2020-12-20T00:24:27.987Z
I'm not sure there's recoding that modifies the data in the case, or ever. If I magick a png to pdf and then extract the image with pdfimages, I get the same thing back, even though the pdf file size is smaller than the original png. Asking over at the IM boards.
By josch on 2020-12-20T08:59:29.392Z
Oooh? That would be new! Can you share the png image and the imagemagick command you used?
By John on 2020-12-20T20:39:32.177Z
I attach an example png. If I convert it to pdf via:
magick orig.png demo.pdf
the 458k png becomes a 379k PDF. Extracting the png gives me back a 452k file:
pdfimages -all demo.pdf demo
The two png files seem to be pixel-identical.
It's also the case that if I use pngcrush on the file and then use IM to embed it, it has exactly the same file size as the pdf IM makes from the original png.
By josch on 2020-12-20T22:14:16.110Z
Thank you! The mystery is solved. :)
You managed to find a PNG (first time I see it) where the paeth filter makes the zlib compression worse. The better option is actually to not apply the paeth filter at all and then compress the pixel data with zlib. This is what pdf does and that's why it's smaller than your original png. You can see this in the pdf from this:
The
/FlateDecode
filter means, that the raw pixel data is compressed with zlib. Your input png image on the other hand utilizes all kind of filters more or less randomly -- no idea which software encoded it. You can see this from this output of pngcheck:As you can see, filters 1, 2, 3 and 4 get used but never filter 0, which would be no filter. For this input data, the best compression is to use no filter at all. This is also what
pngcrush
realizes, so after running your image throughpngcrush
you get:All those zeros mean, that no filter was applied. The resulting png image is then also smaller than the pdf created by imagemagick. Funnily, when using
pdfimages -all
, the newly created png image also seems to pick all kinds of filters in the hopes that this improves compression -- it doesn't. :)Thanks for this input! This is the first time I'm seeing this. :)
By John on 2020-12-21T20:09:22.861Z
To get the png, I ran the attached jpg through a basic IM command:
magick orig.jpg orig.png
Running through all the IM combinations for compression filter, level, and strategy, I can't get down to the pngcrush size. The best I can do is by setting those values, respectively, to 130, 131, 230, 231, 330, 331, 430, and 431. All of which give identically sized files: 357630. The pngcrush output was 343905.
If I understand it correctly, pngcheck shows multiple data chunks with no compression, so presumably it's the chunking that's using up the extra 14k?
By josch on 2020-12-21T21:05:12.670Z
You are probably talking about the options
png:compression-filter
,png:compression-level
andpng:compression-strategy
? Those only control the zip compression. How png works and how it achieves such good compression compared to bmp or gif is the paeth filter. That filter is not a compression. It just filters the data in a way that makes it really well suited for zip compression. In most cases, that is. You found one example, where the paeth filter makes it worse. So you would have to somehow tell imagemagick not to use the paeth filter but hand the data to the zip compressor without filtering. I don't know if there is an option for that.By John on 2020-12-21T21:31:52.523Z
I'm pretty sure that I covered paeth in there, and I think it's a filter value of "4". The other png compression method is using the -quality option. I tried all of those options as well. See details here: https://legacy.imagemagick.org/Usage/formats/#png_quality
I'm hoping to find out on the IM boards whether IM uses pngcrush before putting a png into a PDF.