use /LZWDecode filter for GIF and matching TIFF images #174

Open
opened 7 months ago by frechefuchs · 4 comments

The GIF image format uses LZW compression, which is also a valid compression format for TIFF images. The PDF format provides an /LZWDecode filter that seems like a perfect match to embed GIF and certain TIFF images without transcoding image data.

A test with img2pdf 3.3.0 (Ubuntu 20.04) shows that GIF and LZW compressed TIFF images are always transcoded into /FlateDecode streams by img2pdf.

$ gm convert logo: GIF:- | img2pdf | grep -ia filter
    /Filter /FlateDecode
$ gm convert logo: -compress lzw TIFF:- | img2pdf | grep -ia filter
    /Filter /FlateDecode

Wouldn't it make sense to preserve LZW encoded image data in these cases?

There's one caveat, though. The PDF/A standard explicitly forbids /LZWDecode data streams. That is, when the --pdfa option is given, image data should be transcoded to use /FlateDecode streams, again.

As an example, a PDF file using the /LZWDecode filter can be created using GraphicsMagick like this:

$ gm convert logo: -compress lzw PDF:- | grep -ia filter
/Filter [ /LZWDecode ]
The GIF image format uses LZW compression, which is also a valid compression format for TIFF images. The PDF format provides an `/LZWDecode` filter that seems like a perfect match to embed GIF and certain TIFF images without transcoding image data. A test with img2pdf 3.3.0 (Ubuntu 20.04) shows that GIF and LZW compressed TIFF images are always transcoded into `/FlateDecode` streams by img2pdf. ``` $ gm convert logo: GIF:- | img2pdf | grep -ia filter /Filter /FlateDecode $ gm convert logo: -compress lzw TIFF:- | img2pdf | grep -ia filter /Filter /FlateDecode ``` Wouldn't it make sense to preserve LZW encoded image data in these cases? There's one caveat, though. The PDF/A standard explicitly forbids `/LZWDecode` data streams. That is, when the `--pdfa` option is given, image data should be transcoded to use `/FlateDecode` streams, again. As an example, a PDF file using the `/LZWDecode` filter can be created using GraphicsMagick like this: ``` $ gm convert logo: -compress lzw PDF:- | grep -ia filter /Filter [ /LZWDecode ] ```
josch commented 7 months ago
Owner

Yes, that would make sense if it is possible.

Do you have very large GIF images were re-encoding them slows you down?

One problem is, that Pillow does not give me access to the compressed data and img2pdf would need to learn how to access and extract just the right bits from the input image.

Would you like to propose a patch?

Yes, that would make sense if it is possible. Do you have very large GIF images were re-encoding them slows you down? One problem is, that Pillow does not give me access to the compressed data and img2pdf would need to learn how to access and extract just the right bits from the input image. Would you like to propose a patch?
Poster

Well, I do not really have a use-case for /LZWDecode streams in PDF files. In fact, I'm actually trying to avoid those, because they're forbidden for PDF/A compliant files. I have stumbled across the issue while testing what error my PDF/A validation software (veraPDF) returns when presenting a file using LZWDecode filter. Knowing that img2pdf preserves image data as far as possible I fired-up img2pdf just to find me wondering veraPDF not triggering an alarm. Only after inspection, I found out img2pdf output uses /FlateDecode streams.

And then, I'm neither a Python nor C guy. Sorry, all I can contribute is an idea for enhancement.

Well, I do not really have a use-case for `/LZWDecode` streams in PDF files. In fact, I'm actually trying to avoid those, because they're forbidden for PDF/A compliant files. I have stumbled across the issue while testing what error my PDF/A validation software ([veraPDF](https://verapdf.org/)) returns when presenting a file using `LZWDecode` filter. Knowing that img2pdf preserves image data as far as possible I fired-up img2pdf just to find me wondering veraPDF not triggering an alarm. Only after inspection, I found out img2pdf output uses `/FlateDecode` streams. And then, I'm neither a Python nor C guy. Sorry, all I can contribute is an idea for enhancement.
Poster

The libtiff-tools package contains a tool tiff2pdf that seems to serve a similar purpose than img2pdf, but just for TIFF images. Though, it doesn't seem to preserve LZW compression.

However, it is able to preserve image data of some sub-formats without transcoding. From the man page:

If the input TIFF contains single strip CCITT G4 Fax compressed information, then that is written to the PDF file without transcoding, unless the options of no compression and no passthrough are set, -d and -n.

If the input TIFF contains JPEG or single strip Zip/Deflate compressed information, and they are configured, then that is written to the PDF file without transcoding, unless the options of no compression and no passthrough are set.

Maybe the code can be used as a reference.

The `libtiff-tools` package contains a tool tiff2pdf that seems to serve a similar purpose than img2pdf, but just for TIFF images. Though, it doesn't seem to preserve LZW compression. However, it is able to preserve image data of some sub-formats without transcoding. From the man page: > If the input TIFF contains single strip CCITT G4 Fax compressed information, then that is written to the PDF file without transcoding, unless the options of no compression and no passthrough are set, -d and -n. > > If the input TIFF contains JPEG or single strip Zip/Deflate compressed information, and they are configured, then that is written to the PDF file without transcoding, unless the options of no compression and no passthrough are set. Maybe the code can be used as a reference.
josch added the
enhancement
help wanted
labels 7 months ago
josch commented 7 months ago
Owner

Thank you! I've tagged this issue as "enhancement" and "help wanted", so if somebody finds some time to implement this I'd be happy to review patches. 😄

Thank you! I've tagged this issue as "enhancement" and "help wanted", so if somebody finds some time to implement this I'd be happy to review patches. 😄
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#174
Loading…
There is no content yet.