Transcoding to Monochrome Fails with Pillow >= 8.3.0 #122

Closed
opened 3 years ago by jellopuddingstick · 12 comments

It appears that python-pillow/Pillow#5514, first included in 8.3.0, added a limit to the TIFF strip size.

Currently, the transcode_monochrome() method assumes that Pillow will write a TIFF file with a single strip, but that assumption no longer holds, resulting in the

raise NotImplementedError("Transcoding multiple strips not supported")

error being hit.

It appears that [python-pillow/Pillow#5514](https://github.com/python-pillow/Pillow/pull/5514), first included in [8.3.0](https://github.com/python-pillow/Pillow/releases/tag/8.3.0), added a limit to the TIFF strip size. Currently, the `transcode_monochrome()` method assumes that Pillow will write a TIFF file with a single strip, but that assumption no longer holds, resulting in the ```python raise NotImplementedError("Transcoding multiple strips not supported") ``` error being hit.

An unfortunate consequence of this that I just realized is that this exception handler is always hit, meaning that providing a non-PNG binary image as input will result in it being converted to grayscale.

As a test, I saved a binary PBM file as PNG and TIFF (with Group 4 compression), and then ran all three through img2pdf. The resulting sizes were:

Input Format Input Size Output PDF Size
PBM 268 KB 54 KB
PNG 36 KB 37 KB
TIFF (Group4) 20 KB 54 KB

So, for the time being, img2pdf is not able to directly embed binary images unless they are PNG, and the best possible space savings requires you to provide a binary PNG.

An unfortunate consequence of this that I just realized is that [this exception handler](https://gitlab.mister-muffin.de/josch/img2pdf/src/commit/f483638b1702218a4fa69e3a5ccb56d6e1d59b1b/src/img2pdf.py#L1717-L1721) is always hit, meaning that providing a non-PNG binary image as input will result in it being converted to grayscale. As a test, I saved a binary PBM file as PNG and TIFF (with Group 4 compression), and then ran all three through `img2pdf`. The resulting sizes were: | Input Format | Input Size | Output PDF Size | |---------------|------------|-----------------| | PBM | 268 KB | 54 KB | | PNG | 36 KB | 37 KB | | TIFF (Group4) | 20 KB | 54 KB | So, for the time being, `img2pdf` is _not_ able to directly embed binary images unless they are PNG, and the best possible space savings requires you to provide a binary PNG.
josch commented 3 years ago
Owner

Ouch! Thanks a lot for bringing up this problem.

It might be possible to just add multi-strip support to ccitt_payload_location_from_pil but a better solution would be to drop that hack altogether and do a CCITT Group4 encoding in a different way.

Ouch! Thanks a lot for bringing up this problem. It might be possible to just add multi-strip support to `ccitt_payload_location_from_pil` but a better solution would be to drop that hack altogether and do a CCITT Group4 encoding in a different way.

Thanks a lot for bringing up this problem.

No problem!

It might be possible to just add multi-strip support to ccitt_payload_location_from_pil but a better solution would be to drop that hack altogether and do a CCITT Group4 encoding in a different way.

Although, if you were able to read multi-strip TIFFs, would you be able to handle any TIFF Group4 encoded files without re-encoding them? As it is, the single strip check seems like a significant caveat to the README's claim that:

For [...] TIFF images with CCITT Group 4 encoded data, img2pdf directly embeds the image data into the PDF without re-encoding it.


One other thought I had: in the interim, do you think it makes sense to remove the conversion to grayscale in the exception handler? That way the binary PNG format would be stored, which at least has better space savings than the grayscale.

> Thanks a lot for bringing up this problem. No problem! > It might be possible to just add multi-strip support to `ccitt_payload_location_from_pil` but a better solution would be to drop that hack altogether and do a CCITT Group4 encoding in a different way. Although, if you _were_ able to read multi-strip TIFFs, would you be able to handle any TIFF Group4 encoded files without re-encoding them? As it is, the [single strip check](https://gitlab.mister-muffin.de/josch/img2pdf/src/commit/f483638b1702218a4fa69e3a5ccb56d6e1d59b1b/src/img2pdf.py#L1635) seems like a significant caveat to the README's claim that: > For [...] TIFF images with CCITT Group 4 encoded data, img2pdf directly embeds the image data into the PDF without re-encoding it. --- One other thought I had: in the interim, do you think it makes sense to remove the [conversion to grayscale](https://gitlab.mister-muffin.de/josch/img2pdf/src/commit/f483638b1702218a4fa69e3a5ccb56d6e1d59b1b/src/img2pdf.py#L1720) in the exception handler? That way the binary PNG format would be stored, which at least has better space savings than the grayscale.
josch commented 3 years ago
Owner

It might be possible to just add multi-strip support to ccitt_payload_location_from_pil but a better solution would be to drop that hack altogether and do a CCITT Group4 encoding in a different way.

Although, if you were able to read multi-strip TIFFs, would you be able to handle any TIFF Group4 encoded files without re-encoding them? As it is, the single strip check seems like a significant caveat to the README's claim that:

For [...] TIFF images with CCITT Group 4 encoded data, img2pdf directly embeds the image data into the PDF without re-encoding it.

Indeed I have never seen a multi-strip CCITT Group 4 image since I started writing img2pdf in 2012.

One other thought I had: in the interim, do you think it makes sense to remove the conversion to grayscale in the exception handler? That way the binary PNG format would be stored, which at least has better space savings than the grayscale.

What do you mean? The PNG format doesn't support "binary" images. It supports grayscale with only black and white but those still take up one byte per pixel and not one bit per pixel before being passed to the paeth filter.

> > It might be possible to just add multi-strip support to `ccitt_payload_location_from_pil` but a better solution would be to drop that hack altogether and do a CCITT Group4 encoding in a different way. > > Although, if you _were_ able to read multi-strip TIFFs, would you be able to handle any TIFF Group4 encoded files without re-encoding them? As it is, the [single strip check](https://gitlab.mister-muffin.de/josch/img2pdf/src/commit/f483638b1702218a4fa69e3a5ccb56d6e1d59b1b/src/img2pdf.py#L1635) seems like a significant caveat to the README's claim that: > > > For [...] TIFF images with CCITT Group 4 encoded data, img2pdf directly embeds the image data into the PDF without re-encoding it. Indeed I have never seen a multi-strip CCITT Group 4 image since I started writing img2pdf in 2012. > One other thought I had: in the interim, do you think it makes sense to remove the [conversion to grayscale](https://gitlab.mister-muffin.de/josch/img2pdf/src/commit/f483638b1702218a4fa69e3a5ccb56d6e1d59b1b/src/img2pdf.py#L1720) in the exception handler? That way the binary PNG format would be stored, which at least has better space savings than the grayscale. What do you mean? The PNG format doesn't support "binary" images. It supports grayscale with only black and white but those still take up one byte per pixel and not one bit per pixel before being passed to the paeth filter.

Indeed I have never seen a multi-strip CCITT Group 4 image since I started writing img2pdf in 2012.

Ah, then the single strip limit is not as significant as I thought.

I should mention that I've got very little experience in this space; I assumed that multi-strip was common, since the spec says that:

Use of a single strip is not recommended. Choose RowsPerStrip such that each strip is about 8K bytes, even if the data is not compressed, since it makes buffering simpler for readers.

Indeed, I've just tested with my Brother scanner, and it produced a single strip as you say! So much for the spec 😛.


What do you mean? The PNG format doesn't support "binary" images. It supports grayscale with only black and white but those still take up one byte per pixel and not one bit per pixel before being passed to the paeth filter.

I'm not sure how the underlying data representation changes, but my observation has been that converting a binary image to grayscale results in a larger output PDF, as shown in the table I shared above.

Here is some simple code to demonstrate what I'm observing.

> Indeed I have never seen a multi-strip CCITT Group 4 image since I started writing img2pdf in 2012. Ah, then the single strip limit is not as significant as I thought. I should mention that I've got very little experience in this space; I assumed that multi-strip was common, since [the spec](https://www.adobe.io/content/dam/udp/en/open/standards/tiff/TIFF6.pdf) says that: > Use of a single strip is not recommended. Choose RowsPerStrip such that each strip is about 8K bytes, even if the data is not compressed, since it makes buffering simpler for readers. Indeed, I've just tested with my Brother scanner, and it produced a single strip as you say! So much for the spec 😛. --- > What do you mean? The PNG format doesn't support "binary" images. It supports grayscale with only black and white but those still take up one byte per pixel and not one bit per pixel before being passed to the paeth filter. I'm not sure how the underlying data representation changes, but my observation has been that converting a binary image to grayscale results in a larger output PDF, as shown in the table I shared above. [Here](https://gist.github.com/jellopuddingstick/3921316d79f332015d43255e39311ff3) is some simple code to demonstrate what I'm observing.
josch commented 3 years ago
Owner

I reported this as https://github.com/python-pillow/Pillow/issues/5740

Alternatively, if anybody knows of another group4 encoder, I'm all ears.

I reported this as https://github.com/python-pillow/Pillow/issues/5740 Alternatively, if anybody knows of another group4 encoder, I'm all ears.
josch closed this issue 3 years ago

Hi,
I'd like to experiment with the code related to this issue, but apparently it's not covered by tests. Could you please share a sample file that passes the default strip size limit, or explain how I could find/create one?
Thanks!

Hi, I'd like to experiment with the code related to this issue, but apparently it's not covered by tests. Could you please share a sample file that passes the default strip size limit, or explain how I could find/create one? Thanks!
josch commented 2 years ago
Owner

Since Pillow 8.4.0, TiffImagePlugin has the attribute STRIP_SIZE with the default value of 65536. You can probably create such a TIFF file by setting STRIP_SIZE to a value above 65536 and then saving the tiff.

I also should amend the changes I made in 6eec05c11c to make use of TiffImagePlugin.STRIP_SIZE instead of monkey patching TiffImagePlugin.ImageFileDirectory_v2.__getitem__ which is only necessary for Pillow 8.3.x.

If you could contribute code creating a test case that fails without 6eec05c11c that would be great and would allow me to implement the STRIP_SIZE based method that works for Pillow 8.4.0 and later.

Thanks!

Since Pillow 8.4.0, `TiffImagePlugin` has the attribute STRIP_SIZE with the default value of 65536. You can probably create such a TIFF file by setting STRIP_SIZE to a value above 65536 and then saving the tiff. I also should amend the changes I made in 6eec05c11c7e1cb2f2ea21aa502ebd5f88c5828b to make use of `TiffImagePlugin.STRIP_SIZE` instead of monkey patching `TiffImagePlugin.ImageFileDirectory_v2.__getitem__` which is only necessary for Pillow 8.3.x. If you could contribute code creating a test case that fails without 6eec05c11c7e1cb2f2ea21aa502ebd5f88c5828b that would be great and would allow me to implement the STRIP_SIZE based method that works for Pillow 8.4.0 and later. Thanks!

Thanks for the quick response! I'll take a look.

Thanks for the quick response! I'll take a look.

Ok, I've created a test file (attached, derived from sample_1920×1280.tiff).
When removing the __getitem__ override workaround, the 1 mode image is converted to L:

Transcoding multiple strips is not supported by the PDF format
Converting colorspace 1 to L

However, it seems to work correctly when adding this line:

TiffImagePlugin.STRIP_SIZE = (imgdata.size[0] + 7) // 8 * imgdata.size[1]
Converting monochrome to CCITT Group4
TIFF strip_offsets: 8
TIFF strip_bytes: 722861
Ok, I've created a test file (attached, derived from [`sample_1920×1280.tiff`]( https://filesamples.com/formats/tiff)). When removing the `__getitem__` override workaround, the 1 mode image is converted to L: ``` Transcoding multiple strips is not supported by the PDF format Converting colorspace 1 to L ``` However, it seems to work correctly when adding this line: ```python3 TiffImagePlugin.STRIP_SIZE = (imgdata.size[0] + 7) // 8 * imgdata.size[1] ``` ``` Converting monochrome to CCITT Group4 TIFF strip_offsets: 8 TIFF strip_bytes: 722861 ```
josch commented 2 years ago
Owner

Suppose the following diff:

@@ -1410,27 +1420,35 @@ def transcode_monochrome(imgdata):
     # into putting everything into a single strip. Thanks to Andrew Murray for
     # the hack.
     #
-    # This can be dropped once this gets merged:
-    # https://github.com/python-pillow/Pillow/pull/5744
-    pillow__getitem__ = TiffImagePlugin.ImageFileDirectory_v2.__getitem__
-
-    def __getitem__(self, tag):
-        overrides = {
-            TiffImagePlugin.ROWSPERSTRIP: imgdata.size[1],
-            TiffImagePlugin.STRIPBYTECOUNTS: [
-                (imgdata.size[0] + 7) // 8 * imgdata.size[1]
-            ],
-            TiffImagePlugin.STRIPOFFSETS: [0],
-        }
-        return overrides.get(tag, pillow__getitem__(self, tag))
-
-    # use try/finally to make sure that __getitem__ is reset even if save()
-    # raises an exception
-    try:
-        TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = __getitem__
+    # Since version 8.4.0 Pillow allows us to modify the strip size explicitly
+    if hasattr(TiffImagePlugin, "STRIP_SIZE"):
+        # we are using Pillow 8.4.0 or later
+        oldstripsize = TiffImagePlugin.STRIP_SIZE
+        TiffImagePlugin.STRIP_SIZE = (imgdata.size[0] + 7) // 8 * imgdata.size[1]
         im.save(newimgio, format="TIFF", compression="group4")
-    finally:
-        TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = pillow__getitem__
+        TiffImagePlugin.STRIP_SIZE = oldstripsize
+    else:
+        # only needed for Pillow 8.3.x but works for versions before that as
+        # well
+        pillow__getitem__ = TiffImagePlugin.ImageFileDirectory_v2.__getitem__
+
+        def __getitem__(self, tag):
+            overrides = {
+                TiffImagePlugin.ROWSPERSTRIP: imgdata.size[1],
+                TiffImagePlugin.STRIPBYTECOUNTS: [
+                    (imgdata.size[0] + 7) // 8 * imgdata.size[1]
+                ],
+                TiffImagePlugin.STRIPOFFSETS: [0],
+            }
+            return overrides.get(tag, pillow__getitem__(self, tag))
+
+        # use try/finally to make sure that __getitem__ is reset even if save()
+        # raises an exception
+        try:
+            TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = __getitem__
+            im.save(newimgio, format="TIFF", compression="group4")
+        finally:
+            TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = pillow__getitem__
 
     # Open new image in memory
     newimgio.seek(0)

Using your long_strip.tiff as input and then running:

src/img2pdf.py --engine=internal --nodate ~/Downloads/long_strip.tiff -o out.pdf

The resulting PDF files are bit-by-bit identical no matter whether the first or the second branch of the if is used.

Suppose the following diff: ```diff @@ -1410,27 +1420,35 @@ def transcode_monochrome(imgdata): # into putting everything into a single strip. Thanks to Andrew Murray for # the hack. # - # This can be dropped once this gets merged: - # https://github.com/python-pillow/Pillow/pull/5744 - pillow__getitem__ = TiffImagePlugin.ImageFileDirectory_v2.__getitem__ - - def __getitem__(self, tag): - overrides = { - TiffImagePlugin.ROWSPERSTRIP: imgdata.size[1], - TiffImagePlugin.STRIPBYTECOUNTS: [ - (imgdata.size[0] + 7) // 8 * imgdata.size[1] - ], - TiffImagePlugin.STRIPOFFSETS: [0], - } - return overrides.get(tag, pillow__getitem__(self, tag)) - - # use try/finally to make sure that __getitem__ is reset even if save() - # raises an exception - try: - TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = __getitem__ + # Since version 8.4.0 Pillow allows us to modify the strip size explicitly + if hasattr(TiffImagePlugin, "STRIP_SIZE"): + # we are using Pillow 8.4.0 or later + oldstripsize = TiffImagePlugin.STRIP_SIZE + TiffImagePlugin.STRIP_SIZE = (imgdata.size[0] + 7) // 8 * imgdata.size[1] im.save(newimgio, format="TIFF", compression="group4") - finally: - TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = pillow__getitem__ + TiffImagePlugin.STRIP_SIZE = oldstripsize + else: + # only needed for Pillow 8.3.x but works for versions before that as + # well + pillow__getitem__ = TiffImagePlugin.ImageFileDirectory_v2.__getitem__ + + def __getitem__(self, tag): + overrides = { + TiffImagePlugin.ROWSPERSTRIP: imgdata.size[1], + TiffImagePlugin.STRIPBYTECOUNTS: [ + (imgdata.size[0] + 7) // 8 * imgdata.size[1] + ], + TiffImagePlugin.STRIPOFFSETS: [0], + } + return overrides.get(tag, pillow__getitem__(self, tag)) + + # use try/finally to make sure that __getitem__ is reset even if save() + # raises an exception + try: + TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = __getitem__ + im.save(newimgio, format="TIFF", compression="group4") + finally: + TiffImagePlugin.ImageFileDirectory_v2.__getitem__ = pillow__getitem__ # Open new image in memory newimgio.seek(0) ``` Using your `long_strip.tiff` as input and then running: src/img2pdf.py --engine=internal --nodate ~/Downloads/long_strip.tiff -o out.pdf The resulting PDF files are bit-by-bit identical no matter whether the first or the second branch of the `if` is used.

Thanks for confirming!

Thanks for confirming!
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#122
Loading…
There is no content yet.