Transcoding to Monochrome Fails with Pillow >= 8.3.0 #122
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
It appears that python-pillow/Pillow#5514, first included in 8.3.0, added a limit to the TIFF strip size.
Currently, the
transcode_monochrome()
method assumes that Pillow will write a TIFF file with a single strip, but that assumption no longer holds, resulting in theerror being hit.
An unfortunate consequence of this that I just realized is that this exception handler is always hit, meaning that providing a non-PNG binary image as input will result in it being converted to grayscale.
As a test, I saved a binary PBM file as PNG and TIFF (with Group 4 compression), and then ran all three through
img2pdf
. The resulting sizes were:So, for the time being,
img2pdf
is not able to directly embed binary images unless they are PNG, and the best possible space savings requires you to provide a binary PNG.Ouch! Thanks a lot for bringing up this problem.
It might be possible to just add multi-strip support to
ccitt_payload_location_from_pil
but a better solution would be to drop that hack altogether and do a CCITT Group4 encoding in a different way.No problem!
Although, if you were able to read multi-strip TIFFs, would you be able to handle any TIFF Group4 encoded files without re-encoding them? As it is, the single strip check seems like a significant caveat to the README's claim that:
One other thought I had: in the interim, do you think it makes sense to remove the conversion to grayscale in the exception handler? That way the binary PNG format would be stored, which at least has better space savings than the grayscale.
Indeed I have never seen a multi-strip CCITT Group 4 image since I started writing img2pdf in 2012.
What do you mean? The PNG format doesn't support "binary" images. It supports grayscale with only black and white but those still take up one byte per pixel and not one bit per pixel before being passed to the paeth filter.
Ah, then the single strip limit is not as significant as I thought.
I should mention that I've got very little experience in this space; I assumed that multi-strip was common, since the spec says that:
Indeed, I've just tested with my Brother scanner, and it produced a single strip as you say! So much for the spec 😛.
I'm not sure how the underlying data representation changes, but my observation has been that converting a binary image to grayscale results in a larger output PDF, as shown in the table I shared above.
Here is some simple code to demonstrate what I'm observing.
I reported this as https://github.com/python-pillow/Pillow/issues/5740
Alternatively, if anybody knows of another group4 encoder, I'm all ears.
Hi,
I'd like to experiment with the code related to this issue, but apparently it's not covered by tests. Could you please share a sample file that passes the default strip size limit, or explain how I could find/create one?
Thanks!
Since Pillow 8.4.0,
TiffImagePlugin
has the attribute STRIP_SIZE with the default value of 65536. You can probably create such a TIFF file by setting STRIP_SIZE to a value above 65536 and then saving the tiff.I also should amend the changes I made in
6eec05c11c
to make use ofTiffImagePlugin.STRIP_SIZE
instead of monkey patchingTiffImagePlugin.ImageFileDirectory_v2.__getitem__
which is only necessary for Pillow 8.3.x.If you could contribute code creating a test case that fails without
6eec05c11c
that would be great and would allow me to implement the STRIP_SIZE based method that works for Pillow 8.4.0 and later.Thanks!
Thanks for the quick response! I'll take a look.
Ok, I've created a test file (attached, derived from
sample_1920×1280.tiff
).When removing the
__getitem__
override workaround, the 1 mode image is converted to L:However, it seems to work correctly when adding this line:
Suppose the following diff:
Using your
long_strip.tiff
as input and then running:The resulting PDF files are bit-by-bit identical no matter whether the first or the second branch of the
if
is used.Thanks for confirming!