Hi, I am using img2pdf with jp2 images. My input is not intended to have alpha channel, but sometimes one slips through that does.
If I try img2pdf (0.4.4) on it, I get an error "This function must not be called on images with alpha"
If I understand right, img2pdf can now handle images with alpha for other formats, just not jp2? I'm not sure if it does so by preserving the alpha, or flattening it, or what.
Is there any chance it could, even with an opt-in argument, do something for jp2 with alpha, even if that means flattening over a white bg? If not actualy do the same thing it does for other formats (are there limits in OpenJPEG or whatever other library is used for jp2? Are there issues that could be filed against those libraries?)
And, otherwise, would it be helpful to change this error message to be clear that it only applies to jp2, it's not universally true that img2pdf can't handle alpha?
Hi, I am using img2pdf with jp2 images. My input is not intended to have alpha channel, but sometimes one slips through that does.
If I try `img2pdf` (0.4.4) on it, I get an error "This function must not be called on images with alpha"
If I understand right, img2pdf can now handle images with alpha for _other_ formats, just not jp2? I'm not sure if it does so by preserving the alpha, or flattening it, or what.
Is there any chance it could, even with an opt-in argument, do _something_ for jp2 with alpha, even if that means flattening over a white bg? If not actualy do the same thing it does for other formats (are there limits in OpenJPEG or whatever other library is used for jp2? Are there issues that could be filed against those libraries?)
And, otherwise, would it be helpful to change this error message to be clear that it only applies to jp2, it's not universally true that img2pdf can't handle alpha?
Ooh, that's way better response than I hoped, I thought I was seeing expected behavior! I'm glad I filed it, I almost didn't file it, worried I was being annoying! Thank you so much for your quick response!
Yes, I easily can give you an example, thank you!
I'm not sure if I can attach a JPEG2000 here in this ticket without it being altered, but I'll try that first, attached.
This jp2 was created from a TIFF source (that had an alpha), using the vips package, with a command like:
Ooh, that's way better response than I hoped, I thought I was seeing expected behavior! I'm glad I filed it, I almost didn't file it, worried I was being annoying! Thank you so much for your quick response!
Yes, I easily can give you an example, thank you!
I'm not sure if I can attach a JPEG2000 here in this ticket without it being altered, but I'll try that first, attached.
This jp2 was created from a TIFF source (that had an alpha), using the vips package, with a command like:
vipsthumbnail tiff-with-alpha.tiff --size 1200x65500 --export-profile srgb -o jp2-with-alpha.jp2[Q=40,subsample-mode=off]
Yes, after downloading the attachment above, it does still reproduce for me.
jrochkind-shi scihist_digicoll (master ?) $ img2pdf --version
img2pdf 0.4.4
jrochkind-shi scihist_digicoll (master ?) $ img2pdf /Users/jrochkind/Downloads/jp2-with-alpha.jp2 -o output.pdf
error: This function must not be called on images with alpha
Yes, after downloading the attachment above, it does still reproduce for me.
```
jrochkind-shi scihist_digicoll (master ?) $ img2pdf --version
img2pdf 0.4.4
jrochkind-shi scihist_digicoll (master ?) $ img2pdf /Users/jrochkind/Downloads/jp2-with-alpha.jp2 -o output.pdf
error: This function must not be called on images with alpha
```
This patch should fix your problem. Can you confirm?
diff --git a/src/img2pdf.py b/src/img2pdf.py
index 06f2e7b..99ca557 100755
--- a/src/img2pdf.py
+++ b/src/img2pdf.py
@@ -827,8 +827,10 @@ class pdfdoc(object):
artborder=None,
iccp=None,
):
- assert (color != Colorspace.RGBA and color != Colorspace.LA) or (
- imgformat == ImageFormat.PNG and smaskdata is not None
+ assert (
+ (color != Colorspace.RGBA and color != Colorspace.LA)
+ or (imgformat == ImageFormat.PNG and smaskdata is not None)
+ or imgformat == ImageFormat.JPEG2000
)
if self.engine == Engine.pikepdf:
@@ -1312,7 +1314,7 @@ def get_imgmetadata(
ics = imgdata.mode
# GIF and PNG files with transparency are supported
- if (imgformat == ImageFormat.PNG or imgformat == ImageFormat.GIF) and (
+ if imgformat in [ImageFormat.PNG, ImageFormat.GIF, ImageFormat.JPEG2000] and (
ics in ["RGBA", "LA"] or "transparency" in imgdata.info
):
# Must check the IHDR chunk for the bit depth, because PIL would lossily
@@ -1810,7 +1812,7 @@ def read_images(
raise JpegColorspaceError("jpeg can't be monochrome")
if color == Colorspace["P"]:
raise JpegColorspaceError("jpeg can't have a color palette")
- if color == Colorspace["RGBA"]:
+ if color == Colorspace["RGBA"] and imgformat != ImageFormat.JPEG2000:
raise JpegColorspaceError("jpeg can't have an alpha channel")
logger.debug("read_images() embeds a JPEG")
cleanup()
This patch should fix your problem. Can you confirm?
```patch
diff --git a/src/img2pdf.py b/src/img2pdf.py
index 06f2e7b..99ca557 100755
--- a/src/img2pdf.py
+++ b/src/img2pdf.py
@@ -827,8 +827,10 @@ class pdfdoc(object):
artborder=None,
iccp=None,
):
- assert (color != Colorspace.RGBA and color != Colorspace.LA) or (
- imgformat == ImageFormat.PNG and smaskdata is not None
+ assert (
+ (color != Colorspace.RGBA and color != Colorspace.LA)
+ or (imgformat == ImageFormat.PNG and smaskdata is not None)
+ or imgformat == ImageFormat.JPEG2000
)
if self.engine == Engine.pikepdf:
@@ -1312,7 +1314,7 @@ def get_imgmetadata(
ics = imgdata.mode
# GIF and PNG files with transparency are supported
- if (imgformat == ImageFormat.PNG or imgformat == ImageFormat.GIF) and (
+ if imgformat in [ImageFormat.PNG, ImageFormat.GIF, ImageFormat.JPEG2000] and (
ics in ["RGBA", "LA"] or "transparency" in imgdata.info
):
# Must check the IHDR chunk for the bit depth, because PIL would lossily
@@ -1810,7 +1812,7 @@ def read_images(
raise JpegColorspaceError("jpeg can't be monochrome")
if color == Colorspace["P"]:
raise JpegColorspaceError("jpeg can't have a color palette")
- if color == Colorspace["RGBA"]:
+ if color == Colorspace["RGBA"] and imgformat != ImageFormat.JPEG2000:
raise JpegColorspaceError("jpeg can't have an alpha channel")
logger.debug("read_images() embeds a JPEG")
cleanup()
```
+ assert (
+ color not in (Colorspace.RGBA, Colorspace.LA)
+ or (imgformat in (ImageFormat.PNG, ImageFormat.JPEG2000) and smaskdata is not None)
)
instead of
+ assert (
+ (color != Colorspace.RGBA and color != Colorspace.LA)
+ or (imgformat == ImageFormat.PNG and smaskdata is not None)
+ or imgformat == ImageFormat.JPEG2000
as a style improvement?
How about
```patch
+ assert (
+ color not in (Colorspace.RGBA, Colorspace.LA)
+ or (imgformat in (ImageFormat.PNG, ImageFormat.JPEG2000) and smaskdata is not None)
)
```
instead of
```patch
+ assert (
+ (color != Colorspace.RGBA and color != Colorspace.LA)
+ or (imgformat == ImageFormat.PNG and smaskdata is not None)
+ or imgformat == ImageFormat.JPEG2000
```
as a style improvement?
I'm not much of a pythonista, nor am I used to dealing with patch files, it may take me a bit of time to figure out how to apply the patch locally.
I'm on leave tomorrow, but I'll look at this next week and do my best to figure out how to test it (hints welcome).
Very much appreciate the quick work! This will be super helpful if it allows the PDF to be generated from these JP2's!
Thank you!
I'm not much of a pythonista, nor am I used to dealing with patch files, it may take me a bit of time to figure out how to apply the patch locally.
I'm on leave tomorrow, but I'll look at this next week and do my best to figure out how to test it (hints welcome).
Very much appreciate the quick work! This will be super helpful if it allows the PDF to be generated from these JP2's!
@mara0004 replacing color != Colorspace.RGBA and color != Colorspace.LA with color not in (Colorspace.RGBA, Colorspace.LA) is a good improvement. The other though is incorrect as no smask will be generated for a JPEG2000.
> How about
@mara0004 replacing `color != Colorspace.RGBA and color != Colorspace.LA` with `color not in (Colorspace.RGBA, Colorspace.LA)` is a good improvement. The other though is incorrect as no smask will be generated for a JPEG2000.
I'm not much of a pythonista, nor am I used to dealing with patch files, it may take me a bit of time to figure out how to apply the patch locally.
After cloning the git, you can run this to apply the patch locally:
git apply /path/to/patch
Then you can run img2pdf directly from your clone by executing
src/img2pdf.py input.jp2 -o output.pdf
> I'm not much of a pythonista, nor am I used to dealing with patch files, it may take me a bit of time to figure out how to apply the patch locally.
After cloning the git, you can run this to apply the patch locally:
git apply /path/to/patch
Then you can run img2pdf directly from your clone by executing
src/img2pdf.py input.jp2 -o output.pdf
I'm sorry, I'm not trying to get basic python getting started help from you, but when I try running the patched img2pdf directly from clone, I get:
Traceback (most recent call last):
File "/Users/jrochkind/code/img2pdf/src/img2pdf.py", line 25, in <module>
from PIL import Image, TiffImagePlugin, GifImagePlugin, ImageCms
ModuleNotFoundError: No module named 'PIL'
It's not finding my existing pip dependencies I guess. (I am on MacOS, which doesn't help).
I will work on it more next week -- but also if you were able to reproduce my problem with that file, and the patch resolves it for you with that file, it does seem to me too safe to determine you have resolved it.
I appreciate it!
I'm sorry, I'm not trying to get basic python getting started help from you, but when I try running the patched img2pdf directly from clone, I get:
```
Traceback (most recent call last):
File "/Users/jrochkind/code/img2pdf/src/img2pdf.py", line 25, in <module>
from PIL import Image, TiffImagePlugin, GifImagePlugin, ImageCms
ModuleNotFoundError: No module named 'PIL'
```
It's not finding my existing pip dependencies I guess. (I am on MacOS, which doesn't help).
I will work on it more next week -- but also if you were able to reproduce my problem with that file, and the patch resolves it for you with that file, it does seem to me too safe to determine you have resolved it.
I appreciate it!
On what schedule would you anticipate a release including this fix?
Actually, I figured it out.
And have verified it produced output pdf that looks right, without error!
Awesome!
```
$ pip3 install image
$ python3 src/img2pdf.py jp2-with-alpha.jp2 -o output.pdf
```
Thank you!
On what schedule would you anticipate a release including this fix?
I can make a new release without #159 and #152 fixed but I need to create a unit test for transparent jpeg2000 images. If you have any idea for a solution I'm all ears.
> On what schedule would you anticipate a release including this fix?
Depends. If possible I'd like to hear from @monobot on #159 and from @gms on #152.
I'm also having trouble writing a test case for jpeg2000 images with transparency because of https://stackoverflow.com/questions/76833368/how-to-use-imagemagick-compare-so-that-the-order-does-not-matter
I can make a new release without #159 and #152 fixed but I need to create a unit test for transparent jpeg2000 images. If you have any idea for a solution I'm all ears.
@mara0004 replacing color != Colorspace.RGBA and color != Colorspace.LA with color not in (Colorspace.RGBA, Colorspace.LA) is a good improvement. The other though is incorrect as no smask will be generated for a JPEG2000.
Yeah, sorry. I should have looked more thoroughly before commenting.
Came to my mind too during the night, for if we split an alpha channel off a JPX image, we would lose the JPX encoding, which would increase file size and make it impossible to reconstruct the original image data.
I was on the wrong track because, for the other formats, transparency is only supported through masks. But great if PDF supports transparency natively for JPX.
> @mara0004 replacing color != Colorspace.RGBA and color != Colorspace.LA with color not in (Colorspace.RGBA, Colorspace.LA) is a good improvement. The other though is incorrect as no smask will be generated for a JPEG2000.
Yeah, sorry. I should have looked more thoroughly before commenting.
Came to my mind too during the night, for if we split an alpha channel off a JPX image, we would lose the JPX encoding, which would increase file size and make it impossible to reconstruct the original image data.
I was on the wrong track because, for the other formats, transparency is only supported through masks. But great if PDF supports transparency natively for JPX.
Yeah, sorry. I should have looked more thoroughly before commenting.
No worries! Better a few comments that have issues than not giving input at all. I like it when people would tell me about small improvements like this. The code is getting very old and while working on this issue I discovered even more problematic cases which work in practice but are wrong anyways:
@@ -903,12 +911,12 @@ class pdfdoc(object):
colorspace = [PdfName.ICCBased, iccpdict]
# either embed the whole jpeg or deflate the bitmap representation
- if imgformat is ImageFormat.JPEG:
+ if imgformat == ImageFormat.JPEG:
ofilter = PdfName.DCTDecode
- elif imgformat is ImageFormat.JPEG2000:
+ elif imgformat == ImageFormat.JPEG2000:
ofilter = PdfName.JPXDecode
self.output_version = "1.5" # jpeg2000 needs pdf 1.5
- elif imgformat is ImageFormat.CCITTGroup4:
+ elif imgformat == ImageFormat.CCITTGroup4:
ofilter = [PdfName.CCITTFaxDecode]
else:
ofilter = PdfName.FlateDecode
> Yeah, sorry. I should have looked more thoroughly before commenting.
No worries! Better a few comments that have issues than not giving input at all. I like it when people would tell me about small improvements like this. The code is getting very old and while working on this issue I discovered even more problematic cases which work in practice but are wrong anyways:
```patch
@@ -903,12 +911,12 @@ class pdfdoc(object):
colorspace = [PdfName.ICCBased, iccpdict]
# either embed the whole jpeg or deflate the bitmap representation
- if imgformat is ImageFormat.JPEG:
+ if imgformat == ImageFormat.JPEG:
ofilter = PdfName.DCTDecode
- elif imgformat is ImageFormat.JPEG2000:
+ elif imgformat == ImageFormat.JPEG2000:
ofilter = PdfName.JPXDecode
self.output_version = "1.5" # jpeg2000 needs pdf 1.5
- elif imgformat is ImageFormat.CCITTGroup4:
+ elif imgformat == ImageFormat.CCITTGroup4:
ofilter = [PdfName.CCITTFaxDecode]
else:
ofilter = PdfName.FlateDecode
```
@josch Thank you. May I ask the next possibly silly question: Why do you consider is wrong and == correct? Is it just about consistency?
I think the official enum howto actually recommends comparing to enum attributes by identity, and PEP 8 also says singletons should be compared by identity, not by value (some say it is more idiomatic).
I personally don't like == for enums because it suggests (myenum.ATTR == myenum.ATTR.value) is True, but in fact it is False ("Comparisons against non-enumeration values will always compare not equal").
@josch Thank you. May I ask the next possibly silly question: Why do you consider `is` wrong and `==` correct? Is it just about consistency?
I think the [official enum howto](https://docs.python.org/3/howto/enum.html#comparisons) actually recommends comparing to enum attributes by identity, and PEP 8 also says singletons should be compared by identity, not by value (some say it is more idiomatic).
I personally don't like `==` for enums because it suggests `(myenum.ATTR == myenum.ATTR.value) is True`, but in fact it is `False` ("Comparisons against non-enumeration values will always compare not equal").
Aha, so past-me was smart and knew how to correctly compare enums and present-me forgot about it. Thanks! I'll adjust the code such that enums are always compared by identity.
Today we learned that silly messages can be very useful. :)
Aha, so past-me was smart and knew how to correctly compare enums and present-me forgot about it. Thanks! I'll adjust the code such that enums are always compared by identity.
Today we learned that silly messages can be very useful. :)
@mara0004 things are a bit more complicated concerning the enum comparison. There are good reasons to use the is operator instead of ==. But above you also suggested to use:
imgformat in (ImageFormat.PNG, ImageFormat.JPEG2000)
But the in operator does not check for identity but value, so it uses the slightly wrong comparison for enums. So either this should be re-written to say:
imgformat is ImageFormat.PNG or imgformat is ImageFormat.JPEG2000
or, for longer lists:
any({imgformat is v for v in [ImageFormat.PNG, ImageFormat.JPEG2000]})
This is suboptimal. :(
EDIT Furthermore, I'd also like to switch a few pieces of code to using match/case from PEP636 but when comparing enums, it uses == instead of is (you can check this by overriding Enum.__eq__). So even if I switch all cases where I compare two enums against each other from == to is, there will always remain cases where enums are compared using == -- either because of the in operator or (in the future) because of match/case. So I create a situation in which comparison is inconsistent. This bugs me. Either I always use one or always the other but there is no benefit of mixing both.
@mara0004 things are a bit more complicated concerning the enum comparison. There are good reasons to use the `is` operator instead of `==`. But above you also suggested to use:
imgformat in (ImageFormat.PNG, ImageFormat.JPEG2000)
But the `in` operator does not check for identity but value, so it uses the slightly wrong comparison for enums. So either this should be re-written to say:
imgformat is ImageFormat.PNG or imgformat is ImageFormat.JPEG2000
or, for longer lists:
any({imgformat is v for v in [ImageFormat.PNG, ImageFormat.JPEG2000]})
This is suboptimal. :(
**EDIT** Furthermore, I'd also like to switch a few pieces of code to using match/case from PEP636 but when comparing enums, it uses `==` instead of `is` (you can check this by overriding `Enum.__eq__`). So even if I switch all cases where I compare two enums against each other from `==` to `is`, there will always remain cases where enums are compared using `==` -- either because of the `in` operator or (in the future) because of match/case. So I create a situation in which comparison is inconsistent. This bugs me. Either I always use one or always the other but there is no benefit of mixing both.
Hmm, yes, that is correct, in and match/case will use ==, but I suppose Enum.__eq__ just maps to an identity comparison. With that implementation, both patterns are exchangeable, it is only a question of style. (To guarantee this on the caller side, one could also subclass Enum and override __eq__.)
Personally, I don't think it a problem to use is for explicit comparisons while implicit comparisons go through ==. The effect is the same, and in visible code is feels a bit more direct/idiomatic. Apparently an opinion shared by enum/pep8 officials.
However, I do see your point. It's two distinct code paths in principle, and as such could hypothetically behave differently. If this matters to you, consistently sticking with == is fine. That's a very reasonable argument.
Hmm, yes, that is correct, `in` and `match/case` will use `==`, but I suppose `Enum.__eq__` just maps to an identity comparison. With that implementation, both patterns are exchangeable, it is only a question of style. (To guarantee this on the caller side, one could also subclass Enum and override `__eq__`.)
Personally, I don't think it a problem to use `is` for explicit comparisons while implicit comparisons go through `==`. The effect is the same, and in visible code `is` feels a bit more direct/idiomatic. Apparently an opinion shared by enum/pep8 officials.
However, I do see your point. It's two distinct code paths in principle, and as such could hypothetically behave differently. If this matters to you, consistently sticking with `==` is fine. That's a very reasonable argument.
I forgot which python version added this. But it's older than match/case, if you intend to require python >= 3.10 anyway, then it should be fine.
EDIT: On the other hand, if returned parameters change, this can potentially capture a different variable than desired. So I'm not sure if it's advisable, although it looks nicer at first glance.
BTW, a nit suggestion on commit [acc25a4](https://gitlab.mister-muffin.de/josch/img2pdf/commit/acc25a49265effbbffa36e053ae2a3aa633eddbf):
```
_, _, _, _, _, _, depth = parsejp2(rawdata)
```
can be simplified to
```
*_, depth = parsejp2(rawdata)
```
I forgot which python version added this. But it's older than `match/case`, if you intend to require python >= 3.10 anyway, then it should be fine.
**EDIT**: On the other hand, if returned parameters change, this can potentially capture a different variable than desired. So I'm not sure if it's advisable, although it looks nicer at first glance.
Indeed I forgot about that syntax. It looks funny but it's the same mechanism as when writing
def myfun(foo, bar, *args):
[...]
And *args will catch any additional positional arguments. The underscore is just a placeholder variable name.
I tried this in Python 3.4 from 2014 and it works fine so nothing new about this syntax. I'll use it right away.
Indeed I forgot about that syntax. It looks funny but it's the same mechanism as when writing
def myfun(foo, bar, *args):
[...]
And `*args` will catch any additional positional arguments. The underscore is just a placeholder variable name.
I tried this in Python 3.4 from 2014 and it works fine so nothing new about this syntax. I'll use it right away.
@jrochkind you asked for an estimation for a release including this fix. It turns out that the unit test I wrote for JPEG2000 with alpha channel fails on Fedora. So I cannot release until I figure out what is going on there...
@jrochkind you asked for an estimation for a release including this fix. It turns out that the unit test I wrote for JPEG2000 with alpha channel fails on Fedora. So I cannot release until I figure out what is going on there...
Thanks @josch! I didn't mean to pressure or anything, just plain curious about your plans, since I'm unfamiliar with this project's release practices!
I am sorry the tests are giving you trouble. I wanted to find some time this week to try to contribute to tests, but I'm not sure I'm going to have much to offer on mysterious fedora-specific errors.
I appreciate your work on this tool, and your quick response to to this report, it is all very generous of you, this is an amazing tool. I work in non-profit academic cultural heritage area.
Thanks @josch! I didn't mean to pressure or anything, just plain curious about your plans, since I'm unfamiliar with this project's release practices!
I am sorry the tests are giving you trouble. I wanted to find some time this week to try to contribute to tests, but I'm not sure I'm going to have much to offer on mysterious fedora-specific errors.
I appreciate your work on this tool, and your quick response to to this report, it is all very generous of you, this is an amazing tool. I work in non-profit academic cultural heritage area.
You are not giving me trouble -- quite the opposite! Thanks to you, img2pdf now supports JPEG2000 with transparency. 😄
If your organization is fine with that, maybe I can add it to a new section in the README listing some notable parties that use img2pdf? I've already had contact with libraries or universities that use img2pdf for things like turning scans into PDF for long-term archival storage of historical documents. It's pretty awesome to hear about these things! ❤️
You are not giving me trouble -- quite the opposite! Thanks to you, img2pdf now supports JPEG2000 with transparency. 😄
If your organization is fine with that, maybe I can add it to a new section in the README listing some notable parties that use img2pdf? I've already had contact with libraries or universities that use img2pdf for things like turning scans into PDF for long-term archival storage of historical documents. It's pretty awesome to hear about these things! :heart:
I am trying to use an unreleased git copy of img2pdf to get the alpha fix temporarily... which seems to work mostly... but is oddly failing only on our Github Actions CI server, with weird error no width in jp2 header for some of our test images. But I can't reproduce that error using the same commit from unreleased main branch anywhere else but Github Actions, it does not really seem to be a problem with unreleased main vs 0.4.4... very strange.
Rather than try to debug a very mysterious problem that may just be an artifact of temporarily trying to use an unreleased git commit in requirements.txt, I may just wait until a release.
Thanks @josch!
We are not live yet using img2pdf. But plan to use img2pdf to help power our PDF downloads at https://digital.sciencehistory.org.
I am trying to use an unreleased git copy of img2pdf to get the alpha fix temporarily... which seems to work mostly... but is oddly failing only on our Github Actions CI server, with weird error `no width in jp2 header` for some of our test images. But I can't reproduce that error using the same commit from unreleased main branch anywhere else but Github Actions, it does not *really* seem to be a problem with unreleased `main` vs 0.4.4... *very strange*.
Rather than try to debug a very mysterious problem that may just be an artifact of temporarily trying to use an unreleased git commit in `requirements.txt`, I may just wait until a release.
Hrm... I had to touch the code around the "no width in jp2 header" error message when I implemented support for your images because not only did your image have transparency, it also was 16 bits per pixel instead of the usual 8. If you are able to share the image that gives you trouble I can also take a look.
Hrm... I had to touch the code around the "no width in jp2 header" error message when I implemented support for your images because not only did your image have transparency, it also was 16 bits per pixel instead of the usual 8. If you are able to share the image that gives you trouble I can also take a look.
Oh wow, I actually hadn't noticed that original was 16 bits per pixel either. That is also not normally what we intend.
But when you have 100k plus digitizations of historical material done by different staff over a decade, weird things can pile up, mistakes can be made. There are always a few images in our collection of 100K that are weird mistakes on some dimension.
The current image that is giving me trouble is just a simple TEST image.
OK, I have reproduced it locally, this may indeed be a new bug. Let me know if you want me to file a separate issue?
This is a WEIRD file, since it's part of our test suite, it's not a real file. (it was created dynamically in our test suite, from a miniaturized TIFF version of a real file).
jrochkind-shi img2pdf (main ?) $ ./src/img2pdf.py ~/code/scihist_digicoll/BAD_WIDTH.jp2 -o out.pdf
error: no width in jp2 header
Not sure if there's really something wrong with that jp2? I don't know if our code for deriving it from a TIFF could be bad, or the original test suite TIFF could be bad, very curious what you have to say about this!
Oh wow, I actually hadn't noticed that original was 16 bits per pixel either. That is also not normally what we intend.
But when you have 100k plus digitizations of historical material done by different staff over a decade, weird things can pile up, mistakes can be made. There are always a few images in our collection of 100K that are weird mistakes on some dimension.
The current image that is giving me trouble is just a simple TEST image.
OK, I have reproduced it locally, this may indeed be a new bug. Let me know if you want me to file a separate issue?
This is a WEIRD file, since it's part of our test suite, it's not a real file. (it was created dynamically in our test suite, from a miniaturized TIFF version of a real file).
Running from main branch at 2f736d78
```
jrochkind-shi img2pdf (main ?) $ ./src/img2pdf.py ~/code/scihist_digicoll/BAD_WIDTH.jp2 -o out.pdf
error: no width in jp2 header
```
Not sure if there's really something wrong with that jp2? I don't know if our code for deriving it from a TIFF could be bad, or the original test suite TIFF could be bad, very curious what you have to say about this!
It turns out, that this is a file without the jp2 metadata around it but just a raw jpeg2000 bitstream. I added rudimentary support for those in 09064e8e70. Tell me if this doesn't work for you.
It turns out, that this is a file without the jp2 metadata around it but just a raw jpeg2000 bitstream. I added rudimentary support for those in 09064e8e70a0549147db172eb5b122e50479fd36. Tell me if this doesn't work for you.
It turns out, that this is a file without the jp2 metadata around it but just a raw jpeg2000 bitstream
wow, thanks for the quick diagnosis -- and fix!
That is interesting, because it was definitely not intentional. (I don't know if I have a requirement for handling raw jpeg2000 streams, I didn't mean to provide one!)
That file was produced with vips, using a process that I think ordinarily creates full jp2 files
I want to report it to vips, can you suggest any method I can easily use to determine if a given file is a full jp2 or raw jpeg2000, to make my report as accurate as possible?
update: If I use the file command on my computer, it says JPEG 2000 Part 1 (JP2) for full jp2, and JPEG 2000 codestream for raw JPEG2000, at least for this example.
It's interesting that file was handled properly in 0.4.4 -- by accident I guess?
Confirmed that for me too it is handled properly at 09064e8e70
> It turns out, that this is a file without the jp2 metadata around it but just a raw jpeg2000 bitstream
wow, thanks for the quick diagnosis -- and fix!
That is interesting, because it was definitely not intentional. (I don't know if I have a requirement for handling raw jpeg2000 streams, I didn't mean to provide one!)
That file was produced with `vips`, using a process that I think ordinarily creates full jp2 files
* I want to report it to vips, can you suggest any method I can easily use to determine if a given file is a full jp2 or raw jpeg2000, to make my report as accurate as possible?
* update: If I use the `file` command on my computer, it says `JPEG 2000 Part 1 (JP2)` for full jp2, and `JPEG 2000 codestream` for raw JPEG2000, at least for this example.
* It's interesting that file _was_ handled properly in 0.4.4 -- by accident I guess?
* Confirmed that for me too it is handled properly at 09064e8e70
I want to report it to vips, can you suggest any method I can easily use to determine if a given file is a full jp2 or raw jpeg2000, to make my report as accurate as possible?
I don't think there is anything to report. As far as I understood it, the raw jpeg2000 bytestream is a valid file. It's just missing the container but i have some image viewers installed that display such a file just fine so I guess it's okay?
The way to detect which is which is to look at the first four bytes of the file:
It's interesting that file was handled properly in 0.4.4 -- by accident I guess?
Yes, the commit that broke it was acc25a4926 so that one where I implemented support for transparent jpeg2000. The reason is that this commit also supports bit depth other than 8 and thus tries to parse the JPEG2000 file. This parsing did not happen before as the code assumed that all jpeg2000 files would be 8 bit.
> I want to report it to vips, can you suggest any method I can easily use to determine if a given file is a full jp2 or raw jpeg2000, to make my report as accurate as possible?
I don't think there is anything to report. As far as I understood it, the raw jpeg2000 bytestream is a valid file. It's just missing the container but i have some image viewers installed that display such a file just fine so I guess it's okay?
The way to detect which is which is to look at the first four bytes of the file:
https://gitlab.mister-muffin.de/josch/img2pdf/src/branch/main/src/jp2.py#L135
> It's interesting that file was handled properly in 0.4.4 -- by accident I guess?
Yes, the commit that broke it was acc25a49265effbbffa36e053ae2a3aa633eddbf so that one where I implemented support for transparent jpeg2000. The reason is that this commit also supports bit depth other than 8 and thus tries to parse the JPEG2000 file. This parsing did not happen before as the code assumed that all jpeg2000 files would be 8 bit.
I was using the vips tool intending to convert a TIFF to a .JP2, and the invocation I used normally outputs a .JP2, but in this case output a raw JPEG2000 stream -- at least that's what it looks like to me now, have to investigate more, that part is not an issue for img2pdf though!
Thank you, makes sense!
I was using the `vips` tool intending to convert a TIFF to a .JP2, and the invocation I used normally outputs a .JP2, but in this case output a raw JPEG2000 stream -- at least that's what it looks like to me now, have to investigate more, that part is not an issue for img2pdf though!
Hi, I am using img2pdf with jp2 images. My input is not intended to have alpha channel, but sometimes one slips through that does.
If I try
img2pdf
(0.4.4) on it, I get an error "This function must not be called on images with alpha"If I understand right, img2pdf can now handle images with alpha for other formats, just not jp2? I'm not sure if it does so by preserving the alpha, or flattening it, or what.
Is there any chance it could, even with an opt-in argument, do something for jp2 with alpha, even if that means flattening over a white bg? If not actualy do the same thing it does for other formats (are there limits in OpenJPEG or whatever other library is used for jp2? Are there issues that could be filed against those libraries?)
And, otherwise, would it be helpful to change this error message to be clear that it only applies to jp2, it's not universally true that img2pdf can't handle alpha?
This should not happen. Can you provide an example image so that I can reproduce your problem?
Ooh, that's way better response than I hoped, I thought I was seeing expected behavior! I'm glad I filed it, I almost didn't file it, worried I was being annoying! Thank you so much for your quick response!
Yes, I easily can give you an example, thank you!
I'm not sure if I can attach a JPEG2000 here in this ticket without it being altered, but I'll try that first, attached.
This jp2 was created from a TIFF source (that had an alpha), using the vips package, with a command like:
Yes, after downloading the attachment above, it does still reproduce for me.
This patch should fix your problem. Can you confirm?
How about
instead of
as a style improvement?
Thank you!
I'm not much of a pythonista, nor am I used to dealing with patch files, it may take me a bit of time to figure out how to apply the patch locally.
I'm on leave tomorrow, but I'll look at this next week and do my best to figure out how to test it (hints welcome).
Very much appreciate the quick work! This will be super helpful if it allows the PDF to be generated from these JP2's!
@mara0004 replacing
color != Colorspace.RGBA and color != Colorspace.LA
withcolor not in (Colorspace.RGBA, Colorspace.LA)
is a good improvement. The other though is incorrect as no smask will be generated for a JPEG2000.After cloning the git, you can run this to apply the patch locally:
Then you can run img2pdf directly from your clone by executing
I'm sorry, I'm not trying to get basic python getting started help from you, but when I try running the patched img2pdf directly from clone, I get:
It's not finding my existing pip dependencies I guess. (I am on MacOS, which doesn't help).
I will work on it more next week -- but also if you were able to reproduce my problem with that file, and the patch resolves it for you with that file, it does seem to me too safe to determine you have resolved it.
I appreciate it!
Yes, it works for me with the file you posted.
Actually, I figured it out.
And have verified it produced output pdf that looks right, without error!
Awesome!
Thank you!
On what schedule would you anticipate a release including this fix?
Depends. If possible I'd like to hear from @monobot on #159 and from @gms on #152.
I'm also having trouble writing a test case for jpeg2000 images with transparency because of https://stackoverflow.com/questions/76833368/how-to-use-imagemagick-compare-so-that-the-order-does-not-matter
I can make a new release without #159 and #152 fixed but I need to create a unit test for transparent jpeg2000 images. If you have any idea for a solution I'm all ears.
Yeah, sorry. I should have looked more thoroughly before commenting.
Came to my mind too during the night, for if we split an alpha channel off a JPX image, we would lose the JPX encoding, which would increase file size and make it impossible to reconstruct the original image data.
I was on the wrong track because, for the other formats, transparency is only supported through masks. But great if PDF supports transparency natively for JPX.
No worries! Better a few comments that have issues than not giving input at all. I like it when people would tell me about small improvements like this. The code is getting very old and while working on this issue I discovered even more problematic cases which work in practice but are wrong anyways:
Hi, I've replied on #159.
@josch Thank you. May I ask the next possibly silly question: Why do you consider
is
wrong and==
correct? Is it just about consistency?I think the official enum howto actually recommends comparing to enum attributes by identity, and PEP 8 also says singletons should be compared by identity, not by value (some say it is more idiomatic).
I personally don't like
==
for enums because it suggests(myenum.ATTR == myenum.ATTR.value) is True
, but in fact it isFalse
("Comparisons against non-enumeration values will always compare not equal").Aha, so past-me was smart and knew how to correctly compare enums and present-me forgot about it. Thanks! I'll adjust the code such that enums are always compared by identity.
Today we learned that silly messages can be very useful. :)
@mara0004 things are a bit more complicated concerning the enum comparison. There are good reasons to use the
is
operator instead of==
. But above you also suggested to use:But the
in
operator does not check for identity but value, so it uses the slightly wrong comparison for enums. So either this should be re-written to say:or, for longer lists:
This is suboptimal. :(
EDIT Furthermore, I'd also like to switch a few pieces of code to using match/case from PEP636 but when comparing enums, it uses
==
instead ofis
(you can check this by overridingEnum.__eq__
). So even if I switch all cases where I compare two enums against each other from==
tois
, there will always remain cases where enums are compared using==
-- either because of thein
operator or (in the future) because of match/case. So I create a situation in which comparison is inconsistent. This bugs me. Either I always use one or always the other but there is no benefit of mixing both.Hmm, yes, that is correct,
in
andmatch/case
will use==
, but I supposeEnum.__eq__
just maps to an identity comparison. With that implementation, both patterns are exchangeable, it is only a question of style. (To guarantee this on the caller side, one could also subclass Enum and override__eq__
.)Personally, I don't think it a problem to use
is
for explicit comparisons while implicit comparisons go through==
. The effect is the same, and in visible codeis
feels a bit more direct/idiomatic. Apparently an opinion shared by enum/pep8 officials.However, I do see your point. It's two distinct code paths in principle, and as such could hypothetically behave differently. If this matters to you, consistently sticking with
==
is fine. That's a very reasonable argument.BTW, a nit suggestion on commit acc25a4:
can be simplified to
I forgot which python version added this. But it's older than
match/case
, if you intend to require python >= 3.10 anyway, then it should be fine.EDIT: On the other hand, if returned parameters change, this can potentially capture a different variable than desired. So I'm not sure if it's advisable, although it looks nicer at first glance.
Indeed I forgot about that syntax. It looks funny but it's the same mechanism as when writing
And
*args
will catch any additional positional arguments. The underscore is just a placeholder variable name.I tried this in Python 3.4 from 2014 and it works fine so nothing new about this syntax. I'll use it right away.
@jrochkind you asked for an estimation for a release including this fix. It turns out that the unit test I wrote for JPEG2000 with alpha channel fails on Fedora. So I cannot release until I figure out what is going on there...
Thanks @josch! I didn't mean to pressure or anything, just plain curious about your plans, since I'm unfamiliar with this project's release practices!
I am sorry the tests are giving you trouble. I wanted to find some time this week to try to contribute to tests, but I'm not sure I'm going to have much to offer on mysterious fedora-specific errors.
I appreciate your work on this tool, and your quick response to to this report, it is all very generous of you, this is an amazing tool. I work in non-profit academic cultural heritage area.
You are not giving me trouble -- quite the opposite! Thanks to you, img2pdf now supports JPEG2000 with transparency. 😄
If your organization is fine with that, maybe I can add it to a new section in the README listing some notable parties that use img2pdf? I've already had contact with libraries or universities that use img2pdf for things like turning scans into PDF for long-term archival storage of historical documents. It's pretty awesome to hear about these things! ❤️
Thanks @josch!
We are not live yet using img2pdf. But plan to use img2pdf to help power our PDF downloads at https://digital.sciencehistory.org.
I am trying to use an unreleased git copy of img2pdf to get the alpha fix temporarily... which seems to work mostly... but is oddly failing only on our Github Actions CI server, with weird error
no width in jp2 header
for some of our test images. But I can't reproduce that error using the same commit from unreleased main branch anywhere else but Github Actions, it does not really seem to be a problem with unreleasedmain
vs 0.4.4... very strange.Rather than try to debug a very mysterious problem that may just be an artifact of temporarily trying to use an unreleased git commit in
requirements.txt
, I may just wait until a release.Hrm... I had to touch the code around the "no width in jp2 header" error message when I implemented support for your images because not only did your image have transparency, it also was 16 bits per pixel instead of the usual 8. If you are able to share the image that gives you trouble I can also take a look.
Oh wow, I actually hadn't noticed that original was 16 bits per pixel either. That is also not normally what we intend.
But when you have 100k plus digitizations of historical material done by different staff over a decade, weird things can pile up, mistakes can be made. There are always a few images in our collection of 100K that are weird mistakes on some dimension.
The current image that is giving me trouble is just a simple TEST image.
OK, I have reproduced it locally, this may indeed be a new bug. Let me know if you want me to file a separate issue?
This is a WEIRD file, since it's part of our test suite, it's not a real file. (it was created dynamically in our test suite, from a miniaturized TIFF version of a real file).
Running from main branch at
2f736d78
Not sure if there's really something wrong with that jp2? I don't know if our code for deriving it from a TIFF could be bad, or the original test suite TIFF could be bad, very curious what you have to say about this!
It turns out, that this is a file without the jp2 metadata around it but just a raw jpeg2000 bitstream. I added rudimentary support for those in
09064e8e70
. Tell me if this doesn't work for you.wow, thanks for the quick diagnosis -- and fix!
That is interesting, because it was definitely not intentional. (I don't know if I have a requirement for handling raw jpeg2000 streams, I didn't mean to provide one!)
That file was produced with
vips
, using a process that I think ordinarily creates full jp2 filesI want to report it to vips, can you suggest any method I can easily use to determine if a given file is a full jp2 or raw jpeg2000, to make my report as accurate as possible?
file
command on my computer, it saysJPEG 2000 Part 1 (JP2)
for full jp2, andJPEG 2000 codestream
for raw JPEG2000, at least for this example.It's interesting that file was handled properly in 0.4.4 -- by accident I guess?
Confirmed that for me too it is handled properly at
09064e8e70
I don't think there is anything to report. As far as I understood it, the raw jpeg2000 bytestream is a valid file. It's just missing the container but i have some image viewers installed that display such a file just fine so I guess it's okay?
The way to detect which is which is to look at the first four bytes of the file:
https://gitlab.mister-muffin.de/josch/img2pdf/src/branch/main/src/jp2.py#L135
Yes, the commit that broke it was
acc25a4926
so that one where I implemented support for transparent jpeg2000. The reason is that this commit also supports bit depth other than 8 and thus tries to parse the JPEG2000 file. This parsing did not happen before as the code assumed that all jpeg2000 files would be 8 bit.Thank you, makes sense!
I was using the
vips
tool intending to convert a TIFF to a .JP2, and the invocation I used normally outputs a .JP2, but in this case output a raw JPEG2000 stream -- at least that's what it looks like to me now, have to investigate more, that part is not an issue for img2pdf though!