Add support for JBIG2 (generic coding) #184
No reviewers
Labels
No labels
bug
duplicate
enhancement
forwarded
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: josch/img2pdf#184
Loading…
Reference in a new issue
No description provided.
Delete branch "ooBJ3u/img2pdf:main"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Implements the proposal detailed at #112 (comment).
This is a limited implementation of JBIG2, which can be extended to support multiple pages, symbol tables, and other features of the format in the future.
To test, I included a test fixture. You can also download 042.bmp (the same one as @josch already downloaded in #112 (comment) from https://git.ghostscript.com/?p=tests.git;a=blob_plain;f=jbig2/042.bmp;hb=HEAD and run the following command:
This results in a small PDF, just as @josch originally found in the comment mentioned above.
This is my first contribution to this repository so let me know if something else is needed. Thanks for a great library!
154a61a88f
toee42963164
Wow, thank you! I read through your diff without trying it out yet and it looks really, really good!
My biggest gripe right now is src/tests/input/042.jb2. Why did you use the scan of a page instead the "TEST" image used in the other test cases? One problem with using a "real" test image like you chose in form of the scan of a page is the copyright situation. Even if that page is available in the public domain (is it?) you have to now write that down and keep track of this somewhere.
No problem, I'll swap it out.
2c00f3b66b
tob23d82c45e
Fixed in
085dd192f6
.@ -1820,7 +1842,41 @@ def read_images(
if rawdata[:12] == b"\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A":
# image is jpeg2000
imgformat = ImageFormat.JPEG2000
if rawdata[:14].lower() == b"id=imagemagick":
I wasn't sure why this was
if
instead ofelif
. Won't that make it so JPEG2000 still crashes? I fixed it but wanted to double-check.@ -33,12 +33,14 @@ input file format and image color space.
| JPEG2000 | any | direct |
| PNG (non-interlaced, no transparency) | any | direct |
| TIFF (CCITT Group 4) | monochrome | direct |
| JBIG2 (single-page generic coding) | bi-level | direct |
the other entries seem to use the term
monochrome
for 1 bit per pixel images.Monochrome is also often used for greyscale images, however. See e.g. https://en.wikipedia.org/wiki/Monochrome
Bi-level is pretty standard terminology, though "binary images" or perhaps even "1-bit images" might be clearer. https://en.wikipedia.org/wiki/Binary_image
I'm fine with choosing another term, all I mean is the table should be consistent.
Apologies for the delay. I've updated the README to consistently say "1-bit monochrome" (to differentiate it from the other meaning of "grayscale"). Does this look good?
@josch Would you like to have another look at this? All comments should be addressed now.
Nice!
I have a question. Why does this happen:
Apologies for the delay.
jbigtopnm
only supports JBIG1, not JBIG2. JBIG1 is still used in fax machines, but is not supported by PDF, so not too relevant for us.@josch is this still blocked on anything? Anything I can do to get it merged?
Thank you for the ping and sorry to not have come back to you earlier. If in doubt, please feel free to ping me until I explicitly say otherwise. There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've. This is also due to my frustration with imagemagick which @gms recently summarized well in #204. It is very tiring of trying to do the right thing and being backwards compatible with old versions and then being shot in the foot by another imagemagick change which breaks behaviour...
In any case, your MR looks good. I'd just like to to squash commit 8c5541f41728d8ec87bef49f7f7134a4727a157a into 4901fa202e32d7dc2afb964a09b603257fa1bbe1 because otherwise, your file
src/tests/input/042.jb2
will be part of the git history and as we discussed this is problematic due to copyright reasons.Thanks!
May I ask the same question regarding MRs #200, #201, #202, #203?
Take your time, just wondering if there's anything left to do from my side? Sorry for the noise...
150a23169b
toe2369eb59a
@josch I know just what you mean, which is why this also took me a while, sorry about that! I've squashed all commits into one, and updated the commit message to reflect all the changes that we made. Should be ready to merge now! Thanks again for such a fantastic project.
FWIW, I tested the changes in this MR by converting some TIFF files to JBIG2 (with
jbig2 file.tif > file.jb2
) and running img2pdf on them, and the resulting PDF seems to work just fine.However, I'm noticing a potential problem when extracting the images...
If using
pdfimages -tiff ...
orpdfimages -png ...
(from poppler) to extract/convert one or more of the JBIG2 files from the PDF, the tool complains with:Though note that
pdfimages
exits with status code 0, and the TIFF or PNG files appear valid. I did the same extraction test on a PDF (with JBIG2 images) created by another tool, andpdfimages
does not print the error message, so maybe poppler sees something slightly off with the embedded JBIG2 files but it's not fatal to the conversion...Without reading much about the JBIG2 format, I did notice from the
jbig2enc
man page that it supports a-p/--pdf
flag. The JBIG2 support as implemented in this MR does not appear to support JBIG2 files created with that flag set. Could this be related to the syntax error poppler complains about?@phmccarty that sounds like a bug in
pdfimages
. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output fromjbig2
at all, except stripping the file header, in accordance with the pdf spec.I suspect that
pdfimages
is missing a segment type case, because JBIG2 is a very extensive format, and they might have not fully implemented it?I haven't encountered any issues with PDFs opening either, thankfully... Just the poppler (non-fatal) complaint about syntax of the stream.
I took a closer look at the difference in output between files generated with
jbig2 ...
andjbig2 --pdf ...
. The latter output omits the file header (13 bytes), and also omits a few more bytes at the end of the file, which appear to be the end-of-page and end-of-file segments.Perhaps those two segments should also be excluded? From the PDF spec I see it documented as:
That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case.
Sure, here are the trailing 22 bytes in hexdump format:
I ran some more side-by-side comparisons of the
jbig2 ...
andjbig2 --pdf
output for various sizes of input images, extracting trailing bytes, and the trailing bytes appear to be identical, at least for the tests I ran.If there's a chance the bytes in these trailing segments can change, I suppose the jbig2enc source code will provide hints...
Apologies again for the delay.
@phmccarty I fixed your issue in
244600065d
— see the commit message for more details.@ooBJ3u Thanks! I re-tested with your latest commit, and I no longer see that syntax error.
Great news!
@josch this PR should now really-really be ready to merge then, when you have a moment. 🙏
@ -1824,0 +1876,4 @@
raise ImageOpenError(
"Unsupported JBIG2 format; only single-page generic coding is supported (e.g. from `jbig2enc`)."
)
if rawdata[-22:] != b"\x00\x00\x00\x021\x00\x01\x00\x00\x00\x00\x00\x00\x00\x033\x00\x01\x00\x00\x00\x00":
One question about the style of the code here:
Do you think it would be better to use hex instead of the character value for the 5th and 16th bytes (
1
and3
)? IMO, I like that better for the consistency, but because the value is accurate as-is, I'm not opposed to leaving it.This is how Python prints the bytearray by default, so I figured that is fine.
Okay, that makes sense then. No objection from me.
Changes look good to me
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.