Add support for JBIG2 (generic coding) #184

Open
ooBJ3u wants to merge 2 commits from ooBJ3u/img2pdf:main into main
First-time contributor

Implements the proposal detailed at #112 (comment).

This is a limited implementation of JBIG2, which can be extended to support multiple pages, symbol tables, and other features of the format in the future.

To test, I included a test fixture. You can also download 042.bmp (the same one as @josch already downloaded in #112 (comment) from https://git.ghostscript.com/?p=tests.git;a=blob_plain;f=jbig2/042.bmp;hb=HEAD and run the following command:

jbig2 042.bmp | img2pdf > 042.pdf

This results in a small PDF, just as @josch originally found in the comment mentioned above.

This is my first contribution to this repository so let me know if something else is needed. Thanks for a great library!

Implements the proposal detailed at https://gitlab.mister-muffin.de/josch/img2pdf/issues/112#issuecomment-1304. This is a limited implementation of JBIG2, which can be extended to support multiple pages, symbol tables, and other features of the format in the future. To test, I included a test fixture. You can also download 042.bmp (the same one as @josch already downloaded in https://gitlab.mister-muffin.de/josch/img2pdf/issues/112#issuecomment-307 from https://git.ghostscript.com/?p=tests.git;a=blob_plain;f=jbig2/042.bmp;hb=HEAD and run the following command: ```sh jbig2 042.bmp | img2pdf > 042.pdf ``` This results in a small PDF, just as @josch originally found in the comment mentioned above. This is my first contribution to this repository so let me know if something else is needed. Thanks for a great library!
ooBJ3u added 1 commit 2023-11-29 02:12:52 +00:00
Implements the proposal detailed at
#112 (comment)

This is a limited implementation of JBIG2, which can be extended to
support multiple pages, symbol tables, and other features of the format
in the future.

To test, I included a test fixture. You can also download 042.bmp (the same
one as @josch already downloaded in #112 (comment)
from https://git.ghostscript.com/?p=tests.git;a=blob_plain;f=jbig2/042.bmp;hb=HEAD
and run the following command:

  jbig2 042.bmp | img2pdf > 042.pdf

This results in a small PDF, just as @josch originally found in the
comment mentioned above.

This is my first contribution to this repository so let me know if
something else is needed. Thanks for a great library!
ooBJ3u force-pushed main from 154a61a88f to ee42963164 2023-11-29 02:27:47 +00:00 Compare
Owner

Wow, thank you! I read through your diff without trying it out yet and it looks really, really good!

My biggest gripe right now is src/tests/input/042.jb2. Why did you use the scan of a page instead the "TEST" image used in the other test cases? One problem with using a "real" test image like you chose in form of the scan of a page is the copyright situation. Even if that page is available in the public domain (is it?) you have to now write that down and keep track of this somewhere.

Wow, thank you! I read through your diff without trying it out yet and it looks really, really good! My biggest gripe right now is src/tests/input/042.jb2. Why did you use the scan of a page instead the "TEST" image used in the other test cases? One problem with using a "real" test image like you chose in form of the scan of a page is the copyright situation. Even if that page is available in the public domain (is it?) you have to now write that down and keep track of this somewhere.
Author
First-time contributor

No problem, I'll swap it out.

No problem, I'll swap it out.
ooBJ3u added 1 commit 2023-11-29 04:30:43 +00:00
This also uncovered a bug in jbig2enc where it uses the wrong unit
for resolution.
ooBJ3u force-pushed main from 2c00f3b66b to b23d82c45e 2023-11-29 04:33:20 +00:00 Compare
Author
First-time contributor

Fixed in 085dd192f6.

Fixed in 085dd192f6e14fe0d6384dc661e1e38794bb1507.
ooBJ3u reviewed 2023-11-29 04:39:40 +00:00
src/img2pdf.py Outdated
@ -1820,7 +1842,41 @@ def read_images(
if rawdata[:12] == b"\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A":
# image is jpeg2000
imgformat = ImageFormat.JPEG2000
if rawdata[:14].lower() == b"id=imagemagick":
Author
First-time contributor

I wasn't sure why this was if instead of elif. Won't that make it so JPEG2000 still crashes? I fixed it but wanted to double-check.

I wasn't sure why this was `if` instead of `elif`. Won't that make it so JPEG2000 still crashes? I fixed it but wanted to double-check.
mara0004 reviewed 2023-12-05 14:04:19 +00:00
README.md Outdated
@ -33,12 +33,14 @@ input file format and image color space.
| JPEG2000 | any | direct |
| PNG (non-interlaced, no transparency) | any | direct |
| TIFF (CCITT Group 4) | monochrome | direct |
| JBIG2 (single-page generic coding) | bi-level | direct |
Contributor

the other entries seem to use the term monochrome for 1 bit per pixel images.

the other entries seem to use the term `monochrome` for 1 bit per pixel images.
Author
First-time contributor

Monochrome is also often used for greyscale images, however. See e.g. https://en.wikipedia.org/wiki/Monochrome

Bi-level is pretty standard terminology, though "binary images" or perhaps even "1-bit images" might be clearer. https://en.wikipedia.org/wiki/Binary_image

Monochrome is also often used for greyscale images, however. See e.g. https://en.wikipedia.org/wiki/Monochrome Bi-level is pretty standard terminology, though "binary images" or perhaps even "1-bit images" might be clearer. https://en.wikipedia.org/wiki/Binary_image
Contributor

I'm fine with choosing another term, all I mean is the table should be consistent.

I'm fine with choosing another term, all I mean is the table should be consistent.
Author
First-time contributor

Apologies for the delay. I've updated the README to consistently say "1-bit monochrome" (to differentiate it from the other meaning of "grayscale"). Does this look good?

Apologies for the delay. I've updated the README to consistently say "1-bit monochrome" (to differentiate it from the other meaning of "grayscale"). Does this look good?
ooBJ3u added 1 commit 2024-04-05 01:01:54 +00:00
Author
First-time contributor

@josch Would you like to have another look at this? All comments should be addressed now.

@josch Would you like to have another look at this? All comments should be addressed now.
Owner

Nice!

I have a question. Why does this happen:

$ jbigtopnm mono.jb2
jbigtopnm: Invalid contents of input file.  Input data stream contains invalid data
Nice! I have a question. Why does this happen: ``` $ jbigtopnm mono.jb2 jbigtopnm: Invalid contents of input file. Input data stream contains invalid data ```
Author
First-time contributor

@josch
I have a question. Why does this happen:

$ jbigtopnm mono.jb2
jbigtopnm: Invalid contents of input file.  Input data stream contains invalid data

Apologies for the delay. jbigtopnm only supports JBIG1, not JBIG2. JBIG1 is still used in fax machines, but is not supported by PDF, so not too relevant for us.

> @josch > I have a question. Why does this happen: > > ``` > $ jbigtopnm mono.jb2 > jbigtopnm: Invalid contents of input file. Input data stream contains invalid data > ``` Apologies for the delay. `jbigtopnm` only supports JBIG1, not JBIG2. JBIG1 is still used in fax machines, but is not supported by PDF, so not too relevant for us.
Author
First-time contributor

@josch is this still blocked on anything? Anything I can do to get it merged?

@josch is this still blocked on anything? Anything I can do to get it merged?
Owner

@josch is this still blocked on anything? Anything I can do to get it merged?

Thank you for the ping and sorry to not have come back to you earlier. If in doubt, please feel free to ping me until I explicitly say otherwise. There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've. This is also due to my frustration with imagemagick which @gms recently summarized well in #204. It is very tiring of trying to do the right thing and being backwards compatible with old versions and then being shot in the foot by another imagemagick change which breaks behaviour...

In any case, your MR looks good. I'd just like to to squash commit 8c5541f41728d8ec87bef49f7f7134a4727a157a into 4901fa202e32d7dc2afb964a09b603257fa1bbe1 because otherwise, your file src/tests/input/042.jb2 will be part of the git history and as we discussed this is problematic due to copyright reasons.

Thanks!

> @josch is this still blocked on anything? Anything I can do to get it merged? Thank you for the ping and sorry to not have come back to you earlier. If in doubt, please feel free to ping me until I explicitly say otherwise. There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've. This is also due to my frustration with imagemagick which @gms recently summarized well in #204. It is very tiring of trying to do the right thing and being backwards compatible with old versions and then being shot in the foot by another imagemagick change which breaks behaviour... In any case, your MR looks good. I'd just like to to squash commit 8c5541f41728d8ec87bef49f7f7134a4727a157a into 4901fa202e32d7dc2afb964a09b603257fa1bbe1 because otherwise, your file `src/tests/input/042.jb2` will be part of the git history and as we discussed this is problematic due to copyright reasons. Thanks!
Contributor

@josch is this still blocked on anything? Anything I can do to get it merged?

May I ask the same question regarding MRs #200, #201, #202, #203?
Take your time, just wondering if there's anything left to do from my side? Sorry for the noise...

> @josch is this still blocked on anything? Anything I can do to get it merged? May I ask the same question regarding MRs #200, #201, #202, #203? Take your time, just wondering if there's anything left to do from my side? Sorry for the noise...
ooBJ3u force-pushed main from 150a23169b to e2369eb59a 2024-09-25 19:06:14 +00:00 Compare
Author
First-time contributor

There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've.

@josch I know just what you mean, which is why this also took me a while, sorry about that! I've squashed all commits into one, and updated the commit message to reflect all the changes that we made. Should be ready to merge now! Thanks again for such a fantastic project.

> There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've. @josch I know just what you mean, which is why this also took me a while, sorry about that! I've squashed all commits into one, and updated the commit message to reflect all the changes that we made. Should be ready to merge now! Thanks again for such a fantastic project.
Contributor

FWIW, I tested the changes in this MR by converting some TIFF files to JBIG2 (with jbig2 file.tif > file.jb2) and running img2pdf on them, and the resulting PDF seems to work just fine.

However, I'm noticing a potential problem when extracting the images...

If using pdfimages -tiff ... or pdfimages -png ... (from poppler) to extract/convert one or more of the JBIG2 files from the PDF, the tool complains with:

Syntax Error (9625): Unknown segment type in JBIG2 stream

Though note that pdfimages exits with status code 0, and the TIFF or PNG files appear valid. I did the same extraction test on a PDF (with JBIG2 images) created by another tool, and pdfimages does not print the error message, so maybe poppler sees something slightly off with the embedded JBIG2 files but it's not fatal to the conversion...

Without reading much about the JBIG2 format, I did notice from the jbig2enc man page that it supports a -p/--pdf flag. The JBIG2 support as implemented in this MR does not appear to support JBIG2 files created with that flag set. Could this be related to the syntax error poppler complains about?

FWIW, I tested the changes in this MR by converting some TIFF files to JBIG2 (with `jbig2 file.tif > file.jb2`) and running img2pdf on them, and the resulting PDF seems to work just fine. However, I'm noticing a potential problem when extracting the images... If using `pdfimages -tiff ...` or `pdfimages -png ...` (from poppler) to extract/convert one or more of the JBIG2 files from the PDF, the tool complains with: ``` Syntax Error (9625): Unknown segment type in JBIG2 stream ``` Though note that `pdfimages` exits with status code 0, and the TIFF or PNG files appear valid. I did the same extraction test on a PDF (with JBIG2 images) created by another tool, and `pdfimages` does not print the error message, so maybe poppler sees something slightly off with the embedded JBIG2 files but it's not fatal to the conversion... Without reading much about the JBIG2 format, I did notice from the `jbig2enc` man page that it supports a `-p/--pdf` flag. The JBIG2 support as implemented in this MR does not appear to support JBIG2 files created with that flag set. Could this be related to the syntax error poppler complains about?
Author
First-time contributor

@phmccarty that sounds like a bug in pdfimages. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from jbig2 at all, except stripping the file header, in accordance with the pdf spec.

I suspect that pdfimages is missing a segment type case, because JBIG2 is a very extensive format, and they might have not fully implemented it?

@phmccarty that sounds like a bug in `pdfimages`. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from `jbig2` at all, except stripping the file header, in accordance with the pdf spec. I suspect that `pdfimages` is missing a segment type case, because JBIG2 is a very extensive format, and they might have not fully implemented it?
Contributor

@phmccarty that sounds like a bug in pdfimages. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from jbig2 at all, except stripping the file header, in accordance with the pdf spec.

I haven't encountered any issues with PDFs opening either, thankfully... Just the poppler (non-fatal) complaint about syntax of the stream.

I took a closer look at the difference in output between files generated with jbig2 ... and jbig2 --pdf .... The latter output omits the file header (13 bytes), and also omits a few more bytes at the end of the file, which appear to be the end-of-page and end-of-file segments.

Perhaps those two segments should also be excluded? From the PDF spec I see it documented as:

The JBIG2 file header, end-of-page segments, and end-of-file segment shall not be used in PDF.
> @phmccarty that sounds like a bug in `pdfimages`. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from `jbig2` at all, except stripping the file header, in accordance with the pdf spec. I haven't encountered any issues with PDFs opening either, thankfully... Just the poppler (non-fatal) complaint about syntax of the stream. I took a closer look at the difference in output between files generated with `jbig2 ...` and `jbig2 --pdf ...`. The latter output omits the file header (13 bytes), and also omits a few more bytes at the end of the file, which appear to be the end-of-page and end-of-file segments. Perhaps those two segments should also be excluded? From the PDF spec I see it documented as: ``` The JBIG2 file header, end-of-page segments, and end-of-file segment shall not be used in PDF. ```
Author
First-time contributor

That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case.

That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case.
Contributor

That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case.

Sure, here are the trailing 22 bytes in hexdump format:

00000000  00 00 00 02 31 00 01 00  00 00 00 00 00 00 03 33  |....1..........3|
00000010  00 01 00 00 00 00                                 |......|
00000016

I ran some more side-by-side comparisons of the jbig2 ... and jbig2 --pdf output for various sizes of input images, extracting trailing bytes, and the trailing bytes appear to be identical, at least for the tests I ran.

If there's a chance the bytes in these trailing segments can change, I suppose the jbig2enc source code will provide hints...

> That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case. Sure, here are the trailing 22 bytes in hexdump format: ``` 00000000 00 00 00 02 31 00 01 00 00 00 00 00 00 00 03 33 |....1..........3| 00000010 00 01 00 00 00 00 |......| 00000016 ``` I ran some more side-by-side comparisons of the `jbig2 ...` and `jbig2 --pdf` output for various sizes of input images, extracting trailing bytes, and the trailing bytes appear to be identical, at least for the tests I ran. If there's a chance the bytes in these trailing segments can change, I suppose the jbig2enc source code will provide hints...
ooBJ3u added 1 commit 2024-10-30 04:51:19 +00:00
As noted by @phmccarty in
#184 (comment)
and subsequent comments, we were not properly stripping end-of-page and
end-of-file segments. These are valid segments in a JBIG2 file, but not
when embedded in PDF.

From the PDF spec:
> The JBIG2 file header, end-of-page segments, and end-of-file segment
> shall not be used in PDF.

We were already stripping out the JBIG2 file header, but not yet the
end-of-page and end-of-file segments.

For this, I'm expanding the approach that we were already taking, of
only supporting a narrow subset of JBIG2 files. We assert that the input
file has such a footer, and then we strip it.

We validated that the issue raised by @phmccarty is indeed resolved by
running the following code before and after applying this commit:

```sh
src/img2pdf.py src/tests/input/mono.jb2 > test.pdf
pdfimages -tiff test.pdf img
```

Before this commit, this returned "Syntax Error (1143): Unknown segment
type in JBIG2 stream". After this commit, the error is gone.
Author
First-time contributor

Apologies again for the delay.

@phmccarty I fixed your issue in 244600065d — see the commit message for more details.

Apologies again for the delay. @phmccarty I fixed your issue in https://gitlab.mister-muffin.de/josch/img2pdf/commit/244600065d7eee6b8365c224bd27dcb575557590 — see the commit message for more details.
Contributor

@ooBJ3u Thanks! I re-tested with your latest commit, and I no longer see that syntax error.

@ooBJ3u Thanks! I re-tested with your latest commit, and I no longer see that syntax error.
Author
First-time contributor

Great news!

@josch this PR should now really-really be ready to merge then, when you have a moment. 🙏

Great news! @josch this PR should now really-really be ready to merge then, when you have a moment. 🙏
phmccarty reviewed 2024-11-01 18:53:01 +00:00
@ -1824,0 +1876,4 @@
raise ImageOpenError(
"Unsupported JBIG2 format; only single-page generic coding is supported (e.g. from `jbig2enc`)."
)
if rawdata[-22:] != b"\x00\x00\x00\x021\x00\x01\x00\x00\x00\x00\x00\x00\x00\x033\x00\x01\x00\x00\x00\x00":
Contributor

One question about the style of the code here:

Do you think it would be better to use hex instead of the character value for the 5th and 16th bytes (1 and 3)? IMO, I like that better for the consistency, but because the value is accurate as-is, I'm not opposed to leaving it.

One question about the style of the code here: Do you think it would be better to use hex instead of the character value for the 5th and 16th bytes (`1` and `3`)? IMO, I like that better for the consistency, but because the value is accurate as-is, I'm not opposed to leaving it.
Author
First-time contributor

This is how Python prints the bytearray by default, so I figured that is fine.

This is how Python prints the bytearray by default, so I figured that is fine.
Contributor

Okay, that makes sense then. No objection from me.

Okay, that makes sense then. No objection from me.
phmccarty approved these changes 2024-11-05 01:53:09 +00:00
phmccarty left a comment
Contributor

Changes look good to me

Changes look good to me
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u main:ooBJ3u-main
git checkout ooBJ3u-main

Merge

Merge the changes and update on Forgejo.
git checkout main
git merge --no-ff ooBJ3u-main
git checkout main
git merge --ff-only ooBJ3u-main
git checkout ooBJ3u-main
git rebase main
git checkout main
git merge --no-ff ooBJ3u-main
git checkout main
git merge --squash ooBJ3u-main
git checkout main
git merge --ff-only ooBJ3u-main
git checkout main
git merge ooBJ3u-main
git push origin main
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#184
No description provided.