Add support for JBIG2 (generic coding) #184

ooBJ3u · 2023-11-29T02:12:51Z

ooBJ3u commented

2023-11-29 02:12:51 +00:00

First-time contributor

Implements the proposal detailed at #112 (comment).

This is a limited implementation of JBIG2, which can be extended to support multiple pages, symbol tables, and other features of the format in the future.

To test, I included a test fixture. You can also download 042.bmp (the same one as @josch already downloaded in #112 (comment) from https://git.ghostscript.com/?p=tests.git;a=blob_plain;f=jbig2/042.bmp;hb=HEAD and run the following command:

jbig2 042.bmp | img2pdf > 042.pdf

This results in a small PDF, just as @josch originally found in the comment mentioned above.

This is my first contribution to this repository so let me know if something else is needed. Thanks for a great library!

Implements the proposal detailed at https://gitlab.mister-muffin.de/josch/img2pdf/issues/112#issuecomment-1304. This is a limited implementation of JBIG2, which can be extended to support multiple pages, symbol tables, and other features of the format in the future. To test, I included a test fixture. You can also download 042.bmp (the same one as @josch already downloaded in https://gitlab.mister-muffin.de/josch/img2pdf/issues/112#issuecomment-307 from https://git.ghostscript.com/?p=tests.git;a=blob_plain;f=jbig2/042.bmp;hb=HEAD and run the following command: ```sh jbig2 042.bmp | img2pdf > 042.pdf ``` This results in a small PDF, just as @josch originally found in the comment mentioned above. This is my first contribution to this repository so let me know if something else is needed. Thanks for a great library!

👍 1

ooBJ3u added 1 commit 2023-11-29 02:12:52 +00:00

Add support for JBIG2 (generic coding) 154a61a88f

Implements the proposal detailed at
#112 (comment)

This is a limited implementation of JBIG2, which can be extended to
support multiple pages, symbol tables, and other features of the format
in the future.

To test, I included a test fixture. You can also download 042.bmp (the same
one as @josch already downloaded in #112 (comment)
from https://git.ghostscript.com/?p=tests.git;a=blob_plain;f=jbig2/042.bmp;hb=HEAD
and run the following command:

  jbig2 042.bmp | img2pdf > 042.pdf

This results in a small PDF, just as @josch originally found in the
comment mentioned above.

This is my first contribution to this repository so let me know if
something else is needed. Thanks for a great library!

ooBJ3u force-pushed main from 154a61a88f to ee42963164

2023-11-29 02:27:47 +00:00

Compare

josch commented

2023-11-29 03:55:33 +00:00

Wow, thank you! I read through your diff without trying it out yet and it looks really, really good!

My biggest gripe right now is src/tests/input/042.jb2. Why did you use the scan of a page instead the "TEST" image used in the other test cases? One problem with using a "real" test image like you chose in form of the scan of a page is the copyright situation. Even if that page is available in the public domain (is it?) you have to now write that down and keep track of this somewhere.

Wow, thank you! I read through your diff without trying it out yet and it looks really, really good! My biggest gripe right now is src/tests/input/042.jb2. Why did you use the scan of a page instead the "TEST" image used in the other test cases? One problem with using a "real" test image like you chose in form of the scan of a page is the copyright situation. Even if that page is available in the public domain (is it?) you have to now write that down and keep track of this somewhere.

ooBJ3u commented

2023-11-29 03:56:58 +00:00

First-time contributor

No problem, I'll swap it out.

ooBJ3u added 1 commit 2023-11-29 04:30:43 +00:00

Use mono.jb2 for tests 2c00f3b66b

This also uncovered a bug in jbig2enc where it uses the wrong unit
for resolution.

ooBJ3u force-pushed main from 2c00f3b66b to b23d82c45e

2023-11-29 04:33:20 +00:00

Compare

ooBJ3u commented

2023-11-29 04:34:07 +00:00

First-time contributor

Fixed in 085dd192f6.

Fixed in 085dd192f6e14fe0d6384dc661e1e38794bb1507.

ooBJ3u reviewed 2023-11-29 04:39:40 +00:00

src/img2pdf.py Outdated

					
				@ -1820,7 +1842,41 @@ def read_images(

				        if rawdata[:12] == b"\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A":

				            # image is jpeg2000

				            imgformat = ImageFormat.JPEG2000

				        if rawdata[:14].lower() == b"id=imagemagick":

ooBJ3u commented

2023-11-29 04:39:40 +00:00

First-time contributor

I wasn't sure why this was if instead of elif. Won't that make it so JPEG2000 still crashes? I fixed it but wanted to double-check.

I wasn't sure why this was `if` instead of `elif`. Won't that make it so JPEG2000 still crashes? I fixed it but wanted to double-check.

mara0004 reviewed 2023-12-05 14:04:19 +00:00

README.md Outdated

					
				@ -33,12 +33,14 @@ input file format and image color space.

				| JPEG2000                              | any                            | direct        |

				| PNG (non-interlaced, no transparency) | any                            | direct        |

				| TIFF (CCITT Group 4)                  | monochrome                     | direct        |

				| JBIG2 (single-page generic coding)    | bi-level                       | direct        |

mara0004 commented

2023-12-05 14:04:19 +00:00

the other entries seem to use the term monochrome for 1 bit per pixel images.

the other entries seem to use the term `monochrome` for 1 bit per pixel images.

ooBJ3u commented

2023-12-06 05:33:24 +00:00

First-time contributor

Monochrome is also often used for greyscale images, however. See e.g. https://en.wikipedia.org/wiki/Monochrome

Bi-level is pretty standard terminology, though "binary images" or perhaps even "1-bit images" might be clearer. https://en.wikipedia.org/wiki/Binary_image

Monochrome is also often used for greyscale images, however. See e.g. https://en.wikipedia.org/wiki/Monochrome Bi-level is pretty standard terminology, though "binary images" or perhaps even "1-bit images" might be clearer. https://en.wikipedia.org/wiki/Binary_image

mara0004 commented

2023-12-06 12:11:52 +00:00

I'm fine with choosing another term, all I mean is the table should be consistent.

ooBJ3u commented

2024-04-05 01:02:55 +00:00

First-time contributor

Apologies for the delay. I've updated the README to consistently say "1-bit monochrome" (to differentiate it from the other meaning of "grayscale"). Does this look good?

👍 1

ooBJ3u added 1 commit 2024-04-05 01:01:54 +00:00

Update 'README.md' 150a23169b

Per comment #184/files (comment)

ooBJ3u commented

2024-05-15 13:49:46 +00:00

First-time contributor

@josch Would you like to have another look at this? All comments should be addressed now.

josch commented

2024-05-18 12:19:45 +00:00

Nice!

I have a question. Why does this happen:

$ jbigtopnm mono.jb2
jbigtopnm: Invalid contents of input file.  Input data stream contains invalid data

Nice! I have a question. Why does this happen: ``` $ jbigtopnm mono.jb2 jbigtopnm: Invalid contents of input file. Input data stream contains invalid data ```

ooBJ3u commented

2024-06-06 16:34:00 +00:00

First-time contributor

@josch
I have a question. Why does this happen:

$ jbigtopnm mono.jb2
jbigtopnm: Invalid contents of input file.  Input data stream contains invalid data

Apologies for the delay. jbigtopnm only supports JBIG1, not JBIG2. JBIG1 is still used in fax machines, but is not supported by PDF, so not too relevant for us.

> @josch > I have a question. Why does this happen: > > ``` > $ jbigtopnm mono.jb2 > jbigtopnm: Invalid contents of input file. Input data stream contains invalid data > ``` Apologies for the delay. `jbigtopnm` only supports JBIG1, not JBIG2. JBIG1 is still used in fax machines, but is not supported by PDF, so not too relevant for us.

josch referenced this pull request

2024-07-07 12:31:57 +00:00

feature request: compress image losslessly within pdf #199

ooBJ3u commented

2024-09-11 07:52:41 +00:00

First-time contributor

@josch is this still blocked on anything? Anything I can do to get it merged?

josch commented

2024-09-11 09:45:26 +00:00

@josch is this still blocked on anything? Anything I can do to get it merged?

Thank you for the ping and sorry to not have come back to you earlier. If in doubt, please feel free to ping me until I explicitly say otherwise. There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've. This is also due to my frustration with imagemagick which @gms recently summarized well in #204. It is very tiring of trying to do the right thing and being backwards compatible with old versions and then being shot in the foot by another imagemagick change which breaks behaviour...

In any case, your MR looks good. I'd just like to to squash commit 8c5541f417 into 4901fa202e because otherwise, your file src/tests/input/042.jb2 will be part of the git history and as we discussed this is problematic due to copyright reasons.

Thanks!

> @josch is this still blocked on anything? Anything I can do to get it merged? Thank you for the ping and sorry to not have come back to you earlier. If in doubt, please feel free to ping me until I explicitly say otherwise. There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've. This is also due to my frustration with imagemagick which @gms recently summarized well in #204. It is very tiring of trying to do the right thing and being backwards compatible with old versions and then being shot in the foot by another imagemagick change which breaks behaviour... In any case, your MR looks good. I'd just like to to squash commit 8c5541f41728d8ec87bef49f7f7134a4727a157a into 4901fa202e32d7dc2afb964a09b603257fa1bbe1 because otherwise, your file `src/tests/input/042.jb2` will be part of the git history and as we discussed this is problematic due to copyright reasons. Thanks!

mara0004 commented

2024-09-13 21:42:16 +00:00

@josch is this still blocked on anything? Anything I can do to get it merged?

May I ask the same question regarding MRs #200, #201, #202, #203?
Take your time, just wondering if there's anything left to do from my side? Sorry for the noise...

> @josch is this still blocked on anything? Anything I can do to get it merged? May I ask the same question regarding MRs #200, #201, #202, #203? Take your time, just wondering if there's anything left to do from my side? Sorry for the noise...

ooBJ3u force-pushed main from 150a23169b to e2369eb59a

2024-09-25 19:06:14 +00:00

Compare

ooBJ3u commented

2024-09-25 19:08:21 +00:00

First-time contributor

There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've.

@josch I know just what you mean, which is why this also took me a while, sorry about that! I've squashed all commits into one, and updated the commit message to reflect all the changes that we made. Should be ready to merge now! Thanks again for such a fantastic project.

> There are a lot of FOSS projects I'm taking are of and unfortunately, I haven't spent as much time on img2pdf recently as I should've. @josch I know just what you mean, which is why this also took me a while, sorry about that! I've squashed all commits into one, and updated the commit message to reflect all the changes that we made. Should be ready to merge now! Thanks again for such a fantastic project.

phmccarty commented

2024-10-10 20:54:51 +00:00

FWIW, I tested the changes in this MR by converting some TIFF files to JBIG2 (with jbig2 file.tif > file.jb2) and running img2pdf on them, and the resulting PDF seems to work just fine.

However, I'm noticing a potential problem when extracting the images...

If using pdfimages -tiff ... or pdfimages -png ... (from poppler) to extract/convert one or more of the JBIG2 files from the PDF, the tool complains with:

Syntax Error (9625): Unknown segment type in JBIG2 stream

Though note that pdfimages exits with status code 0, and the TIFF or PNG files appear valid. I did the same extraction test on a PDF (with JBIG2 images) created by another tool, and pdfimages does not print the error message, so maybe poppler sees something slightly off with the embedded JBIG2 files but it's not fatal to the conversion...

Without reading much about the JBIG2 format, I did notice from the jbig2enc man page that it supports a -p/--pdf flag. The JBIG2 support as implemented in this MR does not appear to support JBIG2 files created with that flag set. Could this be related to the syntax error poppler complains about?

FWIW, I tested the changes in this MR by converting some TIFF files to JBIG2 (with `jbig2 file.tif > file.jb2`) and running img2pdf on them, and the resulting PDF seems to work just fine. However, I'm noticing a potential problem when extracting the images... If using `pdfimages -tiff ...` or `pdfimages -png ...` (from poppler) to extract/convert one or more of the JBIG2 files from the PDF, the tool complains with: ``` Syntax Error (9625): Unknown segment type in JBIG2 stream ``` Though note that `pdfimages` exits with status code 0, and the TIFF or PNG files appear valid. I did the same extraction test on a PDF (with JBIG2 images) created by another tool, and `pdfimages` does not print the error message, so maybe poppler sees something slightly off with the embedded JBIG2 files but it's not fatal to the conversion... Without reading much about the JBIG2 format, I did notice from the `jbig2enc` man page that it supports a `-p/--pdf` flag. The JBIG2 support as implemented in this MR does not appear to support JBIG2 files created with that flag set. Could this be related to the syntax error poppler complains about?

ooBJ3u commented

2024-10-10 22:21:49 +00:00

First-time contributor

@phmccarty that sounds like a bug in pdfimages. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from jbig2 at all, except stripping the file header, in accordance with the pdf spec.

I suspect that pdfimages is missing a segment type case, because JBIG2 is a very extensive format, and they might have not fully implemented it?

@phmccarty that sounds like a bug in `pdfimages`. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from `jbig2` at all, except stripping the file header, in accordance with the pdf spec. I suspect that `pdfimages` is missing a segment type case, because JBIG2 is a very extensive format, and they might have not fully implemented it?

phmccarty commented

2024-10-10 23:05:20 +00:00

@phmccarty that sounds like a bug in pdfimages. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from jbig2 at all, except stripping the file header, in accordance with the pdf spec.

I haven't encountered any issues with PDFs opening either, thankfully... Just the poppler (non-fatal) complaint about syntax of the stream.

I took a closer look at the difference in output between files generated with jbig2 ... and jbig2 --pdf .... The latter output omits the file header (13 bytes), and also omits a few more bytes at the end of the file, which appear to be the end-of-page and end-of-file segments.

Perhaps those two segments should also be excluded? From the PDF spec I see it documented as:

The JBIG2 file header, end-of-page segments, and end-of-file segment shall not be used in PDF.

> @phmccarty that sounds like a bug in `pdfimages`. I've created and distributed many files with this MR, and they all open properly in PDF readers, and haven't heard any issues from users with opening the files, so I'm fairly confident that the format is correct. In fact, we don't modify the raw output from `jbig2` at all, except stripping the file header, in accordance with the pdf spec. I haven't encountered any issues with PDFs opening either, thankfully... Just the poppler (non-fatal) complaint about syntax of the stream. I took a closer look at the difference in output between files generated with `jbig2 ...` and `jbig2 --pdf ...`. The latter output omits the file header (13 bytes), and also omits a few more bytes at the end of the file, which appear to be the end-of-page and end-of-file segments. Perhaps those two segments should also be excluded? From the PDF spec I see it documented as: ``` The JBIG2 file header, end-of-page segments, and end-of-file segment shall not be used in PDF. ```

ooBJ3u commented

2024-10-10 23:36:46 +00:00

First-time contributor

That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case.

phmccarty commented

2024-10-11 00:33:48 +00:00

That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case.

Sure, here are the trailing 22 bytes in hexdump format:

00000000  00 00 00 02 31 00 01 00  00 00 00 00 00 00 03 33  |....1..........3|
00000010  00 01 00 00 00 00                                 |......|
00000016

I ran some more side-by-side comparisons of the jbig2 ... and jbig2 --pdf output for various sizes of input images, extracting trailing bytes, and the trailing bytes appear to be identical, at least for the tests I ran.

If there's a chance the bytes in these trailing segments can change, I suppose the jbig2enc source code will provide hints...

> That could explain it. Do you happen to have the byte strings that should be removed at the end handy? We can check to make sure those are present, and then remove those in that case. Sure, here are the trailing 22 bytes in hexdump format: ``` 00000000 00 00 00 02 31 00 01 00 00 00 00 00 00 00 03 33 |....1..........3| 00000010 00 01 00 00 00 00 |......| 00000016 ``` I ran some more side-by-side comparisons of the `jbig2 ...` and `jbig2 --pdf` output for various sizes of input images, extracting trailing bytes, and the trailing bytes appear to be identical, at least for the tests I ran. If there's a chance the bytes in these trailing segments can change, I suppose the jbig2enc source code will provide hints...

ooBJ3u added 1 commit 2024-10-30 04:51:19 +00:00

Strip end-of-page and end-of-file segments from JBIG2 244600065d

As noted by @phmccarty in
#184 (comment)
and subsequent comments, we were not properly stripping end-of-page and
end-of-file segments. These are valid segments in a JBIG2 file, but not
when embedded in PDF.

From the PDF spec:
> The JBIG2 file header, end-of-page segments, and end-of-file segment
> shall not be used in PDF.

We were already stripping out the JBIG2 file header, but not yet the
end-of-page and end-of-file segments.

For this, I'm expanding the approach that we were already taking, of
only supporting a narrow subset of JBIG2 files. We assert that the input
file has such a footer, and then we strip it.

We validated that the issue raised by @phmccarty is indeed resolved by
running the following code before and after applying this commit:

```sh
src/img2pdf.py src/tests/input/mono.jb2 > test.pdf
pdfimages -tiff test.pdf img
```

Before this commit, this returned "Syntax Error (1143): Unknown segment
type in JBIG2 stream". After this commit, the error is gone.

ooBJ3u commented

2024-10-30 04:52:35 +00:00

First-time contributor

Apologies again for the delay.

@phmccarty I fixed your issue in 244600065d — see the commit message for more details.

Apologies again for the delay. @phmccarty I fixed your issue in https://gitlab.mister-muffin.de/josch/img2pdf/commit/244600065d7eee6b8365c224bd27dcb575557590 — see the commit message for more details.

👍 1

phmccarty commented

2024-10-30 17:02:30 +00:00

@ooBJ3u Thanks! I re-tested with your latest commit, and I no longer see that syntax error.

🎉 1

ooBJ3u commented

2024-10-30 17:25:49 +00:00

First-time contributor

Great news!

@josch this PR should now really-really be ready to merge then, when you have a moment. 🙏

Great news! @josch this PR should now really-really be ready to merge then, when you have a moment. 🙏

phmccarty reviewed 2024-11-01 18:53:01 +00:00

src/img2pdf.py

					
				@ -1824,0 +1876,4 @@

				                raise ImageOpenError(

				                    "Unsupported JBIG2 format; only single-page generic coding is supported (e.g. from `jbig2enc`)."

				                )

				            if rawdata[-22:] != b"\x00\x00\x00\x021\x00\x01\x00\x00\x00\x00\x00\x00\x00\x033\x00\x01\x00\x00\x00\x00":

phmccarty commented

2024-11-01 18:53:01 +00:00

One question about the style of the code here:

Do you think it would be better to use hex instead of the character value for the 5th and 16th bytes (1 and 3)? IMO, I like that better for the consistency, but because the value is accurate as-is, I'm not opposed to leaving it.

One question about the style of the code here: Do you think it would be better to use hex instead of the character value for the 5th and 16th bytes (`1` and `3`)? IMO, I like that better for the consistency, but because the value is accurate as-is, I'm not opposed to leaving it.

ooBJ3u commented

2024-11-02 04:03:50 +00:00

First-time contributor

This is how Python prints the bytearray by default, so I figured that is fine.

phmccarty commented

2024-11-05 01:48:52 +00:00

Okay, that makes sense then. No objection from me.

phmccarty approved these changes 2024-11-05 01:53:09 +00:00

phmccarty left a comment

Changes look good to me

🚀 1

ooBJ3u commented

2025-02-04 10:48:53 +00:00

First-time contributor

@josch it looks like this has been be good to merge for a while now. :) I know you're busy with lots of open source projects, but want to hit the button?

josch commented

2025-02-15 08:34:28 +00:00

Hi again! Sorry for the long wait. Indeed, sadly, img2pdf has been rather at the bottom of my TODO list in the past few months but since the next Debian stable release is coming up, I want to make another release before that and merged your changes into main. Thanks a lot for your contributions and sorry for the very long wait.

There is one remaining problem in case you manage to find some time to fix it (I'm looking into it as well). When I run the testsuite on my computer, I get this difference in the test_general[mono.jb2-internal] test:

  Full diff:
    {
        '/Pages': {
            '/Count': Decimal('1'),
            '/Kids': [
                {
                    '/Contents': {
                        'stream': b'q\n115.0000 0 0 48.0000 0.0000 0.0000 cm\n/Im0 Do\nQ',
                    },
                    '/MediaBox': [
                        Decimal('0'),
                        Decimal('0'),
                        Decimal('115'),
                        Decimal('48'),
                    ],
                    '/Resources': {
                        '/XObject': {
                            '/Im0': {
                                '/BitsPerComponent': Decimal('1'),
                                '/ColorSpace': '/DeviceGray',
                                '/Filter': '/JBIG2Decode',
                                '/Height': Decimal('48'),
                                '/Subtype': '/Image',
                                '/Type': '/XObject',
                                '/Width': Decimal('115'),
                                'stream': b'\x00\x00\x00\x000\x00\x01\x00\x00\x00\x13\x00'
                                b'\x00\x00s\x00\x00\x000\x00\x00\x00H\x00'
                                b'\x00\x00H\x01\x00\x00\x00\x00\x00\x01&\x00'
                                b'\x01\x00\x00\x00\x81\x00\x00\x00s\x00\x00\x00'
                                b'0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03'
                                b'\xff\xfd\xff\x02\xfe\xfe\xfe\xab.3I\xbd'
                                b'\x10\xf6\xb7\x9e\xa6U\xb4\xba{\xa3\x95\x80'
                                b'\x8e4\x0f\xe3\xc0\xbe\x1e\x02\xebIu\x1c'
                                b'\x07\xb1\x9f\xd3U\xf5\x8bQ\xb5|o\xbdy\xe7\xdb\xc4'
                                b'V\xa6\xac\xc4.^\xf2\xb7\x01N&\x9c|\xcb\xfe('
                                b')\x0e\x07nS\xe3\x15\x8a\xef\x14\xc6\x18&@\x9c\x9b'
                                b'#_La\x96\x1e\xa6\x07\xb2A\xe1\xa5@\xfc\xd8Q'
  -                             b'$\xd3\xbb\xd6]\x99hS\xff\xac\x00\x00\x00\x021\x00'
  ?                                                           ---------------------
  +                             b'$\xd3\xbb\xd6]\x99hS\xff\xac',
  ?                                                            +
  -                             b'\x01\x00\x00\x00\x00\x00\x00\x00\x033\x00\x01'
  -                             b'\x00\x00\x00\x00',
                            },
                        },
                    },
                    '/Type': '/Page',
                },
            ],
            '/Type': '/Pages',
        },
        '/Type': '/Catalog',
    }

The cause is likely, that there are multiple ways to encode the same pixel information and different versions of the encoder library produce different output.

I think the correct way to handle this is to do the same as is already done with FlateDecode (there are also multiple ways to compress the same data with the deflate algorithm) in line 7157 of src/img2pdf_test.py. There should probably be an elif ret.get("/Filter") == "/JBIG2Decode": and that should be decoding the jbig2 data into raw pixel values.

I'm currently looking for the best way to do this decoding from Python.

Thanks!

Hi again! Sorry for the long wait. Indeed, sadly, img2pdf has been rather at the bottom of my TODO list in the past few months but since the next Debian stable release is coming up, I want to make another release before that and merged your changes into main. Thanks a lot for your contributions and sorry for the very long wait. There is one remaining problem in case you manage to find some time to fix it (I'm looking into it as well). When I run the testsuite on my computer, I get this difference in the `test_general[mono.jb2-internal]` test: ``` Full diff: { '/Pages': { '/Count': Decimal('1'), '/Kids': [ { '/Contents': { 'stream': b'q\n115.0000 0 0 48.0000 0.0000 0.0000 cm\n/Im0 Do\nQ', }, '/MediaBox': [ Decimal('0'), Decimal('0'), Decimal('115'), Decimal('48'), ], '/Resources': { '/XObject': { '/Im0': { '/BitsPerComponent': Decimal('1'), '/ColorSpace': '/DeviceGray', '/Filter': '/JBIG2Decode', '/Height': Decimal('48'), '/Subtype': '/Image', '/Type': '/XObject', '/Width': Decimal('115'), 'stream': b'\x00\x00\x00\x000\x00\x01\x00\x00\x00\x13\x00' b'\x00\x00s\x00\x00\x000\x00\x00\x00H\x00' b'\x00\x00H\x01\x00\x00\x00\x00\x00\x01&\x00' b'\x01\x00\x00\x00\x81\x00\x00\x00s\x00\x00\x00' b'0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03' b'\xff\xfd\xff\x02\xfe\xfe\xfe\xab.3I\xbd' b'\x10\xf6\xb7\x9e\xa6U\xb4\xba{\xa3\x95\x80' b'\x8e4\x0f\xe3\xc0\xbe\x1e\x02\xebIu\x1c' b'\x07\xb1\x9f\xd3U\xf5\x8bQ\xb5|o\xbdy\xe7\xdb\xc4' b'V\xa6\xac\xc4.^\xf2\xb7\x01N&\x9c|\xcb\xfe(' b')\x0e\x07nS\xe3\x15\x8a\xef\x14\xc6\x18&@\x9c\x9b' b'#_La\x96\x1e\xa6\x07\xb2A\xe1\xa5@\xfc\xd8Q' - b'$\xd3\xbb\xd6]\x99hS\xff\xac\x00\x00\x00\x021\x00' ? --------------------- + b'$\xd3\xbb\xd6]\x99hS\xff\xac', ? + - b'\x01\x00\x00\x00\x00\x00\x00\x00\x033\x00\x01' - b'\x00\x00\x00\x00', }, }, }, '/Type': '/Page', }, ], '/Type': '/Pages', }, '/Type': '/Catalog', } ``` The cause is likely, that there are multiple ways to encode the same pixel information and different versions of the encoder library produce different output. I think the correct way to handle this is to do the same as is already done with FlateDecode (there are also multiple ways to compress the same data with the deflate algorithm) in line 7157 of `src/img2pdf_test.py`. There should probably be an `elif ret.get("/Filter") == "/JBIG2Decode":` and that should be decoding the jbig2 data into raw pixel values. I'm currently looking for the best way to do this decoding from Python. Thanks!

josch commented

2025-02-15 08:36:02 +00:00

Oh no, wait! Is what I'm seeing in the diff not just the end-of-page and end-of-file segments from JBIG2 which got stripped in your last commit? Did you just forget to update the test data?