Support for JBIG2 images ~~/ streams~~ #112

Closed
opened 3 years ago by geer0Eec · 40 comments

An example: when extracting images from a pdf file by pdfimages -all, the result contains some JBIG2 images. It is then natural to select some of them to assemble another pdf file. Currently, it seems to me that one needs to first use jbig2dec to decode them, then encode them back again via jbig2 to produce a pdf, which seems to be lossy and cumbersome. I hope that one could assemble these JBIG2 streams images directly via img2pdf.

It seems also reasonable to support not only JBIG2 streams, but also image files themselves.

An example: when extracting images from a `pdf` file by `pdfimages -all`, the result contains some JBIG2 images. It is then natural to select some of them to assemble another `pdf` file. Currently, it seems to me that one needs to first use `jbig2dec` to decode them, then encode them back again via `jbig2` to produce a pdf, which seems to be lossy and cumbersome. I hope that one could assemble these JBIG2 ~~streams~~ images directly via `img2pdf`. ~~It seems also reasonable to support not only JBIG2 streams, but also image files themselves.~~

Could you please share an example PDF so that we comprehend your problem better?
I tend to believe this might rather be an issue with the tool you're using to extract images, for it should reconstruct the actul image rather than just save a jbig2 stream.

Could you please share an example PDF so that we comprehend your problem better? I tend to believe this might rather be an issue with the tool you're using to extract images, for it should reconstruct the actul image rather than just save a jbig2 stream.
Poster

The pdf that I am dealing with is not permitted to share (which however consists of many scanned pages along with some decorations and frames), but the result of pdfimages -list looks like

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   4     3 image    3790  5447  gray    1   1  jbig2  no      1050  0   600   600 68.7K 2.7%
   5     4 image    3790  5447  gray    1   1  jbig2  no      1052  0   600   600 50.6K 2.0%

and the output of pdfimages -all will produce jb2e files, which are, if I understand correctly, JBIG2 streams (without header). I would like to extract some of these scanned pages (i.e. jb2e files) to get a new pdf file.

The `pdf` that I am dealing with is not permitted to share (which however consists of many scanned pages along with some decorations and frames), but the result of `pdfimages -list` looks like ``` page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 4 3 image 3790 5447 gray 1 1 jbig2 no 1050 0 600 600 68.7K 2.7% 5 4 image 3790 5447 gray 1 1 jbig2 no 1052 0 600 600 50.6K 2.0% ``` and the output of `pdfimages -all` will produce `jb2e` files, which are, if I understand correctly, JBIG2 streams (without header). I would like to extract some of these scanned pages (i.e. `jb2e` files) to get a new `pdf` file.

Then maybe you could just try a different tool that extracts real images rather than jbig2 streams? I'm using a custom script with the PdfImage helper model of pikepdf to extract images from PDFs.

Then maybe you could just try a different tool that extracts real images rather than jbig2 streams? I'm using a custom script with the PdfImage helper model of pikepdf to extract images from PDFs.

This is my script in case it helps you. It also searches for images inside Form XObjects and applies a possible /SMask. However, it does not work for some types of images, since they are not supported by the PdfImage model yet (e. g. CMYK).
(Had to append the .txt extension to the file as gitea seems not to accept .py)

This is my script in case it helps you. It also searches for images inside Form XObjects and applies a possible `/SMask`. However, it does not work for some types of images, since they are not supported by the PdfImage model yet (e. g. CMYK). (Had to append the .txt extension to the file as gitea seems not to accept .py)
Poster

Thanks. Let me first open a feature request at pdfimages. However, it seems to me that neither are the real JBIG2 images supported by img2pdf?

Thanks. Let me first open a feature request at `pdfimages`. However, it seems to me that neither are the real `JBIG2` images supported by `img2pdf`?
geer0Eec changed title from Support for JBIG2 images / streams to Support for JBIG2 images ~~/ streams~~ 3 years ago
geer0Eec changed title from Support for JBIG2 images ~~/ streams~~ to Support for JBIG2 images ~~/ streams~~ 3 years ago
josch commented 3 years ago
Owner

Hi,

I want to understand the problem. What is a "jbig2 image"? Which software produces those? jbig2 is not an image format but a way to encode bilevel data. Why would converting a jbig2 stream by lossy? Why don't you just use pdfimages -png instead?

Hi, I want to understand the problem. What is a "jbig2 image"? Which software produces those? jbig2 is not an image format but a way to encode bilevel data. Why would converting a jbig2 stream by lossy? Why don't you just use `pdfimages -png` instead?
Poster

Maybe I misunderstood something. In the manpage of pdfimages, the "formats" JPEG, JPEG2000, JBIG2 and CCITT are listed in parallel, and I think that if I specify -png, it would induce a conversion (everything to png) and I suppose that any these kinds of conversion might be lossy (i.e. if I convert back and forth, I will get a different file), and I imagine that it should be possible to avoid conversions at all.

Maybe I misunderstood something. In the manpage of `pdfimages`, the "formats" `JPEG`, `JPEG2000`, `JBIG2` and `CCITT` are listed in parallel, and I think that if I specify `-png`, it would induce a conversion (everything to `png`) and I suppose that any these kinds of conversion might be lossy (i.e. if I convert back and forth, I will get a different file), and I imagine that it should be possible to avoid conversions at all.
josch commented 3 years ago
Owner

JBIG2 and CCITT are ways to encode bilevel image data but they are not "formats" in the same sense as JPEG or PNG in the sense that they have no header that identifies which kind of file it is, what the dimensions are and other metadata. For example, a JPEG image will start with the bytes 0xFF 0xD8 and this tells the program reading the file that this is a JPEG image. JBIG2 and CCITT do not have such a header and thus, to understand the data from a JBIG2 or CCITT file, you need to somehow know that it is a JBIG2 or CCITT file. This is why, when you use any image manipulation program like gimp or photoshop, they will not let you save or open JBIG2 and CCITT files. Also when you run magick identify -list format from imagemagick, JBIG2 and CCITT will not be listed. Those two are just ways to encode bilevel data but they are missing a container. Without knowing that the file contains JBIG2 or CCITT data, the file just contains junk. So if we wanted img2pdf to support JBIG2 and CCITT, then we would need some way to tell img2pdf that the file we pass to it contains JBIG2 or CCITT data.

I cannot come up with a situation in which storing JBIG2 or CCITT data as PNG would be lossy. Can you?

JBIG2 and CCITT are ways to encode bilevel image data but they are not "formats" in the same sense as JPEG or PNG in the sense that they have no header that identifies which kind of file it is, what the dimensions are and other metadata. For example, a JPEG image will start with the bytes 0xFF 0xD8 and this tells the program reading the file that this is a JPEG image. JBIG2 and CCITT do not have such a header and thus, to understand the data from a JBIG2 or CCITT file, you need to somehow know that it is a JBIG2 or CCITT file. This is why, when you use any image manipulation program like gimp or photoshop, they will not let you save or open JBIG2 and CCITT files. Also when you run `magick identify -list format` from imagemagick, JBIG2 and CCITT will not be listed. Those two are just ways to encode bilevel data but they are missing a container. Without knowing that the file contains JBIG2 or CCITT data, the file just contains junk. So if we wanted img2pdf to support JBIG2 and CCITT, then we would need some way to tell img2pdf that the file we pass to it contains JBIG2 or CCITT data. I cannot come up with a situation in which storing JBIG2 or CCITT data as PNG would be lossy. Can you?
Poster

Thanks for the explanations. However, it seems to me that this answer claims that there are "header" and "tail" in a "normal" JBIG2 file which are "stripped" in the PDF stream. It seems to me that this looks like an identification, similar to 0xFF 0xD8 that you mentioned.

Furthermore, if I understand correctly, there is no more metadata after conversion to png or pbm (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to 0xFF 0xD8). Thus what you really need is just the identification that this piece of data is encoded by JBIG2?

Thanks for the explanations. However, it seems to me that [this answer](https://stackoverflow.com/a/27713306) claims that there are "header" and "tail" in a "normal" JBIG2 file which are "stripped" in the PDF stream. It seems to me that this looks like an identification, similar to `0xFF 0xD8` that you mentioned. Furthermore, if I understand correctly, there is no more metadata after conversion to `png` or `pbm` (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to `0xFF 0xD8`). Thus what you really need is just the identification that this piece of data is encoded by JBIG2?
josch commented 3 years ago
Owner

Yes, looks like there should be a header and if that also contains the size of the image then that should be enough to support JBIG2 as input.

But there seems to be no JBIG2 encoder in the operating system I'm using (Debian) so unless you can provide a JBIG2 file I don't see how I can add support for it to img2pdf.

Yes, looks like there should be a header and if that also contains the size of the image then that should be enough to support JBIG2 as input. But there seems to be no JBIG2 encoder in the operating system I'm using (Debian) so unless you can provide a JBIG2 file I don't see how I can add support for it to img2pdf.
Poster

I did some test and seemingly the size data (height, weight, xppi, yppi) is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by pdfimages to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142

A JBIG2 encoder is available here. It is not included in GNU/Linux because JBIG2 encoding was long patented and possibly still covered by unknown patents.

I will open a feature request to ask a conservative container of JBIG2 and CCITT (for the later, simply add a layer of tiff). "Conservative" means that this is performed costless and in particular, no essential conversion is performed.

I did some test and seemingly the size data `(height, weight, xppi, yppi)` is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by `pdfimages` to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142 A JBIG2 encoder is available [here](https://github.com/agl/jbig2enc). It is not included in GNU/Linux because JBIG2 encoding was [long patented and possibly still covered by unknown patents](https://github.com/agl/jbig2enc). I will open a feature request to ask a conservative container of JBIG2 and CCITT (for the later, simply add a layer of tiff). "Conservative" means that this is performed costless and in particular, no essential conversion is performed.
josch commented 3 years ago
Owner

The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.

Again: why don't you just run pdfimages -png?

The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite. Again: why don't you just run `pdfimages -png`?

Furthermore, if I understand correctly, there is no more metadata after conversion to png or pbm (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to 0xFF 0xD8)

If you are dealing with a PDF that contains JPEG images, you can just use pdfimages -j / pdfimages -jp2, and you will get the original JPEG files with all metadata. However, your issue is about jbig2 streams, where you shouldn't loose anything when converting to PNG.
The PdfImage helper of pikepdf handles the different ways how images can be included in PDFs and will automatically choose the best format.

> Furthermore, if I understand correctly, there is no more metadata after conversion to png or pbm (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to 0xFF 0xD8) If you are dealing with a PDF that contains JPEG images, you can just use `pdfimages -j` / `pdfimages -jp2`, and you will get the original JPEG files with all metadata. However, your issue is about jbig2 streams, where you shouldn't loose anything when converting to PNG. The [PdfImage helper](https://pikepdf.readthedocs.io/en/latest/topics/images.html) of pikepdf handles the different ways how images can be included in PDFs and will automatically choose the best format.
Poster

The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.

Again: why don't you just run pdfimages -png?

I compile jbig2enc myself on my computer (any patent, if still existant, should not go after personal usages). For the public test suite, I believe that the test suite of ocrmypdf should be closely related - they have jbig2enc compression options if installed.

There is at least a computational expense to convert to png then use jbig2enc to compress again, which is in fact redundant. The extra compression is not that cheap - I spent around an hour to compress the images in big PDF files (around 100 MB) via pdfsizeopt. ocrmypdf is more efficient but it still costs time.

> The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite. > > Again: why don't you just run `pdfimages -png`? I compile `jbig2enc` myself on my computer (any patent, if still existant, should not go after personal usages). For the public test suite, I believe that the [test suite of `ocrmypdf`](https://ocrmypdf.readthedocs.io/en/latest/docker.html?highlight=test%20suite#executing-the-test-suite) should be closely related - they have `jbig2enc` compression options if installed. There is at least a computational expense to convert to `png` then use `jbig2enc` to compress again, which is in fact redundant. The extra compression is not that cheap - I spent around an hour to compress the images in big PDF files (around 100 MB) via [pdfsizeopt](https://github.com/pts/pdfsizeopt). `ocrmypdf` is more efficient but it still costs time.
josch commented 3 years ago
Owner

If it's somehow possible then yes, img2pdf should support jbig2 as input. One major reason is the one you cite in your last message: we avoid useless encoding computations in the same way that we avoid those when embedding JPEG or PNG images into the PDF container without re-encoding them.

But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.

Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?

If it's somehow possible then yes, img2pdf should support jbig2 as input. One major reason is the one you cite in your last message: we avoid useless encoding computations in the same way that we avoid those when embedding JPEG or PNG images into the PDF container without re-encoding them. But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default. Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?
josch commented 3 years ago
Owner

Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the /JBIG2Globals object and one is the /XObject itself. So if I run pdfimages -all on that PDF I get a jb2e file and a jb2g file for each image.

According to the answer in https://stackoverflow.com/questions/27709913/jbig2-data-in-pdf-is-not-valid-jbig2-data-wrong-magic/27713306#27713306 it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.

And then there is the problem that we have to somehow teach img2pdf to treat two input images as one. I do not see how to possibly do this in practice.

Any suggestions?

Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the `/JBIG2Globals` object and one is the `/XObject` itself. So if I run `pdfimages -all` on that PDF I get a jb2e file and a jb2g file for each image. According to the answer in https://stackoverflow.com/questions/27709913/jbig2-data-in-pdf-is-not-valid-jbig2-data-wrong-magic/27713306#27713306 it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions. And then there is the problem that we have to somehow teach img2pdf to treat *two* input images as one. I do not see how to possibly do this in practice. Any suggestions?

Okay, I obtained a PDF containing a JBIG2 encoded image.

Could you please share the link/file?

> Okay, I obtained a PDF containing a JBIG2 encoded image. Could you please share the link/file?
Poster

Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the /JBIG2Globals object and one is the /XObject itself. So if I run pdfimages -all on that PDF I get a jb2e file and a jb2g file for each image.

And then there is the problem that we have to somehow teach img2pdf to treat two input images as one. I do not see how to possibly do this in practice.

Any suggestions?

Theoretically, it should be a job of pdfimages to produce a JBIG2 file with header (instead of two files containing raw streams). Meanwhile, you could provide a script to do so, but in my opinion, it should not be integrated into the executive img2pdf itself.

To be clear, for my PDF, I have only extracted a jb2e file, the embedded stream (e=embedded), without "global data" (g=global).

it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.

It seems that the JBIG2 stream contains the image dimensions. There are two evidences:

  1. Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before:

I did some test and seemingly the size data (height, weight, xppi, yppi) is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by pdfimages to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142

I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306

  1. Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a png file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790
> Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the `/JBIG2Globals` object and one is the `/XObject` itself. So if I run `pdfimages -all` on that PDF I get a jb2e file and a jb2g file for each image. > And then there is the problem that we have to somehow teach img2pdf to treat *two* input images as one. I do not see how to possibly do this in practice. > > Any suggestions? Theoretically, it should be a job of `pdfimages` to produce a JBIG2 file with header (instead of two files containing raw streams). Meanwhile, you could provide a script to do so, but in my opinion, it should not be integrated into the executive `img2pdf` itself. To be clear, for my PDF, I have only extracted a `jb2e` file, the embedded stream (`e`=`embedded`), without "global data" (`g`=`global`). > it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions. It seems that the JBIG2 stream *contains* the image dimensions. There are two evidences: 1. Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before: > I did some test and seemingly the size data (height, weight, xppi, yppi) is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by pdfimages to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142 I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306 2. Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a `png` file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790
Poster

But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.

Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?

I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46

You could extract the images, then

  1. use img2pdf to produce a merged pdf;
  2. pass to jbig2enc losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193

The difference is significant in this case. In general, I find that jbig2 usually reduces around 20% of sizes of scanned monochromic documents via ocrmypdf.

> But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default. > > Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size? I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46 You could extract the images, then 1. use `img2pdf` to produce a merged pdf; 2. pass to `jbig2enc` losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193 The difference is significant in this case. In general, I find that `jbig2` usually reduces around 20% of sizes of scanned monochromic documents via `ocrmypdf`.

When I pass this document to imageextractor.py, I obtain 5 jpg images, which together have the size of 4.846.095 B (merged back into a PDF, it takes 4.849.576 B), with the original PDF being 4.872.211 B (first page is non-image). With the first page merged in using pdftk, I get a file of 4.851.092 B, which is a very minor increase in size compared to the original. From looking at the images, I cannot see a visual difference to the input PDF.
An example PDF with JBIG2-encoded images together with the original image files would be really interesting, because then I could verify whether extraction is truly lossless...

When I pass this document to `imageextractor.py`, I obtain 5 jpg images, which together have the size of 4.846.095 B (merged back into a PDF, it takes 4.849.576 B), with the original PDF being 4.872.211 B (first page is non-image). With the first page merged in using pdftk, I get a file of 4.851.092 B, which is a very minor increase in size compared to the original. From looking at the images, I cannot see a visual difference to the input PDF. An example PDF with JBIG2-encoded images together with the original image files would be really interesting, because then I could verify whether extraction is truly lossless...
Poster

Oh, I see, it is not lossless - JBIG2 has only 1 bit per pixel, while the original JPEG has 8 bits per pixel, but a rate of 1/20 seems still something strange.

Oh, I see, it is not lossless - `JBIG2` has only 1 bit per pixel, while the original `JPEG` has 8 bits per pixel, but a rate of 1/20 seems still something strange.

Ah, I was confused. I thought the file you linked was supposed to contain JBIG2 streams already, but this was wrong. I didn't read carefully, sorry... With the compressed PDF you uploaded, I now understand your problem. The original file is 231 KiB, and the five images extracted as PNGs are 690 KiB, which indeed is a considerable increase in file size.

Ah, I was confused. I thought the file you linked was supposed to contain JBIG2 streams already, but this was wrong. I didn't read carefully, sorry... With the compressed PDF you uploaded, I now understand your problem. The original file is 231 KiB, and the five images extracted as PNGs are 690 KiB, which indeed is a considerable increase in file size.

It is then natural to select some of them to assemble another pdf file.

Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.
And if you do need to edit the images, then you can't work with jbig2 anyway...

> It is then natural to select some of them to assemble another pdf file. Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back. And if you do need to edit the images, then you can't work with jbig2 anyway...
Poster

It is then natural to select some of them to assemble another pdf file.

Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.

I know that extracting pages could be done by qpdf (some seem to object to pdftk).

> > It is then natural to select some of them to assemble another pdf file. > > Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back. I know that extracting pages could be done by `qpdf` (some seem to object to `pdftk`).
geer0Eec closed this issue 3 years ago
geer0Eec reopened this issue 3 years ago
josch commented 3 years ago
Owner

I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46

You could extract the images, then

  1. use img2pdf to produce a merged pdf;
  2. pass to jbig2enc losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193

The difference is significant in this case. In general, I find that jbig2 usually reduces around 20% of sizes of scanned monochromic documents via ocrmypdf.

The images in that PDF are not monochromatic but grayscale. But JBIG2 is for bilevel images, so this will be a lossy conversion.

> I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46 > > You could extract the images, then > > 1. use `img2pdf` to produce a merged pdf; > 2. pass to `jbig2enc` losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193 > > The difference is significant in this case. In general, I find that `jbig2` usually reduces around 20% of sizes of scanned monochromic documents via `ocrmypdf`. The images in that PDF are not monochromatic but grayscale. But JBIG2 is for bilevel images, so this will be a lossy conversion.
josch commented 3 years ago
Owner

it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.

It seems that the JBIG2 stream contains the image dimensions. There are two evidences:

  1. Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before:

I did some test and seemingly the size data (height, weight, xppi, yppi) is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by pdfimages to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142

I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306

  1. Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a png file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790

Then I think maybe you should first approach the Pillow project to add support for reading JBIG2 images? That seems to be a better place than img2pdf for code that parses JBIG2 files and can extract information like image dimensions.

> > it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions. > > It seems that the JBIG2 stream *contains* the image dimensions. There are two evidences: > > 1. Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before: > > > I did some test and seemingly the size data (height, weight, xppi, yppi) is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by pdfimages to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142 > > I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306 > > 2. Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a `png` file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790 Then I think maybe you should first approach the Pillow project to add support for reading JBIG2 images? That seems to be a better place than img2pdf for code that parses JBIG2 files and can extract information like image dimensions.
josch commented 3 years ago
Owner

I also just confirmed that lossless JBIG2 compresses some output better than CCITT4. I got some bilevel PDF from here:

https://www.jbig2dec.com/tests/index.html

Then converted 042.bmp to JBIG2 by using:

./src/jbig2 -p -v 042.bmp > 042.jb2

And created a PDF from it using https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741 like so:

python2 pdfsimp.py 042.jb2 > out.pdf

The resulting PDF is 46K small. If I use img2pdf to convert 042.bmp to a PDF using CCITT4, the resulting PDF is 68K in size.

I used pdfimages to extract the embedded images from both pdfs and then compared them using:

compare -metric AE jb2-000.pbm img2pdf-000.pbm diff.png

Indeed there is not a single pixel difference even though the PDF containing jb2 data is much smaller.

Looking at the code of pdfsimp.py, the width, height and horizontal as well as vertical resolution can indeed be obtained from bytes 11 to 27 of the jb2 file.

My only remaining problem now is, that I still don't know how to identify the files produced by the jbig2 command above. The file starts with:

00000000: 0000 0000 3000 0100 0000 1300 0006 c000  ....0...........
00000010: 0009 2300 0000 0000 0000 0001 0000 0000  ..#.............

So there seems to be no magic byte sequence identifying the file type. If I run the tool file, then it just tells me that the file contains "data", meaning, that it cannot identify the file.

So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?

I also just confirmed that lossless JBIG2 compresses some output better than CCITT4. I got some bilevel PDF from here: https://www.jbig2dec.com/tests/index.html Then converted 042.bmp to JBIG2 by using: ./src/jbig2 -p -v 042.bmp > 042.jb2 And created a PDF from it using https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741 like so: python2 pdfsimp.py 042.jb2 > out.pdf The resulting PDF is 46K small. If I use img2pdf to convert 042.bmp to a PDF using CCITT4, the resulting PDF is 68K in size. I used pdfimages to extract the embedded images from both pdfs and then compared them using: compare -metric AE jb2-000.pbm img2pdf-000.pbm diff.png Indeed there is not a single pixel difference even though the PDF containing jb2 data is much smaller. Looking at the code of pdfsimp.py, the width, height and horizontal as well as vertical resolution can indeed be obtained from bytes 11 to 27 of the jb2 file. My only remaining problem now is, that I still don't know how to identify the files produced by the `jbig2` command above. The file starts with: 00000000: 0000 0000 3000 0100 0000 1300 0006 c000 ....0........... 00000010: 0009 2300 0000 0000 0000 0001 0000 0000 ..#............. So there seems to be no magic byte sequence identifying the file type. If I run the tool `file`, then it just tells me that the file contains "data", meaning, that it cannot identify the file. So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?

So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?

Not exactly a creative or reliable idea, but perhaps just the file extensions .jb2 / .jbig2 / .jb2e?

> So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas? Not exactly a creative or reliable idea, but perhaps just the file extensions `.jb2` / `.jbig2` / `.jb2e`?
josch commented 3 years ago
Owner

So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?

Not exactly a creative or reliable idea, but perhaps just the file extensions .jb2 / .jbig2 / .jb2e?

No. This is not MS Windows.

> > So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas? > > Not exactly a creative or reliable idea, but perhaps just the file extensions `.jb2` / `.jbig2` / `.jb2e`? No. This is not MS Windows.

Then it might be hard, if not even the file command is able to detect jbig2...

Then it might be hard, if not even the `file` command is able to detect jbig2...
Poster

So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?

This seems quite impossible, which is a job of pdfimages. I don't know whether they are reluctant to add this extra magic code. Note that their logic of distinguishing the output by their extensions (while keeping the data as is) is kinda MS Windows that you mentioned above.

> So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas? This seems quite impossible, which is a job of `pdfimages`. I don't know whether they are [reluctant](https://gitlab.freedesktop.org/poppler/poppler/-/issues/1106#note_1012073) to add this extra magic code. Note that their logic of distinguishing the output by their extensions (while keeping the data as is) is kinda MS Windows that you mentioned above.

Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files?

Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files?
josch commented 3 years ago
Owner

Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files?

No. Those contain the size and dpi and can thus be arbitrary integers.

> Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files? No. Those contain the size and dpi and can thus be arbitrary integers.
josch commented 2 years ago
Owner

Closing, since there seems to be no good way to identify the format without a proper file magic of the non-existing container.

Closing, since there seems to be no good way to identify the format without a proper file magic of the non-existing container.
josch closed this issue 2 years ago

jbig2enc can give two types of outputs. The one you've been using is "symbol mode", which finds identical looking "symbols", and deduplicates them. This is useful when compressing files directly generated from a word processor, where letters often look exactly the same, pixel-by-pixel. For this mode it outputs a ".sym" file with all the common symbols, and a ".000", ".001", etc file for each image.

Indeed, when using symbol mode there is no good way to detect that these files are JBIG2 files.

However, jbig2enc also has a "generic coder", which does not do this deduplication between images. This often still yields very good results, especially for scanned documents where the same letter is not identical between different instances, and so the deduplication of symbol mode doesn't help much. This is the mode that you get when running jbig2 example.png > example.jb2.

For the generic coder, you can detect the file, since it has a magic header 0x97, 0x4a, 0x42, 0x32, 0x0d, 0x0a, 0x1a, 0x0a. This is defined in the spec, as seen on page 131 of https://github.com/agl/jbig2enc/blob/master/fcd14492.pdf.

We could therefore support this "generic coder" variant of JBIG2. I'll see if I can make a basic implementation.

`jbig2enc` can give two types of outputs. The one you've been using is "symbol mode", which finds identical looking "symbols", and deduplicates them. This is useful when compressing files directly generated from a word processor, where letters often look exactly the same, pixel-by-pixel. For this mode it outputs a ".sym" file with all the common symbols, and a ".000", ".001", etc file for each image. Indeed, when using symbol mode there is no good way to detect that these files are JBIG2 files. However, `jbig2enc` also has a "generic coder", which does not do this deduplication between images. This often still yields very good results, especially for scanned documents where the same letter is not identical between different instances, and so the deduplication of symbol mode doesn't help much. This is the mode that you get when running `jbig2 example.png > example.jb2`. For the generic coder, you *can* detect the file, since it has a magic header `0x97, 0x4a, 0x42, 0x32, 0x0d, 0x0a, 0x1a, 0x0a`. This is defined in the spec, as seen on page 131 of https://github.com/agl/jbig2enc/blob/master/fcd14492.pdf. We could therefore support this "generic coder" variant of JBIG2. I'll see if I can make a basic implementation.

I'm learning more and more every minute I'm looking into this.. What I said above is not entirely accurate: you got the different files because of using their "PDF mode". Both generic coding and symbol mode are supported in the JBIG2 file format (that starts with the magic header).

In fact, the JBIG2 file format supports multiple pages, to accommodate a common symbol lookup table between the pages.

We don't have to support all that to start. We could start with only the simple case of a JBIG2 file with a single page.

I'm learning more and more every minute I'm looking into this.. What I said above is not entirely accurate: you got the different files because of using their "PDF mode". Both generic coding and symbol mode are supported in the JBIG2 file format (that starts with the magic header). In fact, the JBIG2 file format supports multiple pages, to accommodate a common symbol lookup table between the pages. We don't have to support all that to start. We could start with only the simple case of a JBIG2 file with a single page.

Indeed, you're right JBIG2 is a standalone image format after all.
Sorry about my old comments above, with the current state of knowledge they were clearly quite wrong.

Given the chance for considerably better compression I agree support for JBIG2 in img2pdf would be nice to have. Thanks @ooBJ3u for working on this. It might even make sense to set lossless JBIG2 as default codec for monochrome input in the future, if jbig2enc is available?

As a separate matter, I believe a useful complement to support in img2pdf might be a utility to merge the "stripped form" as stored in PDF back into an actual JBIG2 file, similar to fax2tiff for CCITT, and also have pikepdf just re-create the JBIG2 wrapper instead of transcoding to PNG, to allow for seamless use of the format in a PDF image extract and rewrap pipeline.

Indeed, you're right JBIG2 is a standalone image format after all. Sorry about my old comments above, with the current state of knowledge they were clearly quite wrong. Given the chance for considerably better compression I agree support for JBIG2 in img2pdf would be nice to have. Thanks @ooBJ3u for working on this. It might even make sense to set lossless JBIG2 as default codec for monochrome input in the future, if jbig2enc is available? As a separate matter, I believe a useful complement to support in img2pdf might be a utility to merge the "stripped form" as stored in PDF back into an actual JBIG2 file, similar to `fax2tiff` for CCITT, and also have pikepdf just re-create the JBIG2 wrapper instead of transcoding to PNG, to allow for seamless use of the format in a PDF image extract and rewrap pipeline.
Poster

As a separate matter, I believe a useful complement to support in img2pdf might be a utility to merge the "stripped form" as stored in PDF back into an actual JBIG2 file, similar to the fax2tiff utility for CCITT, and also pikepdf just re-creating the JBIG2 wrapper instead of transcoding to PNG, to allow for seamless use of the format in a PDF image extract and rewrap pipeline.

Incidentally, there were very recent comments at https://gitlab.freedesktop.org/poppler/poppler/-/issues/1106#note_2180790 which refers to the same JBIG2 spec. Unfortunately, it seems quite improbable that pdfimages would implement such a thing unless somebody volunteers to submit a PR.

> As a separate matter, I believe a useful complement to support in img2pdf might be a utility to merge the "stripped form" as stored in PDF back into an actual JBIG2 file, similar to the `fax2tiff` utility for CCITT, and also pikepdf just re-creating the JBIG2 wrapper instead of transcoding to PNG, to allow for seamless use of the format in a PDF image extract and rewrap pipeline. Incidentally, there were very recent comments at https://gitlab.freedesktop.org/poppler/poppler/-/issues/1106#note_2180790 which refers to the same JBIG2 spec. Unfortunately, it seems quite improbable that `pdfimages` would implement such a thing unless somebody volunteers to submit a PR.

Unfortunately, it seems quite improbable that pdfimages would implement such a thing unless somebody volunteers to submit a PR.

Yeah, I doubt if this will ever be implemented in poppler. I was more thinking of some python script the caller could use on pdfimages' stripped jbig2 output.

> Unfortunately, it seems quite improbable that pdfimages would implement such a thing unless somebody volunteers to submit a PR. Yeah, I doubt if this will ever be implemented in poppler. I was more thinking of some python script the caller could use on pdfimages' stripped jbig2 output.

What adds complication with pikepdf is the possible shared globals stream. I'm not sure how one would handle that, given the current API is operating on image level, not document level. Embedding shared globals in every individual output image doesn't seem elegant.

What adds complication with pikepdf is the possible shared globals stream. I'm not sure how one would handle that, given the current API is operating on image level, not document level. Embedding shared globals in every individual output image doesn't seem elegant.
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#112
Loading…
There is no content yet.