An example: when extracting images from a pdf file by pdfimages -all, the result contains some JBIG2 images. It is then natural to select some of them to assemble another pdf file. Currently, it seems to me that one needs to first use jbig2dec to decode them, then encode them back again via jbig2 to produce a pdf, which seems to be lossy and cumbersome. I hope that one could assemble these JBIG2 streams images directly via img2pdf.
It seems also reasonable to support not only JBIG2 streams, but also image files themselves.
An example: when extracting images from a `pdf` file by `pdfimages -all`, the result contains some JBIG2 images. It is then natural to select some of them to assemble another `pdf` file. Currently, it seems to me that one needs to first use `jbig2dec` to decode them, then encode them back again via `jbig2` to produce a pdf, which seems to be lossy and cumbersome. I hope that one could assemble these JBIG2 ~~streams~~ images directly via `img2pdf`.
~~It seems also reasonable to support not only JBIG2 streams, but also image files themselves.~~
Could you please share an example PDF so that we comprehend your problem better?
I tend to believe this might rather be an issue with the tool you're using to extract images, for it should reconstruct the actul image rather than just save a jbig2 stream.
Could you please share an example PDF so that we comprehend your problem better?
I tend to believe this might rather be an issue with the tool you're using to extract images, for it should reconstruct the actul image rather than just save a jbig2 stream.
The pdf that I am dealing with is not permitted to share (which however consists of many scanned pages along with some decorations and frames), but the result of pdfimages -list looks like
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
4 3 image 3790 5447 gray 1 1 jbig2 no 1050 0 600 600 68.7K 2.7%
5 4 image 3790 5447 gray 1 1 jbig2 no 1052 0 600 600 50.6K 2.0%
and the output of pdfimages -all will produce jb2e files, which are, if I understand correctly, JBIG2 streams (without header). I would like to extract some of these scanned pages (i.e. jb2e files) to get a new pdf file.
The `pdf` that I am dealing with is not permitted to share (which however consists of many scanned pages along with some decorations and frames), but the result of `pdfimages -list` looks like
```
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
4 3 image 3790 5447 gray 1 1 jbig2 no 1050 0 600 600 68.7K 2.7%
5 4 image 3790 5447 gray 1 1 jbig2 no 1052 0 600 600 50.6K 2.0%
```
and the output of `pdfimages -all` will produce `jb2e` files, which are, if I understand correctly, JBIG2 streams (without header). I would like to extract some of these scanned pages (i.e. `jb2e` files) to get a new `pdf` file.
Then maybe you could just try a different tool that extracts real images rather than jbig2 streams? I personally use a self-written script employing the PdfImage helper model of pikepdf to extract images from PDFs.
Then maybe you could just try a different tool that extracts real images rather than jbig2 streams? I personally use a self-written script employing the PdfImage helper model of pikepdf to extract images from PDFs.
This is my script in case it helps you. It also searches for images inside Form XObjects and applies a possible /SMask. However, it does not work for some types of images, since they are not supported by the PdfImage model yet (e. g. CMYK).
(Had to append the .txt extension to the file as gitea seems not to accept .py)
This is my script in case it helps you. It also searches for images inside Form XObjects and applies a possible `/SMask`. However, it does not work for some types of images, since they are not supported by the PdfImage model yet (e. g. CMYK).
(Had to append the .txt extension to the file as gitea seems not to accept .py)
I want to understand the problem. What is a "jbig2 image"? Which software produces those? jbig2 is not an image format but a way to encode bilevel data. Why would converting a jbig2 stream by lossy? Why don't you just use pdfimages -png instead?
Hi,
I want to understand the problem. What is a "jbig2 image"? Which software produces those? jbig2 is not an image format but a way to encode bilevel data. Why would converting a jbig2 stream by lossy? Why don't you just use `pdfimages -png` instead?
Maybe I misunderstood something. In the manpage of pdfimages, the "formats" JPEG, JPEG2000, JBIG2 and CCITT are listed in parallel, and I think that if I specify -png, it would induce a conversion (everything to png) and I suppose that any these kinds of conversion might be lossy (i.e. if I convert back and forth, I will get a different file), and I imagine that it should be possible to avoid conversions at all.
Maybe I misunderstood something. In the manpage of `pdfimages`, the "formats" `JPEG`, `JPEG2000`, `JBIG2` and `CCITT` are listed in parallel, and I think that if I specify `-png`, it would induce a conversion (everything to `png`) and I suppose that any these kinds of conversion might be lossy (i.e. if I convert back and forth, I will get a different file), and I imagine that it should be possible to avoid conversions at all.
JBIG2 and CCITT are ways to encode bilevel image data but they are not "formats" in the same sense as JPEG or PNG in the sense that they have no header that identifies which kind of file it is, what the dimensions are and other metadata. For example, a JPEG image will start with the bytes 0xFF 0xD8 and this tells the program reading the file that this is a JPEG image. JBIG2 and CCITT do not have such a header and thus, to understand the data from a JBIG2 or CCITT file, you need to somehow know that it is a JBIG2 or CCITT file. This is why, when you use any image manipulation program like gimp or photoshop, they will not let you save or open JBIG2 and CCITT files. Also when you run magick identify -list format from imagemagick, JBIG2 and CCITT will not be listed. Those two are just ways to encode bilevel data but they are missing a container. Without knowing that the file contains JBIG2 or CCITT data, the file just contains junk. So if we wanted img2pdf to support JBIG2 and CCITT, then we would need some way to tell img2pdf that the file we pass to it contains JBIG2 or CCITT data.
I cannot come up with a situation in which storing JBIG2 or CCITT data as PNG would be lossy. Can you?
JBIG2 and CCITT are ways to encode bilevel image data but they are not "formats" in the same sense as JPEG or PNG in the sense that they have no header that identifies which kind of file it is, what the dimensions are and other metadata. For example, a JPEG image will start with the bytes 0xFF 0xD8 and this tells the program reading the file that this is a JPEG image. JBIG2 and CCITT do not have such a header and thus, to understand the data from a JBIG2 or CCITT file, you need to somehow know that it is a JBIG2 or CCITT file. This is why, when you use any image manipulation program like gimp or photoshop, they will not let you save or open JBIG2 and CCITT files. Also when you run `magick identify -list format` from imagemagick, JBIG2 and CCITT will not be listed. Those two are just ways to encode bilevel data but they are missing a container. Without knowing that the file contains JBIG2 or CCITT data, the file just contains junk. So if we wanted img2pdf to support JBIG2 and CCITT, then we would need some way to tell img2pdf that the file we pass to it contains JBIG2 or CCITT data.
I cannot come up with a situation in which storing JBIG2 or CCITT data as PNG would be lossy. Can you?
Thanks for the explanations. However, it seems to me that this answer claims that there are "header" and "tail" in a "normal" JBIG2 file which are "stripped" in the PDF stream. It seems to me that this looks like an identification, similar to 0xFF 0xD8 that you mentioned.
Furthermore, if I understand correctly, there is no more metadata after conversion to png or pbm (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to 0xFF 0xD8). Thus what you really need is just the identification that this piece of data is encoded by JBIG2?
Thanks for the explanations. However, it seems to me that [this answer](https://stackoverflow.com/a/27713306) claims that there are "header" and "tail" in a "normal" JBIG2 file which are "stripped" in the PDF stream. It seems to me that this looks like an identification, similar to `0xFF 0xD8` that you mentioned.
Furthermore, if I understand correctly, there is no more metadata after conversion to `png` or `pbm` (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to `0xFF 0xD8`). Thus what you really need is just the identification that this piece of data is encoded by JBIG2?
Yes, looks like there should be a header and if that also contains the size of the image then that should be enough to support JBIG2 as input.
But there seems to be no JBIG2 encoder in the operating system I'm using (Debian) so unless you can provide a JBIG2 file I don't see how I can add support for it to img2pdf.
Yes, looks like there should be a header and if that also contains the size of the image then that should be enough to support JBIG2 as input.
But there seems to be no JBIG2 encoder in the operating system I'm using (Debian) so unless you can provide a JBIG2 file I don't see how I can add support for it to img2pdf.
I will open a feature request to ask a conservative container of JBIG2 and CCITT (for the later, simply add a layer of tiff). "Conservative" means that this is performed costless and in particular, no essential conversion is performed.
I did some test and seemingly the size data `(height, weight, xppi, yppi)` is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by `pdfimages` to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142
A JBIG2 encoder is available [here](https://github.com/agl/jbig2enc). It is not included in GNU/Linux because JBIG2 encoding was [long patented and possibly still covered by unknown patents](https://github.com/agl/jbig2enc).
I will open a feature request to ask a conservative container of JBIG2 and CCITT (for the later, simply add a layer of tiff). "Conservative" means that this is performed costless and in particular, no essential conversion is performed.
The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.
Again: why don't you just run pdfimages -png?
The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.
Again: why don't you just run `pdfimages -png`?
Furthermore, if I understand correctly, there is no more metadata after conversion to png or pbm (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to 0xFF 0xD8)
If you are dealing with a PDF that contains JPEG images, you can just use pdfimages -j / pdfimages -jp2, and you will get the original JPEG files with all metadata. However, your issue is about jbig2 streams, where you shouldn't loose anything when converting to PNG.
The PdfImage helper of pikepdf handles the different ways how images can be included in PDFs and will automatically choose the best format.
> Furthermore, if I understand correctly, there is no more metadata after conversion to png or pbm (I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to 0xFF 0xD8)
If you are dealing with a PDF that contains JPEG images, you can just use `pdfimages -j` / `pdfimages -jp2`, and you will get the original JPEG files with all metadata. However, your issue is about jbig2 streams, where you shouldn't loose anything when converting to PNG.
The [PdfImage helper](https://pikepdf.readthedocs.io/en/latest/topics/images.html) of pikepdf handles the different ways how images can be included in PDFs and will automatically choose the best format.
The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.
Again: why don't you just run pdfimages -png?
I compile jbig2enc myself on my computer (any patent, if still existant, should not go after personal usages). For the public test suite, I believe that the test suite of ocrmypdf should be closely related - they have jbig2enc compression options if installed.
There is at least a computational expense to convert to png then use jbig2enc to compress again, which is in fact redundant. The extra compression is not that cheap - I spent around an hour to compress the images in big PDF files (around 100 MB) via pdfsizeopt. ocrmypdf is more efficient but it still costs time.
> The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.
>
> Again: why don't you just run `pdfimages -png`?
I compile `jbig2enc` myself on my computer (any patent, if still existant, should not go after personal usages). For the public test suite, I believe that the [test suite of `ocrmypdf`](https://ocrmypdf.readthedocs.io/en/latest/docker.html?highlight=test%20suite#executing-the-test-suite) should be closely related - they have `jbig2enc` compression options if installed.
There is at least a computational expense to convert to `png` then use `jbig2enc` to compress again, which is in fact redundant. The extra compression is not that cheap - I spent around an hour to compress the images in big PDF files (around 100 MB) via [pdfsizeopt](https://github.com/pts/pdfsizeopt). `ocrmypdf` is more efficient but it still costs time.
If it's somehow possible then yes, img2pdf should support jbig2 as input. One major reason is the one you cite in your last message: we avoid useless encoding computations in the same way that we avoid those when embedding JPEG or PNG images into the PDF container without re-encoding them.
But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.
Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?
If it's somehow possible then yes, img2pdf should support jbig2 as input. One major reason is the one you cite in your last message: we avoid useless encoding computations in the same way that we avoid those when embedding JPEG or PNG images into the PDF container without re-encoding them.
But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.
Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?
Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the /JBIG2Globals object and one is the /XObject itself. So if I run pdfimages -all on that PDF I get a jb2e file and a jb2g file for each image.
And then there is the problem that we have to somehow teach img2pdf to treat two input images as one. I do not see how to possibly do this in practice.
Any suggestions?
Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the `/JBIG2Globals` object and one is the `/XObject` itself. So if I run `pdfimages -all` on that PDF I get a jb2e file and a jb2g file for each image.
According to the answer in https://stackoverflow.com/questions/27709913/jbig2-data-in-pdf-is-not-valid-jbig2-data-wrong-magic/27713306#27713306 it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.
And then there is the problem that we have to somehow teach img2pdf to treat *two* input images as one. I do not see how to possibly do this in practice.
Any suggestions?
Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the /JBIG2Globals object and one is the /XObject itself. So if I run pdfimages -all on that PDF I get a jb2e file and a jb2g file for each image.
And then there is the problem that we have to somehow teach img2pdf to treat two input images as one. I do not see how to possibly do this in practice.
Any suggestions?
Theoretically, it should be a job of pdfimages to produce a JBIG2 file with header (instead of two files containing raw streams). Meanwhile, you could provide a script to do so, but in my opinion, it should not be integrated into the executive img2pdf itself.
To be clear, for my PDF, I have only extracted a jb2e file, the embedded stream (e=embedded), without "global data" (g=global).
it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.
It seems that the JBIG2 stream contains the image dimensions. There are two evidences:
Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before:
Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a png file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790
> Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the `/JBIG2Globals` object and one is the `/XObject` itself. So if I run `pdfimages -all` on that PDF I get a jb2e file and a jb2g file for each image.
> And then there is the problem that we have to somehow teach img2pdf to treat *two* input images as one. I do not see how to possibly do this in practice.
>
> Any suggestions?
Theoretically, it should be a job of `pdfimages` to produce a JBIG2 file with header (instead of two files containing raw streams). Meanwhile, you could provide a script to do so, but in my opinion, it should not be integrated into the executive `img2pdf` itself.
To be clear, for my PDF, I have only extracted a `jb2e` file, the embedded stream (`e`=`embedded`), without "global data" (`g`=`global`).
> it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.
It seems that the JBIG2 stream *contains* the image dimensions. There are two evidences:
1. Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before:
> I did some test and seemingly the size data (height, weight, xppi, yppi) is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by pdfimages to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142
I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306
2. Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a `png` file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790
But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.
Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?
The difference is significant in this case. In general, I find that jbig2 usually reduces around 20% of sizes of scanned monochromic documents via ocrmypdf.
> But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.
>
> Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?
I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46
You could extract the images, then
1. use `img2pdf` to produce a merged pdf;
2. pass to `jbig2enc` losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193
The difference is significant in this case. In general, I find that `jbig2` usually reduces around 20% of sizes of scanned monochromic documents via `ocrmypdf`.
When I pass this document to imageextractor.py, I obtain 5 jpg images, which together have the size of 4.846.095 B (merged back into a PDF, it takes 4.849.576 B), with the original PDF being 4.872.211 B (first page is non-image). With the first page merged in using pdftk, I get a file of 4.851.092 B, which is a very minor increase in size compared to the original. From looking at the images, I cannot see a visual difference to the input PDF.
An example PDF with JBIG2-encoded images together with the original image files would be really interesting, because then I could verify whether extraction is truly lossless...
When I pass this document to `imageextractor.py`, I obtain 5 jpg images, which together have the size of 4.846.095 B (merged back into a PDF, it takes 4.849.576 B), with the original PDF being 4.872.211 B (first page is non-image). With the first page merged in using pdftk, I get a file of 4.851.092 B, which is a very minor increase in size compared to the original. From looking at the images, I cannot see a visual difference to the input PDF.
An example PDF with JBIG2-encoded images together with the original image files would be really interesting, because then I could verify whether extraction is truly lossless...
Oh, I see, it is not lossless - JBIG2 has only 1 bit per pixel, while the original JPEG has 8 bits per pixel, but a rate of 1/20 seems still something strange.
Oh, I see, it is not lossless - `JBIG2` has only 1 bit per pixel, while the original `JPEG` has 8 bits per pixel, but a rate of 1/20 seems still something strange.
Ah, I was confused. I thought the file you linked was supposed to contain JBIG2 streams already, but this was wrong. I didn't read carefully, sorry... With the compressed PDF you uploaded, I now understand your problem. The original file is 231 KiB, and the five images extracted as PNGs are 690 KiB, which indeed is a considerable increase in file size.
Ah, I was confused. I thought the file you linked was supposed to contain JBIG2 streams already, but this was wrong. I didn't read carefully, sorry... With the compressed PDF you uploaded, I now understand your problem. The original file is 231 KiB, and the five images extracted as PNGs are 690 KiB, which indeed is a considerable increase in file size.
It is then natural to select some of them to assemble another pdf file.
Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.
And if you do need to edit the images, then you can't work with jbig2 anyway...
> It is then natural to select some of them to assemble another pdf file.
Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.
And if you do need to edit the images, then you can't work with jbig2 anyway...
It is then natural to select some of them to assemble another pdf file.
Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.
I know that extracting pages could be done by qpdf (some seem to object to pdftk).
> > It is then natural to select some of them to assemble another pdf file.
>
> Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.
I know that extracting pages could be done by `qpdf` (some seem to object to `pdftk`).
The difference is significant in this case. In general, I find that jbig2 usually reduces around 20% of sizes of scanned monochromic documents via ocrmypdf.
The images in that PDF are not monochromatic but grayscale. But JBIG2 is for bilevel images, so this will be a lossy conversion.
> I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46
>
> You could extract the images, then
>
> 1. use `img2pdf` to produce a merged pdf;
> 2. pass to `jbig2enc` losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193
>
> The difference is significant in this case. In general, I find that `jbig2` usually reduces around 20% of sizes of scanned monochromic documents via `ocrmypdf`.
The images in that PDF are not monochromatic but grayscale. But JBIG2 is for bilevel images, so this will be a lossy conversion.
it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.
It seems that the JBIG2 stream contains the image dimensions. There are two evidences:
Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before:
Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a png file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790
Then I think maybe you should first approach the Pillow project to add support for reading JBIG2 images? That seems to be a better place than img2pdf for code that parses JBIG2 files and can extract information like image dimensions.
> > it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.
>
> It seems that the JBIG2 stream *contains* the image dimensions. There are two evidences:
>
> 1. Stronger: the image dimensions could be computed from the JBIG2 stream in constant time. I have mentioned this before:
>
> > I did some test and seemingly the size data (height, weight, xppi, yppi) is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated by pdfimages to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142
>
> I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306
>
> 2. Weaker: the image dimensions are computable from the JBIG2 stream. It is possible to convert a JBIG2 embedded stream into a `png` file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790
Then I think maybe you should first approach the Pillow project to add support for reading JBIG2 images? That seems to be a better place than img2pdf for code that parses JBIG2 files and can extract information like image dimensions.
Indeed there is not a single pixel difference even though the PDF containing jb2 data is much smaller.
Looking at the code of pdfsimp.py, the width, height and horizontal as well as vertical resolution can indeed be obtained from bytes 11 to 27 of the jb2 file.
My only remaining problem now is, that I still don't know how to identify the files produced by the jbig2 command above. The file starts with:
So there seems to be no magic byte sequence identifying the file type. If I run the tool file, then it just tells me that the file contains "data", meaning, that it cannot identify the file.
So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
I also just confirmed that lossless JBIG2 compresses some output better than CCITT4. I got some bilevel PDF from here:
https://www.jbig2dec.com/tests/index.html
Then converted 042.bmp to JBIG2 by using:
./src/jbig2 -p -v 042.bmp > 042.jb2
And created a PDF from it using https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741 like so:
python2 pdfsimp.py 042.jb2 > out.pdf
The resulting PDF is 46K small. If I use img2pdf to convert 042.bmp to a PDF using CCITT4, the resulting PDF is 68K in size.
I used pdfimages to extract the embedded images from both pdfs and then compared them using:
compare -metric AE jb2-000.pbm img2pdf-000.pbm diff.png
Indeed there is not a single pixel difference even though the PDF containing jb2 data is much smaller.
Looking at the code of pdfsimp.py, the width, height and horizontal as well as vertical resolution can indeed be obtained from bytes 11 to 27 of the jb2 file.
My only remaining problem now is, that I still don't know how to identify the files produced by the `jbig2` command above. The file starts with:
00000000: 0000 0000 3000 0100 0000 1300 0006 c000 ....0...........
00000010: 0009 2300 0000 0000 0000 0001 0000 0000 ..#.............
So there seems to be no magic byte sequence identifying the file type. If I run the tool `file`, then it just tells me that the file contains "data", meaning, that it cannot identify the file.
So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
Not exactly a creative or reliable idea, but perhaps just the file extensions .jb2 / .jbig2 / .jb2e?
> So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
Not exactly a creative or reliable idea, but perhaps just the file extensions `.jb2` / `.jbig2` / `.jb2e`?
So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
Not exactly a creative or reliable idea, but perhaps just the file extensions .jb2 / .jbig2 / .jb2e?
No. This is not MS Windows.
> > So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
>
> Not exactly a creative or reliable idea, but perhaps just the file extensions `.jb2` / `.jbig2` / `.jb2e`?
No. This is not MS Windows.
So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
This seems quite impossible, which is a job of pdfimages. I don't know whether they are reluctant to add this extra magic code. Note that their logic of distinguishing the output by their extensions (while keeping the data as is) is kinda MS Windows that you mentioned above.
> So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
This seems quite impossible, which is a job of `pdfimages`. I don't know whether they are [reluctant](https://gitlab.freedesktop.org/poppler/poppler/-/issues/1106#note_1012073) to add this extra magic code. Note that their logic of distinguishing the output by their extensions (while keeping the data as is) is kinda MS Windows that you mentioned above.
Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files?
No. Those contain the size and dpi and can thus be arbitrary integers.
> Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files?
No. Those contain the size and dpi and can thus be arbitrary integers.
An example: when extracting images from a
pdf
file bypdfimages -all
, the result contains some JBIG2 images. It is then natural to select some of them to assemble anotherpdf
file. Currently, it seems to me that one needs to first usejbig2dec
to decode them, then encode them back again viajbig2
to produce a pdf, which seems to be lossy and cumbersome. I hope that one could assemble these JBIG2streamsimages directly viaimg2pdf
.It seems also reasonable to support not only JBIG2 streams, but also image files themselves.Could you please share an example PDF so that we comprehend your problem better?
I tend to believe this might rather be an issue with the tool you're using to extract images, for it should reconstruct the actul image rather than just save a jbig2 stream.
The
pdf
that I am dealing with is not permitted to share (which however consists of many scanned pages along with some decorations and frames), but the result ofpdfimages -list
looks likeand the output of
pdfimages -all
will producejb2e
files, which are, if I understand correctly, JBIG2 streams (without header). I would like to extract some of these scanned pages (i.e.jb2e
files) to get a newpdf
file.Then maybe you could just try a different tool that extracts real images rather than jbig2 streams? I personally use a self-written script employing the PdfImage helper model of pikepdf to extract images from PDFs.
This is my script in case it helps you. It also searches for images inside Form XObjects and applies a possible
/SMask
. However, it does not work for some types of images, since they are not supported by the PdfImage model yet (e. g. CMYK).(Had to append the .txt extension to the file as gitea seems not to accept .py)
Thanks. Let me first open a feature request at
pdfimages
. However, it seems to me that neither are the realJBIG2
images supported byimg2pdf
?Support for JBIG2 images / streamsto Support for JBIG2 images ~~/ streams~~ 2 years agoSupport for JBIG2 images ~~/ streams~~to Support for JBIG2 images ~~/ streams~~ 2 years agoHi,
I want to understand the problem. What is a "jbig2 image"? Which software produces those? jbig2 is not an image format but a way to encode bilevel data. Why would converting a jbig2 stream by lossy? Why don't you just use
pdfimages -png
instead?Maybe I misunderstood something. In the manpage of
pdfimages
, the "formats"JPEG
,JPEG2000
,JBIG2
andCCITT
are listed in parallel, and I think that if I specify-png
, it would induce a conversion (everything topng
) and I suppose that any these kinds of conversion might be lossy (i.e. if I convert back and forth, I will get a different file), and I imagine that it should be possible to avoid conversions at all.JBIG2 and CCITT are ways to encode bilevel image data but they are not "formats" in the same sense as JPEG or PNG in the sense that they have no header that identifies which kind of file it is, what the dimensions are and other metadata. For example, a JPEG image will start with the bytes 0xFF 0xD8 and this tells the program reading the file that this is a JPEG image. JBIG2 and CCITT do not have such a header and thus, to understand the data from a JBIG2 or CCITT file, you need to somehow know that it is a JBIG2 or CCITT file. This is why, when you use any image manipulation program like gimp or photoshop, they will not let you save or open JBIG2 and CCITT files. Also when you run
magick identify -list format
from imagemagick, JBIG2 and CCITT will not be listed. Those two are just ways to encode bilevel data but they are missing a container. Without knowing that the file contains JBIG2 or CCITT data, the file just contains junk. So if we wanted img2pdf to support JBIG2 and CCITT, then we would need some way to tell img2pdf that the file we pass to it contains JBIG2 or CCITT data.I cannot come up with a situation in which storing JBIG2 or CCITT data as PNG would be lossy. Can you?
Thanks for the explanations. However, it seems to me that this answer claims that there are "header" and "tail" in a "normal" JBIG2 file which are "stripped" in the PDF stream. It seems to me that this looks like an identification, similar to
0xFF 0xD8
that you mentioned.Furthermore, if I understand correctly, there is no more metadata after conversion to
png
orpbm
(I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to0xFF 0xD8
). Thus what you really need is just the identification that this piece of data is encoded by JBIG2?Yes, looks like there should be a header and if that also contains the size of the image then that should be enough to support JBIG2 as input.
But there seems to be no JBIG2 encoder in the operating system I'm using (Debian) so unless you can provide a JBIG2 file I don't see how I can add support for it to img2pdf.
I did some test and seemingly the size data
(height, weight, xppi, yppi)
is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated bypdfimages
to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142A JBIG2 encoder is available here. It is not included in GNU/Linux because JBIG2 encoding was long patented and possibly still covered by unknown patents.
I will open a feature request to ask a conservative container of JBIG2 and CCITT (for the later, simply add a layer of tiff). "Conservative" means that this is performed costless and in particular, no essential conversion is performed.
The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.
Again: why don't you just run
pdfimages -png
?If you are dealing with a PDF that contains JPEG images, you can just use
pdfimages -j
/pdfimages -jp2
, and you will get the original JPEG files with all metadata. However, your issue is about jbig2 streams, where you shouldn't loose anything when converting to PNG.The PdfImage helper of pikepdf handles the different ways how images can be included in PDFs and will automatically choose the best format.
I compile
jbig2enc
myself on my computer (any patent, if still existant, should not go after personal usages). For the public test suite, I believe that the test suite ofocrmypdf
should be closely related - they havejbig2enc
compression options if installed.There is at least a computational expense to convert to
png
then usejbig2enc
to compress again, which is in fact redundant. The extra compression is not that cheap - I spent around an hour to compress the images in big PDF files (around 100 MB) via pdfsizeopt.ocrmypdf
is more efficient but it still costs time.If it's somehow possible then yes, img2pdf should support jbig2 as input. One major reason is the one you cite in your last message: we avoid useless encoding computations in the same way that we avoid those when embedding JPEG or PNG images into the PDF container without re-encoding them.
But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.
Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?
Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the
/JBIG2Globals
object and one is the/XObject
itself. So if I runpdfimages -all
on that PDF I get a jb2e file and a jb2g file for each image.According to the answer in https://stackoverflow.com/questions/27709913/jbig2-data-in-pdf-is-not-valid-jbig2-data-wrong-magic/27713306#27713306 it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.
And then there is the problem that we have to somehow teach img2pdf to treat two input images as one. I do not see how to possibly do this in practice.
Any suggestions?
Could you please share the link/file?
Theoretically, it should be a job of
pdfimages
to produce a JBIG2 file with header (instead of two files containing raw streams). Meanwhile, you could provide a script to do so, but in my opinion, it should not be integrated into the executiveimg2pdf
itself.To be clear, for my PDF, I have only extracted a
jb2e
file, the embedded stream (e
=embedded
), without "global data" (g
=global
).It seems that the JBIG2 stream contains the image dimensions. There are two evidences:
I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306
png
file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46
You could extract the images, then
img2pdf
to produce a merged pdf;jbig2enc
losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193The difference is significant in this case. In general, I find that
jbig2
usually reduces around 20% of sizes of scanned monochromic documents viaocrmypdf
.When I pass this document to
imageextractor.py
, I obtain 5 jpg images, which together have the size of 4.846.095 B (merged back into a PDF, it takes 4.849.576 B), with the original PDF being 4.872.211 B (first page is non-image). With the first page merged in using pdftk, I get a file of 4.851.092 B, which is a very minor increase in size compared to the original. From looking at the images, I cannot see a visual difference to the input PDF.An example PDF with JBIG2-encoded images together with the original image files would be really interesting, because then I could verify whether extraction is truly lossless...
Oh, I see, it is not lossless -
JBIG2
has only 1 bit per pixel, while the originalJPEG
has 8 bits per pixel, but a rate of 1/20 seems still something strange.Ah, I was confused. I thought the file you linked was supposed to contain JBIG2 streams already, but this was wrong. I didn't read carefully, sorry... With the compressed PDF you uploaded, I now understand your problem. The original file is 231 KiB, and the five images extracted as PNGs are 690 KiB, which indeed is a considerable increase in file size.
Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.
And if you do need to edit the images, then you can't work with jbig2 anyway...
I know that extracting pages could be done by
qpdf
(some seem to object topdftk
).The images in that PDF are not monochromatic but grayscale. But JBIG2 is for bilevel images, so this will be a lossy conversion.
Then I think maybe you should first approach the Pillow project to add support for reading JBIG2 images? That seems to be a better place than img2pdf for code that parses JBIG2 files and can extract information like image dimensions.
I also just confirmed that lossless JBIG2 compresses some output better than CCITT4. I got some bilevel PDF from here:
https://www.jbig2dec.com/tests/index.html
Then converted 042.bmp to JBIG2 by using:
And created a PDF from it using https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741 like so:
The resulting PDF is 46K small. If I use img2pdf to convert 042.bmp to a PDF using CCITT4, the resulting PDF is 68K in size.
I used pdfimages to extract the embedded images from both pdfs and then compared them using:
Indeed there is not a single pixel difference even though the PDF containing jb2 data is much smaller.
Looking at the code of pdfsimp.py, the width, height and horizontal as well as vertical resolution can indeed be obtained from bytes 11 to 27 of the jb2 file.
My only remaining problem now is, that I still don't know how to identify the files produced by the
jbig2
command above. The file starts with:So there seems to be no magic byte sequence identifying the file type. If I run the tool
file
, then it just tells me that the file contains "data", meaning, that it cannot identify the file.So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
Not exactly a creative or reliable idea, but perhaps just the file extensions
.jb2
/.jbig2
/.jb2e
?No. This is not MS Windows.
Then it might be hard, if not even the
file
command is able to detect jbig2...This seems quite impossible, which is a job of
pdfimages
. I don't know whether they are reluctant to add this extra magic code. Note that their logic of distinguishing the output by their extensions (while keeping the data as is) is kinda MS Windows that you mentioned above.Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files?
No. Those contain the size and dpi and can thus be arbitrary integers.
Closing, since there seems to be no good way to identify the format without a proper file magic of the non-existing container.