Support for JBIG2 images ~~/ streams~~ #112
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
An example: when extracting images from a
pdf
file bypdfimages -all
, the result contains some JBIG2 images. It is then natural to select some of them to assemble anotherpdf
file. Currently, it seems to me that one needs to first usejbig2dec
to decode them, then encode them back again viajbig2
to produce a pdf, which seems to be lossy and cumbersome. I hope that one could assemble these JBIG2streamsimages directly viaimg2pdf
.It seems also reasonable to support not only JBIG2 streams, but also image files themselves.Could you please share an example PDF so that we comprehend your problem better?
I tend to believe this might rather be an issue with the tool you're using to extract images, for it should reconstruct the actul image rather than just save a jbig2 stream.
The
pdf
that I am dealing with is not permitted to share (which however consists of many scanned pages along with some decorations and frames), but the result ofpdfimages -list
looks likeand the output of
pdfimages -all
will producejb2e
files, which are, if I understand correctly, JBIG2 streams (without header). I would like to extract some of these scanned pages (i.e.jb2e
files) to get a newpdf
file.Then maybe you could just try a different tool that extracts real images rather than jbig2 streams? I'm using a custom script with the PdfImage helper model of pikepdf to extract images from PDFs.
This is my script in case it helps you. It also searches for images inside Form XObjects and applies a possible
/SMask
. However, it does not work for some types of images, since they are not supported by the PdfImage model yet (e. g. CMYK).(Had to append the .txt extension to the file as gitea seems not to accept .py)
Thanks. Let me first open a feature request at
pdfimages
. However, it seems to me that neither are the realJBIG2
images supported byimg2pdf
?Support for JBIG2 images / streamsto Support for JBIG2 images ~~/ streams~~Support for JBIG2 images ~~/ streams~~to Support for JBIG2 images ~~/ streams~~Hi,
I want to understand the problem. What is a "jbig2 image"? Which software produces those? jbig2 is not an image format but a way to encode bilevel data. Why would converting a jbig2 stream by lossy? Why don't you just use
pdfimages -png
instead?Maybe I misunderstood something. In the manpage of
pdfimages
, the "formats"JPEG
,JPEG2000
,JBIG2
andCCITT
are listed in parallel, and I think that if I specify-png
, it would induce a conversion (everything topng
) and I suppose that any these kinds of conversion might be lossy (i.e. if I convert back and forth, I will get a different file), and I imagine that it should be possible to avoid conversions at all.JBIG2 and CCITT are ways to encode bilevel image data but they are not "formats" in the same sense as JPEG or PNG in the sense that they have no header that identifies which kind of file it is, what the dimensions are and other metadata. For example, a JPEG image will start with the bytes 0xFF 0xD8 and this tells the program reading the file that this is a JPEG image. JBIG2 and CCITT do not have such a header and thus, to understand the data from a JBIG2 or CCITT file, you need to somehow know that it is a JBIG2 or CCITT file. This is why, when you use any image manipulation program like gimp or photoshop, they will not let you save or open JBIG2 and CCITT files. Also when you run
magick identify -list format
from imagemagick, JBIG2 and CCITT will not be listed. Those two are just ways to encode bilevel data but they are missing a container. Without knowing that the file contains JBIG2 or CCITT data, the file just contains junk. So if we wanted img2pdf to support JBIG2 and CCITT, then we would need some way to tell img2pdf that the file we pass to it contains JBIG2 or CCITT data.I cannot come up with a situation in which storing JBIG2 or CCITT data as PNG would be lossy. Can you?
Thanks for the explanations. However, it seems to me that this answer claims that there are "header" and "tail" in a "normal" JBIG2 file which are "stripped" in the PDF stream. It seems to me that this looks like an identification, similar to
0xFF 0xD8
that you mentioned.Furthermore, if I understand correctly, there is no more metadata after conversion to
png
orpbm
(I believe that "normal" JPEG images contain more metadata, such as EXIF, in addition to0xFF 0xD8
). Thus what you really need is just the identification that this piece of data is encoded by JBIG2?Yes, looks like there should be a header and if that also contains the size of the image then that should be enough to support JBIG2 as input.
But there seems to be no JBIG2 encoder in the operating system I'm using (Debian) so unless you can provide a JBIG2 file I don't see how I can add support for it to img2pdf.
I did some test and seemingly the size data
(height, weight, xppi, yppi)
is contained in the JBIG2 stream (not the header). I pass the JBIG2 stream generated bypdfimages
to the code https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741#file-pdfsimp-py-L142A JBIG2 encoder is available here. It is not included in GNU/Linux because JBIG2 encoding was long patented and possibly still covered by unknown patents.
I will open a feature request to ask a conservative container of JBIG2 and CCITT (for the later, simply add a layer of tiff). "Conservative" means that this is performed costless and in particular, no essential conversion is performed.
The next issue: with jbig2enc not being present in Linux distros and being encumbered by patents and no JBIG2 support in imagemagick or PIL, I cannot add any testcases for JBIG2 support to the img2pdf testsuite.
Again: why don't you just run
pdfimages -png
?If you are dealing with a PDF that contains JPEG images, you can just use
pdfimages -j
/pdfimages -jp2
, and you will get the original JPEG files with all metadata. However, your issue is about jbig2 streams, where you shouldn't loose anything when converting to PNG.The PdfImage helper of pikepdf handles the different ways how images can be included in PDFs and will automatically choose the best format.
I compile
jbig2enc
myself on my computer (any patent, if still existant, should not go after personal usages). For the public test suite, I believe that the test suite ofocrmypdf
should be closely related - they havejbig2enc
compression options if installed.There is at least a computational expense to convert to
png
then usejbig2enc
to compress again, which is in fact redundant. The extra compression is not that cheap - I spent around an hour to compress the images in big PDF files (around 100 MB) via pdfsizeopt.ocrmypdf
is more efficient but it still costs time.If it's somehow possible then yes, img2pdf should support jbig2 as input. One major reason is the one you cite in your last message: we avoid useless encoding computations in the same way that we avoid those when embedding JPEG or PNG images into the PDF container without re-encoding them.
But why do you insist on using jbig2enc in the first place? I cannot find any bilevel image where using jbig2 leads to any significant space reduction compared to the compression that img2pdf uses by default.
Are you somehow able to share an example image where using jbig2 over the alternatives really has a positive impact on the file size?
Okay, I obtained a PDF containing a JBIG2 encoded image. This is becoming more troublesome... Instead of just containing one blob per image, the PDF contains two blobs for each image. One is the
/JBIG2Globals
object and one is the/XObject
itself. So if I runpdfimages -all
on that PDF I get a jb2e file and a jb2g file for each image.According to the answer in https://stackoverflow.com/questions/27709913/jbig2-data-in-pdf-is-not-valid-jbig2-data-wrong-magic/27713306#27713306 it seems that indeed the PDF does not contain the header containing the crucial information that this is a JBIG2 file (the magic) and the image dimensions.
And then there is the problem that we have to somehow teach img2pdf to treat two input images as one. I do not see how to possibly do this in practice.
Any suggestions?
Could you please share the link/file?
Theoretically, it should be a job of
pdfimages
to produce a JBIG2 file with header (instead of two files containing raw streams). Meanwhile, you could provide a script to do so, but in my opinion, it should not be integrated into the executiveimg2pdf
itself.To be clear, for my PDF, I have only extracted a
jb2e
file, the embedded stream (e
=embedded
), without "global data" (g
=global
).It seems that the JBIG2 stream contains the image dimensions. There are two evidences:
I did not learn Python but I don't think that this code is about reading the last segment as mentioned in https://stackoverflow.com/a/27713306
png
file without header, therefore the dimension data should be computable from the raw stream per se: https://unix.stackexchange.com/a/591790I have just encountered an extreme example yesterday: https://www.e-periodica.ch/cntmng?pid=ens-001:1968:14::46
You could extract the images, then
img2pdf
to produce a merged pdf;jbig2enc
losslessly as described in https://github.com/agl/jbig2enc/issues/24#issuecomment-204697193The difference is significant in this case. In general, I find that
jbig2
usually reduces around 20% of sizes of scanned monochromic documents viaocrmypdf
.When I pass this document to
imageextractor.py
, I obtain 5 jpg images, which together have the size of 4.846.095 B (merged back into a PDF, it takes 4.849.576 B), with the original PDF being 4.872.211 B (first page is non-image). With the first page merged in using pdftk, I get a file of 4.851.092 B, which is a very minor increase in size compared to the original. From looking at the images, I cannot see a visual difference to the input PDF.An example PDF with JBIG2-encoded images together with the original image files would be really interesting, because then I could verify whether extraction is truly lossless...
Oh, I see, it is not lossless -
JBIG2
has only 1 bit per pixel, while the originalJPEG
has 8 bits per pixel, but a rate of 1/20 seems still something strange.Ah, I was confused. I thought the file you linked was supposed to contain JBIG2 streams already, but this was wrong. I didn't read carefully, sorry... With the compressed PDF you uploaded, I now understand your problem. The original file is 231 KiB, and the five images extracted as PNGs are 690 KiB, which indeed is a considerable increase in file size.
Unless you need to edit the images, it might be easiest to work with the PDF document and a tool like pdftk or similar to remove or add pages. This would avoid the increase in size caused by extracting images and merging back.
And if you do need to edit the images, then you can't work with jbig2 anyway...
I know that extracting pages could be done by
qpdf
(some seem to object topdftk
).The images in that PDF are not monochromatic but grayscale. But JBIG2 is for bilevel images, so this will be a lossy conversion.
Then I think maybe you should first approach the Pillow project to add support for reading JBIG2 images? That seems to be a better place than img2pdf for code that parses JBIG2 files and can extract information like image dimensions.
I also just confirmed that lossless JBIG2 compresses some output better than CCITT4. I got some bilevel PDF from here:
https://www.jbig2dec.com/tests/index.html
Then converted 042.bmp to JBIG2 by using:
And created a PDF from it using https://gist.github.com/kmlyvens/b532c7aec2fe2bd8214ae2b3faf8f741 like so:
The resulting PDF is 46K small. If I use img2pdf to convert 042.bmp to a PDF using CCITT4, the resulting PDF is 68K in size.
I used pdfimages to extract the embedded images from both pdfs and then compared them using:
Indeed there is not a single pixel difference even though the PDF containing jb2 data is much smaller.
Looking at the code of pdfsimp.py, the width, height and horizontal as well as vertical resolution can indeed be obtained from bytes 11 to 27 of the jb2 file.
My only remaining problem now is, that I still don't know how to identify the files produced by the
jbig2
command above. The file starts with:So there seems to be no magic byte sequence identifying the file type. If I run the tool
file
, then it just tells me that the file contains "data", meaning, that it cannot identify the file.So now the only remaining piece we need is some way to identify the file as JBIG2. Any ideas?
Not exactly a creative or reliable idea, but perhaps just the file extensions
.jb2
/.jbig2
/.jb2e
?No. This is not MS Windows.
Then it might be hard, if not even the
file
command is able to detect jbig2...This seems quite impossible, which is a job of
pdfimages
. I don't know whether they are reluctant to add this extra magic code. Note that their logic of distinguishing the output by their extensions (while keeping the data as is) is kinda MS Windows that you mentioned above.Yet another unqualified idea of mine: Maybe the format of bytes 11 to 27 could be used to identify jbig2 files?
No. Those contain the size and dpi and can thus be arbitrary integers.
Closing, since there seems to be no good way to identify the format without a proper file magic of the non-existing container.
jbig2enc
can give two types of outputs. The one you've been using is "symbol mode", which finds identical looking "symbols", and deduplicates them. This is useful when compressing files directly generated from a word processor, where letters often look exactly the same, pixel-by-pixel. For this mode it outputs a ".sym" file with all the common symbols, and a ".000", ".001", etc file for each image.Indeed, when using symbol mode there is no good way to detect that these files are JBIG2 files.
However,
jbig2enc
also has a "generic coder", which does not do this deduplication between images. This often still yields very good results, especially for scanned documents where the same letter is not identical between different instances, and so the deduplication of symbol mode doesn't help much. This is the mode that you get when runningjbig2 example.png > example.jb2
.For the generic coder, you can detect the file, since it has a magic header
0x97, 0x4a, 0x42, 0x32, 0x0d, 0x0a, 0x1a, 0x0a
. This is defined in the spec, as seen on page 131 of https://github.com/agl/jbig2enc/blob/master/fcd14492.pdf.We could therefore support this "generic coder" variant of JBIG2. I'll see if I can make a basic implementation.
I'm learning more and more every minute I'm looking into this.. What I said above is not entirely accurate: you got the different files because of using their "PDF mode". Both generic coding and symbol mode are supported in the JBIG2 file format (that starts with the magic header).
In fact, the JBIG2 file format supports multiple pages, to accommodate a common symbol lookup table between the pages.
We don't have to support all that to start. We could start with only the simple case of a JBIG2 file with a single page.
Indeed, you're right JBIG2 is a standalone image format after all.
Sorry about my old comments above, with the current state of knowledge they were clearly quite wrong.
Given the chance for considerably better compression I agree support for JBIG2 in img2pdf would be nice to have. Thanks @ooBJ3u for working on this. It might even make sense to set lossless JBIG2 as default codec for monochrome input in the future, if jbig2enc is available?
As a separate matter, I believe a useful complement to support in img2pdf might be a utility to merge the "stripped form" as stored in PDF back into an actual JBIG2 file, similar to
fax2tiff
for CCITT, and also have pikepdf just re-create the JBIG2 wrapper instead of transcoding to PNG, to allow for seamless use of the format in a PDF image extract and rewrap pipeline.Incidentally, there were very recent comments at https://gitlab.freedesktop.org/poppler/poppler/-/issues/1106#note_2180790 which refers to the same JBIG2 spec. Unfortunately, it seems quite improbable that
pdfimages
would implement such a thing unless somebody volunteers to submit a PR.Yeah, I doubt if this will ever be implemented in poppler. I was more thinking of some python script the caller could use on pdfimages' stripped jbig2 output.
What adds complication with pikepdf is the possible shared globals stream. I'm not sure how one would handle that, given the current API is operating on image level, not document level. Embedding shared globals in every individual output image doesn't seem elegant.