forked from josch/img2pdf
README.md: comparison to other software
This commit is contained in:
parent
f5d8d86dff
commit
9395b6fbbe
1 changed files with 91 additions and 21 deletions
112
README.md
112
README.md
|
@ -17,9 +17,9 @@ Conventional conversion software (like ImageMagick) would either:
|
||||||
2. not be small because using wasteful flate encoding of raw pixel data
|
2. not be small because using wasteful flate encoding of raw pixel data
|
||||||
3. not be fast because input data gets re-encoded
|
3. not be fast because input data gets re-encoded
|
||||||
|
|
||||||
Another advantage of not having to re-encode the input in most common
|
Another advantage of not having to re-encode the input (in most common
|
||||||
situations is, that img2pdf is able to handle much larger input than other
|
situations) is, that img2pdf is able to handle much larger input than other
|
||||||
software.
|
software, because the raw pixel data never has to be loaded into memory.
|
||||||
|
|
||||||
The following table shows how img2pdf handles different input depending on the
|
The following table shows how img2pdf handles different input depending on the
|
||||||
input file format and image color space.
|
input file format and image color space.
|
||||||
|
@ -56,34 +56,35 @@ descriptor.
|
||||||
If no output file is specified with the `-o`/`--output` option, output will be
|
If no output file is specified with the `-o`/`--output` option, output will be
|
||||||
done to stdout. A typical invocation is:
|
done to stdout. A typical invocation is:
|
||||||
|
|
||||||
img2pdf img1.png img2.jpg -o out.pdf
|
$ img2pdf img1.png img2.jpg -o out.pdf
|
||||||
|
|
||||||
The detailed documentation can be accessed by running:
|
The detailed documentation can be accessed by running:
|
||||||
|
|
||||||
img2pdf --help
|
$ img2pdf --help
|
||||||
|
|
||||||
Bugs
|
Bugs
|
||||||
----
|
----
|
||||||
|
|
||||||
If you find a JPEG, JPEG2000 or PNG file that, when embedded into the PDF
|
- If you find a JPEG, JPEG2000, PNG or CCITT Group 4 encoded TIFF file that,
|
||||||
cannot be read by the Adobe Acrobat Reader, please contact me.
|
when embedded into the PDF cannot be read by the Adobe Acrobat Reader,
|
||||||
|
please contact me.
|
||||||
|
|
||||||
I have not yet figured out how to determine the colorspace of JPEG2000 files.
|
- I have not yet figured out how to determine the colorspace of JPEG2000
|
||||||
Therefore JPEG2000 files use DeviceRGB by default. For JPEG2000 files with
|
files. Therefore JPEG2000 files use DeviceRGB by default. For JPEG2000
|
||||||
other colorspaces, you must explicitly specify it using the `--colorspace`
|
files with other colorspaces, you must explicitly specify it using the
|
||||||
option.
|
`--colorspace` option.
|
||||||
|
|
||||||
Input images with alpha channels are not allowed. PDF doesn't support alpha
|
- Input images with alpha channels are not allowed. PDF doesn't support alpha
|
||||||
channels in images and thus, the alpha channel of the input would have to be
|
channels in images and thus, the alpha channel of the input would have to be
|
||||||
discarded. But img2pdf will always be lossless and thus, input images must not
|
discarded. But img2pdf will always be lossless and thus, input images must
|
||||||
carry transparency information.
|
not carry transparency information.
|
||||||
|
|
||||||
img2pdf uses PIL (or Pillow) to obtain image meta data and to convert the input
|
- img2pdf uses PIL (or Pillow) to obtain image meta data and to convert the
|
||||||
if necessary. To prevent decompression bomb denial of service attacks, Pillow
|
input if necessary. To prevent decompression bomb denial of service attacks,
|
||||||
limits the maximum number of pixels an input image is allowed to have. If you
|
Pillow limits the maximum number of pixels an input image is allowed to
|
||||||
are sure that you know what you are doing, then you can disable this safeguard
|
have. If you are sure that you know what you are doing, then you can disable
|
||||||
by passing the `--pillow-limit-break` option to img2pdf. This allows one to
|
this safeguard by passing the `--pillow-limit-break` option to img2pdf. This
|
||||||
process even very large input images.
|
allows one to process even very large input images.
|
||||||
|
|
||||||
Installation
|
Installation
|
||||||
------------
|
------------
|
||||||
|
@ -146,3 +147,72 @@ The package can also be used as a library:
|
||||||
layout_fun = img2pdf.get_layout_fun(a4inpt)
|
layout_fun = img2pdf.get_layout_fun(a4inpt)
|
||||||
with open("name.pdf","wb") as f:
|
with open("name.pdf","wb") as f:
|
||||||
f.write(img2pdf.convert('test.jpg', layout_fun=layout_fun))
|
f.write(img2pdf.convert('test.jpg', layout_fun=layout_fun))
|
||||||
|
|
||||||
|
Comparison to ImageMagick
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Create a large test image:
|
||||||
|
|
||||||
|
$ convert logo: -resize 8000x original.jpg
|
||||||
|
|
||||||
|
Convert it into PDF using ImageMagick and img2pdf:
|
||||||
|
|
||||||
|
$ time img2pdf original.jpg -o img2pdf.pdf
|
||||||
|
$ time convert original.jpg imagemagick.pdf
|
||||||
|
|
||||||
|
Notice how ImageMagick took an order of magnitude longer to do the conversion
|
||||||
|
than img2pdf. It also used twice the memory.
|
||||||
|
|
||||||
|
Now extract the image data from both PDF documents and compare it to the
|
||||||
|
original:
|
||||||
|
|
||||||
|
$ pdfimages -all img2pdf.pdf tmp
|
||||||
|
$ compare -metric AE original.jpg tmp-000.jpg null:
|
||||||
|
0
|
||||||
|
$ pdfimages -all imagemagick.pdf tmp
|
||||||
|
$ compare -metric AE original.jpg tmp-000.jpg null:
|
||||||
|
118716
|
||||||
|
|
||||||
|
To get lossless output with ImageMagick we can use Zip compression but that
|
||||||
|
unnecessarily increases the size of the output:
|
||||||
|
|
||||||
|
$ convert original.jpg -compress Zip imagemagick.pdf
|
||||||
|
$ pdfimages -all imagemagick.pdf tmp
|
||||||
|
$ compare -metric AE original.jpg tmp-000.png null:
|
||||||
|
0
|
||||||
|
$ stat --format="%s %n" original.jpg img2pdf.pdf imagemagick.pdf
|
||||||
|
1535837 original.jpg
|
||||||
|
1536683 img2pdf.pdf
|
||||||
|
9397809 imagemagick.pdf
|
||||||
|
|
||||||
|
Comparison to pdfLaTeX
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
pdfLaTeX performs a lossless conversion from included images to PDF by default.
|
||||||
|
If the input is a JPEG, then it simply embeds the JPEG into the PDF in the same
|
||||||
|
way as img2pdf does it. But for other image formats it uses flate compression
|
||||||
|
of the plain pixel data and thus needlessly increases the output file size:
|
||||||
|
|
||||||
|
$ convert logo: -resize 8000x original.png
|
||||||
|
$ cat << END > pdflatex.tex
|
||||||
|
\documentclass{article}
|
||||||
|
\usepackage{graphicx}
|
||||||
|
\begin{document}
|
||||||
|
\includegraphics{original.png}
|
||||||
|
\end{document}
|
||||||
|
END
|
||||||
|
$ pdflatex pdflatex.tex
|
||||||
|
$ stat --format="%s %n" original.png pdflatex.pdf
|
||||||
|
4500182 original.png
|
||||||
|
9318120 pdflatex.pdf
|
||||||
|
|
||||||
|
Comparison to Tesseract OCR
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
Tesseract OCR comes closest to the functionality img2pdf provides. It is able
|
||||||
|
to convert JPEG and PNG input to PDF without needlessly increasing the filesize
|
||||||
|
and is at the same time lossless. So if your input is JPEG and PNG images, then
|
||||||
|
you should safely be able to use Tesseract instead of img2pdf. For other input,
|
||||||
|
Tesseract might not do a lossless conversion. For example it converts CMYK
|
||||||
|
input to RGB and removes the alpha channel from images with transparency. For
|
||||||
|
multipage TIFF or animated GIF, it will only convert the first frame.
|
||||||
|
|
Loading…
Reference in a new issue