Compare commits

...

30 commits
0.5.0 ... main

Author SHA1 Message Date
bb188a3eaf
release version 0.6.1 2025-04-27 18:54:38 +02:00
69c3ac6b25
src/img2pdf_test.py: do not unlink temporary files -- let pytest take care of that for us
This way, we can export all the artifacts for later retrieval when
pytest runs on CI systems, making debugging of issues far easier as it
avoids having to recreate the artifacts locally first.
2025-04-27 18:50:11 +02:00
dffc0dbe16
src/img2pdf.py: fix camelCase -> snake_case change of pymupdf
Thanks: Blair Chintella
2025-04-27 18:48:50 +02:00
Art Gabdullin
b91007fef8
README.md - Fix Windows bin URL 2025-03-26 03:54:38 +01:00
a8cb28ba31
src/img2pdf_test.py: skip test_miff_cmyk16 on s390x because of https://github.com/ImageMagick/ImageMagick/issues/8055 2025-03-23 15:37:32 +01:00
c6d12d6239
src/img2pdf_test.py: skip test_tiff_float on s390x because of https://github.com/ImageMagick/ImageMagick/issues/8054 2025-03-23 15:37:32 +01:00
59132f20f8
src/img2pdf_test.py: exiftool -all= now sets the unit to Undefined since version 13.23
Thanks: gregor herrmann <gregoa@debian.org>
2025-03-23 15:37:32 +01:00
3ba7d17e15
HACKING: document gitea release 2025-02-21 00:40:17 +01:00
43c16ac369
HACKING: add final git push 2025-02-16 17:45:09 +01:00
08c4d9beec
release version 0.6.0 2025-02-15 15:07:29 +01:00
9e6eba9f40
reformat with black 2025-02-15 14:59:04 +01:00
5aeb628506
Extract an API to predict the DPI used by img2pdf 2025-02-15 14:48:33 +01:00
b6dbfdb481
Slightly simplify imgformat retrieval
No need for a loop here - we can access the enum like a dictionary,
which should be more efficient.
2025-02-15 14:40:22 +01:00
23436114f8
Slightly simplify the getexif procedure if PIL is new enough
The getexif() procedure is available since Pillow 6.0.0. If it's
available, change the algorithm to a simplified version.

In the future, the _getexif() branch can be deleted.
2025-02-15 14:36:37 +01:00
2d5e4e3cb7
break out convert_to_docobject() from convert() which returns a document handle 2025-02-15 14:35:57 +01:00
5e515abb6f
src/tests/output/mono.jb2.pdf: strip off the last 22 bytes (end-of-page and end-of-file segments) 2025-02-15 09:49:39 +01:00
a2e2998fb1
Strip end-of-page and end-of-file segments from JBIG2
As noted by @phmccarty in
#184 (comment)
and subsequent comments, we were not properly stripping end-of-page and
end-of-file segments. These are valid segments in a JBIG2 file, but not
when embedded in PDF.

From the PDF spec:
> The JBIG2 file header, end-of-page segments, and end-of-file segment
> shall not be used in PDF.

We were already stripping out the JBIG2 file header, but not yet the
end-of-page and end-of-file segments.

For this, I'm expanding the approach that we were already taking, of
only supporting a narrow subset of JBIG2 files. We assert that the input
file has such a footer, and then we strip it.

We validated that the issue raised by @phmccarty is indeed resolved by
running the following code before and after applying this commit:

```sh
src/img2pdf.py src/tests/input/mono.jb2 > test.pdf
pdfimages -tiff test.pdf img
```

Before this commit, this returned "Syntax Error (1143): Unknown segment
type in JBIG2 stream". After this commit, the error is gone.
2025-02-15 08:12:51 +01:00
14948e7ba8
Add support for JBIG2 (generic coding)
Implements the proposal detailed at
#112 (comment)

This is a limited implementation of JBIG2, which can be extended to
support multiple pages, symbol tables, and other features of the format
in the future.

Added a test case based on mono.tif.

Updated the README.md based on
#184/files (comment)
2025-02-15 08:12:51 +01:00
bcfdf8b54e
src/img2pdf_test.py: test_jpg_2000_rgba8 no longer works with compare_poppler() 2025-02-15 08:12:08 +01:00
9f74740c95
src/img2pdf_test.py: test_miff_cmyk8 now compares exactly 2025-02-15 08:12:07 +01:00
cbc3d50c63
src/img2pdf_test.py: support None for tiff:alpha 2025-02-15 08:12:07 +01:00
4b549592bf
README.md: add example of how to use img2pdf together with scanimage 2024-09-11 11:35:07 +02:00
5540365cfd
add example for how to specify custom dpi 2024-09-11 11:34:14 +02:00
819b366bf5
release version 0.5.1 2023-11-26 06:33:10 +01:00
cc8c708295
HACKING: how to bisect 2023-11-25 09:47:53 +01:00
fb9537d8b7
src/img2pdf.py: allow PNG input without dpi units but non-square dpi aspect ratio
Closes: #181
2023-11-25 09:47:52 +01:00
7678435eb7
validate icc profile and no default location on windows
closes: #179
2023-11-07 18:50:07 +01:00
ba7a360866
release version 0.5.0 2023-10-28 08:35:54 +02:00
7f0bf47ff3
src/img2pdf.py: reformat with black 2023-10-28 08:35:53 +02:00
Leo
5cd0918d50 Issue #175 related. The original was SmartAlbums, but another case with 'Adobe PS', so delete the exif_software check part 2023-10-18 13:33:44 +08:00
8 changed files with 626 additions and 409 deletions

View file

@ -2,6 +2,34 @@
CHANGES
=======
0.6.1 (2025-04-27)
------------------
- testsuite fixes
0.6.0 (2025-02-15)
------------------
- Add support for JBIG2 (generic coding)
- Add convert_to_docobject() broken out from convert()
- Add pil_get_dpi() broken out from get_imgmetadata()
0.5.1 (2023-11-26)
------------------
- no default ICC profile location for PDF/A-1b on Windows
- workaround for PNG input without dpi units but non-square dpi aspect ratio
0.5.0 (2023-10-28)
------------------
- support MIFF for 16 bit CMYK input
- accept pathlib.Path objects as input
- don't store RGB ICC profiles from bilevel or grayscale TIFF, PNG and JPEG
- thumbnails are no longer included by default and --include-thumbnails has to
be used if you want them
- support for pikepdf (>= 6.2.0)
0.4.4 (2022-04-07)
------------------

55
HACKING
View file

@ -27,6 +27,57 @@ Making a new release
- Build and upload to pypi:
$ rm dist/*
$ rm -rf dist/*
$ python3 setup.py sdist
$ twine upload --sign dist/*
$ twine upload dist/*
- Push everything to git forge
$ git push
- Push to github
$ git push github
- Obtain img2pdf.exe from appveyor:
https://ci.appveyor.com/project/josch/img2pdf/
- Create new release:
https://gitlab.mister-muffin.de/josch/img2pdf/releases/new
Using debbisect to find regressions
-----------------------------------
$ debbisect --cache=./cache --depends="git,ca-certificates,python3,
ghostscript,imagemagick,mupdf-tools,poppler-utils,python3-pil,
python3-pytest,python3-numpy,python3-scipy,python3-pikepdf" \
--verbose 2023-09-16 2023-10-24 \
'chroot "$1" sh -c "
git clone https://gitlab.mister-muffin.de/josch/img2pdf.git
&& cd img2pdf
&& pytest 'src/img2pdf_test.py::test_jpg_2000_rgba8[internal]"'
Using debbisect cache
---------------------
$ mmdebstrap --variant=apt --aptopt='Acquire::Check-Valid-Until "false"' \
--include=git,ca-certificates,python3,ghostscript,imagemagick \
--include=mupdf-tools,poppler-utils,python3-pil,python3-pytest \
--include=python3-numpy,python3-scipy,python3-pikepdf \
--hook-dir=/usr/share/mmdebstrap/hooks/file-mirror-automount \
--setup-hook='mkdir -p "$1/home/josch/git/devscripts/cache/pool/"' \
--setup-hook='mount -o ro,bind /home/josch/git/devscripts/cache/pool/ "$1/home/josch/git/devscripts/cache/pool/"' \
--chrooted-customize-hook=bash
unstable /dev/null
file:///home/josch/git/devscripts/cache/archive/debian/20231022T090139Z/
Bisecting imagemagick
---------------------
$ git clean -fdx && git reset --hard
$ ./configure --prefix=$(pwd)/prefix
$ make -j$(nproc)
$ make install
$ LD_LIBRARY_PATH=$(pwd)/prefix/lib prefix/bin/compare ...

View file

@ -27,18 +27,20 @@ software, because the raw pixel data never has to be loaded into memory.
The following table shows how img2pdf handles different input depending on the
input file format and image color space.
| Format | Colorspace | Result |
| ------------------------------------- | ------------------------------ | ------------- |
| JPEG | any | direct |
| JPEG2000 | any | direct |
| PNG (non-interlaced, no transparency) | any | direct |
| TIFF (CCITT Group 4) | monochrome | direct |
| any | any except CMYK and monochrome | PNG Paeth |
| any | monochrome | CCITT Group 4 |
| any | CMYK | flate |
| Format | Colorspace | Result |
| ------------------------------------- | ------------------------------------ | ------------- |
| JPEG | any | direct |
| JPEG2000 | any | direct |
| PNG (non-interlaced, no transparency) | any | direct |
| TIFF (CCITT Group 4) | 1-bit monochrome | direct |
| JBIG2 (single-page generic coding) | 1-bit monochrome | direct |
| any | any except CMYK and 1-bit monochrome | PNG Paeth |
| any | 1-bit monochrome | CCITT Group 4 |
| any | CMYK | flate |
For JPEG, JPEG2000, non-interlaced PNG and TIFF images with CCITT Group 4
encoded data, img2pdf directly embeds the image data into the PDF without
For JPEG, JPEG2000, non-interlaced PNG, TIFF images with CCITT Group 4
encoded data, and JBIG2 with single-page generic coding (e.g. using `jbig2enc`),
img2pdf directly embeds the image data into the PDF without
re-encoding it. It thus treats the PDF format merely as a container format for
the image data. In these cases, img2pdf only increases the filesize by the size
of the PDF container (typically around 500 to 700 bytes). Since data is only
@ -47,7 +49,7 @@ solutions for these input formats.
For all other input types, img2pdf first has to transform the pixel data to
make it compatible with PDF. In most cases, the PNG Paeth filter is applied to
the pixel data. For monochrome input, CCITT Group 4 is used instead. Only for
the pixel data. For 1-bit monochrome input, CCITT Group 4 is used instead. Only for
CMYK input no filter is applied before finally applying flate compression.
Usage
@ -65,6 +67,12 @@ The detailed documentation can be accessed by running:
$ img2pdf --help
With no command line arguments supplied, img2pdf will read a single image from
standard input and write the resulting PDF to standard output. Here is an
example for how to scan directly to PDF using scanimage(1) from SANE:
$ scanimage --mode=Color --resolution=300 | pnmtojpeg -quality 90 | img2pdf > scan.pdf
Bugs
----
@ -118,7 +126,7 @@ You can then test the converter using:
$ ve/bin/img2pdf -o test.pdf src/tests/test.jpg
If you don't want to setup Python on Windows, then head to the
[releases](/josch/img2pdf/releases) section and download the latest
[releases](https://gitlab.mister-muffin.de/josch/img2pdf/releases) section and download the latest
`img2pdf.exe`.
GUI

View file

@ -1,7 +1,7 @@
import sys
from setuptools import setup
VERSION = "0.4.4"
VERSION = "0.6.1"
INSTALL_REQUIRES = (
"Pillow",

View file

@ -22,7 +22,7 @@ import sys
import os
import zlib
import argparse
from PIL import Image, TiffImagePlugin, GifImagePlugin, ImageCms
from PIL import Image, TiffImagePlugin, GifImagePlugin, ImageCms, ExifTags
if hasattr(GifImagePlugin, "LoadingStrategy"):
# Pillow 9.0.0 started emitting all frames but the first as RGB instead of
@ -62,7 +62,7 @@ try:
except ImportError:
have_pikepdf = False
__version__ = "0.4.4"
__version__ = "0.6.1"
default_dpi = 96.0
papersizes = {
"letter": "8.5inx11in",
@ -128,7 +128,7 @@ PageOrientation = Enum("PageOrientation", "portrait landscape")
Colorspace = Enum("Colorspace", "RGB RGBA L LA 1 CMYK CMYK;I P PA other")
ImageFormat = Enum(
"ImageFormat", "JPEG JPEG2000 CCITTGroup4 PNG GIF TIFF MPO MIFF other"
"ImageFormat", "JPEG JPEG2000 CCITTGroup4 PNG GIF TIFF MPO MIFF JBIG2 other"
)
PageMode = Enum("PageMode", "none outlines thumbs")
@ -918,6 +918,11 @@ class pdfdoc(object):
self.output_version = "1.5" # jpeg2000 needs pdf 1.5
elif imgformat is ImageFormat.CCITTGroup4:
ofilter = [PdfName.CCITTFaxDecode]
elif imgformat is ImageFormat.JBIG2:
ofilter = PdfName.JBIG2Decode
# JBIG2Decode requires PDF 1.4
if self.output_version < "1.4":
self.output_version = "1.4"
else:
ofilter = PdfName.FlateDecode
@ -1075,7 +1080,7 @@ class pdfdoc(object):
self.tostream(stream)
return stream.getvalue()
def tostream(self, outputstream):
def finalize(self):
if self.engine == Engine.pikepdf:
PdfArray = pikepdf.Array
PdfDict = pikepdf.Dictionary
@ -1267,7 +1272,9 @@ class pdfdoc(object):
self.writer.addobj(metadata)
self.writer.addobj(iccstream)
# now write out the PDF
def tostream(self, outputstream):
# write out the PDF
# this assumes that finalize() has been invoked beforehand by the caller
if self.engine == Engine.pikepdf:
kwargs = {}
if pikepdf.__version__ >= "6.2.0":
@ -1276,6 +1283,8 @@ class pdfdoc(object):
outputstream, min_version=self.output_version, linearize=True, **kwargs
)
elif self.engine == Engine.pdfrw:
from pdfrw import PdfName, PdfArray
self.writer.trailer.Info = self.writer.docinfo
# setting the version attribute of the pdfrw PdfWriter object will
# influence the behaviour of the write() function
@ -1295,51 +1304,27 @@ class pdfdoc(object):
raise ValueError("unknown engine: %s" % self.engine)
def get_imgmetadata(
imgdata, imgformat, default_dpi, colorspace, rawdata=None, rotreq=None
):
if imgformat == ImageFormat.JPEG2000 and rawdata is not None and imgdata is None:
# this codepath gets called if the PIL installation is not able to
# handle JPEG2000 files
imgwidthpx, imgheightpx, ics, hdpi, vdpi, channels, bpp = jp2.parse(rawdata)
def pil_get_dpi(imgdata, imgformat, default_dpi):
ndpi = imgdata.info.get("dpi")
if ndpi is None:
# the PNG plugin of PIL adds the undocumented "aspect" field instead of
# the "dpi" field if the PNG pHYs chunk unit is not set to meters
if imgformat == ImageFormat.PNG and imgdata.info.get("aspect") is not None:
aspect = imgdata.info["aspect"]
# make sure not to go below the default dpi
if aspect[0] > aspect[1]:
ndpi = (default_dpi * aspect[0] / aspect[1], default_dpi)
else:
ndpi = (default_dpi, default_dpi * aspect[1] / aspect[0])
else:
ndpi = (default_dpi, default_dpi)
if hdpi is None:
hdpi = default_dpi
if vdpi is None:
vdpi = default_dpi
ndpi = (hdpi, vdpi)
else:
imgwidthpx, imgheightpx = imgdata.size
ndpi = imgdata.info.get("dpi", (default_dpi, default_dpi))
# In python3, the returned dpi value for some tiff images will
# not be an integer but a float. To make the behaviour of
# img2pdf the same between python2 and python3, we convert that
# float into an integer by rounding.
# Search online for the 72.009 dpi problem for more info.
ndpi = (int(round(ndpi[0])), int(round(ndpi[1])))
ics = imgdata.mode
# GIF and PNG files with transparency are supported
if imgformat in [ImageFormat.PNG, ImageFormat.GIF, ImageFormat.JPEG2000] and (
ics in ["RGBA", "LA"] or "transparency" in imgdata.info
):
# Must check the IHDR chunk for the bit depth, because PIL would lossily
# convert 16-bit RGBA/LA images to 8-bit.
if imgformat == ImageFormat.PNG and rawdata is not None:
depth = rawdata[24]
if depth > 8:
logger.warning("Image with transparency and a bit depth of %d." % depth)
logger.warning("This is unsupported due to PIL limitations.")
logger.warning(
"If you accept a lossy conversion, you can manually convert "
"your images to 8 bit using `convert -depth 8` from imagemagick"
)
raise AlphaChannelError(
"Refusing to work with multiple >8bit channels."
)
elif ics in ["LA", "PA", "RGBA"] or "transparency" in imgdata.info:
raise AlphaChannelError("This function must not be called on images with alpha")
# In python3, the returned dpi value for some tiff images will
# not be an integer but a float. To make the behaviour of
# img2pdf the same between python2 and python3, we convert that
# float into an integer by rounding.
# Search online for the 72.009 dpi problem for more info.
ndpi = (int(round(ndpi[0])), int(round(ndpi[1])))
# Since commit 07a96209597c5e8dfe785c757d7051ce67a980fb or release 4.1.0
# Pillow retrieves the DPI from EXIF if it cannot find the DPI in the JPEG
@ -1356,11 +1341,112 @@ def get_imgmetadata(
imgdata.tag_v2.get(TiffImagePlugin.Y_RESOLUTION, default_dpi),
)
return ndpi
def get_imgmetadata(
imgdata, imgformat, default_dpi, colorspace, rawdata=None, rotreq=None
):
if imgformat == ImageFormat.JPEG2000 and rawdata is not None and imgdata is None:
# this codepath gets called if the PIL installation is not able to
# handle JPEG2000 files
imgwidthpx, imgheightpx, ics, hdpi, vdpi, channels, bpp = jp2.parse(rawdata)
if hdpi is None:
hdpi = default_dpi
if vdpi is None:
vdpi = default_dpi
ndpi = (hdpi, vdpi)
elif imgformat == ImageFormat.JBIG2:
imgwidthpx, imgheightpx, xres, yres = struct.unpack(">IIII", rawdata[24:40])
INCH_PER_METER = 39.370079
if xres == 0:
hdpi = default_dpi
elif xres < 1000:
# If xres is very small, it's likely accidentally expressed in dpi instead
# of dpm. See e.g. https://github.com/agl/jbig2enc/issues/86
hdpi = xres
else:
hdpi = int(float(xres) / INCH_PER_METER)
if yres == 0:
vdpi = default_dpi
elif yres < 1000:
vdpi = yres
else:
vdpi = int(float(yres) / INCH_PER_METER)
ndpi = (hdpi, vdpi)
ics = "1"
else:
imgwidthpx, imgheightpx = imgdata.size
ndpi = pil_get_dpi(imgdata, imgformat, default_dpi)
ics = imgdata.mode
logger.debug("input dpi = %d x %d", *ndpi)
# GIF and PNG files with transparency are supported
if imgformat in [ImageFormat.PNG, ImageFormat.GIF, ImageFormat.JPEG2000] and (
ics in ["RGBA", "LA"]
or (imgdata is not None and "transparency" in imgdata.info)
):
# Must check the IHDR chunk for the bit depth, because PIL would lossily
# convert 16-bit RGBA/LA images to 8-bit.
if imgformat == ImageFormat.PNG and rawdata is not None:
depth = rawdata[24]
if depth > 8:
logger.warning("Image with transparency and a bit depth of %d." % depth)
logger.warning("This is unsupported due to PIL limitations.")
logger.warning(
"If you accept a lossy conversion, you can manually convert "
"your images to 8 bit using `convert -depth 8` from imagemagick"
)
raise AlphaChannelError(
"Refusing to work with multiple >8bit channels."
)
elif ics in ["LA", "PA", "RGBA"] or (
imgdata is not None and "transparency" in imgdata.info
):
raise AlphaChannelError("This function must not be called on images with alpha")
rotation = 0
if rotreq in (None, Rotation.auto, Rotation.ifvalid):
if hasattr(imgdata, "_getexif") and imgdata._getexif() is not None:
if hasattr(imgdata, "getexif") and imgdata.getexif() is not None:
exif_dict = imgdata.getexif()
o_key = ExifTags.Base.Orientation.value # 274 rsp. 0x112
if exif_dict and o_key in exif_dict:
# Detailed information on EXIF rotation tags:
# http://impulseadventure.com/photo/exif-orientation.html
value = exif_dict[o_key]
if value == 1:
rotation = 0
elif value == 6:
rotation = 90
elif value == 3:
rotation = 180
elif value == 8:
rotation = 270
elif value in (2, 4, 5, 7):
if rotreq == Rotation.ifvalid:
logger.warning(
"Unsupported flipped rotation mode (%d): use "
"--rotation=ifvalid or "
"rotation=img2pdf.Rotation.ifvalid to ignore",
value,
)
else:
raise ExifOrientationError(
"Unsupported flipped rotation mode (%d): use "
"--rotation=ifvalid or "
"rotation=img2pdf.Rotation.ifvalid to ignore" % value
)
else:
if rotreq == Rotation.ifvalid:
logger.warning("Invalid rotation (%d)", value)
else:
raise ExifOrientationError(
"Invalid rotation (%d): use --rotation=ifvalid "
"or rotation=img2pdf.Rotation.ifvalid to ignore" % value
)
elif hasattr(imgdata, "_getexif") and imgdata._getexif() is not None:
for tag, value in imgdata._getexif().items():
if TAGS.get(tag, tag) == "Orientation":
# Detailed information on EXIF rotation tags:
@ -1395,6 +1481,7 @@ def get_imgmetadata(
"Invalid rotation (%d): use --rotation=ifvalid "
"or rotation=img2pdf.Rotation.ifvalid to ignore" % value
)
elif rotreq in (Rotation.none, Rotation["0"]):
rotation = 0
elif rotreq == Rotation["90"]:
@ -1443,7 +1530,7 @@ def get_imgmetadata(
logger.debug("input colorspace = %s", color.name)
iccp = None
if "icc_profile" in imgdata.info:
if imgdata is not None and "icc_profile" in imgdata.info:
iccp = imgdata.info.get("icc_profile")
# GIMP saves bilevel TIFF images and palette PNG images with only black and
# white in the palette with an RGB ICC profile which is useless
@ -1481,22 +1568,16 @@ def get_imgmetadata(
# SmartAlbums old version (found 2.2.6) exports JPG with only 1 compone
# with an RGB ICC profile which is useless.
# This produces an error in Adobe Acrobat, so we ignore it with a warning.
# Update: Found another case, the JPG is created by Adobe PhotoShop, so we
# don't check software anymore.
if iccp is not None and (
(color == Colorspace["L"] and imgformat == ImageFormat.JPEG)
):
exifsoft = None
if hasattr(imgdata, "_getexif") and imgdata._getexif() is not None:
for tag, value in imgdata._getexif().items():
if TAGS.get(tag, tag) == "Software":
exifsoft = value
with io.BytesIO(iccp) as f:
prf = ImageCms.ImageCmsProfile(f)
if (prf.profile.model and "sRGB" in prf.profile.model) and (
exifsoft and "SmartAlbums" in exifsoft
):
logger.warning(
"Ignoring RGB ICC profile in Grayscale JPG created by SmartAlbums"
)
if prf.profile.xcolor_space not in ("GRAY"):
logger.warning("Ignoring non-GRAY ICC profile in Grayscale JPG")
iccp = None
logger.debug("width x height = %dpx x %dpx", imgwidthpx, imgheightpx)
@ -1618,6 +1699,7 @@ miff_re = re.compile(
re.VERBOSE,
)
# https://imagemagick.org/script/miff.php
# turn off black formatting until python 3.10 is available on more platforms
# and we can use match/case
@ -1799,8 +1881,6 @@ def parse_miff(data):
results.extend(parse_miff(rest[lenpal + lenimgdata :]))
return results
# fmt: on
def read_images(
rawdata, colorspace, first_frame_only=False, rot=None, include_thumbnails=False
):
@ -1814,7 +1894,51 @@ def read_images(
if rawdata[:12] == b"\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A":
# image is jpeg2000
imgformat = ImageFormat.JPEG2000
if rawdata[:14].lower() == b"id=imagemagick":
elif rawdata[:8] == b"\x97\x4a\x42\x32\x0d\x0a\x1a\x0a":
# For now we only support single-page generic coding of JBIG2, for example as generated by
# https://github.com/agl/jbig2enc
#
# In fact, you can pipe an example image `like src/tests/input/mono.png` directly into img2pdf:
# jbig2 src/tests/input/mono.png | img2pdf -o src/tests/output/mono.png.pdf
#
# For this we assume that the first 13 bytes are the JBIG file header describing a document with one page,
# followed by a "page information" segment describing the dimensions of that page.
#
# The following annotated `hexdump -C 042.jb2` shows the first 40 bytes that we inspect directly.
# The first 24 bytes (until "||") have to match exactly, while the following 16 bytes are read by get_imgmetadata.
#
# 97 4a 42 32 0d 0a 1a 0a 01 00 00 00 01 00 00 00
# \_____________________/ | \_________/ \______
# magic-bytes org/unk pages seg-num
#
# 00 30 00 01 00 00 00 13 || 00 00 00 73 00 00 00 30
# _/ | | | \_________/ || \_________/ \_________/
# type refs page seg-size || width-px height-px
#
# 00 00 00 48 00 00 00 48
# \_________/ \_________/
# xres yres
#
# For more information on the data format, see:
# * https://github.com/agl/jbig2enc/blob/ea05019/fcd14492.pdf
# For more information about the generic coding, see:
# * https://github.com/agl/jbig2enc/blob/ea05019/src/jbig2enc.cc#L898
imgformat = ImageFormat.JBIG2
if (
rawdata[:24]
!= b"\x97\x4a\x42\x32\x0d\x0a\x1a\x0a\x01\x00\x00\x00\x01\x00\x00\x00\x00\x30\x00\x01\x00\x00\x00\x13"
):
raise ImageOpenError(
"Unsupported JBIG2 format; only single-page generic coding is supported (e.g. from `jbig2enc`)."
)
if (
rawdata[-22:]
!= b"\x00\x00\x00\x021\x00\x01\x00\x00\x00\x00\x00\x00\x00\x033\x00\x01\x00\x00\x00\x00"
):
raise ImageOpenError(
"Unsupported JBIG2 format; we expect end-of-page and end-of-file segments at the end (e.g. from `jbig2enc`)."
)
elif rawdata[:14].lower() == b"id=imagemagick":
# image is in MIFF format
# this is useful for 16 bit CMYK because PNG cannot do CMYK and thus
# we need PIL but PIL cannot do 16 bit
@ -1826,12 +1950,7 @@ def read_images(
)
else:
logger.debug("PIL format = %s", imgdata.format)
imgformat = None
for f in ImageFormat:
if f.name == imgdata.format:
imgformat = f
if imgformat is None:
imgformat = ImageFormat.other
imgformat = getattr(ImageFormat, imgdata.format, ImageFormat.other)
def cleanup():
if imgdata is not None:
@ -2060,6 +2179,28 @@ def read_images(
)
]
if imgformat == ImageFormat.JBIG2:
color, ndpi, imgwidthpx, imgheightpx, rotation, iccp = get_imgmetadata(
imgdata, imgformat, default_dpi, colorspace, rawdata, rot
)
streamdata = rawdata[13:-22] # Strip file header and footer
return [
(
color,
ndpi,
imgformat,
streamdata,
None,
imgwidthpx,
imgheightpx,
[],
False,
1,
rotation,
iccp,
)
]
if imgformat == ImageFormat.MIFF:
return parse_miff(rawdata)
@ -2599,14 +2740,11 @@ def find_scale(pagewidth, pageheight):
return 10 ** ceil(log10(oversized))
# given one or more input image, depending on outputstream, either return a
# string containing the whole PDF if outputstream is None or write the PDF
# data to the given file-like object and return None
#
# Input images can be given as file like objects (they must implement read()),
# as a binary string representing the image content or as filenames to the
# images.
def convert(*images, **kwargs):
# Convert the image(s) to a `pdfdoc` object.
# The `.writer` attribute holds the underlying engine document handle, and
# `.output_version` the minimum version the caller should use when saving.
# The main convert() wraps this implementation function.
def convert_to_docobject(*images, **kwargs):
_default_kwargs = dict(
engine=None,
title=None,
@ -2627,7 +2765,6 @@ def convert(*images, **kwargs):
viewer_fit_window=False,
viewer_center_window=False,
viewer_fullscreen=False,
outputstream=None,
first_frame_only=False,
allow_oversized=True,
cropborder=None,
@ -2790,10 +2927,22 @@ def convert(*images, **kwargs):
iccp,
)
if kwargs["outputstream"]:
pdf.tostream(kwargs["outputstream"])
return
pdf.finalize()
return pdf
# given one or more input image, depending on outputstream, either return a
# string containing the whole PDF if outputstream is None or write the PDF
# data to the given file-like object and return None
#
# Input images can be given as file like objects (they must implement read()),
# as a binary string representing the image content or as filenames to the
# images.
def convert(*images, outputstream=None, **kwargs):
pdf = convert_to_docobject(*images, **kwargs)
if outputstream:
pdf.tostream(outputstream)
return
return pdf.tostring()
@ -3316,10 +3465,10 @@ def gui():
f.seek(0)
doc = fitz.open(stream=f, filetype="pdf")
for page in doc:
if page.getDisplayList().rect.width > maxpagewidth:
maxpagewidth = page.getDisplayList().rect.width
if page.getDisplayList().rect.height > maxpageheight:
maxpageheight = page.getDisplayList().rect.height
if page.get_displaylist().rect.width > maxpagewidth:
maxpagewidth = page.get_displaylist().rect.width
if page.get_displaylist().rect.height > maxpageheight:
maxpageheight = page.get_displaylist().rect.height
draw()
def save_pdf(stream):
@ -3471,9 +3620,9 @@ def gui():
mat_0 = fitz.Matrix(zoom, zoom)
canvas.image = tkinter.PhotoImage(
data=doc[pagenum]
.getDisplayList()
.getPixmap(matrix=mat_0, alpha=False)
.getImageData("ppm")
.get_displaylist()
.get_pixmap(matrix=mat_0, alpha=False)
.tobytes("ppm")
)
canvas.create_image(
(canvas.size[0] - maxpagewidth * zoom) / 2,
@ -3820,14 +3969,31 @@ def gui():
app.mainloop()
def file_is_icc(fname):
with open(fname, "rb") as f:
data = f.read(40)
if len(data) < 40:
return False
return data[36:] == b"acsp"
def validate_icc(fname):
if not file_is_icc(fname):
raise argparse.ArgumentTypeError('"%s" is not an ICC profile' % fname)
return fname
def get_default_icc_profile():
for profile in [
"/usr/share/color/icc/sRGB.icc",
"/usr/share/color/icc/OpenICC/sRGB.icc",
"/usr/share/color/icc/colord/sRGB.icc",
]:
if os.path.exists(profile):
return profile
if not os.path.exists(profile):
continue
if not file_is_icc(profile):
continue
return profile
return "/usr/share/color/icc/sRGB.icc"
@ -3936,6 +4102,10 @@ Examples:
$ img2pdf --output out.pdf page1.jpg page2.jpg
Use a custom dpi value for the input images:
$ img2pdf --output out.pdf --imgsize 300dpi page1.jpg page2.jpg
Convert a directory of JPEG images into a PDF with printable A4 pages in
landscape mode. On each page, the photo takes the maximum amount of space
while preserving its aspect ratio and a print border of 2 cm on the top and
@ -4098,17 +4268,29 @@ RGB.""",
% Image.MAX_IMAGE_PIXELS,
)
outargs.add_argument(
"--pdfa",
nargs="?",
const=get_default_icc_profile(),
default=None,
help="Output a PDF/A-1b compliant document. By default, this will "
"embed either /usr/share/color/icc/sRGB.icc, "
"/usr/share/color/icc/OpenICC/sRGB.icc or "
"/usr/share/color/icc/colord/sRGB.icc as the color profile, whichever "
"is found to exist first.",
)
if sys.platform == "win32":
# on Windows, there are no default paths to search for an ICC profile
# so make the argument required instead of optional
outargs.add_argument(
"--pdfa",
type=validate_icc,
help="Output a PDF/A-1b compliant document. The argument to this "
"option is the path to the ICC profile that will be embedded into "
"the resulting PDF.",
)
else:
outargs.add_argument(
"--pdfa",
nargs="?",
const=get_default_icc_profile(),
default=None,
type=validate_icc,
help="Output a PDF/A-1b compliant document. By default, this will "
"embed either /usr/share/color/icc/sRGB.icc, "
"/usr/share/color/icc/OpenICC/sRGB.icc or "
"/usr/share/color/icc/colord/sRGB.icc as the color profile, whichever "
"is found to exist first.",
)
sizeargs = parser.add_argument_group(
title="Image and page size and layout arguments",

File diff suppressed because it is too large Load diff

BIN
src/tests/input/mono.jb2 Normal file

Binary file not shown.

Binary file not shown.