Compare commits

..

2 commits

Author SHA1 Message Date
5100507403 Address discrepancies between PDF and XMP timestamps
The PDF format and XMP metadata specs define different syntax for dates,
so account for these discrepancies by more carefully constructing the
final timestamps by post-processing strftime() output.
2023-05-29 19:55:05 -07:00
1dd05cc36b Treat default creation/mod dates as UTC (fixes #155)
(Tested with Python 3.11.3 on Arch Linux.)

Without passing a tzinfo object to `datetime.now()`, a "naive" datetime
object is created, which is not timezone-aware. To fix the default
date/time detection for non-UTC local timezones, pass
`datetime.timezone.utc` to convert the value to UTC and make the
datetime object "aware".

Also, adjust the strftime() wrappers to use the UTC offsets instead of a
literal `Z`; using the literal `Z` at the end appears to be valid for
ISO 8601, but for some reason it does not successfully convert, whereas
the `%z` placeholder substitutes the UTC offset and successfully
converts.
2023-05-29 14:17:56 -07:00
9 changed files with 402 additions and 1031 deletions

View file

@ -2,29 +2,6 @@
CHANGES CHANGES
======= =======
0.6.0 (2025-02-15)
------------------
- Add support for JBIG2 (generic coding)
- Add convert_to_docobject() broken out from convert()
- Add pil_get_dpi() broken out from get_imgmetadata()
0.5.1 (2023-11-26)
------------------
- no default ICC profile location for PDF/A-1b on Windows
- workaround for PNG input without dpi units but non-square dpi aspect ratio
0.5.0 (2023-10-28)
------------------
- support MIFF for 16 bit CMYK input
- accept pathlib.Path objects as input
- don't store RGB ICC profiles from bilevel or grayscale TIFF, PNG and JPEG
- thumbnails are no longer included by default and --include-thumbnails has to
be used if you want them
- support for pikepdf (>= 6.2.0)
0.4.4 (2022-04-07) 0.4.4 (2022-04-07)
------------------ ------------------

55
HACKING
View file

@ -27,57 +27,6 @@ Making a new release
- Build and upload to pypi: - Build and upload to pypi:
$ rm -rf dist/* $ rm dist/*
$ python3 setup.py sdist $ python3 setup.py sdist
$ twine upload dist/* $ twine upload --sign dist/*
- Push everything to git forge
$ git push
- Push to github
$ git push github
- Obtain img2pdf.exe from appveyor:
https://ci.appveyor.com/project/josch/img2pdf/
- Create new release:
https://gitlab.mister-muffin.de/josch/img2pdf/releases/new
Using debbisect to find regressions
-----------------------------------
$ debbisect --cache=./cache --depends="git,ca-certificates,python3,
ghostscript,imagemagick,mupdf-tools,poppler-utils,python3-pil,
python3-pytest,python3-numpy,python3-scipy,python3-pikepdf" \
--verbose 2023-09-16 2023-10-24 \
'chroot "$1" sh -c "
git clone https://gitlab.mister-muffin.de/josch/img2pdf.git
&& cd img2pdf
&& pytest 'src/img2pdf_test.py::test_jpg_2000_rgba8[internal]"'
Using debbisect cache
---------------------
$ mmdebstrap --variant=apt --aptopt='Acquire::Check-Valid-Until "false"' \
--include=git,ca-certificates,python3,ghostscript,imagemagick \
--include=mupdf-tools,poppler-utils,python3-pil,python3-pytest \
--include=python3-numpy,python3-scipy,python3-pikepdf \
--hook-dir=/usr/share/mmdebstrap/hooks/file-mirror-automount \
--setup-hook='mkdir -p "$1/home/josch/git/devscripts/cache/pool/"' \
--setup-hook='mount -o ro,bind /home/josch/git/devscripts/cache/pool/ "$1/home/josch/git/devscripts/cache/pool/"' \
--chrooted-customize-hook=bash
unstable /dev/null
file:///home/josch/git/devscripts/cache/archive/debian/20231022T090139Z/
Bisecting imagemagick
---------------------
$ git clean -fdx && git reset --hard
$ ./configure --prefix=$(pwd)/prefix
$ make -j$(nproc)
$ make install
$ LD_LIBRARY_PATH=$(pwd)/prefix/lib prefix/bin/compare ...

View file

@ -27,20 +27,18 @@ software, because the raw pixel data never has to be loaded into memory.
The following table shows how img2pdf handles different input depending on the The following table shows how img2pdf handles different input depending on the
input file format and image color space. input file format and image color space.
| Format | Colorspace | Result | | Format | Colorspace | Result |
| ------------------------------------- | ------------------------------------ | ------------- | | ------------------------------------- | ------------------------------ | ------------- |
| JPEG | any | direct | | JPEG | any | direct |
| JPEG2000 | any | direct | | JPEG2000 | any | direct |
| PNG (non-interlaced, no transparency) | any | direct | | PNG (non-interlaced, no transparency) | any | direct |
| TIFF (CCITT Group 4) | 1-bit monochrome | direct | | TIFF (CCITT Group 4) | monochrome | direct |
| JBIG2 (single-page generic coding) | 1-bit monochrome | direct | | any | any except CMYK and monochrome | PNG Paeth |
| any | any except CMYK and 1-bit monochrome | PNG Paeth | | any | monochrome | CCITT Group 4 |
| any | 1-bit monochrome | CCITT Group 4 | | any | CMYK | flate |
| any | CMYK | flate |
For JPEG, JPEG2000, non-interlaced PNG, TIFF images with CCITT Group 4 For JPEG, JPEG2000, non-interlaced PNG and TIFF images with CCITT Group 4
encoded data, and JBIG2 with single-page generic coding (e.g. using `jbig2enc`), encoded data, img2pdf directly embeds the image data into the PDF without
img2pdf directly embeds the image data into the PDF without
re-encoding it. It thus treats the PDF format merely as a container format for re-encoding it. It thus treats the PDF format merely as a container format for
the image data. In these cases, img2pdf only increases the filesize by the size the image data. In these cases, img2pdf only increases the filesize by the size
of the PDF container (typically around 500 to 700 bytes). Since data is only of the PDF container (typically around 500 to 700 bytes). Since data is only
@ -49,7 +47,7 @@ solutions for these input formats.
For all other input types, img2pdf first has to transform the pixel data to For all other input types, img2pdf first has to transform the pixel data to
make it compatible with PDF. In most cases, the PNG Paeth filter is applied to make it compatible with PDF. In most cases, the PNG Paeth filter is applied to
the pixel data. For 1-bit monochrome input, CCITT Group 4 is used instead. Only for the pixel data. For monochrome input, CCITT Group 4 is used instead. Only for
CMYK input no filter is applied before finally applying flate compression. CMYK input no filter is applied before finally applying flate compression.
Usage Usage
@ -67,12 +65,6 @@ The detailed documentation can be accessed by running:
$ img2pdf --help $ img2pdf --help
With no command line arguments supplied, img2pdf will read a single image from
standard input and write the resulting PDF to standard output. Here is an
example for how to scan directly to PDF using scanimage(1) from SANE:
$ scanimage --mode=Color --resolution=300 | pnmtojpeg -quality 90 | img2pdf > scan.pdf
Bugs Bugs
---- ----

View file

@ -1,7 +1,7 @@
import sys import sys
from setuptools import setup from setuptools import setup
VERSION = "0.6.0" VERSION = "0.4.4"
INSTALL_REQUIRES = ( INSTALL_REQUIRES = (
"Pillow", "Pillow",

View file

@ -22,7 +22,7 @@ import sys
import os import os
import zlib import zlib
import argparse import argparse
from PIL import Image, TiffImagePlugin, GifImagePlugin, ImageCms, ExifTags from PIL import Image, TiffImagePlugin, GifImagePlugin
if hasattr(GifImagePlugin, "LoadingStrategy"): if hasattr(GifImagePlugin, "LoadingStrategy"):
# Pillow 9.0.0 started emitting all frames but the first as RGB instead of # Pillow 9.0.0 started emitting all frames but the first as RGB instead of
@ -36,8 +36,9 @@ if hasattr(GifImagePlugin, "LoadingStrategy"):
# TiffImagePlugin.DEBUG = True # TiffImagePlugin.DEBUG = True
from PIL.ExifTags import TAGS from PIL.ExifTags import TAGS
from datetime import datetime, timezone from datetime import datetime
import jp2 from datetime import timezone
from jp2 import parsejp2
from enum import Enum from enum import Enum
from io import BytesIO from io import BytesIO
import logging import logging
@ -46,7 +47,6 @@ import platform
import hashlib import hashlib
from itertools import chain from itertools import chain
import re import re
import io
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -62,7 +62,7 @@ try:
except ImportError: except ImportError:
have_pikepdf = False have_pikepdf = False
__version__ = "0.6.0" __version__ = "0.4.4"
default_dpi = 96.0 default_dpi = 96.0
papersizes = { papersizes = {
"letter": "8.5inx11in", "letter": "8.5inx11in",
@ -128,7 +128,7 @@ PageOrientation = Enum("PageOrientation", "portrait landscape")
Colorspace = Enum("Colorspace", "RGB RGBA L LA 1 CMYK CMYK;I P PA other") Colorspace = Enum("Colorspace", "RGB RGBA L LA 1 CMYK CMYK;I P PA other")
ImageFormat = Enum( ImageFormat = Enum(
"ImageFormat", "JPEG JPEG2000 CCITTGroup4 PNG GIF TIFF MPO MIFF JBIG2 other" "ImageFormat", "JPEG JPEG2000 CCITTGroup4 PNG GIF TIFF MPO MIFF other"
) )
PageMode = Enum("PageMode", "none outlines thumbs") PageMode = Enum("PageMode", "none outlines thumbs")
@ -722,7 +722,16 @@ class pdfdoc(object):
self.writer.docinfo = PdfDict(indirect=True) self.writer.docinfo = PdfDict(indirect=True)
def datetime_to_pdfdate(dt): def datetime_to_pdfdate(dt):
return dt.astimezone(tz=timezone.utc).strftime("%Y%m%d%H%M%SZ") time_no_tz = dt.strftime("%Y%m%d%H%M%S")
tz_pdf = ""
# Format for `%z` specifier is [+-]HHMM(SS(\.ffffff)?)?, but the
# PDF format only accepts the [+-]HHMM part, and it must be
# formatted as [+-]HH'MM'. Note that PDF 1.7 removed the need for
# the trailing apostrophe (after MM), but earlier specs require it.
tz = dt.strftime("%z")
if tz:
tz_pdf = "%s%s'%s'" % (tz[0], tz[1:3], tz[3:5])
return time_no_tz + tz_pdf
for k in ["Title", "Author", "Creator", "Producer", "Subject"]: for k in ["Title", "Author", "Creator", "Producer", "Subject"]:
v = locals()[k.lower()] v = locals()[k.lower()]
@ -732,7 +741,7 @@ class pdfdoc(object):
v = PdfString.encode(v) v = PdfString.encode(v)
self.writer.docinfo[getattr(PdfName, k)] = v self.writer.docinfo[getattr(PdfName, k)] = v
now = datetime.now().astimezone() now = datetime.now(tz=timezone.utc)
for k in ["CreationDate", "ModDate"]: for k in ["CreationDate", "ModDate"]:
v = locals()[k.lower()] v = locals()[k.lower()]
if v is None and nodate: if v is None and nodate:
@ -752,7 +761,15 @@ class pdfdoc(object):
) )
def datetime_to_xmpdate(dt): def datetime_to_xmpdate(dt):
return dt.astimezone(tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") time_no_tz = dt.strftime("%Y-%m-%dT%H:%M:%S")
tz_xmp = ""
# Format for `%z` specifier is [+-]HHMM(SS(\.ffffff)?)?, but the
# XMP metadata only accepts the [+-]HHMM part, and it must be
# formatted as [+-]HH:MM.
tz = dt.strftime("%z")
if tz:
tz_xmp = "%s%s:%s" % (tz[0], tz[1:3], tz[3:5])
return time_no_tz + tz_xmp
self.xmp = b"""<?xpacket begin='\xef\xbb\xbf' id='W5M0MpCehiHzreSzNTczkc9d'?> self.xmp = b"""<?xpacket begin='\xef\xbb\xbf' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>
@ -827,10 +844,8 @@ class pdfdoc(object):
artborder=None, artborder=None,
iccp=None, iccp=None,
): ):
assert ( assert (color != Colorspace.RGBA and color != Colorspace.LA) or (
color not in [Colorspace.RGBA, Colorspace.LA] imgformat == ImageFormat.PNG and smaskdata is not None
or (imgformat == ImageFormat.PNG and smaskdata is not None)
or imgformat == ImageFormat.JPEG2000
) )
if self.engine == Engine.pikepdf: if self.engine == Engine.pikepdf:
@ -854,13 +869,7 @@ class pdfdoc(object):
if color == Colorspace["1"] or color == Colorspace.L or color == Colorspace.LA: if color == Colorspace["1"] or color == Colorspace.L or color == Colorspace.LA:
colorspace = PdfName.DeviceGray colorspace = PdfName.DeviceGray
elif color == Colorspace.RGB or color == Colorspace.RGBA: elif color == Colorspace.RGB or color == Colorspace.RGBA:
if color == Colorspace.RGBA and imgformat == ImageFormat.JPEG2000: colorspace = PdfName.DeviceRGB
# there is no DeviceRGBA and for JPXDecode it is okay to have
# no colorspace as the pdf reader is supposed to get this info
# from the jpeg2000 payload itself
colorspace = None
else:
colorspace = PdfName.DeviceRGB
elif color == Colorspace.CMYK or color == Colorspace["CMYK;I"]: elif color == Colorspace.CMYK or color == Colorspace["CMYK;I"]:
colorspace = PdfName.DeviceCMYK colorspace = PdfName.DeviceCMYK
elif color == Colorspace.P: elif color == Colorspace.P:
@ -918,11 +927,6 @@ class pdfdoc(object):
self.output_version = "1.5" # jpeg2000 needs pdf 1.5 self.output_version = "1.5" # jpeg2000 needs pdf 1.5
elif imgformat is ImageFormat.CCITTGroup4: elif imgformat is ImageFormat.CCITTGroup4:
ofilter = [PdfName.CCITTFaxDecode] ofilter = [PdfName.CCITTFaxDecode]
elif imgformat is ImageFormat.JBIG2:
ofilter = PdfName.JBIG2Decode
# JBIG2Decode requires PDF 1.4
if self.output_version < "1.4":
self.output_version = "1.4"
else: else:
ofilter = PdfName.FlateDecode ofilter = PdfName.FlateDecode
@ -936,8 +940,7 @@ class pdfdoc(object):
image[PdfName.Filter] = ofilter image[PdfName.Filter] = ofilter
image[PdfName.Width] = imgwidthpx image[PdfName.Width] = imgwidthpx
image[PdfName.Height] = imgheightpx image[PdfName.Height] = imgheightpx
if colorspace is not None: image[PdfName.ColorSpace] = colorspace
image[PdfName.ColorSpace] = colorspace
image[PdfName.BitsPerComponent] = depth image[PdfName.BitsPerComponent] = depth
smask = None smask = None
@ -1080,7 +1083,7 @@ class pdfdoc(object):
self.tostream(stream) self.tostream(stream)
return stream.getvalue() return stream.getvalue()
def finalize(self): def tostream(self, outputstream):
if self.engine == Engine.pikepdf: if self.engine == Engine.pikepdf:
PdfArray = pikepdf.Array PdfArray = pikepdf.Array
PdfDict = pikepdf.Dictionary PdfDict = pikepdf.Dictionary
@ -1272,9 +1275,7 @@ class pdfdoc(object):
self.writer.addobj(metadata) self.writer.addobj(metadata)
self.writer.addobj(iccstream) self.writer.addobj(iccstream)
def tostream(self, outputstream): # now write out the PDF
# write out the PDF
# this assumes that finalize() has been invoked beforehand by the caller
if self.engine == Engine.pikepdf: if self.engine == Engine.pikepdf:
kwargs = {} kwargs = {}
if pikepdf.__version__ >= "6.2.0": if pikepdf.__version__ >= "6.2.0":
@ -1283,8 +1284,6 @@ class pdfdoc(object):
outputstream, min_version=self.output_version, linearize=True, **kwargs outputstream, min_version=self.output_version, linearize=True, **kwargs
) )
elif self.engine == Engine.pdfrw: elif self.engine == Engine.pdfrw:
from pdfrw import PdfName, PdfArray
self.writer.trailer.Info = self.writer.docinfo self.writer.trailer.Info = self.writer.docinfo
# setting the version attribute of the pdfrw PdfWriter object will # setting the version attribute of the pdfrw PdfWriter object will
# influence the behaviour of the write() function # influence the behaviour of the write() function
@ -1304,27 +1303,47 @@ class pdfdoc(object):
raise ValueError("unknown engine: %s" % self.engine) raise ValueError("unknown engine: %s" % self.engine)
def pil_get_dpi(imgdata, imgformat, default_dpi): def get_imgmetadata(
ndpi = imgdata.info.get("dpi") imgdata, imgformat, default_dpi, colorspace, rawdata=None, rotreq=None
if ndpi is None: ):
# the PNG plugin of PIL adds the undocumented "aspect" field instead of if imgformat == ImageFormat.JPEG2000 and rawdata is not None and imgdata is None:
# the "dpi" field if the PNG pHYs chunk unit is not set to meters # this codepath gets called if the PIL installation is not able to
if imgformat == ImageFormat.PNG and imgdata.info.get("aspect") is not None: # handle JPEG2000 files
aspect = imgdata.info["aspect"] imgwidthpx, imgheightpx, ics, hdpi, vdpi = parsejp2(rawdata)
# make sure not to go below the default dpi
if aspect[0] > aspect[1]:
ndpi = (default_dpi * aspect[0] / aspect[1], default_dpi)
else:
ndpi = (default_dpi, default_dpi * aspect[1] / aspect[0])
else:
ndpi = (default_dpi, default_dpi)
# In python3, the returned dpi value for some tiff images will if hdpi is None:
# not be an integer but a float. To make the behaviour of hdpi = default_dpi
# img2pdf the same between python2 and python3, we convert that if vdpi is None:
# float into an integer by rounding. vdpi = default_dpi
# Search online for the 72.009 dpi problem for more info. ndpi = (hdpi, vdpi)
ndpi = (int(round(ndpi[0])), int(round(ndpi[1]))) else:
imgwidthpx, imgheightpx = imgdata.size
ndpi = imgdata.info.get("dpi", (default_dpi, default_dpi))
# In python3, the returned dpi value for some tiff images will
# not be an integer but a float. To make the behaviour of
# img2pdf the same between python2 and python3, we convert that
# float into an integer by rounding.
# Search online for the 72.009 dpi problem for more info.
ndpi = (int(round(ndpi[0])), int(round(ndpi[1])))
ics = imgdata.mode
# GIF and PNG files with transparency are supported
if (imgformat == ImageFormat.PNG or imgformat == ImageFormat.GIF) and (
ics in ["RGBA", "LA"] or "transparency" in imgdata.info
):
# Must check the IHDR chunk for the bit depth, because PIL would lossily
# convert 16-bit RGBA/LA images to 8-bit.
if imgformat == ImageFormat.PNG and rawdata is not None:
depth = rawdata[24]
if depth > 8:
logger.warning("Image with transparency and a bit depth of %d." % depth)
logger.warning("This is unsupported due to PIL limitations.")
raise AlphaChannelError(
"Refusing to work with multiple >8bit channels."
)
elif ics in ["LA", "PA", "RGBA"] or "transparency" in imgdata.info:
raise AlphaChannelError("This function must not be called on images with alpha")
# Since commit 07a96209597c5e8dfe785c757d7051ce67a980fb or release 4.1.0 # Since commit 07a96209597c5e8dfe785c757d7051ce67a980fb or release 4.1.0
# Pillow retrieves the DPI from EXIF if it cannot find the DPI in the JPEG # Pillow retrieves the DPI from EXIF if it cannot find the DPI in the JPEG
@ -1341,112 +1360,11 @@ def pil_get_dpi(imgdata, imgformat, default_dpi):
imgdata.tag_v2.get(TiffImagePlugin.Y_RESOLUTION, default_dpi), imgdata.tag_v2.get(TiffImagePlugin.Y_RESOLUTION, default_dpi),
) )
return ndpi
def get_imgmetadata(
imgdata, imgformat, default_dpi, colorspace, rawdata=None, rotreq=None
):
if imgformat == ImageFormat.JPEG2000 and rawdata is not None and imgdata is None:
# this codepath gets called if the PIL installation is not able to
# handle JPEG2000 files
imgwidthpx, imgheightpx, ics, hdpi, vdpi, channels, bpp = jp2.parse(rawdata)
if hdpi is None:
hdpi = default_dpi
if vdpi is None:
vdpi = default_dpi
ndpi = (hdpi, vdpi)
elif imgformat == ImageFormat.JBIG2:
imgwidthpx, imgheightpx, xres, yres = struct.unpack(">IIII", rawdata[24:40])
INCH_PER_METER = 39.370079
if xres == 0:
hdpi = default_dpi
elif xres < 1000:
# If xres is very small, it's likely accidentally expressed in dpi instead
# of dpm. See e.g. https://github.com/agl/jbig2enc/issues/86
hdpi = xres
else:
hdpi = int(float(xres) / INCH_PER_METER)
if yres == 0:
vdpi = default_dpi
elif yres < 1000:
vdpi = yres
else:
vdpi = int(float(yres) / INCH_PER_METER)
ndpi = (hdpi, vdpi)
ics = "1"
else:
imgwidthpx, imgheightpx = imgdata.size
ndpi = pil_get_dpi(imgdata, imgformat, default_dpi)
ics = imgdata.mode
logger.debug("input dpi = %d x %d", *ndpi) logger.debug("input dpi = %d x %d", *ndpi)
# GIF and PNG files with transparency are supported
if imgformat in [ImageFormat.PNG, ImageFormat.GIF, ImageFormat.JPEG2000] and (
ics in ["RGBA", "LA"]
or (imgdata is not None and "transparency" in imgdata.info)
):
# Must check the IHDR chunk for the bit depth, because PIL would lossily
# convert 16-bit RGBA/LA images to 8-bit.
if imgformat == ImageFormat.PNG and rawdata is not None:
depth = rawdata[24]
if depth > 8:
logger.warning("Image with transparency and a bit depth of %d." % depth)
logger.warning("This is unsupported due to PIL limitations.")
logger.warning(
"If you accept a lossy conversion, you can manually convert "
"your images to 8 bit using `convert -depth 8` from imagemagick"
)
raise AlphaChannelError(
"Refusing to work with multiple >8bit channels."
)
elif ics in ["LA", "PA", "RGBA"] or (
imgdata is not None and "transparency" in imgdata.info
):
raise AlphaChannelError("This function must not be called on images with alpha")
rotation = 0 rotation = 0
if rotreq in (None, Rotation.auto, Rotation.ifvalid): if rotreq in (None, Rotation.auto, Rotation.ifvalid):
if hasattr(imgdata, "getexif") and imgdata.getexif() is not None: if hasattr(imgdata, "_getexif") and imgdata._getexif() is not None:
exif_dict = imgdata.getexif()
o_key = ExifTags.Base.Orientation.value # 274 rsp. 0x112
if exif_dict and o_key in exif_dict:
# Detailed information on EXIF rotation tags:
# http://impulseadventure.com/photo/exif-orientation.html
value = exif_dict[o_key]
if value == 1:
rotation = 0
elif value == 6:
rotation = 90
elif value == 3:
rotation = 180
elif value == 8:
rotation = 270
elif value in (2, 4, 5, 7):
if rotreq == Rotation.ifvalid:
logger.warning(
"Unsupported flipped rotation mode (%d): use "
"--rotation=ifvalid or "
"rotation=img2pdf.Rotation.ifvalid to ignore",
value,
)
else:
raise ExifOrientationError(
"Unsupported flipped rotation mode (%d): use "
"--rotation=ifvalid or "
"rotation=img2pdf.Rotation.ifvalid to ignore" % value
)
else:
if rotreq == Rotation.ifvalid:
logger.warning("Invalid rotation (%d)", value)
else:
raise ExifOrientationError(
"Invalid rotation (%d): use --rotation=ifvalid "
"or rotation=img2pdf.Rotation.ifvalid to ignore" % value
)
elif hasattr(imgdata, "_getexif") and imgdata._getexif() is not None:
for tag, value in imgdata._getexif().items(): for tag, value in imgdata._getexif().items():
if TAGS.get(tag, tag) == "Orientation": if TAGS.get(tag, tag) == "Orientation":
# Detailed information on EXIF rotation tags: # Detailed information on EXIF rotation tags:
@ -1481,7 +1399,6 @@ def get_imgmetadata(
"Invalid rotation (%d): use --rotation=ifvalid " "Invalid rotation (%d): use --rotation=ifvalid "
"or rotation=img2pdf.Rotation.ifvalid to ignore" % value "or rotation=img2pdf.Rotation.ifvalid to ignore" % value
) )
elif rotreq in (Rotation.none, Rotation["0"]): elif rotreq in (Rotation.none, Rotation["0"]):
rotation = 0 rotation = 0
elif rotreq == Rotation["90"]: elif rotreq == Rotation["90"]:
@ -1530,55 +1447,8 @@ def get_imgmetadata(
logger.debug("input colorspace = %s", color.name) logger.debug("input colorspace = %s", color.name)
iccp = None iccp = None
if imgdata is not None and "icc_profile" in imgdata.info: if "icc_profile" in imgdata.info:
iccp = imgdata.info.get("icc_profile") iccp = imgdata.info.get("icc_profile")
# GIMP saves bilevel TIFF images and palette PNG images with only black and
# white in the palette with an RGB ICC profile which is useless
# https://gitlab.gnome.org/GNOME/gimp/-/issues/3438
# and produces an error in Adobe Acrobat, so we ignore it with a warning.
# imagemagick also used to (wrongly) include an RGB ICC profile for bilevel
# images: https://github.com/ImageMagick/ImageMagick/issues/2070
if iccp is not None and (
(color == Colorspace["1"] and imgformat == ImageFormat.TIFF)
or (
imgformat == ImageFormat.PNG
and color == Colorspace.P
and rawdata is not None
and parse_png(rawdata)[1]
in [b"\x00\x00\x00\xff\xff\xff", b"\xff\xff\xff\x00\x00\x00"]
)
):
with io.BytesIO(iccp) as f:
prf = ImageCms.ImageCmsProfile(f)
if (
prf.profile.model == "sRGB"
and prf.profile.manufacturer == "GIMP"
and prf.profile.profile_description == "GIMP built-in sRGB"
):
if imgformat == ImageFormat.TIFF:
logger.warning(
"Ignoring RGB ICC profile in bilevel TIFF produced by GIMP."
)
elif imgformat == ImageFormat.PNG:
logger.warning(
"Ignoring RGB ICC profile in 2-color palette PNG produced by GIMP."
)
logger.warning("https://gitlab.gnome.org/GNOME/gimp/-/issues/3438")
iccp = None
# SmartAlbums old version (found 2.2.6) exports JPG with only 1 compone
# with an RGB ICC profile which is useless.
# This produces an error in Adobe Acrobat, so we ignore it with a warning.
# Update: Found another case, the JPG is created by Adobe PhotoShop, so we
# don't check software anymore.
if iccp is not None and (
(color == Colorspace["L"] and imgformat == ImageFormat.JPEG)
):
with io.BytesIO(iccp) as f:
prf = ImageCms.ImageCmsProfile(f)
if prf.profile.xcolor_space not in ("GRAY"):
logger.warning("Ignoring non-GRAY ICC profile in Grayscale JPG")
iccp = None
logger.debug("width x height = %dpx x %dpx", imgwidthpx, imgheightpx) logger.debug("width x height = %dpx x %dpx", imgwidthpx, imgheightpx)
@ -1699,7 +1569,6 @@ miff_re = re.compile(
re.VERBOSE, re.VERBOSE,
) )
# https://imagemagick.org/script/miff.php # https://imagemagick.org/script/miff.php
# turn off black formatting until python 3.10 is available on more platforms # turn off black formatting until python 3.10 is available on more platforms
# and we can use match/case # and we can use match/case
@ -1798,7 +1667,7 @@ def parse_miff(data):
elif hdata["colorspace"] == "Gray": elif hdata["colorspace"] == "Gray":
numchannels = 1 numchannels = 1
colorspace = Colorspace.L colorspace = Colorspace.L
if hdata.get("matte"): if hdata["matte"]:
numchannels += 1 numchannels += 1
if hdata.get("profile"): if hdata.get("profile"):
# there is no key encoding the length of icc or exif data # there is no key encoding the length of icc or exif data
@ -1848,7 +1717,7 @@ def parse_miff(data):
# case "PseudoClass": # case "PseudoClass":
elif hdata["class"] == "PseudoClass": elif hdata["class"] == "PseudoClass":
assert "colors" in hdata assert "colors" in hdata
if hdata.get("matte"): if hdata["matte"]:
numchannels = 2 numchannels = 2
else: else:
numchannels = 1 numchannels = 1
@ -1881,9 +1750,9 @@ def parse_miff(data):
results.extend(parse_miff(rest[lenpal + lenimgdata :])) results.extend(parse_miff(rest[lenpal + lenimgdata :]))
return results return results
# fmt: on # fmt: on
def read_images(
rawdata, colorspace, first_frame_only=False, rot=None, include_thumbnails=False
): def read_images(rawdata, colorspace, first_frame_only=False, rot=None):
im = BytesIO(rawdata) im = BytesIO(rawdata)
im.seek(0) im.seek(0)
imgdata = None imgdata = None
@ -1894,51 +1763,7 @@ def read_images(
if rawdata[:12] == b"\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A": if rawdata[:12] == b"\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A":
# image is jpeg2000 # image is jpeg2000
imgformat = ImageFormat.JPEG2000 imgformat = ImageFormat.JPEG2000
elif rawdata[:8] == b"\x97\x4a\x42\x32\x0d\x0a\x1a\x0a": if rawdata[:14].lower() == b"id=imagemagick":
# For now we only support single-page generic coding of JBIG2, for example as generated by
# https://github.com/agl/jbig2enc
#
# In fact, you can pipe an example image `like src/tests/input/mono.png` directly into img2pdf:
# jbig2 src/tests/input/mono.png | img2pdf -o src/tests/output/mono.png.pdf
#
# For this we assume that the first 13 bytes are the JBIG file header describing a document with one page,
# followed by a "page information" segment describing the dimensions of that page.
#
# The following annotated `hexdump -C 042.jb2` shows the first 40 bytes that we inspect directly.
# The first 24 bytes (until "||") have to match exactly, while the following 16 bytes are read by get_imgmetadata.
#
# 97 4a 42 32 0d 0a 1a 0a 01 00 00 00 01 00 00 00
# \_____________________/ | \_________/ \______
# magic-bytes org/unk pages seg-num
#
# 00 30 00 01 00 00 00 13 || 00 00 00 73 00 00 00 30
# _/ | | | \_________/ || \_________/ \_________/
# type refs page seg-size || width-px height-px
#
# 00 00 00 48 00 00 00 48
# \_________/ \_________/
# xres yres
#
# For more information on the data format, see:
# * https://github.com/agl/jbig2enc/blob/ea05019/fcd14492.pdf
# For more information about the generic coding, see:
# * https://github.com/agl/jbig2enc/blob/ea05019/src/jbig2enc.cc#L898
imgformat = ImageFormat.JBIG2
if (
rawdata[:24]
!= b"\x97\x4a\x42\x32\x0d\x0a\x1a\x0a\x01\x00\x00\x00\x01\x00\x00\x00\x00\x30\x00\x01\x00\x00\x00\x13"
):
raise ImageOpenError(
"Unsupported JBIG2 format; only single-page generic coding is supported (e.g. from `jbig2enc`)."
)
if (
rawdata[-22:]
!= b"\x00\x00\x00\x021\x00\x01\x00\x00\x00\x00\x00\x00\x00\x033\x00\x01\x00\x00\x00\x00"
):
raise ImageOpenError(
"Unsupported JBIG2 format; we expect end-of-page and end-of-file segments at the end (e.g. from `jbig2enc`)."
)
elif rawdata[:14].lower() == b"id=imagemagick":
# image is in MIFF format # image is in MIFF format
# this is useful for 16 bit CMYK because PNG cannot do CMYK and thus # this is useful for 16 bit CMYK because PNG cannot do CMYK and thus
# we need PIL but PIL cannot do 16 bit # we need PIL but PIL cannot do 16 bit
@ -1950,7 +1775,12 @@ def read_images(
) )
else: else:
logger.debug("PIL format = %s", imgdata.format) logger.debug("PIL format = %s", imgdata.format)
imgformat = getattr(ImageFormat, imgdata.format, ImageFormat.other) imgformat = None
for f in ImageFormat:
if f.name == imgdata.format:
imgformat = f
if imgformat is None:
imgformat = ImageFormat.other
def cleanup(): def cleanup():
if imgdata is not None: if imgdata is not None:
@ -1976,13 +1806,10 @@ def read_images(
raise JpegColorspaceError("jpeg can't be monochrome") raise JpegColorspaceError("jpeg can't be monochrome")
if color == Colorspace["P"]: if color == Colorspace["P"]:
raise JpegColorspaceError("jpeg can't have a color palette") raise JpegColorspaceError("jpeg can't have a color palette")
if color == Colorspace["RGBA"] and imgformat != ImageFormat.JPEG2000: if color == Colorspace["RGBA"]:
raise JpegColorspaceError("jpeg can't have an alpha channel") raise JpegColorspaceError("jpeg can't have an alpha channel")
logger.debug("read_images() embeds a JPEG") logger.debug("read_images() embeds a JPEG")
cleanup() cleanup()
depth = 8
if imgformat == ImageFormat.JPEG2000:
*_, depth = jp2.parse(rawdata)
return [ return [
( (
color, color,
@ -1994,7 +1821,7 @@ def read_images(
imgheightpx, imgheightpx,
[], [],
False, False,
depth, 8,
rotation, rotation,
iccp, iccp,
) )
@ -2011,77 +1838,6 @@ def read_images(
if imgformat == ImageFormat.MPO: if imgformat == ImageFormat.MPO:
result = [] result = []
img_page_count = 0 img_page_count = 0
assert len(imgdata._MpoImageFile__mpoffsets) == len(imgdata.mpinfo[0xB002])
num_frames = len(imgdata.mpinfo[0xB002])
# An MPO file can be a main image together with one or more thumbnails
# if that is the case, then we only include all frames if the
# --include-thumbnails option is given. If it is not, such an MPO file
# will be embedded as is, so including its thumbnails but showing up
# as a single image page in the resulting PDF.
num_main_frames = 0
num_thumbnail_frames = 0
for i, mpent in enumerate(imgdata.mpinfo[0xB002]):
# check only the first frame for being the main image
if (
i == 0
and mpent["Attribute"]["DependentParentImageFlag"]
and not mpent["Attribute"]["DependentChildImageFlag"]
and mpent["Attribute"]["RepresentativeImageFlag"]
and mpent["Attribute"]["MPType"] == "Baseline MP Primary Image"
):
num_main_frames += 1
elif (
not mpent["Attribute"]["DependentParentImageFlag"]
and mpent["Attribute"]["DependentChildImageFlag"]
and not mpent["Attribute"]["RepresentativeImageFlag"]
and mpent["Attribute"]["MPType"]
in [
"Large Thumbnail (VGA Equivalent)",
"Large Thumbnail (Full HD Equivalent)",
]
):
num_thumbnail_frames += 1
logger.debug(f"number of frames: {num_frames}")
logger.debug(f"number of main frames: {num_main_frames}")
logger.debug(f"number of thumbnail frames: {num_thumbnail_frames}")
# this MPO file is a main image plus zero or more thumbnails
# embed as-is unless the --include-thumbnails option was given
if num_frames == 1 or (
not include_thumbnails
and num_main_frames == 1
and num_thumbnail_frames + 1 == num_frames
):
color, ndpi, imgwidthpx, imgheightpx, rotation, iccp = get_imgmetadata(
imgdata, imgformat, default_dpi, colorspace, rawdata, rot
)
if color == Colorspace["1"]:
raise JpegColorspaceError("jpeg can't be monochrome")
if color == Colorspace["P"]:
raise JpegColorspaceError("jpeg can't have a color palette")
if color == Colorspace["RGBA"]:
raise JpegColorspaceError("jpeg can't have an alpha channel")
logger.debug("read_images() embeds an MPO verbatim")
cleanup()
return [
(
color,
ndpi,
ImageFormat.JPEG,
rawdata,
None,
imgwidthpx,
imgheightpx,
[],
False,
8,
rotation,
iccp,
)
]
# If the control flow reaches here, the MPO has more than a single
# frame but was not detected to be a main image followed by multiple
# thumbnails. We thus treat this MPO as we do other multi-frame images
# and include all its frames as individual pages.
for offset, mpent in zip( for offset, mpent in zip(
imgdata._MpoImageFile__mpoffsets, imgdata.mpinfo[0xB002] imgdata._MpoImageFile__mpoffsets, imgdata.mpinfo[0xB002]
): ):
@ -2179,28 +1935,6 @@ def read_images(
) )
] ]
if imgformat == ImageFormat.JBIG2:
color, ndpi, imgwidthpx, imgheightpx, rotation, iccp = get_imgmetadata(
imgdata, imgformat, default_dpi, colorspace, rawdata, rot
)
streamdata = rawdata[13:-22] # Strip file header and footer
return [
(
color,
ndpi,
imgformat,
streamdata,
None,
imgwidthpx,
imgheightpx,
[],
False,
1,
rotation,
iccp,
)
]
if imgformat == ImageFormat.MIFF: if imgformat == ImageFormat.MIFF:
return parse_miff(rawdata) return parse_miff(rawdata)
@ -2369,16 +2103,7 @@ def read_images(
) )
) )
else: else:
if color in [Colorspace.P, Colorspace.PA] and iccp is not None: if (
# PDF does not support palette images with icc profile
if color == Colorspace.P:
newcolor = Colorspace.RGB
newimg = newimg.convert(mode="RGB")
elif color == Colorspace.PA:
newcolor = Colorspace.RGBA
newimg = newimg.convert(mode="RGBA")
smaskidat = None
elif (
color == Colorspace.RGBA color == Colorspace.RGBA
or color == Colorspace.LA or color == Colorspace.LA
or color == Colorspace.PA or color == Colorspace.PA
@ -2392,21 +2117,25 @@ def read_images(
newcolor = color newcolor = color
l, a = newimg.split() l, a = newimg.split()
newimg = l newimg = l
elif color == Colorspace.PA or (
color == Colorspace.P and "transparency" in newimg.info
):
newcolor = color
a = newimg.convert(mode="RGBA").split()[-1]
else: else:
newcolor = Colorspace.RGBA newcolor = Colorspace.RGBA
r, g, b, a = newimg.convert(mode="RGBA").split() r, g, b, a = newimg.convert(mode="RGBA").split()
newimg = Image.merge("RGB", (r, g, b)) newimg = Image.merge("RGB", (r, g, b))
smaskidat, *_ = to_png_data(a) smaskidat, _, _ = to_png_data(a)
logger.warning( logger.warning(
"Image contains an alpha channel. Computing a separate " "Image contains an alpha channel. Computing a separate "
"soft mask (/SMask) image to store transparency in PDF." "soft mask (/SMask) image to store transparency in PDF."
) )
elif color in [Colorspace.P, Colorspace.PA] and iccp is not None:
# PDF does not support palette images with icc profile
if color == Colorspace.P:
newcolor = Colorspace.RGB
newimg = newimg.convert(mode="RGB")
elif color == Colorspace.PA:
newcolor = Colorspace.RGBA
newimg = newimg.convert(mode="RGBA")
smaskidat = None
else: else:
newcolor = color newcolor = color
smaskidat = None smaskidat = None
@ -2740,11 +2469,14 @@ def find_scale(pagewidth, pageheight):
return 10 ** ceil(log10(oversized)) return 10 ** ceil(log10(oversized))
# Convert the image(s) to a `pdfdoc` object. # given one or more input image, depending on outputstream, either return a
# The `.writer` attribute holds the underlying engine document handle, and # string containing the whole PDF if outputstream is None or write the PDF
# `.output_version` the minimum version the caller should use when saving. # data to the given file-like object and return None
# The main convert() wraps this implementation function. #
def convert_to_docobject(*images, **kwargs): # Input images can be given as file like objects (they must implement read()),
# as a binary string representing the image content or as filenames to the
# images.
def convert(*images, **kwargs):
_default_kwargs = dict( _default_kwargs = dict(
engine=None, engine=None,
title=None, title=None,
@ -2765,6 +2497,7 @@ def convert_to_docobject(*images, **kwargs):
viewer_fit_window=False, viewer_fit_window=False,
viewer_center_window=False, viewer_center_window=False,
viewer_fullscreen=False, viewer_fullscreen=False,
outputstream=None,
first_frame_only=False, first_frame_only=False,
allow_oversized=True, allow_oversized=True,
cropborder=None, cropborder=None,
@ -2773,7 +2506,6 @@ def convert_to_docobject(*images, **kwargs):
artborder=None, artborder=None,
pdfa=None, pdfa=None,
rotation=None, rotation=None,
include_thumbnails=False,
) )
for kwname, default in _default_kwargs.items(): for kwname, default in _default_kwargs.items():
if kwname not in kwargs: if kwname not in kwargs:
@ -2866,7 +2598,6 @@ def convert_to_docobject(*images, **kwargs):
kwargs["colorspace"], kwargs["colorspace"],
kwargs["first_frame_only"], kwargs["first_frame_only"],
kwargs["rotation"], kwargs["rotation"],
kwargs["include_thumbnails"],
): ):
pagewidth, pageheight, imgwidthpdf, imgheightpdf = kwargs["layout_fun"]( pagewidth, pageheight, imgwidthpdf, imgheightpdf = kwargs["layout_fun"](
imgwidthpx, imgheightpx, ndpi imgwidthpx, imgheightpx, ndpi
@ -2927,22 +2658,10 @@ def convert_to_docobject(*images, **kwargs):
iccp, iccp,
) )
pdf.finalize() if kwargs["outputstream"]:
return pdf pdf.tostream(kwargs["outputstream"])
# given one or more input image, depending on outputstream, either return a
# string containing the whole PDF if outputstream is None or write the PDF
# data to the given file-like object and return None
#
# Input images can be given as file like objects (they must implement read()),
# as a binary string representing the image content or as filenames to the
# images.
def convert(*images, outputstream=None, **kwargs):
pdf = convert_to_docobject(*images, **kwargs)
if outputstream:
pdf.tostream(outputstream)
return return
return pdf.tostring() return pdf.tostring()
@ -3254,7 +2973,7 @@ def valid_date(string):
else: else:
try: try:
return parser.parse(string) return parser.parse(string)
except: except TypeError:
pass pass
# as a last resort, try the local date utility # as a last resort, try the local date utility
try: try:
@ -3267,7 +2986,7 @@ def valid_date(string):
except subprocess.CalledProcessError: except subprocess.CalledProcessError:
pass pass
else: else:
return datetime.fromtimestamp(int(utime)) return datetime.utcfromtimestamp(int(utime))
raise argparse.ArgumentTypeError("cannot parse date: %s" % string) raise argparse.ArgumentTypeError("cannot parse date: %s" % string)
@ -3969,35 +3688,7 @@ def gui():
app.mainloop() app.mainloop()
def file_is_icc(fname): def main(argv=sys.argv):
with open(fname, "rb") as f:
data = f.read(40)
if len(data) < 40:
return False
return data[36:] == b"acsp"
def validate_icc(fname):
if not file_is_icc(fname):
raise argparse.ArgumentTypeError('"%s" is not an ICC profile' % fname)
return fname
def get_default_icc_profile():
for profile in [
"/usr/share/color/icc/sRGB.icc",
"/usr/share/color/icc/OpenICC/sRGB.icc",
"/usr/share/color/icc/colord/sRGB.icc",
]:
if not os.path.exists(profile):
continue
if not file_is_icc(profile):
continue
return profile
return "/usr/share/color/icc/sRGB.icc"
def get_main_parser():
rendered_papersizes = "" rendered_papersizes = ""
for k, v in sorted(papersizes.items()): for k, v in sorted(papersizes.items()):
rendered_papersizes += " %-8s %s\n" % (papernames[k], v) rendered_papersizes += " %-8s %s\n" % (papernames[k], v)
@ -4038,9 +3729,7 @@ Paper sizes:
the value in the second column has the same effect as giving the short hand the value in the second column has the same effect as giving the short hand
in the first column. Appending ^T (a caret/circumflex followed by the letter in the first column. Appending ^T (a caret/circumflex followed by the letter
T) turns the paper size from portrait into landscape. The postfix thus T) turns the paper size from portrait into landscape. The postfix thus
symbolizes the transpose. Note that on Windows cmd.exe the caret symbol is symbolizes the transpose. The values are case insensitive.
the escape character, so you need to put quotes around the option value.
The values are case insensitive.
%s %s
@ -4102,16 +3791,12 @@ Examples:
$ img2pdf --output out.pdf page1.jpg page2.jpg $ img2pdf --output out.pdf page1.jpg page2.jpg
Use a custom dpi value for the input images:
$ img2pdf --output out.pdf --imgsize 300dpi page1.jpg page2.jpg
Convert a directory of JPEG images into a PDF with printable A4 pages in Convert a directory of JPEG images into a PDF with printable A4 pages in
landscape mode. On each page, the photo takes the maximum amount of space landscape mode. On each page, the photo takes the maximum amount of space
while preserving its aspect ratio and a print border of 2 cm on the top and while preserving its aspect ratio and a print border of 2 cm on the top and
bottom and 2.5 cm on the left and right hand side. bottom and 2.5 cm on the left and right hand side.
$ img2pdf --output out.pdf --pagesize "A4^T" --border 2cm:2.5cm *.jpg $ img2pdf --output out.pdf --pagesize A4^T --border 2cm:2.5cm *.jpg
On each A4 page, fit images into a 10 cm times 15 cm rectangle but keep the On each A4 page, fit images into a 10 cm times 15 cm rectangle but keep the
original image size if the image is smaller than that. original image size if the image is smaller than that.
@ -4246,17 +3931,6 @@ RGB.""",
"input image be converted into a page in the resulting PDF.", "input image be converted into a page in the resulting PDF.",
) )
outargs.add_argument(
"--include-thumbnails",
action="store_true",
help="Some multi-frame formats like MPO carry a main image and "
"one or more scaled-down copies of the main image (thumbnails). "
"In such a case, img2pdf will only include the main image and "
"not create additional pages for each of the thumbnails. If this "
"option is set, img2pdf will instead create one page per frame and "
"thus store each thumbnail on its own page.",
)
outargs.add_argument( outargs.add_argument(
"--pillow-limit-break", "--pillow-limit-break",
action="store_true", action="store_true",
@ -4268,29 +3942,14 @@ RGB.""",
% Image.MAX_IMAGE_PIXELS, % Image.MAX_IMAGE_PIXELS,
) )
if sys.platform == "win32": outargs.add_argument(
# on Windows, there are no default paths to search for an ICC profile "--pdfa",
# so make the argument required instead of optional nargs="?",
outargs.add_argument( const="/usr/share/color/icc/sRGB.icc",
"--pdfa", default=None,
type=validate_icc, help="Output a PDF/A-1b compliant document. By default, this will "
help="Output a PDF/A-1b compliant document. The argument to this " "embed /usr/share/color/icc/sRGB.icc as the color profile.",
"option is the path to the ICC profile that will be embedded into " )
"the resulting PDF.",
)
else:
outargs.add_argument(
"--pdfa",
nargs="?",
const=get_default_icc_profile(),
default=None,
type=validate_icc,
help="Output a PDF/A-1b compliant document. By default, this will "
"embed either /usr/share/color/icc/sRGB.icc, "
"/usr/share/color/icc/OpenICC/sRGB.icc or "
"/usr/share/color/icc/colord/sRGB.icc as the color profile, whichever "
"is found to exist first.",
)
sizeargs = parser.add_argument_group( sizeargs = parser.add_argument_group(
title="Image and page size and layout arguments", title="Image and page size and layout arguments",
@ -4579,11 +4238,8 @@ and left/right, respectively. It is not possible to specify asymmetric borders.
action="store_true", action="store_true",
help="Instruct the PDF viewer to open the PDF in fullscreen mode", help="Instruct the PDF viewer to open the PDF in fullscreen mode",
) )
return parser
args = parser.parse_args(argv[1:])
def main(argv=sys.argv):
args = get_main_parser().parse_args(argv[1:])
if args.verbose: if args.verbose:
logging.basicConfig(level=logging.DEBUG) logging.basicConfig(level=logging.DEBUG)
@ -4610,7 +4266,7 @@ def main(argv=sys.argv):
print( print(
"Reading image from standard input...\n" "Reading image from standard input...\n"
"Re-run with -h or --help for usage information.", "Re-run with -h or --help for usage information.",
file=sys.stderr, file=sys.stderr
) )
try: try:
images = [sys.stdin.buffer.read()] images = [sys.stdin.buffer.read()]
@ -4672,7 +4328,6 @@ def main(argv=sys.argv):
artborder=args.art_border, artborder=args.art_border,
pdfa=args.pdfa, pdfa=args.pdfa,
rotation=args.rotation, rotation=args.rotation,
include_thumbnails=args.include_thumbnails,
) )
except Exception as e: except Exception as e:
logger.error("error: " + str(e)) logger.error("error: " + str(e))

File diff suppressed because it is too large Load diff

View file

@ -37,8 +37,9 @@ def getBox(data, byteStart, noBytes):
def parse_ihdr(data): def parse_ihdr(data):
height, width, channels, bpp = struct.unpack(">IIHB", data[:11]) height = struct.unpack(">I", data[0:4])[0]
return width, height, channels, bpp + 1 width = struct.unpack(">I", data[4:8])[0]
return width, height
def parse_colr(data): def parse_colr(data):
@ -58,8 +59,8 @@ def parse_colr(data):
def parse_resc(data): def parse_resc(data):
hnum, hden, vnum, vden, hexp, vexp = struct.unpack(">HHHHBB", data) hnum, hden, vnum, vden, hexp, vexp = struct.unpack(">HHHHBB", data)
hdpi = ((hnum / hden) * (10**hexp) * 100) / 2.54 hdpi = ((hnum / hden) * (10 ** hexp) * 100) / 2.54
vdpi = ((vnum / vden) * (10**vexp) * 100) / 2.54 vdpi = ((vnum / vden) * (10 ** vexp) * 100) / 2.54
return hdpi, vdpi return hdpi, vdpi
@ -84,13 +85,13 @@ def parse_jp2h(data):
while byteStart < noBytes and boxLengthValue != 0: while byteStart < noBytes and boxLengthValue != 0:
boxLengthValue, boxType, byteEnd, boxContents = getBox(data, byteStart, noBytes) boxLengthValue, boxType, byteEnd, boxContents = getBox(data, byteStart, noBytes)
if boxType == b"ihdr": if boxType == b"ihdr":
width, height, channels, bpp = parse_ihdr(boxContents) width, height = parse_ihdr(boxContents)
elif boxType == b"colr": elif boxType == b"colr":
colorspace = parse_colr(boxContents) colorspace = parse_colr(boxContents)
elif boxType == b"res ": elif boxType == b"res ":
hdpi, vdpi = parse_res(boxContents) hdpi, vdpi = parse_res(boxContents)
byteStart = byteEnd byteStart = byteEnd
return (width, height, colorspace, hdpi, vdpi, channels, bpp) return (width, height, colorspace, hdpi, vdpi)
def parsejp2(data): def parsejp2(data):
@ -101,9 +102,7 @@ def parsejp2(data):
while byteStart < noBytes and boxLengthValue != 0: while byteStart < noBytes and boxLengthValue != 0:
boxLengthValue, boxType, byteEnd, boxContents = getBox(data, byteStart, noBytes) boxLengthValue, boxType, byteEnd, boxContents = getBox(data, byteStart, noBytes)
if boxType == b"jp2h": if boxType == b"jp2h":
width, height, colorspace, hdpi, vdpi, channels, bpp = parse_jp2h( width, height, colorspace, hdpi, vdpi = parse_jp2h(boxContents)
boxContents
)
break break
byteStart = byteEnd byteStart = byteEnd
if not width: if not width:
@ -113,41 +112,13 @@ def parsejp2(data):
if not colorspace: if not colorspace:
raise Exception("no colorspace in jp2 header") raise Exception("no colorspace in jp2 header")
# retrieving the dpi is optional so we do not error out if not present # retrieving the dpi is optional so we do not error out if not present
return (width, height, colorspace, hdpi, vdpi, channels, bpp) return (width, height, colorspace, hdpi, vdpi)
def parsej2k(data):
lsiz, rsiz, xsiz, ysiz, xosiz, yosiz, _, _, _, _, csiz = struct.unpack(
">HHIIIIIIIIH", data[4:42]
)
ssiz = [None] * csiz
xrsiz = [None] * csiz
yrsiz = [None] * csiz
for i in range(csiz):
ssiz[i], xrsiz[i], yrsiz[i] = struct.unpack(
"BBB", data[42 + 3 * i : 42 + 3 * (i + 1)]
)
assert ssiz == [7, 7, 7]
return xsiz - xosiz, ysiz - yosiz, None, None, None, csiz, 8
def parse(data):
if data[:4] == b"\xff\x4f\xff\x51":
return parsej2k(data)
else:
return parsejp2(data)
if __name__ == "__main__": if __name__ == "__main__":
import sys import sys
width, height, colorspace, hdpi, vdpi, channels, bpp = parse( width, height, colorspace = parsejp2(open(sys.argv[1]).read())
open(sys.argv[1], "rb").read() sys.stdout.write("width = %d" % width)
) sys.stdout.write("height = %d" % height)
print("width = %d" % width) sys.stdout.write("colorspace = %s" % colorspace)
print("height = %d" % height)
print("colorspace = %s" % colorspace)
print("hdpi = %s" % hdpi)
print("vdpi = %s" % vdpi)
print("channels = %s" % channels)
print("bpp = %s" % bpp)

Binary file not shown.

Binary file not shown.