diff --git a/CHANGES.rst b/CHANGES.rst index 69a9b36..ec2a745 100644 --- a/CHANGES.rst +++ b/CHANGES.rst @@ -2,7 +2,43 @@ CHANGES ======= -1.0.0 (unreleased) +0.1.6 +----- + + - replace -x and -y option by combined option -s (or --pagesize) and use -S + for --subject + - correctly encode and escape non-ascii metadata + - explicitly store date in UTC and allow parsing all date formats understood + by dateutil and `date --date` + +0.1.5 +----- + +- Enable support for CMYK images +- Rework test suite +- support file objects as input + +0.1.4 +----- + +- add Python 3 support +- make output reproducible by sorting and --nodate option + +0.1.3 +----- + +- Avoid leaking file descriptors +- Convert unrecognized colorspaces to RGB + +0.1.1 +----- + +- allow running src/img2pdf.py standalone +- license change from GPL to LGPL +- Add pillow 2.4.0 support +- add options to specify pdf dimensions in points + +0.1.0 (unreleased) ------------------ - Initial PyPI release. diff --git a/README.md b/README.md index 6061d8e..74f24b6 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,16 @@ img2pdf ======= -Lossless conversion of images to PDF without unnecessarily re-encoding JPEG and -JPEG2000 files. Thus, no loss of quality and no unnecessary large output file. +Losslessly convert images to PDF without unnecessarily re-encoding JPEG and +JPEG2000 files. Image quality is retained without unnecessarily increasing +file size. Background ---------- -PDF is able to embed JPEG and JPEG2000 images as they are without re-encoding -them (and hence loosing quality) but I was missing a tool to do this -automatically, thus I wrote this piece of python code. +Quality loss can be avoided when converting JPEG and JPEG2000 images to +PDF by embedding them without re-encoding. I wrote this piece of python code. +because I was missing a tool to do this automatically. If you know how to embed JPEG and JPEG2000 images into a PDF container without recompression, using existing tools, please contact me so that I can put this @@ -18,110 +19,173 @@ code into the garbage bin :D Functionality ------------- -The program will take image filenames from commandline arguments and output a -PDF file with them embedded into it. If the input image is a JPEG or JPEG2000 -file, it will be included as-is without any processing. If it is in any other -format, the image will be included as zip-encoded RGB. As a result, this tool -will be able to lossless wrap any image into a PDF container while performing -better (in terms of quality/filesize ratio) than existing tools in case the -input image is a JPEG or JPEG2000 file. +This program will take a list of images and produce a PDF file with the images +embedded in it. JPEG and JPEG2000 images will be included without +recompression. Images in other formats will be included with zip/flate +encoding which usually leads to an increase in the resulting size because +formats like png compress better than PDF which just zip/flate compresses the +RGB data. As a result, this tool is able to losslessly wrap images into a PDF +container with a quality-filesize ratio that is typically better (in case of +JPEG and JPEG2000 images) or equal (in case of other formats) than that of +existing tools. -For example, imagemagick will re-encode the input JPEG image and thus change -its content: +For example, imagemagick will re-encode the input JPEG image (thus changing +its content): $ convert img.jpg img.pdf $ pdfimages img.pdf img.extr # not using -j to be extra sure there is no recompression $ compare -metric AE img.jpg img.extr-000.ppm null: 1.6301e+06 -If one wants to do a lossless conversion from any format to PDF with -imagemagick then one has to use zip-encoding: +If one wants to losslessly convert from any format to PDF with +imagemagick, one has to use zip compression: $ convert input.jpg -compress Zip output.pdf $ pdfimages img.pdf img.extr # not using -j to be extra sure there is no recompression $ compare -metric AE img.jpg img.extr-000.ppm null: 0 -The downside is, that using imagemagick like this will make the resulting PDF -files a few times bigger than the input JPEG or JPEG2000 file and can also not -output a multipage PDF. +However, this approach will result in PDF files that are a few times larger +than the input JPEG or JPEG2000 file. -img2pdf is able to output a PDF with multiple pages if more than one input -image is given, losslessly embed JPEG and JPEG2000 files into a PDF container -without adding more overhead than the PDF structure itself and will save all -other graphics formats using lossless zip-compression. +img2pdf is able to losslessly embed JPEG and JPEG2000 files into a PDF +container without additional overhead (aside from the PDF structure itself), +save other graphics formats using lossless zip compression, +and produce multi-page PDF files when more than one input image is given. -Another nifty advantage: Since no re-encoding is done in case of JPEG images, -the conversion is many (ten to hundred) times faster with img2pdf compared to -imagemagick. While a run of above convert command with a 2.8MB JPEG takes 27 -seconds (on average) on my machine, conversion using img2pdf takes just a -fraction of a second. +Also, since JPEG and JPEG2000 images are not reencoded, conversion with +img2pdf is several times faster than with other tools. -Commandline Arguments ---------------------- -At least one input file argument must be given as img2pdf needs to seek in the -file descriptor which would not be possible for stdin. +Usage +----- -Specify the dpi with the -d or --dpi options instead of reading it from the -image or falling back to 96.0. +#### General Notes -Specify the output file with -o or --output. By default output will be done to -stdout. +The images must be provided as files because img2pdf needs to seek +in the file descriptor. Input cannot be piped through stdin. -Specify metadata using the --title, --author, --creator, --producer, ---creationdate, --moddate, --subject and --keywords options (or their short -forms). +If no output file is specified with the `-o`/`--output` option, +output will be to stdout. -Specify -C or --colorspace to force a colorspace using PIL short handles like -'RGB', 'L' or '1'. +Descriptions of the options should be self explanatory. +They are available by running: -More help is available with the -h or --help option. + img2pdf --help + + +#### Controlling Page Size + +The PDF page size can be manipulated. By default, the image will be sized "into" the given dimensions with the aspect ratio retained. For instance, to size an image into a page that is at most 500pt x 500pt, use: + + img2pdf -s 500x500 -o output.pdf input.jpg + +To "fill" out a page that is at least 500pt x 500pt, follow the dimensions with a `^`: + + img2pdf -s 500x500^ -o output.pdf input.jpg + +To output pages that are exactly 500pt x 500pt, follow the dimensions with an `!`: + + img2pdf -s 500x500\! -o output.pdf input.jpg + +Notice that the default unit is points. Units may be also be specified and mixed: + + img2pdf -s 8.5inx27.94cm -o output.pdf input.jpg + +If either width or height is omitted, the other will be calculated +to preserve aspect ratio. + + img2pdf -s x280mm -o output1.pdf input.jpg + img2pdf -s 280mmx -o output2.pdf input.jpg + +Some standard page sizes are recognized: + + img2pdf -s letter -o output1.pdf input.jpg + img2pdf -s a4 -o output2.pdf input.jpg + +#### Colorspace + +Currently, the colorspace must be forced for JPEG 2000 images that are +not in the RGB colorspace. Available colorspace options are based on +Python Imaging Library (PIL) short handles. + + * `RGB` = RGB color + * `L` = Grayscale + * `1` = Black and white (internally converted to grayscale) + * `CMYK` = CMYK color + * `CMYK;I` = CMYK color with inversion + +For example, to encode a grayscale JPEG2000 image, use: + + img2pdf -C L -o output.pdf input.jp2 Bugs ---- -If you find a JPEG or JPEG2000 file that, when embedded can not be read by the -Adobe Acrobat Reader, please contact me. +If you find a JPEG or JPEG2000 file that, when embedded cannot be read +by the Adobe Acrobat Reader, please contact me. + +For lossless conversion of formats other than JPEG or JPEG2000, zip/flate +encoding is used. This choice is based on tests I did with a number of images. +I converted them into PDF using the lossless variants of the compression +formats offered by imagemagick. In all my tests, zip/flate encoding performed +best. You can verify my findings using the test_comp.sh script with any input +image given as a commandline argument. If you find an input file that is +outperformed by another lossless compression method, contact me. + +I have not yet figured out how to determine the colorspace of JPEG2000 files. +Therefore JPEG2000 files use DeviceRGB by default. For JPEG2000 files with +other colorspaces, you must force it using the `--colorspace` option. -For lossless conversion of other formats than JPEG or JPEG2000 files, zip/flate -encoding is used. This choice is based on a number of tests I did on images. -I converted them into PDF using imagemagick and all compressions it has to -offer and then compared the output size of the lossless variants. In all my -tests, zip/flate encoding performed best. You can verify my findings using the -test_comp.sh script with any input image given as a commandline argument. If -you find an input file that is outperformed by another lossless compression, -contact me. +It might be possible to store transparency using masks but it is not clear +what the utility of such a functionality would be. -I have not yet figured out how to read the colorspace from jpeg2000 files. -Therefor jpeg2000 files use DeviceRGB per default. If your jpeg2000 files are -of any other colorspace you must force it using the --colorspace option. -Like -C L for DeviceGray. +Most vector graphic formats can be losslessly turned into PDF (minus some of +the features unsupported by PDF) but img2pdf will currently turn vector +graphics into their lossy raster representations. + +Acrobat is able to store a hint for the PDF reader of how to present the PDF +when opening it. Things like automatic fullscreen or the zoom level can be +configured. + +It would be nice if a single input image could be read from standard input. Installation ------------ -You can install the package using: +On a Debian- and Ubuntu-based systems, dependencies may be installed +with the following command: + + apt-get install python python-pil python-setuptools + +Or for Python 3: - $ pip install img2pdf + apt-get install python3 python3-pil python3-setuptools -If you want to install from source code simply use: +You can then install the package using: - $ cd img2pdf/ - $ pip install . + $ pip install img2pdf + +If you prefer to install from source code use: + + $ cd img2pdf/ + $ pip install . To test the console script without installing the package on your system, -simply use virtualenv: +use virtualenv: - $ cd img2pdf/ - $ virtualenv ve - $ ve/bin/pip install . + $ cd img2pdf/ + $ virtualenv ve + $ ve/bin/pip install . You can then test the converter using: - $ ve/bin/img2pdf -o test.pdf src/tests/test.jpg + $ ve/bin/img2pdf -o test.pdf src/tests/test.jpg + +The package can also be used as a library: -Note that the package can also be used as a library as follows: + import img2pdf + pdf_bytes = img2pdf.convert(['test.jpg']) - import img2pdf - pdf_bytes = img2pdf('test.jpg', dpi=150) + file = open("name.pdf","wb") + file.write(pdf_bytes) diff --git a/setup.cfg b/setup.cfg new file mode 100644 index 0000000..b88034e --- /dev/null +++ b/setup.cfg @@ -0,0 +1,2 @@ +[metadata] +description-file = README.md diff --git a/setup.py b/setup.py index 2b490dc..1ad815c 100644 --- a/setup.py +++ b/setup.py @@ -1,9 +1,12 @@ from setuptools import setup +VERSION="0.1.6~git" + setup ( name='img2pdf', - version='0.1.0', + version=VERSION, author = "Johannes 'josch' Schauer", + author_email = 'j.schauer@email.de', description = "Convert images to PDF via direct JPEG inclusion.", long_description = open('README.md').read(), license = "LGPL", @@ -15,12 +18,15 @@ setup ( 'Programming Language :: Python :: 2', 'Programming Language :: Python :: 2.6', 'Programming Language :: Python :: 2.7', + 'Programming Language :: Python :: 3', + 'Programming Language :: Python :: 3.4', 'Programming Language :: Python :: Implementation :: CPython', 'License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)', 'Programming Language :: Python', 'Natural Language :: English', 'Operating System :: OS Independent'], - url = 'http://pypi.python.org/pypi/img2pdf', + url = 'https://github.com/josch/img2pdf', + download_url = 'https://github.com/josch/img2pdf/archive/'+VERSION+'.tar.gz', package_dir={"": "src"}, py_modules=['img2pdf', 'jp2'], include_package_data = True, diff --git a/src/img2pdf.py b/src/img2pdf.py index 91dd0e2..fd45a1a 100755 --- a/src/img2pdf.py +++ b/src/img2pdf.py @@ -1,3 +1,5 @@ +#!/usr/bin/env python2 + # Copyright (C) 2012-2014 Johannes 'josch' Schauer # # This program is free software: you can redistribute it and/or @@ -15,13 +17,20 @@ # License along with this program. If not, see # . +__version__ = "0.1.6~git" +default_dpi = 96.0 + +import re import sys import zlib import argparse -import struct from PIL import Image from datetime import datetime from jp2 import parsejp2 +try: + from cStringIO import cStringIO +except ImportError: + from io import BytesIO as cStringIO # XXX: Switch to use logging module. def debug_out(message, verbose=True): @@ -34,19 +43,28 @@ def error_out(message): def warning_out(message): sys.stderr.write("W: "+message+"\n") +def datetime_to_pdfdate(dt): + return dt.strftime("%Y%m%d%H%M%SZ") + def parse(cont, indent=1): if type(cont) is dict: - return "<<\n"+"\n".join( - [4 * indent * " " + "%s %s" % (k, parse(v, indent+1)) - for k, v in cont.items()])+"\n"+4*(indent-1)*" "+">>" - elif type(cont) is int or type(cont) is float: - return str(cont) + return b"<<\n"+b"\n".join( + [4 * indent * b" " + k + b" " + parse(v, indent+1) + for k, v in sorted(cont.items())])+b"\n"+4*(indent-1)*b" "+b">>" + elif type(cont) is int: + return str(cont).encode() + elif type(cont) is float: + return ("%0.4f"%cont).encode() elif isinstance(cont, obj): - return "%d 0 R"%cont.identifier - elif type(cont) is str: + return ("%d 0 R"%cont.identifier).encode() + elif type(cont) is str or type(cont) is bytes: + if type(cont) is str and type(cont) is not bytes: + raise Exception("parse must be passed a bytes object in py3") return cont elif type(cont) is list: - return "[ "+" ".join([parse(c, indent) for c in cont])+" ]" + return b"[ "+b" ".join([parse(c, indent) for c in cont])+b" ]" + else: + raise Exception("cannot handle type %s"%type(cont)) class obj(object): def __init__(self, content, stream=None): @@ -56,56 +74,56 @@ class obj(object): def tostring(self): if self.stream: return ( - "%d 0 obj " % self.identifier + + ("%d 0 obj " % self.identifier).encode() + parse(self.content) + - "\nstream\n" + self.stream + "\nendstream\nendobj\n") + b"\nstream\n" + self.stream + b"\nendstream\nendobj\n") else: - return "%d 0 obj "%self.identifier+parse(self.content)+" endobj\n" + return ("%d 0 obj "%self.identifier).encode()+parse(self.content)+b" endobj\n" class pdfdoc(object): def __init__(self, version=3, title=None, author=None, creator=None, producer=None, creationdate=None, moddate=None, subject=None, - keywords=None): + keywords=None, nodate=False): self.version = version # default pdf version 1.3 now = datetime.now() self.objects = [] info = {} if title: - info["/Title"] = "("+title+")" + info[b"/Title"] = b"("+title+b")" if author: - info["/Author"] = "("+author+")" + info[b"/Author"] = b"("+author+b")" if creator: - info["/Creator"] = "("+creator+")" + info[b"/Creator"] = b"("+creator+b")" if producer: - info["/Producer"] = "("+producer+")" + info[b"/Producer"] = b"("+producer+b")" if creationdate: - info["/CreationDate"] = "(D:"+creationdate.strftime("%Y%m%d%H%M%S")+")" - else: - info["/CreationDate"] = "(D:"+now.strftime("%Y%m%d%H%M%S")+")" + info[b"/CreationDate"] = b"(D:"+datetime_to_pdfdate(creationdate).encode()+b")" + elif not nodate: + info[b"/CreationDate"] = b"(D:"+datetime_to_pdfdate(now).encode()+b")" if moddate: - info["/ModDate"] = "(D:"+moddate.strftime("%Y%m%d%H%M%S")+")" - else: - info["/ModDate"] = "(D:"+now.strftime("%Y%m%d%H%M%S")+")" + info[b"/ModDate"] = b"(D:"+datetime_to_pdfdate(moddate).encode()+b")" + elif not nodate: + info[b"/ModDate"] = b"(D:"+datetime_to_pdfdate(now).encode()+b")" if subject: - info["/Subject"] = "("+subject+")" + info[b"/Subject"] = b"("+subject+b")" if keywords: - info["/Keywords"] = "("+",".join(keywords)+")" + info[b"/Keywords"] = b"("+b",".join(keywords)+b")" self.info = obj(info) # create an incomplete pages object so that a /Parent entry can be # added to each page self.pages = obj({ - "/Type": "/Pages", - "/Kids": [], - "/Count": 0 + b"/Type": b"/Pages", + b"/Kids": [], + b"/Count": 0 }) self.catalog = obj({ - "/Pages": self.pages, - "/Type": "/Catalog" + b"/Pages": self.pages, + b"/Type": b"/Catalog" }) self.addobj(self.catalog) self.addobj(self.pages) @@ -115,71 +133,70 @@ class pdfdoc(object): obj.identifier = newid self.objects.append(obj) - def addimage(self, color, width, height, dpi, imgformat, imgdata): + def addimage(self, color, width, height, imgformat, imgdata, pdf_x, pdf_y): if color == 'L': - color = "/DeviceGray" + colorspace = b"/DeviceGray" elif color == 'RGB': - color = "/DeviceRGB" + colorspace = b"/DeviceRGB" + elif color == 'CMYK' or color == 'CMYK;I': + colorspace = b"/DeviceCMYK" else: error_out("unsupported color space: %s"%color) exit(1) - # pdf units = 1/72 inch - pdf_x, pdf_y = 72.0*width/dpi[0], 72.0*height/dpi[1] - - print(pdf_x) - print(pdf_y) - if pdf_x < 3.00 or pdf_y < 3.00: - warning_out("pdf width or height is below 3.00 - decrease the dpi") - elif pdf_x > 14400.0 or pdf_y > 14400.0: - #error_out(("pdf width or height is above 200.00 - increase the dpi") + warning_out("pdf width or height is below 3.00\" - decrease the dpi") + elif pdf_x > 200.0 or pdf_y > 200.0: warning_out("pdf width or height would be above 200\" - squeezed inside") - x_scale = 14400.0 / pdf_x - y_scale = 14400.0 / pdf_y + x_scale = 200.0 / pdf_x + y_scale = 200.0 / pdf_y scale = min(x_scale, y_scale) * 0.999 pdf_x *= scale pdf_y *= scale # either embed the whole jpeg or deflate the bitmap representation if imgformat is "JPEG": - ofilter = [ "/DCTDecode" ] - elif imgformat is "JP2": - ofilter = [ "/JPXDecode" ] + ofilter = [ b"/DCTDecode" ] + elif imgformat is "JPEG2000": + ofilter = [ b"/JPXDecode" ] self.version = 5 # jpeg2000 needs pdf 1.5 else: - ofilter = [ "/FlateDecode" ] + ofilter = [ b"/FlateDecode" ] image = obj({ - "/Type": "/XObject", - "/Subtype": "/Image", - "/Filter": ofilter, - "/Width": width, - "/Height": height, - "/ColorSpace": color, - # hardcoded as PIL doesnt provide bits for non-jpeg formats - "/BitsPerComponent": 8, - "/Length": len(imgdata) + b"/Type": b"/XObject", + b"/Subtype": b"/Image", + b"/Filter": ofilter, + b"/Width": width, + b"/Height": height, + b"/ColorSpace": colorspace, + # hardcoded as PIL doesn't provide bits for non-jpeg formats + b"/BitsPerComponent": 8, + b"/Length": len(imgdata) }, imgdata) - text = "q\n%f 0 0 %f 0 0 cm\n/Im0 Do\nQ"%(pdf_x, pdf_y) + if color == 'CMYK;I': + # Inverts all four channels + image.content[b'/Decode'] = [1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0] + + text = ("q\n%0.4f 0 0 %0.4f 0 0 cm\n/Im0 Do\nQ"%(pdf_x, pdf_y)).encode() content = obj({ - "/Length": len(text) + b"/Length": len(text) }, text) page = obj({ - "/Type": "/Page", - "/Parent": self.pages, - "/Resources": { - "/XObject": { - "/Im0": image + b"/Type": b"/Page", + b"/Parent": self.pages, + b"/Resources": { + b"/XObject": { + b"/Im0": image } }, - "/MediaBox": [0, 0, pdf_x, pdf_y], - "/Contents": content + b"/MediaBox": [0, 0, pdf_x, pdf_y], + b"/Contents": content }) - self.pages.content["/Kids"].append(page) - self.pages.content["/Count"] += 1 + self.pages.content[b"/Kids"].append(page) + self.pages.content[b"/Count"] += 1 self.addobj(page) self.addobj(content) self.addobj(image) @@ -190,35 +207,43 @@ class pdfdoc(object): xreftable = list() - result = "%%PDF-1.%d\n"%self.version + result = ("%%PDF-1.%d\n"%self.version).encode() - xreftable.append("0000000000 65535 f \n") + xreftable.append(b"0000000000 65535 f \n") for o in self.objects: - xreftable.append("%010d 00000 n \n"%len(result)) + xreftable.append(("%010d 00000 n \n"%len(result)).encode()) result += o.tostring() xrefoffset = len(result) - result += "xref\n" - result += "0 %d\n"%len(xreftable) + result += b"xref\n" + result += ("0 %d\n"%len(xreftable)).encode() for x in xreftable: result += x - result += "trailer\n" - result += parse({"/Size": len(xreftable), "/Info": self.info, "/Root": self.catalog})+"\n" - result += "startxref\n" - result += "%d\n"%xrefoffset - result += "%%EOF\n" + result += b"trailer\n" + result += parse({b"/Size": len(xreftable), b"/Info": self.info, b"/Root": self.catalog})+b"\n" + result += b"startxref\n" + result += ("%d\n"%xrefoffset).encode() + result += b"%%EOF\n" return result -def convert(images, dpi, title=None, author=None, creator=None, producer=None, - creationdate=None, moddate=None, subject=None, keywords=None, - colorspace=None, verbose=False): +def convert(images, dpi=None, pagesize=(None, None, None), title=None, + author=None, creator=None, producer=None, creationdate=None, + moddate=None, subject=None, keywords=None, colorspace=None, + nodate=False, verbose=False): + + pagesize_options = pagesize[2] pdf = pdfdoc(3, title, author, creator, producer, creationdate, - moddate, subject, keywords) + moddate, subject, keywords, nodate) - for im in images: - rawdata = im.read() - im.seek(0) + for imfilename in images: + debug_out("Reading %s"%imfilename, verbose) + try: + rawdata = imfilename.read() + except AttributeError: + with open(imfilename, "rb") as im: + rawdata = im.read() + im = cStringIO(rawdata) try: imgdata = Image.open(im) except IOError as e: @@ -229,14 +254,11 @@ def convert(images, dpi, title=None, author=None, creator=None, producer=None, exit(1) # image is jpeg2000 width, height, ics = parsejp2(rawdata) - imgformat = "JP2" + imgformat = "JPEG2000" - if dpi: - ndpi = dpi, dpi - debug_out("input dpi (forced) = %d x %d"%ndpi, verbose) - else: - ndpi = (96, 96) # TODO: read real dpi - debug_out("input dpi = %d x %d"%ndpi, verbose) + # TODO: read real dpi from input jpeg2000 image + ndpi = (default_dpi, default_dpi) + debug_out("input dpi = %d x %d" % ndpi, verbose) if colorspace: color = colorspace @@ -248,26 +270,45 @@ def convert(images, dpi, title=None, author=None, creator=None, producer=None, width, height = imgdata.size imgformat = imgdata.format - if dpi: - ndpi = dpi, dpi - debug_out("input dpi (forced) = %d x %d"%ndpi, verbose) - else: - ndpi = imgdata.info.get("dpi", (96, 96)) - debug_out("input dpi = %d x %d"%ndpi, verbose) + ndpi = imgdata.info.get("dpi", (default_dpi, default_dpi)) + # in python3, the returned dpi value for some tiff images will + # not be an integer but a float. To make the behaviour of + # img2pdf the same between python2 and python3, we convert that + # float into an integer by rounding + # search online for the 72.009 dpi problem for more info + ndpi = (int(round(ndpi[0])),int(round(ndpi[1]))) + debug_out("input dpi = %d x %d" % ndpi, verbose) if colorspace: color = colorspace debug_out("input colorspace (forced) = %s"%(color), verbose) else: color = imgdata.mode + if color == "CMYK" and imgformat == "JPEG": + # Adobe inverts CMYK JPEGs for some reason, and others + # have followed suit as well. Some software assumes the + # JPEG is inverted if the Adobe tag (APP14), while other + # software assumes all CMYK JPEGs are inverted. I don't + # have enough experience with these to know which is + # better for images currently in the wild, so I'm going + # with the first approach for now. + if "adobe" in imgdata.info: + color = "CMYK;I" debug_out("input colorspace = %s"%(color), verbose) debug_out("width x height = %d x %d"%(width,height), verbose) debug_out("imgformat = %s"%imgformat, verbose) + if dpi: + ndpi = dpi, dpi + debug_out("input dpi (forced) = %d x %d" % ndpi, verbose) + elif pagesize_options: + ndpi = get_ndpi(width, height, pagesize) + debug_out("calculated dpi (based on pagesize) = %d x %d" % ndpi, verbose) + # depending on the input format, determine whether to pass the raw # image or the zlib compressed color information - if imgformat is "JPEG" or imgformat is "JP2": + if imgformat is "JPEG" or imgformat is "JPEG2000": if color == '1': error_out("jpeg can't be monochrome") exit(1) @@ -275,16 +316,61 @@ def convert(images, dpi, title=None, author=None, creator=None, producer=None, else: # because we do not support /CCITTFaxDecode if color == '1': + debug_out("Converting colorspace 1 to L", verbose) imgdata = imgdata.convert('L') color = 'L' - imgdata = zlib.compress(imgdata.tostring()) + elif color in ("RGB", "L", "CMYK", "CMYK;I"): + debug_out("Colorspace is OK: %s"%color, verbose) + else: + debug_out("Converting colorspace %s to RGB"%color, verbose) + imgdata = imgdata.convert('RGB') + color = imgdata.mode + img = imgdata.tobytes() + # the python-pil version 2.3.0-1ubuntu3 in Ubuntu does not have the close() method + try: + imgdata.close() + except AttributeError: + pass + imgdata = zlib.compress(img) + im.close() - pdf.addimage(color, width, height, ndpi, imgformat, imgdata) + if pagesize_options and pagesize_options['exact'][1]: + # output size exactly to specified dimensions + # pagesize[0], pagesize[1] already checked in valid_size() + pdf_x, pdf_y = pagesize[0], pagesize[1] + else: + # output size based on dpi; point = 1/72 inch + pdf_x, pdf_y = 72.0*width/float(ndpi[0]), 72.0*height/float(ndpi[1]) - im.close() + pdf.addimage(color, width, height, imgformat, imgdata, pdf_x, pdf_y) return pdf.tostring() +def get_ndpi(width, height, pagesize): + pagesize_options = pagesize[2] + + if pagesize_options and pagesize_options['fill'][1]: + if width/height < pagesize[0]/pagesize[1]: + tmp_dpi = 72.0*width/pagesize[0] + else: + tmp_dpi = 72.0*height/pagesize[1] + elif pagesize[0] and pagesize[1]: + # if both height and width given with no specific pagesize_option, + # resize to fit "into" page + if width/height < pagesize[0]/pagesize[1]: + tmp_dpi = 72.0*height/pagesize[1] + else: + tmp_dpi = 72.0*width/pagesize[0] + elif pagesize[0]: + # if width given, calculate dpi based on width + tmp_dpi = 72.0*width/pagesize[0] + elif pagesize[1]: + # if height given, calculate dpi based on height + tmp_dpi = 72.0*height/pagesize[1] + else: + tmp_dpi = default_dpi + + return tmp_dpi, tmp_dpi def positive_float(string): value = float(string) @@ -294,58 +380,276 @@ def positive_float(string): return value def valid_date(string): - return datetime.strptime(string, "%Y-%m-%dT%H:%M:%S") + # first try parsing in ISO8601 format + try: + return datetime.strptime(string, "%Y-%m-%d") + except ValueError: + pass + try: + return datetime.strptime(string, "%Y-%m-%dT%H:%M") + except ValueError: + pass + try: + return datetime.strptime(string, "%Y-%m-%dT%H:%M:%S") + except ValueError: + pass + # then try dateutil + try: + from dateutil import parser + except ImportError: + pass + else: + try: + return parser.parse(string) + except TypeError: + pass + # as a last resort, try the local date utility + try: + import subprocess + except ImportError: + pass + else: + try: + utime = subprocess.check_output(["date", "--date", string, "+%s"]) + except subprocess.CalledProcessError: + pass + else: + return datetime.utcfromtimestamp(int(utime)) + raise argparse.ArgumentTypeError("cannot parse date: %s"%string) + +def get_standard_papersize(string): + papersizes = { + "11x17" : "792x792^", # "792x1224", + "ledger" : "792x792^", # "1224x792", + "legal" : "612x612^", # "612x1008", + "letter" : "612x612^", # "612x792", + "arche" : "2592x2592^", # "2592x3456", + "archd" : "1728x1728^", # "1728x2592", + "archc" : "1296x1296^", # "1296x1728", + "archb" : "864x864^", # "864x1296", + "archa" : "648x648^", # "648x864", + "a0" : "2380x2380^", # "2380x3368", + "a1" : "1684x1684^", # "1684x2380", + "a2" : "1190x1190^", # "1190x1684", + "a3" : "842x842^", # "842x1190", + "a4" : "595x595^", # "595x842", + "a5" : "421x421^", # "421x595", + "a6" : "297x297^", # "297x421", + "a7" : "210x210^", # "210x297", + "a8" : "148x148^", # "148x210", + "a9" : "105x105^", # "105x148", + "a10" : "74x74^", # "74x105", + "b0" : "2836x2836^", # "2836x4008", + "b1" : "2004x2004^", # "2004x2836", + "b2" : "1418x1418^", # "1418x2004", + "b3" : "1002x1002^", # "1002x1418", + "b4" : "709x709^", # "709x1002", + "b5" : "501x501^", # "501x709", + "c0" : "2600x2600^", # "2600x3677", + "c1" : "1837x1837^", # "1837x2600", + "c2" : "1298x1298^", # "1298x1837", + "c3" : "918x918^", # "918x1298", + "c4" : "649x649^", # "649x918", + "c5" : "459x459^", # "459x649", + "c6" : "323x323^", # "323x459", + "flsa" : "612x612^", # "612x936", + "flse" : "612x612^", # "612x936", + "halfletter" : "396x396^", # "396x612", + "tabloid" : "792x792^", # "792x1224", + "statement" : "396x396^", # "396x612", + "executive" : "540x540^", # "540x720", + "folio" : "612x612^", # "612x936", + "quarto" : "610x610^", # "610x780" + } + + string = string.lower() + return papersizes.get(string, string) + +def valid_size(string): + # conversion factors from units to points + units = { + 'in' : 72.0, + 'cm' : 72.0/2.54, + 'mm' : 72.0/25.4, + 'pt' : 1.0 + } + + pagesize_options = { + 'exact' : ['\!', False], + 'shrink' : ['\>', False], + 'enlarge' : ['\<', False], + 'fill' : ['\^', False], + 'percent' : ['\%', False], + 'count' : ['\@', False], + } + + string = get_standard_papersize(string) + + pattern = re.compile(r""" + ([0-9]*\.?[0-9]*) # tokens.group(1) == width; may be empty + ([a-z]*) # tokens.group(2) == units; may be empty + x + ([0-9]*\.?[0-9]*) # tokens.group(3) == height; may be empty + ([a-zA-Z]*) # tokens.group(4) == units; may be empty + ([^0-9a-zA-Z]*) # tokens.group(5) == extra options + """, re.VERBOSE) + + tokens = pattern.match(string) + + # tokens.group(0) should match entire input string + if tokens.group(0) != string: + msg = ('Input size needs to be of the format AuxBv#, ' + 'where A is width, B is height, u and v are units, ' + '# are options. ' + 'You may omit either width or height, but not both. ' + 'Units may be specified as (in, cm, mm, pt). ' + 'You may omit units, which will default to pt. ' + 'Available options include (! = exact ; ^ = fill ; default = into).') + raise argparse.ArgumentTypeError(msg) + + # temporary list to loop through to process width and height + pagesize_size = { + 'x' : [0, tokens.group(1), tokens.group(2)], + 'y' : [0, tokens.group(3), tokens.group(4)] + } + + for key, value in pagesize_size.items(): + try: + value[0] = float(value[1]) + value[0] *= units[value[2]] # convert to points + except ValueError: + # assign None if width or height not provided + value[0] = None + except KeyError: + # if units unrecognized, raise error + # otherwise default to pt because units not provided + if value[2]: + msg = "unrecognized unit '%s'." % value[2] + raise argparse.ArgumentTypeError(msg) + + x = pagesize_size['x'][0] + y = pagesize_size['y'][0] + + # parse options for resize methods + if tokens.group(5): + for key, value in pagesize_options.items(): + if re.search(value[0], tokens.group(5)): + value[1] = True + + if pagesize_options['fill'][1]: + # if either width or height is not given, try to fill in missing value + if not x: + x = y + elif not y: + y = x + + if pagesize_options['exact'][1]: + if not x or not y: + msg = ('exact size requires both width and height.') + raise argparse.ArgumentTypeError(msg) + + if not x and not y: + msg = ('width and height cannot both be omitted.') + raise argparse.ArgumentTypeError(msg) + + return (x, y, pagesize_options) + +# in python3, the received argument will be a unicode str() object which needs +# to be encoded into a bytes() object +# in python2, the received argument will be a binary str() object which needs +# no encoding +# we check whether we use python2 or python3 by checking whether the argument +# is both, type str and type bytes (only the case in python2) +def pdf_embedded_string(string): + if type(string) is str and type(string) is not bytes: + # py3 + pass + else: + # py2 + string = string.decode("utf8") + string = b"\xfe\xff"+string.encode("utf-16-be") + string = string.replace(b'\\', b'\\\\') + string = string.replace(b'(', b'\\(') + string = string.replace(b')', b'\\)') + return string parser = argparse.ArgumentParser( description='Lossless conversion/embedding of images (in)to pdf') parser.add_argument( - 'images', metavar='infile', type=argparse.FileType('rb'), + 'images', metavar='infile', type=str, nargs='+', help='input file(s)') parser.add_argument( '-o', '--output', metavar='out', type=argparse.FileType('wb'), - default=sys.stdout, help='output file (default: stdout)') -parser.add_argument( + default=getattr(sys.stdout, "buffer", sys.stdout), + help='output file (default: stdout)') + +sizeopts = parser.add_mutually_exclusive_group() +sizeopts.add_argument( '-d', '--dpi', metavar='dpi', type=positive_float, - help='dpi for pdf output (default: 96.0)') + help=('dpi for pdf output. ' + 'If input image does not specify dpi the default is %.2f. ' + 'Must not be used with -s/--pagesize.') % default_dpi +) + +sizeopts.add_argument( + '-s', '--pagesize', metavar='size', type=valid_size, + default=(None, None, None), + help=('size of the pdf pages in format AuxBv#, ' + 'where A is width, B is height, u and v are units, # are options. ' + 'You may omit either width or height, but not both. ' + 'Some common page sizes, such as letter and a4, are also recognized. ' + 'Units may be specified as (in, cm, mm, pt). ' + 'Units default to pt when absent. ' + 'Available options include (! = exact ; ^ = fill ; default = into). ' + 'Must not be used with -d/--dpi.') +) + parser.add_argument( - '-t', '--title', metavar='title', type=str, + '-t', '--title', metavar='title', type=pdf_embedded_string, help='title for metadata') parser.add_argument( - '-a', '--author', metavar='author', type=str, + '-a', '--author', metavar='author', type=pdf_embedded_string, help='author for metadata') parser.add_argument( - '-c', '--creator', metavar='creator', type=str, + '-c', '--creator', metavar='creator', type=pdf_embedded_string, help='creator for metadata') parser.add_argument( - '-p', '--producer', metavar='producer', type=str, + '-p', '--producer', metavar='producer', type=pdf_embedded_string, help='producer for metadata') parser.add_argument( '-r', '--creationdate', metavar='creationdate', type=valid_date, - help='creation date for metadata in YYYY-MM-DDTHH:MM:SS format') + help='UTC creation date for metadata in YYYY-MM-DD or YYYY-MM-DDTHH:MM or YYYY-MM-DDTHH:MM:SS format or any format understood by python dateutil module or any format understood by `date --date`') parser.add_argument( '-m', '--moddate', metavar='moddate', type=valid_date, - help='modification date for metadata in YYYY-MM-DDTHH:MM:SS format') + help='UTC modification date for metadata in YYYY-MM-DD or YYYY-MM-DDTHH:MM or YYYY-MM-DDTHH:MM:SS format or any format understood by python dateutil module or any format understood by `date --date`') parser.add_argument( - '-s', '--subject', metavar='subject', type=str, + '-S', '--subject', metavar='subject', type=pdf_embedded_string, help='subject for metadata') parser.add_argument( - '-k', '--keywords', metavar='kw', type=str, nargs='+', + '-k', '--keywords', metavar='kw', type=pdf_embedded_string, nargs='+', help='keywords for metadata') parser.add_argument( - '-C', '--colorspace', metavar='colorspace', type=str, - help='force PIL colorspace (one of: RGB, L, 1)') + '-C', '--colorspace', metavar='colorspace', type=pdf_embedded_string, + help='force PIL colorspace (one of: RGB, L, 1, CMYK, CMYK;I)') +parser.add_argument( + '-D', '--nodate', help='do not add timestamps', action="store_true") parser.add_argument( '-v', '--verbose', help='verbose mode', action="store_true") +parser.add_argument( + '-V', '--version', action='version', version='%(prog)s '+__version__, + help="Print version information and exit") def main(args=None): if args is None: args = sys.argv[1:] args = parser.parse_args(args) + args.output.write( convert( - args.images, args.dpi, args.title, args.author, + args.images, args.dpi, args.pagesize, args.title, args.author, args.creator, args.producer, args.creationdate, args.moddate, - args.subject, args.keywords, args.colorspace, args.verbose)) + args.subject, args.keywords, args.colorspace, args.nodate, + args.verbose)) if __name__ == '__main__': main() diff --git a/src/jp2.py b/src/jp2.py index addfe5d..c897e5f 100644 --- a/src/jp2.py +++ b/src/jp2.py @@ -85,6 +85,6 @@ def parsejp2(data): if __name__ == "__main__": import sys width, height, colorspace = parsejp2(open(sys.argv[1]).read()) - print "width = %d"%width - print "height = %d"%height - print "colorspace = %s"%colorspace + sys.stdout.write("width = %d"%width) + sys.stdout.write("height = %d"%height) + sys.stdout.write("colorspace = %s"%colorspace) diff --git a/src/tests/__init__.py b/src/tests/__init__.py index 8fd6866..15c9328 100644 --- a/src/tests/__init__.py +++ b/src/tests/__init__.py @@ -1,7 +1,109 @@ import unittest -import test_img2pdf + +import os +import img2pdf +import zlib +from PIL import Image + +HERE = os.path.dirname(__file__) + +#convert +set date:create +set date:modify -define png:exclude-chunk=time def test_suite(): + class TestImg2Pdf(unittest.TestCase): + pass + + for test_name in os.listdir(os.path.join(HERE, "input")): + inputf = os.path.join(HERE, "input", test_name) + if not os.path.isfile(inputf): + continue + outputf = os.path.join(HERE, "output", test_name+".pdf") + assert os.path.isfile(outputf) + def handle(self, f=inputf, out=outputf): + with open(f, "rb") as inf: + orig_imgdata = inf.read() + pdf = img2pdf.convert([f], nodate=True) + imgdata = b"" + instream = False + imgobj = False + colorspace = None + imgfilter = None + width = None + height = None + length = None + # ugly workaround to parse the created pdf + for line in pdf.split(b'\n'): + if instream: + if line == b"endstream": + break + else: + imgdata += line + b'\n' + else: + if imgobj and line == b"stream": + instream = True + elif b"/Subtype /Image" in line: + imgobj = True + elif b"/Width" in line: + width = int(line.split()[-1]) + elif b"/Height" in line: + height = int(line.split()[-1]) + elif b"/Length" in line: + length = int(line.split()[-1]) + elif b"/Filter" in line: + imgfilter = line.split()[-2] + elif b"/ColorSpace" in line: + colorspace = line.split()[-1] + # remove trailing \n + imgdata = imgdata[:-1] + # test if the length field is correct + self.assertEqual(len(imgdata), length) + # test if the filter is valid: + self.assertIn(imgfilter, [b"/DCTDecode", b"/JPXDecode", b"/FlateDecode"]) + # test if the colorspace is valid + self.assertIn(colorspace, [b"/DeviceGray", b"/DeviceRGB", b"/DeviceCMYK"]) + # test if the image has correct size + orig_img = Image.open(f) + self.assertEqual(width, orig_img.size[0]) + self.assertEqual(height, orig_img.size[1]) + # if the input file is a jpeg then it should've been copied + # verbatim into the PDF + if imgfilter in [b"/DCTDecode", b"/JPXDecode"]: + self.assertEqual(imgdata, orig_imgdata) + elif imgfilter == b"/FlateDecode": + # otherwise, the data is flate encoded and has to be equal to + # the pixel data of the input image + imgdata = zlib.decompress(imgdata) + if colorspace == b"/DeviceGray": + colorspace = 'L' + elif colorspace == b"/DeviceRGB": + colorspace = 'RGB' + elif colorspace == b"/DeviceCMYK": + colorspace = 'CMYK' + else: + raise Exception("invalid colorspace") + im = Image.frombytes(colorspace, (width, height), imgdata) + if orig_img.mode == '1': + orig_img = orig_img.convert("L") + elif orig_img.mode not in ("RGB", "L", "CMYK", "CMYK;I"): + orig_img = orig_img.convert("RGB") + self.assertEqual(im.tobytes(), orig_img.tobytes()) + # the python-pil version 2.3.0-1ubuntu3 in Ubuntu does not have the close() method + try: + im.close() + except AttributeError: + pass + # lastly, make sure that the generated pdf matches bit by bit the + # expected pdf + with open(out, "rb") as outf: + out = outf.read() + self.assertEqual(pdf, out) + # the python-pil version 2.3.0-1ubuntu3 in Ubuntu does not have the close() method + try: + orig_img.close() + except AttributeError: + pass + setattr(TestImg2Pdf, "test_%s"%test_name, handle) + return unittest.TestSuite(( - unittest.makeSuite(test_img2pdf.TestImg2Pdf), + unittest.makeSuite(TestImg2Pdf), )) diff --git a/src/tests/input/CMYK.jpg b/src/tests/input/CMYK.jpg new file mode 100644 index 0000000..44213a8 Binary files /dev/null and b/src/tests/input/CMYK.jpg differ diff --git a/src/tests/input/CMYK.tif b/src/tests/input/CMYK.tif new file mode 100644 index 0000000..8e3803e Binary files /dev/null and b/src/tests/input/CMYK.tif differ diff --git a/src/tests/test.jpg b/src/tests/input/normal.jpg similarity index 100% rename from src/tests/test.jpg rename to src/tests/input/normal.jpg diff --git a/src/tests/test.png b/src/tests/input/normal.png similarity index 100% rename from src/tests/test.png rename to src/tests/input/normal.png diff --git a/src/tests/output/CMYK.jpg.pdf b/src/tests/output/CMYK.jpg.pdf new file mode 100644 index 0000000..2a00022 Binary files /dev/null and b/src/tests/output/CMYK.jpg.pdf differ diff --git a/src/tests/output/CMYK.tif.pdf b/src/tests/output/CMYK.tif.pdf new file mode 100644 index 0000000..54c0b4e Binary files /dev/null and b/src/tests/output/CMYK.tif.pdf differ diff --git a/src/tests/test.pdf b/src/tests/output/normal.jpg.pdf similarity index 93% rename from src/tests/test.pdf rename to src/tests/output/normal.jpg.pdf index c3a3154..1b891a0 100644 Binary files a/src/tests/test.pdf and b/src/tests/output/normal.jpg.pdf differ diff --git a/src/tests/output/normal.png.pdf b/src/tests/output/normal.png.pdf new file mode 100644 index 0000000..5538634 Binary files /dev/null and b/src/tests/output/normal.png.pdf differ diff --git a/src/tests/test_img2pdf.py b/src/tests/test_img2pdf.py deleted file mode 100644 index 82bc316..0000000 --- a/src/tests/test_img2pdf.py +++ /dev/null @@ -1,20 +0,0 @@ -import datetime -import os -import unittest -import img2pdf - -HERE = os.path.dirname(__file__) -moddate = datetime.datetime(2014, 1, 1) - -class TestImg2Pdf(unittest.TestCase): - def test_jpg2pdf(self): - with open(os.path.join(HERE, 'test.jpg'), 'r') as img_fp: - with open(os.path.join(HERE, 'test.pdf'), 'r') as pdf_fp: - self.assertEqual( - img2pdf.convert([img_fp], 150, - creationdate=moddate, moddate=moddate), - pdf_fp.read()) - - def test_png2pdf(self): - with open(os.path.join(HERE, 'test.png'), 'r') as img_fp: - self.assertRaises(SystemExit, img2pdf.convert, [img_fp], 150)