Merge branch 'master' of http://gitlab.mister-muffin.de/josch/img2pdf

# Conflicts: # src/img2pdf.py
2016-01-16 03:05:31 -08:00 · 2016-01-16 03:05:31 -08:00 · 1261741136
commit 1261741136
parent 50e22888ca fdee171d40
16 changed files with 711 additions and 217 deletions
--- a/CHANGES.rst
+++ b/CHANGES.rst
@ -2,7 +2,43 @@
 CHANGES
 =======

-1.0.0 (unreleased)
+0.1.6
+-----
+
+ - replace -x and -y option by combined option -s (or --pagesize) and use -S
+   for --subject
+ - correctly encode and escape non-ascii metadata
+ - explicitly store date in UTC and allow parsing all date formats understood
+   by dateutil and `date --date`
+
+0.1.5
+-----
+
+- Enable support for CMYK images
+- Rework test suite
+- support file objects as input
+
+0.1.4
+-----
+
+- add Python 3 support
+- make output reproducible by sorting and --nodate option
+
+0.1.3
+-----
+
+- Avoid leaking file descriptors
+- Convert unrecognized colorspaces to RGB
+
+0.1.1
+-----
+
+- allow running src/img2pdf.py standalone
+- license change from GPL to LGPL
+- Add pillow 2.4.0 support
+- add options to specify pdf dimensions in points
+
+0.1.0 (unreleased)
 ------------------

 - Initial PyPI release.
--- a/README.md
+++ b/README.md
@ -1,15 +1,16 @@
 img2pdf
 =======

-Lossless conversion of images to PDF without unnecessarily re-encoding JPEG and
-JPEG2000 files. Thus, no loss of quality and no unnecessary large output file.
+Losslessly convert images to PDF without unnecessarily re-encoding JPEG and
+JPEG2000 files.  Image quality is retained without unnecessarily increasing
+file size.

 Background
 ----------

-PDF is able to embed JPEG and JPEG2000 images as they are without re-encoding
-them (and hence loosing quality) but I was missing a tool to do this
-automatically, thus I wrote this piece of python code.
+Quality loss can be avoided when converting JPEG and JPEG2000 images to
+PDF by embedding them without re-encoding.  I wrote this piece of python code.
+because I was missing a tool to do this automatically.

 If you know how to embed JPEG and JPEG2000 images into a PDF container without
 recompression, using existing tools, please contact me so that I can put this
@ -18,100 +19,160 @@ code into the garbage bin :D
 Functionality
 -------------

-The program will take image filenames from commandline arguments and output a
-PDF file with them embedded into it. If the input image is a JPEG or JPEG2000
-file, it will be included as-is without any processing. If it is in any other
-format, the image will be included as zip-encoded RGB. As a result, this tool
-will be able to lossless wrap any image into a PDF container while performing
-better (in terms of quality/filesize ratio) than existing tools in case the
-input image is a JPEG or JPEG2000 file.
+This program will take a list of images and produce a PDF file with the images
+embedded in it.  JPEG and JPEG2000 images will be included without
+recompression.  Images in other formats will be included with zip/flate
+encoding which usually leads to an increase in the resulting size because
+formats like png compress better than PDF which just zip/flate compresses the
+RGB data.  As a result, this tool is able to losslessly wrap images into a PDF
+container with a quality-filesize ratio that is typically better (in case of
+JPEG and JPEG2000 images) or equal (in case of other formats) than that of
+existing tools.

-For example, imagemagick will re-encode the input JPEG image and thus change
-its content:
+For example, imagemagick will re-encode the input JPEG image (thus changing
+its content):

 	$ convert img.jpg img.pdf
 	$ pdfimages img.pdf img.extr # not using -j to be extra sure there is no recompression
 	$ compare -metric AE img.jpg img.extr-000.ppm null:
 	1.6301e+06

-If one wants to do a lossless conversion from any format to PDF with
-imagemagick then one has to use zip-encoding:
+If one wants to losslessly convert from any format to PDF with
+imagemagick, one has to use zip compression:

 	$ convert input.jpg -compress Zip output.pdf
 	$ pdfimages img.pdf img.extr # not using -j to be extra sure there is no recompression
 	$ compare -metric AE img.jpg img.extr-000.ppm null:
 	0

-The downside is, that using imagemagick like this will make the resulting PDF
-files a few times bigger than the input JPEG or JPEG2000 file and can also not
-output a multipage PDF.
+However, this approach will result in PDF files that are a few times larger
+than the input JPEG or JPEG2000 file.

-img2pdf is able to output a PDF with multiple pages if more than one input
-image is given, losslessly embed JPEG and JPEG2000 files into a PDF container
-without adding more overhead than the PDF structure itself and will save all
-other graphics formats using lossless zip-compression.
+img2pdf is able to losslessly embed JPEG and JPEG2000 files into a PDF
+container without additional overhead (aside from the PDF structure itself),
+save other graphics formats using lossless zip compression,
+and produce multi-page PDF files when more than one input image is given.

-Another nifty advantage: Since no re-encoding is done in case of JPEG images,
-the conversion is many (ten to hundred) times faster with img2pdf compared to
-imagemagick. While a run of above convert command with a 2.8MB JPEG takes 27
-seconds (on average) on my machine, conversion using img2pdf takes just a
-fraction of a second.
+Also, since JPEG and JPEG2000 images are not reencoded, conversion  with
+img2pdf is several times faster than with other tools.

-Commandline Arguments
---------------------

-At least one input file argument must be given as img2pdf needs to seek in the
-file descriptor which would not be possible for stdin.
+Usage
+-----

-Specify the dpi with the -d or --dpi options instead of reading it from the
-image or falling back to 96.0.
+#### General Notes

-Specify the output file with -o or --output. By default output will be done to
-stdout.
+The images must be provided as files because img2pdf needs to seek
+in the file descriptor.  Input cannot be piped through stdin.

-Specify metadata using the --title, --author, --creator, --producer,
--creationdate, --moddate, --subject and --keywords options (or their short
-forms).
+If no output file is specified with the `-o`/`--output` option,
+output will be to stdout.

-Specify -C or --colorspace to force a colorspace using PIL short handles like
-'RGB', 'L' or '1'.
+Descriptions of the options should be self explanatory.
+They are available by running:

-More help is available with the -h or --help option.
+	img2pdf --help
+
+
+#### Controlling Page Size
+
+The PDF page size can be manipulated.  By default, the image will be sized "into" the given dimensions with the aspect ratio retained.  For instance, to size an image into a page that is at most 500pt x 500pt, use:
+
+	img2pdf -s 500x500 -o output.pdf input.jpg
+
+To "fill" out a page that is at least 500pt x 500pt, follow the dimensions with a `^`:
+
+	img2pdf -s 500x500^ -o output.pdf input.jpg
+
+To output pages that are exactly 500pt x 500pt, follow the dimensions with an `!`:
+
+	img2pdf -s 500x500\! -o output.pdf input.jpg
+
+Notice that the default unit is points.  Units may be also be specified and mixed:
+
+	img2pdf -s 8.5inx27.94cm -o output.pdf input.jpg
+
+If either width or height is omitted, the other will be calculated
+to preserve aspect ratio.
+
+	img2pdf -s x280mm -o output1.pdf input.jpg
+	img2pdf -s 280mmx -o output2.pdf input.jpg
+
+Some standard page sizes are recognized:
+
+	img2pdf -s letter -o output1.pdf input.jpg
+	img2pdf -s a4 -o output2.pdf input.jpg
+
+#### Colorspace
+
+Currently, the colorspace must be forced for JPEG 2000 images that are
+not in the RGB colorspace.  Available colorspace options are based on
+Python Imaging Library (PIL) short handles.
+
+ * `RGB` = RGB color
+ * `L` = Grayscale
+ * `1` = Black and white (internally converted to grayscale)
+ * `CMYK` = CMYK color
+ * `CMYK;I` = CMYK color with inversion
+
+For example, to encode a grayscale JPEG2000 image, use:
+
+	img2pdf -C L -o output.pdf input.jp2

 Bugs
 ----

-If you find a JPEG or JPEG2000 file that, when embedded can not be read by the
-Adobe Acrobat Reader, please contact me.
+If you find a JPEG or JPEG2000 file that, when embedded cannot be read
+by the Adobe Acrobat Reader, please contact me.

-For lossless conversion of other formats than JPEG or JPEG2000 files, zip/flate
-encoding is used.  This choice is based on a number of tests I did on images.
-I converted them into PDF using imagemagick and all compressions it has to
-offer and then compared the output size of the lossless variants. In all my
-tests, zip/flate encoding performed best. You can verify my findings using the
-test_comp.sh script with any input image given as a commandline argument. If
-you find an input file that is outperformed by another lossless compression,
-contact me.
+For lossless conversion of formats other than JPEG or JPEG2000, zip/flate
+encoding is used.  This choice is based on tests I did with a number of images.
+I converted them into PDF using the lossless variants of the compression
+formats offered by imagemagick.  In all my tests, zip/flate encoding performed
+best.  You can verify my findings using the test_comp.sh script with any input
+image given as a commandline argument.  If you find an input file that is
+outperformed by another lossless compression method, contact me.

-I have not yet figured out how to read the colorspace from jpeg2000 files.
-Therefor jpeg2000 files use DeviceRGB per default. If your jpeg2000 files are
-of any other colorspace you must force it using the --colorspace option.
-Like -C L for DeviceGray.
+I have not yet figured out how to determine the colorspace of JPEG2000 files.
+Therefore JPEG2000 files use DeviceRGB by default. For JPEG2000 files with
+other colorspaces, you must force it using the `--colorspace` option.
+
+It might be possible to store transparency using masks but it is not clear
+what the utility of such a functionality would be.
+
+Most vector graphic formats can be losslessly turned into PDF (minus some of
+the features unsupported by PDF) but img2pdf will currently turn vector
+graphics into their lossy raster representations.
+
+Acrobat is able to store a hint for the PDF reader of how to present the PDF
+when opening it. Things like automatic fullscreen or the zoom level can be
+configured.
+
+It would be nice if a single input image could be read from standard input.

 Installation
 ------------

-You can install the package using:
+On a Debian- and Ubuntu-based systems, dependencies may be installed
+with the following command:
+
+	apt-get install python python-pil python-setuptools
+
+Or for Python 3:
+
+	apt-get install python3 python3-pil python3-setuptools
+
+You can then install the package using:

 	$ pip install img2pdf

-If you want to install from source code simply use:
+If you prefer to install from source code use:

 	$ cd img2pdf/
 	$ pip install .

 To test the console script without installing the package on your system,
-simply use virtualenv:
+use virtualenv:

 	$ cd img2pdf/
 	$ virtualenv ve
@ -121,7 +182,10 @@ You can then test the converter using:

 	$ ve/bin/img2pdf -o test.pdf src/tests/test.jpg

-Note that the package can also be used as a library as follows:
+The package can also be used as a library:

 	import img2pdf
-  pdf_bytes = img2pdf('test.jpg', dpi=150)
+	pdf_bytes = img2pdf.convert(['test.jpg'])
+
+	file = open("name.pdf","wb")
+	file.write(pdf_bytes)
--- a/setup.cfg
+++ b/setup.cfg
@ -0,0 +1,2 @@
+[metadata]
+description-file = README.md
--- a/setup.py
+++ b/setup.py
@ -1,9 +1,12 @@
 from setuptools import setup

+VERSION="0.1.6~git"
+
 setup (
    name='img2pdf',
-    version='0.1.0',
+    version=VERSION,
    author = "Johannes 'josch' Schauer",
+    author_email = 'j.schauer@email.de',
    description = "Convert images to PDF via direct JPEG inclusion.",
    long_description = open('README.md').read(),
    license = "LGPL",
@ -15,12 +18,15 @@ setup (
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 2.6',
        'Programming Language :: Python :: 2.7',
+        'Programming Language :: Python :: 3',
+        'Programming Language :: Python :: 3.4',
        'Programming Language :: Python :: Implementation :: CPython',
        'License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)',
        'Programming Language :: Python',
        'Natural Language :: English',
        'Operating System :: OS Independent'],
-    url = 'http://pypi.python.org/pypi/img2pdf',
+    url = 'https://github.com/josch/img2pdf',
+    download_url = 'https://github.com/josch/img2pdf/archive/'+VERSION+'.tar.gz',
    package_dir={"": "src"},
    py_modules=['img2pdf', 'jp2'],
    include_package_data = True,
--- a/src/img2pdf.py
+++ b/src/img2pdf.py
@ -1,3 +1,5 @@
+#!/usr/bin/env python2
+
 # Copyright (C) 2012-2014 Johannes 'josch' Schauer <j.schauer at email.de>
 #
 # This program is free software: you can redistribute it and/or
@ -15,13 +17,20 @@
 # License along with this program.  If not, see
 # <http://www.gnu.org/licenses/>.

+__version__ = "0.1.6~git"
+default_dpi = 96.0
+
+import re
 import sys
 import zlib
 import argparse
-import struct
 from PIL import Image
 from datetime import datetime
 from jp2 import parsejp2
+try:
+    from cStringIO import cStringIO
+except ImportError:
+    from io import BytesIO as cStringIO

 # XXX: Switch to use logging module.
 def debug_out(message, verbose=True):
@ -34,19 +43,28 @@ def error_out(message):
 def warning_out(message):
    sys.stderr.write("W: "+message+"\n")

+def datetime_to_pdfdate(dt):
+    return dt.strftime("%Y%m%d%H%M%SZ")
+
 def parse(cont, indent=1):
    if type(cont) is dict:
-        return "<<\n"+"\n".join(
-            [4 * indent * " " + "%s %s" % (k, parse(v, indent+1))
-             for k, v in cont.items()])+"\n"+4*(indent-1)*" "+">>"
-    elif type(cont) is int or type(cont) is float:
-        return str(cont)
+        return b"<<\n"+b"\n".join(
+            [4 * indent * b" " + k + b" " + parse(v, indent+1)
+             for k, v in sorted(cont.items())])+b"\n"+4*(indent-1)*b" "+b">>"
+    elif type(cont) is int:
+        return str(cont).encode()
+    elif type(cont) is float:
+        return ("%0.4f"%cont).encode()
    elif isinstance(cont, obj):
-        return "%d 0 R"%cont.identifier
-    elif type(cont) is str:
+        return ("%d 0 R"%cont.identifier).encode()
+    elif type(cont) is str or type(cont) is bytes:
+        if type(cont) is str and type(cont) is not bytes:
+            raise Exception("parse must be passed a bytes object in py3")
        return cont
    elif type(cont) is list:
-        return "[ "+" ".join([parse(c, indent) for c in cont])+" ]"
+        return b"[ "+b" ".join([parse(c, indent) for c in cont])+b" ]"
+    else:
+        raise Exception("cannot handle type %s"%type(cont))

 class obj(object):
    def __init__(self, content, stream=None):
@ -56,56 +74,56 @@ class obj(object):
    def tostring(self):
        if self.stream:
            return (
-                "%d 0 obj " % self.identifier +
+                ("%d 0 obj " % self.identifier).encode() +
                parse(self.content) +
-                "\nstream\n" + self.stream + "\nendstream\nendobj\n")
+                b"\nstream\n" + self.stream + b"\nendstream\nendobj\n")
        else:
-            return "%d 0 obj "%self.identifier+parse(self.content)+" endobj\n"
+            return ("%d 0 obj "%self.identifier).encode()+parse(self.content)+b" endobj\n"

 class pdfdoc(object):

    def __init__(self, version=3, title=None, author=None, creator=None,
                 producer=None, creationdate=None, moddate=None, subject=None,
-                 keywords=None):
+                 keywords=None, nodate=False):
        self.version = version # default pdf version 1.3
        now = datetime.now()
        self.objects = []

        info = {}
        if title:
-            info["/Title"] = "("+title+")"
+            info[b"/Title"] = b"("+title+b")"
        if author:
-            info["/Author"] = "("+author+")"
+            info[b"/Author"] = b"("+author+b")"
        if creator:
-            info["/Creator"] = "("+creator+")"
+            info[b"/Creator"] = b"("+creator+b")"
        if producer:
-            info["/Producer"] = "("+producer+")"
+            info[b"/Producer"] = b"("+producer+b")"
        if creationdate:
-            info["/CreationDate"] = "(D:"+creationdate.strftime("%Y%m%d%H%M%S")+")"
-        else:
-            info["/CreationDate"] = "(D:"+now.strftime("%Y%m%d%H%M%S")+")"
+            info[b"/CreationDate"] = b"(D:"+datetime_to_pdfdate(creationdate).encode()+b")"
+        elif not nodate:
+            info[b"/CreationDate"] = b"(D:"+datetime_to_pdfdate(now).encode()+b")"
        if moddate:
-            info["/ModDate"] = "(D:"+moddate.strftime("%Y%m%d%H%M%S")+")"
-        else:
-            info["/ModDate"] = "(D:"+now.strftime("%Y%m%d%H%M%S")+")"
+            info[b"/ModDate"] = b"(D:"+datetime_to_pdfdate(moddate).encode()+b")"
+        elif not nodate:
+            info[b"/ModDate"] = b"(D:"+datetime_to_pdfdate(now).encode()+b")"
        if subject:
-            info["/Subject"] = "("+subject+")"
+            info[b"/Subject"] = b"("+subject+b")"
        if keywords:
-            info["/Keywords"] = "("+",".join(keywords)+")"
+            info[b"/Keywords"] = b"("+b",".join(keywords)+b")"

        self.info = obj(info)

        # create an incomplete pages object so that a /Parent entry can be
        # added to each page
        self.pages = obj({
-            "/Type": "/Pages",
-            "/Kids": [],
-            "/Count": 0
+            b"/Type": b"/Pages",
+            b"/Kids": [],
+            b"/Count": 0
        })

        self.catalog = obj({
-            "/Pages": self.pages,
-            "/Type": "/Catalog"
+            b"/Pages": self.pages,
+            b"/Type": b"/Catalog"
        })
        self.addobj(self.catalog)
        self.addobj(self.pages)
@ -115,71 +133,70 @@ class pdfdoc(object):
        obj.identifier = newid
        self.objects.append(obj)

-    def addimage(self, color, width, height, dpi, imgformat, imgdata):
+    def addimage(self, color, width, height, imgformat, imgdata, pdf_x, pdf_y):
        if color == 'L':
-            color = "/DeviceGray"
+            colorspace = b"/DeviceGray"
        elif color == 'RGB':
-            color = "/DeviceRGB"
+            colorspace = b"/DeviceRGB"
+        elif color == 'CMYK' or color == 'CMYK;I':
+            colorspace = b"/DeviceCMYK"
        else:
            error_out("unsupported color space: %s"%color)
            exit(1)

-        # pdf units = 1/72 inch
-        pdf_x, pdf_y = 72.0*width/dpi[0], 72.0*height/dpi[1]
-
-        print(pdf_x)
-        print(pdf_y)
-
        if pdf_x < 3.00 or pdf_y < 3.00:
-            warning_out("pdf width or height is below 3.00 - decrease the dpi")
-        elif pdf_x > 14400.0 or pdf_y > 14400.0:
-            #error_out(("pdf width or height is above 200.00 - increase the dpi")
+            warning_out("pdf width or height is below 3.00\" - decrease the dpi")
+        elif pdf_x > 200.0 or pdf_y > 200.0:
            warning_out("pdf width or height would be above 200\" - squeezed inside")
-            x_scale = 14400.0 / pdf_x
-            y_scale = 14400.0 / pdf_y
+            x_scale = 200.0 / pdf_x
+            y_scale = 200.0 / pdf_y
            scale = min(x_scale, y_scale) * 0.999
            pdf_x *= scale
            pdf_y *= scale

        # either embed the whole jpeg or deflate the bitmap representation
        if imgformat is "JPEG":
-            ofilter = [ "/DCTDecode" ]
-        elif imgformat is "JP2":
-            ofilter = [ "/JPXDecode" ]
+            ofilter = [ b"/DCTDecode" ]
+        elif imgformat is "JPEG2000":
+            ofilter = [ b"/JPXDecode" ]
            self.version = 5 # jpeg2000 needs pdf 1.5
        else:
-            ofilter = [ "/FlateDecode" ]
+            ofilter = [ b"/FlateDecode" ]
        image = obj({
-            "/Type": "/XObject",
-            "/Subtype": "/Image",
-            "/Filter": ofilter,
-            "/Width": width,
-            "/Height": height,
-            "/ColorSpace": color,
-            # hardcoded as PIL doesnt provide bits for non-jpeg formats
-            "/BitsPerComponent": 8,
-            "/Length": len(imgdata)
+            b"/Type": b"/XObject",
+            b"/Subtype": b"/Image",
+            b"/Filter": ofilter,
+            b"/Width": width,
+            b"/Height": height,
+            b"/ColorSpace": colorspace,
+            # hardcoded as PIL doesn't provide bits for non-jpeg formats
+            b"/BitsPerComponent": 8,
+            b"/Length": len(imgdata)
        }, imgdata)

-        text = "q\n%f 0 0 %f 0 0 cm\n/Im0 Do\nQ"%(pdf_x, pdf_y)
+        if color == 'CMYK;I':
+            # Inverts all four channels
+            image.content[b'/Decode'] = [1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]
+
+        text = ("q\n%0.4f 0 0 %0.4f 0 0 cm\n/Im0 Do\nQ"%(pdf_x, pdf_y)).encode()

        content = obj({
-            "/Length": len(text)
+            b"/Length": len(text)
        }, text)

        page = obj({
-            "/Type": "/Page",
-            "/Parent": self.pages,
-            "/Resources": {
-                "/XObject": {
-                    "/Im0": image
+            b"/Type": b"/Page",
+            b"/Parent": self.pages,
+            b"/Resources": {
+                b"/XObject": {
+                    b"/Im0": image
                }
            },
-            "/MediaBox": [0, 0, pdf_x, pdf_y],
-            "/Contents": content
+            b"/MediaBox": [0, 0, pdf_x, pdf_y],
+            b"/Contents": content
        })
-        self.pages.content["/Kids"].append(page)
-        self.pages.content["/Count"] += 1
+        self.pages.content[b"/Kids"].append(page)
+        self.pages.content[b"/Count"] += 1
        self.addobj(page)
        self.addobj(content)
        self.addobj(image)
@ -190,35 +207,43 @@ class pdfdoc(object):

        xreftable = list()

-        result = "%%PDF-1.%d\n"%self.version
+        result = ("%%PDF-1.%d\n"%self.version).encode()

-        xreftable.append("0000000000 65535 f \n")
+        xreftable.append(b"0000000000 65535 f \n")
        for o in self.objects:
-            xreftable.append("%010d 00000 n \n"%len(result))
+            xreftable.append(("%010d 00000 n \n"%len(result)).encode())
            result += o.tostring()

        xrefoffset = len(result)
-        result += "xref\n"
-        result += "0 %d\n"%len(xreftable)
+        result += b"xref\n"
+        result += ("0 %d\n"%len(xreftable)).encode()
        for x in xreftable:
            result += x
-        result += "trailer\n"
-        result += parse({"/Size": len(xreftable), "/Info": self.info, "/Root": self.catalog})+"\n"
-        result += "startxref\n"
-        result += "%d\n"%xrefoffset
-        result += "%%EOF\n"
+        result += b"trailer\n"
+        result += parse({b"/Size": len(xreftable), b"/Info": self.info, b"/Root": self.catalog})+b"\n"
+        result += b"startxref\n"
+        result += ("%d\n"%xrefoffset).encode()
+        result += b"%%EOF\n"
        return result

-def convert(images, dpi, title=None, author=None, creator=None, producer=None,
-            creationdate=None, moddate=None, subject=None, keywords=None,
-            colorspace=None, verbose=False):
+def convert(images, dpi=None, pagesize=(None, None, None), title=None,
+            author=None, creator=None, producer=None, creationdate=None,
+            moddate=None, subject=None, keywords=None, colorspace=None,
+            nodate=False, verbose=False):
+
+    pagesize_options = pagesize[2]

    pdf = pdfdoc(3, title, author, creator, producer, creationdate,
-                 moddate, subject, keywords)
+                 moddate, subject, keywords, nodate)

-    for im in images:
+    for imfilename in images:
+        debug_out("Reading %s"%imfilename, verbose)
+        try:
+            rawdata = imfilename.read()
+        except AttributeError:
+            with open(imfilename, "rb") as im:
                rawdata = im.read()
-        im.seek(0)
+        im = cStringIO(rawdata)
        try:
            imgdata = Image.open(im)
        except IOError as e:
@ -229,14 +254,11 @@ def convert(images, dpi, title=None, author=None, creator=None, producer=None,
                exit(1)
            # image is jpeg2000
            width, height, ics = parsejp2(rawdata)
-            imgformat = "JP2"
+            imgformat = "JPEG2000"

-            if dpi:
-                ndpi = dpi, dpi
-                debug_out("input dpi (forced) = %d x %d"%ndpi, verbose)
-            else:
-                ndpi = (96, 96) # TODO: read real dpi
-                debug_out("input dpi = %d x %d"%ndpi, verbose)
+            # TODO: read real dpi from input jpeg2000 image
+            ndpi = (default_dpi, default_dpi)
+            debug_out("input dpi = %d x %d" % ndpi, verbose)

            if colorspace:
                color = colorspace
@ -248,26 +270,45 @@ def convert(images, dpi, title=None, author=None, creator=None, producer=None,
            width, height = imgdata.size
            imgformat = imgdata.format

-            if dpi:
-                ndpi = dpi, dpi
-                debug_out("input dpi (forced) = %d x %d"%ndpi, verbose)
-            else:
-                ndpi = imgdata.info.get("dpi", (96, 96))
-                debug_out("input dpi = %d x %d"%ndpi, verbose)
+            ndpi = imgdata.info.get("dpi", (default_dpi, default_dpi))
+            # in python3, the returned dpi value for some tiff images will
+            # not be an integer but a float. To make the behaviour of
+            # img2pdf the same between python2 and python3, we convert that
+            # float into an integer by rounding
+            # search online for the 72.009 dpi problem for more info
+            ndpi = (int(round(ndpi[0])),int(round(ndpi[1])))
+            debug_out("input dpi = %d x %d" % ndpi, verbose)

            if colorspace:
                color = colorspace
                debug_out("input colorspace (forced) = %s"%(color), verbose)
            else:
                color = imgdata.mode
+                if color == "CMYK" and imgformat == "JPEG":
+                    # Adobe inverts CMYK JPEGs for some reason, and others
+                    # have followed suit as well. Some software assumes the
+                    # JPEG is inverted if the Adobe tag (APP14), while other
+                    # software assumes all CMYK JPEGs are inverted. I don't
+                    # have enough experience with these to know which is
+                    # better for images currently in the wild, so I'm going
+                    # with the first approach for now.
+                    if "adobe" in imgdata.info:
+                        color = "CMYK;I"
                debug_out("input colorspace = %s"%(color), verbose)

        debug_out("width x height = %d x %d"%(width,height), verbose)
        debug_out("imgformat = %s"%imgformat, verbose)

+        if dpi:
+            ndpi = dpi, dpi
+            debug_out("input dpi (forced) = %d x %d" % ndpi, verbose)
+        elif pagesize_options:
+            ndpi = get_ndpi(width, height, pagesize)
+            debug_out("calculated dpi (based on pagesize) = %d x %d" % ndpi, verbose)
+
        # depending on the input format, determine whether to pass the raw
        # image or the zlib compressed color information
-        if imgformat is "JPEG" or imgformat is "JP2":
+        if imgformat is "JPEG" or imgformat is "JPEG2000":
            if color == '1':
                error_out("jpeg can't be monochrome")
                exit(1)
@ -275,16 +316,61 @@ def convert(images, dpi, title=None, author=None, creator=None, producer=None,
        else:
            # because we do not support /CCITTFaxDecode
            if color == '1':
+                debug_out("Converting colorspace 1 to L", verbose)
                imgdata = imgdata.convert('L')
                color = 'L'
-            imgdata = zlib.compress(imgdata.tostring())
-
-        pdf.addimage(color, width, height, ndpi, imgformat, imgdata)
-
+            elif color in ("RGB", "L", "CMYK", "CMYK;I"):
+                debug_out("Colorspace is OK: %s"%color, verbose)
+            else:
+                debug_out("Converting colorspace %s to RGB"%color, verbose)
+                imgdata = imgdata.convert('RGB')
+                color = imgdata.mode
+            img = imgdata.tobytes()
+            # the python-pil version 2.3.0-1ubuntu3 in Ubuntu does not have the close() method
+            try:
+                imgdata.close()
+            except AttributeError:
+                pass
+            imgdata = zlib.compress(img)
        im.close()

+        if pagesize_options and pagesize_options['exact'][1]:
+            # output size exactly to specified dimensions
+            # pagesize[0], pagesize[1] already checked in valid_size()
+            pdf_x, pdf_y = pagesize[0], pagesize[1]
+        else:
+            # output size based on dpi; point = 1/72 inch
+            pdf_x, pdf_y = 72.0*width/float(ndpi[0]), 72.0*height/float(ndpi[1])
+
+        pdf.addimage(color, width, height, imgformat, imgdata, pdf_x, pdf_y)
+
    return pdf.tostring()

+def get_ndpi(width, height, pagesize):
+    pagesize_options = pagesize[2]
+
+    if pagesize_options and pagesize_options['fill'][1]:
+        if width/height < pagesize[0]/pagesize[1]:
+            tmp_dpi = 72.0*width/pagesize[0]
+        else:
+            tmp_dpi = 72.0*height/pagesize[1]
+    elif pagesize[0] and pagesize[1]:
+        # if both height and width given with no specific pagesize_option,
+        # resize to fit "into" page
+        if width/height < pagesize[0]/pagesize[1]:
+            tmp_dpi = 72.0*height/pagesize[1]
+        else:
+            tmp_dpi = 72.0*width/pagesize[0]
+    elif pagesize[0]:
+        # if width given, calculate dpi based on width
+        tmp_dpi = 72.0*width/pagesize[0]
+    elif pagesize[1]:
+        # if height given, calculate dpi based on height
+        tmp_dpi = 72.0*height/pagesize[1]
+    else:
+        tmp_dpi = default_dpi
+
+    return tmp_dpi, tmp_dpi

 def positive_float(string):
    value = float(string)
@ -294,58 +380,276 @@ def positive_float(string):
    return value

 def valid_date(string):
+    # first try parsing in ISO8601 format
+    try:
+        return datetime.strptime(string, "%Y-%m-%d")
+    except ValueError:
+        pass
+    try:
+        return datetime.strptime(string, "%Y-%m-%dT%H:%M")
+    except ValueError:
+        pass
+    try:
        return datetime.strptime(string, "%Y-%m-%dT%H:%M:%S")
+    except ValueError:
+        pass
+    # then try dateutil
+    try:
+        from dateutil import parser
+    except ImportError:
+        pass
+    else:
+        try:
+            return parser.parse(string)
+        except TypeError:
+            pass
+    # as a last resort, try the local date utility
+    try:
+        import subprocess
+    except ImportError:
+        pass
+    else:
+        try:
+            utime = subprocess.check_output(["date", "--date", string, "+%s"])
+        except subprocess.CalledProcessError:
+            pass
+        else:
+            return datetime.utcfromtimestamp(int(utime))
+    raise argparse.ArgumentTypeError("cannot parse date: %s"%string)
+
+def get_standard_papersize(string):
+    papersizes = {
+        "11x17"       : "792x792^",     # "792x1224",
+        "ledger"      : "792x792^",     # "1224x792",
+        "legal"       : "612x612^",     # "612x1008",
+        "letter"      : "612x612^",     # "612x792",
+        "arche"       : "2592x2592^",   # "2592x3456",
+        "archd"       : "1728x1728^",   # "1728x2592",
+        "archc"       : "1296x1296^",   # "1296x1728",
+        "archb"       : "864x864^",     # "864x1296",
+        "archa"       : "648x648^",     # "648x864",
+        "a0"          : "2380x2380^",   # "2380x3368",
+        "a1"          : "1684x1684^",   # "1684x2380",
+        "a2"          : "1190x1190^",   # "1190x1684",
+        "a3"          : "842x842^",     # "842x1190",
+        "a4"          : "595x595^",     # "595x842",
+        "a5"          : "421x421^",     # "421x595",
+        "a6"          : "297x297^",     # "297x421",
+        "a7"          : "210x210^",     # "210x297",
+        "a8"          : "148x148^",     # "148x210",
+        "a9"          : "105x105^",     # "105x148",
+        "a10"         : "74x74^",       # "74x105",
+        "b0"          : "2836x2836^",   # "2836x4008",
+        "b1"          : "2004x2004^",   # "2004x2836",
+        "b2"          : "1418x1418^",   # "1418x2004",
+        "b3"          : "1002x1002^",   # "1002x1418",
+        "b4"          : "709x709^",     # "709x1002",
+        "b5"          : "501x501^",     # "501x709",
+        "c0"          : "2600x2600^",   # "2600x3677",
+        "c1"          : "1837x1837^",   # "1837x2600",
+        "c2"          : "1298x1298^",   # "1298x1837",
+        "c3"          : "918x918^",     # "918x1298",
+        "c4"          : "649x649^",     # "649x918",
+        "c5"          : "459x459^",     # "459x649",
+        "c6"          : "323x323^",     # "323x459",
+        "flsa"        : "612x612^",     # "612x936",
+        "flse"        : "612x612^",     # "612x936",
+        "halfletter"  : "396x396^",     # "396x612",
+        "tabloid"     : "792x792^",     # "792x1224",
+        "statement"   : "396x396^",     # "396x612",
+        "executive"   : "540x540^",     # "540x720",
+        "folio"       : "612x612^",     # "612x936",
+        "quarto"      : "610x610^",     # "610x780"
+    }
+
+    string = string.lower()
+    return papersizes.get(string, string)
+
+def valid_size(string):
+    # conversion factors from units to points
+    units = {
+        'in'  : 72.0,
+        'cm'  : 72.0/2.54,
+        'mm'  : 72.0/25.4,
+        'pt' : 1.0
+    }
+
+    pagesize_options = {
+        'exact'  : ['\!', False],
+        'shrink'  : ['\>', False],
+        'enlarge' : ['\<', False],
+        'fill'    : ['\^', False],
+        'percent' : ['\%', False],
+        'count'   : ['\@', False],
+    }
+
+    string = get_standard_papersize(string)
+
+    pattern = re.compile(r"""
+            ([0-9]*\.?[0-9]*)   # tokens.group(1) == width; may be empty
+            ([a-z]*)            # tokens.group(2) == units; may be empty
+            x
+            ([0-9]*\.?[0-9]*)   # tokens.group(3) == height; may be empty
+            ([a-zA-Z]*)         # tokens.group(4) == units; may be empty
+            ([^0-9a-zA-Z]*)     # tokens.group(5) == extra options
+        """, re.VERBOSE)
+
+    tokens = pattern.match(string)
+
+    # tokens.group(0) should match entire input string
+    if tokens.group(0) != string:
+        msg = ('Input size needs to be of the format AuxBv#, '
+            'where A is width, B is height, u and v are units, '
+            '# are options.  '
+            'You may omit either width or height, but not both.  '
+            'Units may be specified as (in, cm, mm, pt).  '
+            'You may omit units, which will default to pt.  '
+            'Available options include (! = exact ; ^ = fill ; default = into).')
+        raise argparse.ArgumentTypeError(msg)
+
+    # temporary list to loop through to process width and height
+    pagesize_size = {
+        'x' : [0, tokens.group(1), tokens.group(2)],
+        'y' : [0, tokens.group(3), tokens.group(4)]
+    }
+
+    for key, value in pagesize_size.items():
+        try:
+            value[0] = float(value[1])
+            value[0] *= units[value[2]]     # convert to points
+        except ValueError:
+            # assign None if width or height not provided
+            value[0] = None
+        except KeyError:
+            # if units unrecognized, raise error
+            # otherwise default to pt because units not provided
+            if value[2]:
+                msg = "unrecognized unit '%s'." % value[2]
+                raise argparse.ArgumentTypeError(msg)
+
+    x = pagesize_size['x'][0]
+    y = pagesize_size['y'][0]
+
+    # parse options for resize methods
+    if tokens.group(5):
+        for key, value in pagesize_options.items():
+            if re.search(value[0], tokens.group(5)):
+                value[1] = True
+
+    if pagesize_options['fill'][1]:
+        # if either width or height is not given, try to fill in missing value
+        if not x:
+            x = y
+        elif not y:
+            y = x
+
+    if pagesize_options['exact'][1]:
+        if not x or not y:
+            msg = ('exact size requires both width and height.')
+            raise argparse.ArgumentTypeError(msg)
+
+    if not x and not y:
+        msg = ('width and height cannot both be omitted.')
+        raise argparse.ArgumentTypeError(msg)
+
+    return (x, y, pagesize_options)
+
+# in python3, the received argument will be a unicode str() object which needs
+# to be encoded into a bytes() object
+# in python2, the received argument will be a binary str() object which needs
+# no encoding
+# we check whether we use python2 or python3 by checking whether the argument
+# is both, type str and type bytes (only the case in python2)
+def pdf_embedded_string(string):
+    if type(string) is str and type(string) is not bytes:
+        # py3
+        pass
+    else:
+        # py2
+        string = string.decode("utf8")
+    string = b"\xfe\xff"+string.encode("utf-16-be")
+    string = string.replace(b'\\', b'\\\\')
+    string = string.replace(b'(', b'\\(')
+    string = string.replace(b')', b'\\)')
+    return string

 parser = argparse.ArgumentParser(
    description='Lossless conversion/embedding of images (in)to pdf')
 parser.add_argument(
-    'images', metavar='infile', type=argparse.FileType('rb'),
+    'images', metavar='infile', type=str,
    nargs='+', help='input file(s)')
 parser.add_argument(
    '-o', '--output', metavar='out', type=argparse.FileType('wb'),
-    default=sys.stdout, help='output file (default: stdout)')
-parser.add_argument(
+    default=getattr(sys.stdout, "buffer", sys.stdout),
+    help='output file (default: stdout)')
+
+sizeopts = parser.add_mutually_exclusive_group()
+sizeopts.add_argument(
    '-d', '--dpi', metavar='dpi', type=positive_float,
-    help='dpi for pdf output (default: 96.0)')
+    help=('dpi for pdf output. '
+        'If input image does not specify dpi the default is %.2f.  '
+        'Must not be used with -s/--pagesize.') % default_dpi
+)
+
+sizeopts.add_argument(
+    '-s', '--pagesize', metavar='size', type=valid_size,
+    default=(None, None, None),
+    help=('size of the pdf pages in format AuxBv#, '
+        'where A is width, B is height, u and v are units, # are options. '
+        'You may omit either width or height, but not both.  '
+        'Some common page sizes, such as letter and a4, are also recognized.  '
+        'Units may be specified as (in, cm, mm, pt).  '
+        'Units default to pt when absent.  '
+        'Available options include (! = exact ; ^ = fill ; default = into).  '
+        'Must not be used with -d/--dpi.')
+)
+
 parser.add_argument(
-    '-t', '--title', metavar='title', type=str,
+    '-t', '--title', metavar='title', type=pdf_embedded_string,
    help='title for metadata')
 parser.add_argument(
-    '-a', '--author', metavar='author', type=str,
+    '-a', '--author', metavar='author', type=pdf_embedded_string,
    help='author for metadata')
 parser.add_argument(
-    '-c', '--creator', metavar='creator', type=str,
+    '-c', '--creator', metavar='creator', type=pdf_embedded_string,
    help='creator for metadata')
 parser.add_argument(
-    '-p', '--producer', metavar='producer', type=str,
+    '-p', '--producer', metavar='producer', type=pdf_embedded_string,
    help='producer for metadata')
 parser.add_argument(
    '-r', '--creationdate', metavar='creationdate', type=valid_date,
-    help='creation date for metadata in YYYY-MM-DDTHH:MM:SS format')
+    help='UTC creation date for metadata in YYYY-MM-DD or YYYY-MM-DDTHH:MM or YYYY-MM-DDTHH:MM:SS format or any format understood by python dateutil module or any format understood by `date --date`')
 parser.add_argument(
    '-m', '--moddate', metavar='moddate', type=valid_date,
-    help='modification date for metadata in YYYY-MM-DDTHH:MM:SS format')
+    help='UTC modification date for metadata in YYYY-MM-DD or YYYY-MM-DDTHH:MM or YYYY-MM-DDTHH:MM:SS format or any format understood by python dateutil module or any format understood by `date --date`')
 parser.add_argument(
-    '-s', '--subject', metavar='subject', type=str,
+    '-S', '--subject', metavar='subject', type=pdf_embedded_string,
    help='subject for metadata')
 parser.add_argument(
-    '-k', '--keywords', metavar='kw', type=str, nargs='+',
+    '-k', '--keywords', metavar='kw', type=pdf_embedded_string, nargs='+',
    help='keywords for metadata')
 parser.add_argument(
-    '-C', '--colorspace', metavar='colorspace', type=str,
-    help='force PIL colorspace (one of: RGB, L, 1)')
+    '-C', '--colorspace', metavar='colorspace', type=pdf_embedded_string,
+    help='force PIL colorspace (one of: RGB, L, 1, CMYK, CMYK;I)')
+parser.add_argument(
+    '-D', '--nodate', help='do not add timestamps', action="store_true")
 parser.add_argument(
    '-v', '--verbose', help='verbose mode', action="store_true")
+parser.add_argument(
+    '-V', '--version', action='version', version='%(prog)s '+__version__,
+    help="Print version information and exit")

 def main(args=None):
    if args is None:
        args = sys.argv[1:]
    args = parser.parse_args(args)
+
    args.output.write(
        convert(
-            args.images, args.dpi, args.title, args.author,
+            args.images, args.dpi, args.pagesize, args.title, args.author,
            args.creator, args.producer, args.creationdate, args.moddate,
-            args.subject, args.keywords, args.colorspace, args.verbose))
+            args.subject, args.keywords, args.colorspace, args.nodate,
+            args.verbose))

 if __name__ == '__main__':
    main()
--- a/src/jp2.py
+++ b/src/jp2.py
@ -85,6 +85,6 @@ def parsejp2(data):
 if __name__ == "__main__":
    import sys
    width, height, colorspace = parsejp2(open(sys.argv[1]).read())
-    print "width = %d"%width
-    print "height = %d"%height
-    print "colorspace = %s"%colorspace
+    sys.stdout.write("width = %d"%width)
+    sys.stdout.write("height = %d"%height)
+    sys.stdout.write("colorspace = %s"%colorspace)
--- a/src/tests/init.py
+++ b/src/tests/init.py
@ -1,7 +1,109 @@
 import unittest
-import test_img2pdf
+
+import os
+import img2pdf
+import zlib
+from PIL import Image
+
+HERE = os.path.dirname(__file__)
+
+#convert +set date:create +set date:modify -define png:exclude-chunk=time

 def test_suite():
+    class TestImg2Pdf(unittest.TestCase):
+        pass
+
+    for test_name in os.listdir(os.path.join(HERE, "input")):
+        inputf = os.path.join(HERE, "input", test_name)
+        if not os.path.isfile(inputf):
+            continue
+        outputf = os.path.join(HERE, "output", test_name+".pdf")
+        assert os.path.isfile(outputf)
+        def handle(self, f=inputf, out=outputf):
+            with open(f, "rb") as inf:
+                orig_imgdata = inf.read()
+            pdf = img2pdf.convert([f], nodate=True)
+            imgdata = b""
+            instream = False
+            imgobj = False
+            colorspace = None
+            imgfilter = None
+            width = None
+            height = None
+            length = None
+            # ugly workaround to parse the created pdf
+            for line in pdf.split(b'\n'):
+                if instream:
+                    if line == b"endstream":
+                        break
+                    else:
+                        imgdata += line + b'\n'
+                else:
+                    if imgobj and line == b"stream":
+                        instream = True
+                    elif b"/Subtype /Image" in line:
+                        imgobj = True
+                    elif b"/Width" in line:
+                        width = int(line.split()[-1])
+                    elif b"/Height" in line:
+                        height = int(line.split()[-1])
+                    elif b"/Length" in line:
+                        length = int(line.split()[-1])
+                    elif b"/Filter" in line:
+                        imgfilter = line.split()[-2]
+                    elif b"/ColorSpace" in line:
+                        colorspace = line.split()[-1]
+            # remove trailing \n
+            imgdata = imgdata[:-1]
+            # test if the length field is correct
+            self.assertEqual(len(imgdata), length)
+            # test if the filter is valid:
+            self.assertIn(imgfilter, [b"/DCTDecode", b"/JPXDecode", b"/FlateDecode"])
+            # test if the colorspace is valid
+            self.assertIn(colorspace, [b"/DeviceGray", b"/DeviceRGB", b"/DeviceCMYK"])
+            # test if the image has correct size
+            orig_img = Image.open(f)
+            self.assertEqual(width, orig_img.size[0])
+            self.assertEqual(height, orig_img.size[1])
+            # if the input file is a jpeg then it should've been copied
+            # verbatim into the PDF
+            if imgfilter in [b"/DCTDecode", b"/JPXDecode"]:
+                self.assertEqual(imgdata, orig_imgdata)
+            elif imgfilter == b"/FlateDecode":
+                # otherwise, the data is flate encoded and has to be equal to
+                # the pixel data of the input image
+                imgdata = zlib.decompress(imgdata)
+                if colorspace == b"/DeviceGray":
+                    colorspace = 'L'
+                elif colorspace == b"/DeviceRGB":
+                    colorspace = 'RGB'
+                elif colorspace == b"/DeviceCMYK":
+                    colorspace = 'CMYK'
+                else:
+                    raise Exception("invalid colorspace")
+                im = Image.frombytes(colorspace, (width, height), imgdata)
+                if orig_img.mode == '1':
+                    orig_img = orig_img.convert("L")
+                elif orig_img.mode not in ("RGB", "L", "CMYK", "CMYK;I"):
+                    orig_img = orig_img.convert("RGB")
+                self.assertEqual(im.tobytes(), orig_img.tobytes())
+                # the python-pil version 2.3.0-1ubuntu3 in Ubuntu does not have the close() method
+                try:
+                    im.close()
+                except AttributeError:
+                    pass
+            # lastly, make sure that the generated pdf matches bit by bit the
+            # expected pdf
+            with open(out, "rb") as outf:
+                out = outf.read()
+            self.assertEqual(pdf, out)
+            # the python-pil version 2.3.0-1ubuntu3 in Ubuntu does not have the close() method
+            try:
+                orig_img.close()
+            except AttributeError:
+                pass
+        setattr(TestImg2Pdf, "test_%s"%test_name, handle)
+
    return unittest.TestSuite((
-            unittest.makeSuite(test_img2pdf.TestImg2Pdf),
+            unittest.makeSuite(TestImg2Pdf),
            ))
--- a/src/tests/input/CMYK.jpg
+++ b/src/tests/input/CMYK.jpg
--- a/src/tests/input/CMYK.tif
+++ b/src/tests/input/CMYK.tif
--- a/src/tests/input/normal.jpg
+++ b/src/tests/input/normal.jpg
--- a/src/tests/input/normal.png
+++ b/src/tests/input/normal.png
--- a/src/tests/output/CMYK.jpg.pdf
+++ b/src/tests/output/CMYK.jpg.pdf
--- a/src/tests/output/CMYK.tif.pdf
+++ b/src/tests/output/CMYK.tif.pdf
--- a/src/tests/output/normal.jpg.pdf
+++ b/src/tests/output/normal.jpg.pdf
--- a/src/tests/output/normal.png.pdf
+++ b/src/tests/output/normal.png.pdf
--- a/src/tests/test_img2pdf.py
+++ b/src/tests/test_img2pdf.py
@ -1,20 +0,0 @@
-import datetime
-import os
-import unittest
-import img2pdf
-
-HERE = os.path.dirname(__file__)
-moddate = datetime.datetime(2014, 1, 1)
-
-class TestImg2Pdf(unittest.TestCase):
-    def test_jpg2pdf(self):
-        with open(os.path.join(HERE, 'test.jpg'), 'r') as img_fp:
-            with open(os.path.join(HERE, 'test.pdf'), 'r') as pdf_fp:
-                self.assertEqual(
-                    img2pdf.convert([img_fp], 150,
-                                    creationdate=moddate, moddate=moddate),
-                    pdf_fp.read())
-
-    def test_png2pdf(self):
-        with open(os.path.join(HERE, 'test.png'), 'r') as img_fp:
-            self.assertRaises(SystemExit, img2pdf.convert, [img_fp], 150)