MPO JPEGs from digital camera with thumbnails inserted as multiple frames #135

Closed
opened 2 years ago by j_barlow · 6 comments

I found that img2pdf does not work properly on JPEGs produced by my digital camera.

As far as Pillow is concerned the files it produces are MPO format (Multi-Picture Object). But they're fully valid JPEGs too if that makes any sense at all. They start with a standard JPEG header, but have both EXIF and TIFF metadata, which in turn contains the additional thumbnails.

In [7]: im.format_description
Out[7]: 'MPO (CIPA DC-007)'

When img2pdf is used on this type of JPEG, it creates a PDF containing the original image first, as it should; it then appends all of the thumbnails as separate pages which is not particularly helpful.

Pillow's im.mpinfo seems to contain metadata that could be used to determine that the images are thumbnails that can be ignored.

The output file can be produced with img2pdf -o output.pdf input.jpg.

I used img2pdf 0.4.3. (img2pdf 0.4.0 will also convert all of the JPEGs to PNG compression, but this img2pdf 0.4.3 does not have this issue.)

I found that img2pdf does not work properly on JPEGs produced by my digital camera. As far as Pillow is concerned the files it produces are MPO format (Multi-Picture Object). But they're fully valid JPEGs too if that makes any sense at all. They start with a standard JPEG header, but have both EXIF and TIFF metadata, which in turn contains the additional thumbnails. ``` In [7]: im.format_description Out[7]: 'MPO (CIPA DC-007)' ``` When img2pdf is used on this type of JPEG, it creates a PDF containing the original image first, as it should; it then appends all of the thumbnails as separate pages which is not particularly helpful. Pillow's `im.mpinfo` seems to contain metadata that could be used to determine that the images are thumbnails that can be ignored. The output file can be produced with `img2pdf -o output.pdf input.jpg`. I used img2pdf 0.4.3. (img2pdf 0.4.0 will also convert all of the JPEGs to PNG compression, but this img2pdf 0.4.3 does not have this issue.)
Poster

Here are the files.

Here are the files.
josch commented 2 years ago
Owner

Thanks for your bug report and the test input! If I understand the problem correctly, then this is a feature and not a bug. img2pdf by default outputs one page for each frame in a multi-frame input image. MPO files are normal JPEG files for parsers that only understand JPEG but they appear as multi-frame images for parsers that understand the MPO format like PIL. Thus, MPO files are treated the same way by img2pdf as multi-frame GIF images, for example.

I think what you want is the --first-frame-only option.

Thanks for your bug report and the test input! If I understand the problem correctly, then this is a feature and not a bug. img2pdf by default outputs one page for each frame in a multi-frame input image. MPO files are normal JPEG files for parsers that only understand JPEG but they appear as multi-frame images for parsers that understand the MPO format like PIL. Thus, MPO files are treated the same way by img2pdf as multi-frame GIF images, for example. I think what you want is the `--first-frame-only` option.
josch commented 2 years ago
Owner

Does this solve this issue? If yes, please close it. Thanks! :)

Does this solve this issue? If yes, please close it. Thanks! :)
Poster

I don't think it's reasonable behavior to include both an image and its thumbnails in the final PDF. I can't think of any reason a user would want that behavior.

I understand that in general, a multi-frame image should be unpacked into multiple pages, if the images have some special function like thumbnails, they should be discarded.

I don't think it's reasonable behavior to include both an image and its thumbnails in the final PDF. I can't think of any reason a user would want that behavior. I understand that in general, a multi-frame image should be unpacked into multiple pages, if the images have some special function like thumbnails, they should be discarded.
josch commented 2 years ago
Owner

I think you are right. I think I want to add another command line option called --include-thumbnails. By default, thumbnails will not be included because they are redundant. So with the new behaviour MPO files will be copied into the PDF as they are and thus show up as a single page. The old behaviour can be triggered by supplying --include-thumbnails. One can argue that the old behaviour was kinda buggy as it chopped up the JPEG into multiple individual JPEGs where the first JPEG still retained the MPO information but was missing the thumbnails.

Does anybody know of another image format where thumbnails are represented by Pillow as additional image frames?

I think you are right. I think I want to add another command line option called `--include-thumbnails`. By default, thumbnails will not be included because they are redundant. So with the new behaviour MPO files will be copied into the PDF as they are and thus show up as a single page. The old behaviour can be triggered by supplying `--include-thumbnails`. One can argue that the old behaviour was kinda buggy as it chopped up the JPEG into multiple individual JPEGs where the first JPEG still retained the MPO information but was missing the thumbnails. Does anybody know of another image format where thumbnails are represented by Pillow as additional image frames?
Owner

The following diff implements the --include-thumbnails option and does not include thumbnails by default. Could somebody try out if this does the right thing for them?

diff --git a/src/img2pdf.py b/src/img2pdf.py
index 35141d0..4300999 100755
--- a/src/img2pdf.py
+++ b/src/img2pdf.py
@@ -1750,7 +1750,9 @@ def parse_miff(data):
 # fmt: on
 
 
-def read_images(rawdata, colorspace, first_frame_only=False, rot=None):
+def read_images(
+    rawdata, colorspace, first_frame_only=False, rot=None, include_thumbnails=False
+):
     im = BytesIO(rawdata)
     im.seek(0)
     imgdata = None
@@ -1836,6 +1838,77 @@ def read_images(rawdata, colorspace, first_frame_only=False, rot=None):
     if imgformat == ImageFormat.MPO:
         result = []
         img_page_count = 0
+        assert len(imgdata._MpoImageFile__mpoffsets) == len(imgdata.mpinfo[0xB002])
+        num_frames = len(imgdata.mpinfo[0xB002])
+        # An MPO file can be a main image together with one or more thumbnails
+        # if that is the case, then we only include all frames if the
+        # --include-thumbnails option is given. If it is not, such an MPO file
+        # will be embedded as is, so including its thumbnails but showing up
+        # as a single image page in the resulting PDF.
+        num_main_frames = 0
+        num_thumbnail_frames = 0
+        for i, mpent in enumerate(imgdata.mpinfo[0xB002]):
+            # check only the first frame for being the main image
+            if (
+                i == 0
+                and mpent["Attribute"]["DependentParentImageFlag"]
+                and not mpent["Attribute"]["DependentChildImageFlag"]
+                and mpent["Attribute"]["RepresentativeImageFlag"]
+                and mpent["Attribute"]["MPType"] == "Baseline MP Primary Image"
+            ):
+                num_main_frames += 1
+            elif (
+                not mpent["Attribute"]["DependentParentImageFlag"]
+                and mpent["Attribute"]["DependentChildImageFlag"]
+                and not mpent["Attribute"]["RepresentativeImageFlag"]
+                and mpent["Attribute"]["MPType"]
+                in [
+                    "Large Thumbnail (VGA Equivalent)",
+                    "Large Thumbnail (Full HD Equivalent)",
+                ]
+            ):
+                num_thumbnail_frames += 1
+        logger.debug(f"number of frames: {num_frames}")
+        logger.debug(f"number of main frames: {num_main_frames}")
+        logger.debug(f"number of thumbnail frames: {num_thumbnail_frames}")
+        # this MPO file is a main image plus zero or more thumbnails
+        # embed as-is unless the --include-thumbnails option was given
+        if num_frames == 1 or (
+            not include_thumbnails
+            and num_main_frames == 1
+            and num_thumbnail_frames + 1 == num_frames
+        ):
+            color, ndpi, imgwidthpx, imgheightpx, rotation, iccp = get_imgmetadata(
+                imgdata, imgformat, default_dpi, colorspace, rawdata, rot
+            )
+            if color == Colorspace["1"]:
+                raise JpegColorspaceError("jpeg can't be monochrome")
+            if color == Colorspace["P"]:
+                raise JpegColorspaceError("jpeg can't have a color palette")
+            if color == Colorspace["RGBA"]:
+                raise JpegColorspaceError("jpeg can't have an alpha channel")
+            logger.debug("read_images() embeds an MPO verbatim")
+            cleanup()
+            return [
+                (
+                    color,
+                    ndpi,
+                    ImageFormat.JPEG,
+                    rawdata,
+                    None,
+                    imgwidthpx,
+                    imgheightpx,
+                    [],
+                    False,
+                    8,
+                    rotation,
+                    iccp,
+                )
+            ]
+        # If the control flow reaches here, the MPO has more than a single
+        # frame but was not detected to be a main image followed by multiple
+        # thumbnails. We thus treat this MPO as we do other multi-frame images
+        # and include all its frames as individual pages.
         for offset, mpent in zip(
             imgdata._MpoImageFile__mpoffsets, imgdata.mpinfo[0xB002]
         ):
@@ -2509,6 +2582,7 @@ def convert(*images, **kwargs):
         artborder=None,
         pdfa=None,
         rotation=None,
+        include_thumbnails=False,
     )
     for kwname, default in _default_kwargs.items():
         if kwname not in kwargs:
@@ -2601,6 +2675,7 @@ def convert(*images, **kwargs):
             kwargs["colorspace"],
             kwargs["first_frame_only"],
             kwargs["rotation"],
+            kwargs["include_thumbnails"],
         ):
             pagewidth, pageheight, imgwidthpdf, imgheightpdf = kwargs["layout_fun"](
                 imgwidthpx, imgheightpx, ndpi
@@ -3936,6 +4011,17 @@ RGB.""",
         "input image be converted into a page in the resulting PDF.",
     )
 
+    outargs.add_argument(
+        "--include-thumbnails",
+        action="store_true",
+        help="Some multi-frame formats like MPO carry a main image and "
+        "one or more scaled-down copies of the main image (thumbnails). "
+        "In such a case, img2pdf will only include the main image and "
+        "not create additional pages for each of the thumbnails. If this "
+        "option is set, img2pdf will instead create one page per frame and "
+        "thus store each thumbnail on its own page.",
+    )
+
     outargs.add_argument(
         "--pillow-limit-break",
         action="store_true",
@@ -4333,6 +4419,7 @@ and left/right, respectively. It is not possible to specify asymmetric borders.
             artborder=args.art_border,
             pdfa=args.pdfa,
             rotation=args.rotation,
+            include_thumbnails=args.include_thumbnails,
         )
     except Exception as e:
         logger.error("error: " + str(e))
The following diff implements the `--include-thumbnails` option and does not include thumbnails by default. Could somebody try out if this does the right thing for them? ```diff diff --git a/src/img2pdf.py b/src/img2pdf.py index 35141d0..4300999 100755 --- a/src/img2pdf.py +++ b/src/img2pdf.py @@ -1750,7 +1750,9 @@ def parse_miff(data): # fmt: on -def read_images(rawdata, colorspace, first_frame_only=False, rot=None): +def read_images( + rawdata, colorspace, first_frame_only=False, rot=None, include_thumbnails=False +): im = BytesIO(rawdata) im.seek(0) imgdata = None @@ -1836,6 +1838,77 @@ def read_images(rawdata, colorspace, first_frame_only=False, rot=None): if imgformat == ImageFormat.MPO: result = [] img_page_count = 0 + assert len(imgdata._MpoImageFile__mpoffsets) == len(imgdata.mpinfo[0xB002]) + num_frames = len(imgdata.mpinfo[0xB002]) + # An MPO file can be a main image together with one or more thumbnails + # if that is the case, then we only include all frames if the + # --include-thumbnails option is given. If it is not, such an MPO file + # will be embedded as is, so including its thumbnails but showing up + # as a single image page in the resulting PDF. + num_main_frames = 0 + num_thumbnail_frames = 0 + for i, mpent in enumerate(imgdata.mpinfo[0xB002]): + # check only the first frame for being the main image + if ( + i == 0 + and mpent["Attribute"]["DependentParentImageFlag"] + and not mpent["Attribute"]["DependentChildImageFlag"] + and mpent["Attribute"]["RepresentativeImageFlag"] + and mpent["Attribute"]["MPType"] == "Baseline MP Primary Image" + ): + num_main_frames += 1 + elif ( + not mpent["Attribute"]["DependentParentImageFlag"] + and mpent["Attribute"]["DependentChildImageFlag"] + and not mpent["Attribute"]["RepresentativeImageFlag"] + and mpent["Attribute"]["MPType"] + in [ + "Large Thumbnail (VGA Equivalent)", + "Large Thumbnail (Full HD Equivalent)", + ] + ): + num_thumbnail_frames += 1 + logger.debug(f"number of frames: {num_frames}") + logger.debug(f"number of main frames: {num_main_frames}") + logger.debug(f"number of thumbnail frames: {num_thumbnail_frames}") + # this MPO file is a main image plus zero or more thumbnails + # embed as-is unless the --include-thumbnails option was given + if num_frames == 1 or ( + not include_thumbnails + and num_main_frames == 1 + and num_thumbnail_frames + 1 == num_frames + ): + color, ndpi, imgwidthpx, imgheightpx, rotation, iccp = get_imgmetadata( + imgdata, imgformat, default_dpi, colorspace, rawdata, rot + ) + if color == Colorspace["1"]: + raise JpegColorspaceError("jpeg can't be monochrome") + if color == Colorspace["P"]: + raise JpegColorspaceError("jpeg can't have a color palette") + if color == Colorspace["RGBA"]: + raise JpegColorspaceError("jpeg can't have an alpha channel") + logger.debug("read_images() embeds an MPO verbatim") + cleanup() + return [ + ( + color, + ndpi, + ImageFormat.JPEG, + rawdata, + None, + imgwidthpx, + imgheightpx, + [], + False, + 8, + rotation, + iccp, + ) + ] + # If the control flow reaches here, the MPO has more than a single + # frame but was not detected to be a main image followed by multiple + # thumbnails. We thus treat this MPO as we do other multi-frame images + # and include all its frames as individual pages. for offset, mpent in zip( imgdata._MpoImageFile__mpoffsets, imgdata.mpinfo[0xB002] ): @@ -2509,6 +2582,7 @@ def convert(*images, **kwargs): artborder=None, pdfa=None, rotation=None, + include_thumbnails=False, ) for kwname, default in _default_kwargs.items(): if kwname not in kwargs: @@ -2601,6 +2675,7 @@ def convert(*images, **kwargs): kwargs["colorspace"], kwargs["first_frame_only"], kwargs["rotation"], + kwargs["include_thumbnails"], ): pagewidth, pageheight, imgwidthpdf, imgheightpdf = kwargs["layout_fun"]( imgwidthpx, imgheightpx, ndpi @@ -3936,6 +4011,17 @@ RGB.""", "input image be converted into a page in the resulting PDF.", ) + outargs.add_argument( + "--include-thumbnails", + action="store_true", + help="Some multi-frame formats like MPO carry a main image and " + "one or more scaled-down copies of the main image (thumbnails). " + "In such a case, img2pdf will only include the main image and " + "not create additional pages for each of the thumbnails. If this " + "option is set, img2pdf will instead create one page per frame and " + "thus store each thumbnail on its own page.", + ) + outargs.add_argument( "--pillow-limit-break", action="store_true", @@ -4333,6 +4419,7 @@ and left/right, respectively. It is not possible to specify asymmetric borders. artborder=args.art_border, pdfa=args.pdfa, rotation=args.rotation, + include_thumbnails=args.include_thumbnails, ) except Exception as e: logger.error("error: " + str(e)) ```
josch closed this issue 11 months ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#135
Loading…
There is no content yet.