Multi Picture jpegs are erroneously converted to multi-paged pdf #123

Closed
opened 3 years ago by johannesnussbaum · 11 comments

What I have

I have an image which seems to be a normal one-paged jpeg file, but under the hood, it contains a thumbnail. The image is attached.

What I want

I want to create a pdf file with only one page, which contains my image. My Python code looks as follows:

import img2pdf
filepath = '(path)/DSC00059.JPG'
with open(filepath + '.pdf', 'w+b') as f:
   f.write(img2pdf.convert(filepath))

What I get

When I convert the image into a pdf, I get 2 pages: first the thumbnail, then the actual image. The pdf is attached. This behaviour is problematic for me, and I cannot find out how to get rid of the thumbnail.
I know that I could delete the pdf pages by hand, but it must be a scalable solution, because I have a lot of files.

What I tried to solve the problem

With exiftool, I analysed the jpeg and then tried to extract the thumbnail. It worked, but I don't know how to extract the actual image. Below what I did:

exiftool '(path)/DSC00059.JPG'
(...)
ThumbnailOffset                 : 38996
ThumbnailLength                 : 6189
MPFVersion                      : 0100
NumberOfImages                  : 2
MPImageFlags                    : Dependent child image
MPImageFormat                   : JPEG
MPImageType                     : Large Thumbnail (full HD equivalent)
MPImageLength                   : 601248
MPImageStart                    : 3863040
DependentImage1EntryNumber      : 0
DependentImage2EntryNumber      : 0
ImageWidth                      : 4896
ImageHeight                     : 2752
(...)
ThumbnailImage                  : (Binary data 6189 bytes, use -b option to extract)
PreviewImage                    : (Binary data 601248 bytes, use -b option to extract)  


exiftool -PreviewImage -b '(path)/DSC00059.JPG' > '(path)/DSC00059 (PreviewImage).JPG'

References:
https://exiftool.org/forum/index.php?topic=6328.0

Suggestion how to improve img2pdf

http://fileformats.archiveteam.org/wiki/Multi-Picture_Format says: "MP files are valid JPEG files, except that they have additional data after the EOI marker."

So I suggest that img2pdf ignores everything after the EOI marker.

In the meanwhile, does anyone have a workaround for me?

## What I have I have an image which seems to be a normal one-paged jpeg file, but under the hood, it contains a thumbnail. The image is attached. ## What I want I want to create a pdf file with only one page, which contains my image. My Python code looks as follows: ``` import img2pdf filepath = '(path)/DSC00059.JPG' with open(filepath + '.pdf', 'w+b') as f: f.write(img2pdf.convert(filepath)) ``` ## What I get When I convert the image into a pdf, I get 2 pages: first the thumbnail, then the actual image. The pdf is attached. This behaviour is problematic for me, and I cannot find out how to get rid of the thumbnail. I know that I could delete the pdf pages by hand, but it must be a scalable solution, because I have a lot of files. ## What I tried to solve the problem With exiftool, I analysed the jpeg and then tried to extract the thumbnail. It worked, but I don't know how to extract the actual image. Below what I did: ``` exiftool '(path)/DSC00059.JPG' (...) ThumbnailOffset : 38996 ThumbnailLength : 6189 MPFVersion : 0100 NumberOfImages : 2 MPImageFlags : Dependent child image MPImageFormat : JPEG MPImageType : Large Thumbnail (full HD equivalent) MPImageLength : 601248 MPImageStart : 3863040 DependentImage1EntryNumber : 0 DependentImage2EntryNumber : 0 ImageWidth : 4896 ImageHeight : 2752 (...) ThumbnailImage : (Binary data 6189 bytes, use -b option to extract) PreviewImage : (Binary data 601248 bytes, use -b option to extract) exiftool -PreviewImage -b '(path)/DSC00059.JPG' > '(path)/DSC00059 (PreviewImage).JPG' ``` References: https://exiftool.org/forum/index.php?topic=6328.0 ## Suggestion how to improve img2pdf http://fileformats.archiveteam.org/wiki/Multi-Picture_Format says: "MP files are valid JPEG files, except that they have additional data after the EOI marker." So I suggest that img2pdf ignores everything after the EOI marker. In the meanwhile, does anyone have a workaround for me?
josch commented 3 years ago
Owner

From the --help output:

  --first-frame-only    By default, img2pdf will convert multi-frame images
                        like multi-page TIFF or animated GIF images to one
                        page per frame. This option will only let the first
                        frame of every multi-frame input image be converted
                        into a page in the resulting PDF.

If you use img2pdf via Python, you might try passing first_frame_only=True to convert().

Does that solve your problem?

From the `--help` output: ``` --first-frame-only By default, img2pdf will convert multi-frame images like multi-page TIFF or animated GIF images to one page per frame. This option will only let the first frame of every multi-frame input image be converted into a page in the resulting PDF. ``` If you use img2pdf via Python, you might try passing `first_frame_only=True` to `convert()`. Does that solve your problem?

When I convert the image into a pdf, I get 2 pages: first the thumbnail, then the actual image.

I already had some MPO images but for me it was the other way round: first the actual image, then the thumbnail. In this case one can simply do img2pdf.convert(..., first_frame_only=True).

Until the img2pdf maintainer has decided about your case, I would suggest to use pikepdf for deleting the superfluous thumbnail pages. However, this may be less efficient because img2pdf wastes time on converting images you actually don't need.

import io
import img2pdf
import pikepdf

def image_to_pdf(image_file):
    outbytes = io.BytesIO()
    img2pdf.convert(
        image_file,
        outputstream = outbytes,
        engine = img2pdf.Engine.pikepdf,
        rotation = img2pdf.Rotation.ifvalid,
    )
    outbytes.seek(0)
    pdf = pikepdf.Pdf.open(outbytes)
    return pdf, outbytes

pdf, outbytes = image_to_pdf("/path/to/your/mpo.jpeg")
del pdf.pages[0]

pdf.save("/path/to/your/output.pdf")
outbytes.close()

(Note that a few older versions of pikepdf are unable to delete pages due to an issue with qpdf, so I recommend you to use the latest pikepdf release.)

> When I convert the image into a pdf, I get 2 pages: first the thumbnail, then the actual image. I already had some MPO images but for me it was the other way round: first the actual image, then the thumbnail. In this case one can simply do `img2pdf.convert(..., first_frame_only=True)`. Until the img2pdf maintainer has decided about your case, I would suggest to use pikepdf for deleting the superfluous thumbnail pages. However, this may be less efficient because img2pdf wastes time on converting images you actually don't need. ```python3 import io import img2pdf import pikepdf def image_to_pdf(image_file): outbytes = io.BytesIO() img2pdf.convert( image_file, outputstream = outbytes, engine = img2pdf.Engine.pikepdf, rotation = img2pdf.Rotation.ifvalid, ) outbytes.seek(0) pdf = pikepdf.Pdf.open(outbytes) return pdf, outbytes pdf, outbytes = image_to_pdf("/path/to/your/mpo.jpeg") del pdf.pages[0] pdf.save("/path/to/your/output.pdf") outbytes.close() ``` (Note that a few older versions of pikepdf are unable to delete pages due to an issue with qpdf, so I recommend you to use the latest pikepdf release.)

@josch Whoops, I didn't see your comment before posting.

I think first_frame_only doesn't help @johannesnussbaum because apparently his desired image is on the second frame.

@josch Whoops, I didn't see your comment before posting. I think `first_frame_only` doesn't help @johannesnussbaum because apparently his desired image is on the second frame.

What about an --include-frames option followed by a comma-separated list of frames to convert (and include_frames: Sequence[int] = ... for the library interface)?

What about an `--include-frames` option followed by a comma-separated list of frames to convert (and `include_frames: Sequence[int] = ...` for the library interface)?
josch commented 3 years ago
Owner

I think first_frame_only doesn't help @johannesnussbaum because apparently his desired image is on the second frame.

I do not see an attached image, so I cannot check. But @johannesnussbaum also writes "So I suggest that img2pdf ignores everything after the EOI marker." which suggests to me that only the first frame is desired.

In all MPO files that I've seen so far, the actual image comes first and only then comes the thumbnail, so that MPO-unaware applications show the full image instead of the thumbnail. If it's really the other way round in this case, then another solution might be to fix the software that created these images or writing a script that fixes them.

What about an --include-frames option followed by a comma-separated list of frames to convert (and include_frames: Sequence[int] = ... for the library interface)?

Before I add another option, I want to know whether those MPO files were produced by buggy software or not. There is no sense in adding CLI options just to fix another broken piece of software.

> I think `first_frame_only` doesn't help @johannesnussbaum because apparently his desired image is on the second frame. I do not see an attached image, so I cannot check. But @johannesnussbaum also writes "So I suggest that img2pdf ignores everything after the EOI marker." which suggests to me that only the first frame is desired. In all MPO files that I've seen so far, the actual image comes first and only *then* comes the thumbnail, so that MPO-unaware applications show the full image instead of the thumbnail. If it's really the other way round in this case, then another solution might be to fix the software that created these images or writing a script that fixes them. > What about an `--include-frames` option followed by a comma-separated list of frames to convert (and `include_frames: Sequence[int] = ...` for the library interface)? Before I add another option, I want to know whether those MPO files were produced by buggy software or not. There is no sense in adding CLI options just to fix another broken piece of software.

Before I add another option, I want to know whether those MPO files were produced by buggy software or not. There is no sense in adding CLI options just to fix another broken piece of software.

I understand you want to be careful not to make img2pdf bloated, but I think this option would be useful anyway, like for multi-frame GIFs where the user might only want to convert certain frames. Moreover, --include-frames could just replace --first-frame-only.

> Before I add another option, I want to know whether those MPO files were produced by buggy software or not. There is no sense in adding CLI options just to fix another broken piece of software. I understand you want to be careful not to make img2pdf bloated, but I think this option would be useful anyway, like for multi-frame GIFs where the user might only want to convert certain frames. Moreover, `--include-frames` could just replace `--first-frame-only`.
Poster

Thank you very much for your help, @josch and @mara0004!

  • Excuse that the attached files were lost. I uploaded them, but somehow they disappeared.
  • --first-frame-only actually solves my problem, because the thumbnail is in the second frame. I didn't realise that, because the thumbnail appears bigger in the pdf, but it has a smaller resolution.
  • I suggest to change img2pdf's default for JPEGs to take only the first frame. Because 99% of users don't know that JPEGs can have frames, so they will be very surprised to see the resulting PDF with more than 1 page.
  • It might make sense to leave the default for GIFs, TIFFs and so on. In those cases, users are more probable to be informed about multi-page/multi-framing.
Thank you very much for your help, @josch and @mara0004! - Excuse that the attached files were lost. I uploaded them, but somehow they disappeared. - `--first-frame-only` actually solves my problem, because the thumbnail is in the second frame. I didn't realise that, because the thumbnail appears bigger in the pdf, but it has a smaller resolution. - I suggest to change `img2pdf`'s default for JPEGs to take only the first frame. Because 99% of users don't know that JPEGs can have frames, so they will be very surprised to see the resulting PDF with more than 1 page. - It might make sense to leave the default for GIFs, TIFFs and so on. In those cases, users are more probable to be informed about multi-page/multi-framing.

--first-frame-only actually solves my problem, because the thumbnail is in the second frame. I didn't realise that, because the thumbnail appears bigger in the pdf, but it has a smaller resolution.

That's good news :)

> --first-frame-only actually solves my problem, because the thumbnail is in the second frame. I didn't realise that, because the thumbnail appears bigger in the pdf, but it has a smaller resolution. That's good news :)
josch commented 3 years ago
Owner
  • --first-frame-only actually solves my problem, because the thumbnail is in the second frame. I didn't realise that, because the thumbnail appears bigger in the pdf, but it has a smaller resolution.

Yes, that's because img2pdf respects the dpi value in the images.

  • I suggest to change img2pdf's default for JPEGs to take only the first frame. Because 99% of users don't know that JPEGs can have frames, so they will be very surprised to see the resulting PDF with more than 1 page.
  • It might make sense to leave the default for GIFs, TIFFs and so on. In those cases, users are more probable to be informed about multi-page/multi-framing.

I think that's a bad idea:

  • img2pdf supports over 40 different image formats as input. Is the user supposed to look up which of those formats default to --first-frame-only and which ones do not and then pass the options accordingly?
  • if some formats default to --first-frame-only then we need an option to negate that, like --no-first-frame-only
  • what if the user runs img2pdf on a whole directory of files and expects the same behaviour for all its input?
  • do you have a source for the 99% above or is that just your opinion? I get a lot of different mail of people claiming opposite things to be "normal" for them. No matter what I implement, there will be people who dislike that default.
  • defaulting to "put all images into the PDF" aligns well with the "always lossless" idea of img2pdf
> - `--first-frame-only` actually solves my problem, because the thumbnail is in the second frame. I didn't realise that, because the thumbnail appears bigger in the pdf, but it has a smaller resolution. Yes, that's because img2pdf respects the dpi value in the images. > - I suggest to change `img2pdf`'s default for JPEGs to take only the first frame. Because 99% of users don't know that JPEGs can have frames, so they will be very surprised to see the resulting PDF with more than 1 page. > - It might make sense to leave the default for GIFs, TIFFs and so on. In those cases, users are more probable to be informed about multi-page/multi-framing. I think that's a bad idea: * img2pdf supports over 40 different image formats as input. Is the user supposed to look up which of those formats default to `--first-frame-only` and which ones do not and then pass the options accordingly? * if some formats default to `--first-frame-only` then we need an option to negate that, like `--no-first-frame-only` * what if the user runs img2pdf on a whole directory of files and expects the same behaviour for all its input? * do you have a source for the 99% above or is that just your opinion? I get a lot of different mail of people claiming opposite things to be "normal" for them. No matter what I implement, there will be people who dislike that default. * defaulting to "put all images into the PDF" aligns well with the "always lossless" idea of img2pdf
Poster

Thanks for your thoughts, @josch. I understand your considerations and respect them. Now that I'm informed about multiframed JPEGs, it won't be a problem for me in the future.

The 99% are an opinion I'm quite sure of, but I understand that you're never be able to satisfy everyone. And of course you prefer consistency with the other file formats.

You can close the issue now, and thanks again for your help.

Thanks for your thoughts, @josch. I understand your considerations and respect them. Now that I'm informed about multiframed JPEGs, it won't be a problem for me in the future. The 99% are an opinion I'm quite sure of, but I understand that you're never be able to satisfy everyone. And of course you prefer consistency with the other file formats. You can close the issue now, and thanks again for your help.
josch commented 3 years ago
Owner

Sure thing! Don't hesitate to file another issue if you run into any other problem. :)

Sure thing! Don't hesitate to file another issue if you run into any other problem. :)
josch closed this issue 3 years ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#123
Loading…
There is no content yet.