Multi Picture jpegs are erroneously converted to multi-paged pdf #123
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I have
I have an image which seems to be a normal one-paged jpeg file, but under the hood, it contains a thumbnail. The image is attached.
What I want
I want to create a pdf file with only one page, which contains my image. My Python code looks as follows:
What I get
When I convert the image into a pdf, I get 2 pages: first the thumbnail, then the actual image. The pdf is attached. This behaviour is problematic for me, and I cannot find out how to get rid of the thumbnail.
I know that I could delete the pdf pages by hand, but it must be a scalable solution, because I have a lot of files.
What I tried to solve the problem
With exiftool, I analysed the jpeg and then tried to extract the thumbnail. It worked, but I don't know how to extract the actual image. Below what I did:
References:
https://exiftool.org/forum/index.php?topic=6328.0
Suggestion how to improve img2pdf
http://fileformats.archiveteam.org/wiki/Multi-Picture_Format says: "MP files are valid JPEG files, except that they have additional data after the EOI marker."
So I suggest that img2pdf ignores everything after the EOI marker.
In the meanwhile, does anyone have a workaround for me?
From the
--help
output:If you use img2pdf via Python, you might try passing
first_frame_only=True
toconvert()
.Does that solve your problem?
I already had some MPO images but for me it was the other way round: first the actual image, then the thumbnail. In this case one can simply do
img2pdf.convert(..., first_frame_only=True)
.Until the img2pdf maintainer has decided about your case, I would suggest to use pikepdf for deleting the superfluous thumbnail pages. However, this may be less efficient because img2pdf wastes time on converting images you actually don't need.
(Note that a few older versions of pikepdf are unable to delete pages due to an issue with qpdf, so I recommend you to use the latest pikepdf release.)
@josch Whoops, I didn't see your comment before posting.
I think
first_frame_only
doesn't help @johannesnussbaum because apparently his desired image is on the second frame.What about an
--include-frames
option followed by a comma-separated list of frames to convert (andinclude_frames: Sequence[int] = ...
for the library interface)?I do not see an attached image, so I cannot check. But @johannesnussbaum also writes "So I suggest that img2pdf ignores everything after the EOI marker." which suggests to me that only the first frame is desired.
In all MPO files that I've seen so far, the actual image comes first and only then comes the thumbnail, so that MPO-unaware applications show the full image instead of the thumbnail. If it's really the other way round in this case, then another solution might be to fix the software that created these images or writing a script that fixes them.
Before I add another option, I want to know whether those MPO files were produced by buggy software or not. There is no sense in adding CLI options just to fix another broken piece of software.
I understand you want to be careful not to make img2pdf bloated, but I think this option would be useful anyway, like for multi-frame GIFs where the user might only want to convert certain frames. Moreover,
--include-frames
could just replace--first-frame-only
.Thank you very much for your help, @josch and @mara0004!
--first-frame-only
actually solves my problem, because the thumbnail is in the second frame. I didn't realise that, because the thumbnail appears bigger in the pdf, but it has a smaller resolution.img2pdf
's default for JPEGs to take only the first frame. Because 99% of users don't know that JPEGs can have frames, so they will be very surprised to see the resulting PDF with more than 1 page.That's good news :)
Yes, that's because img2pdf respects the dpi value in the images.
I think that's a bad idea:
--first-frame-only
and which ones do not and then pass the options accordingly?--first-frame-only
then we need an option to negate that, like--no-first-frame-only
Thanks for your thoughts, @josch. I understand your considerations and respect them. Now that I'm informed about multiframed JPEGs, it won't be a problem for me in the future.
The 99% are an opinion I'm quite sure of, but I understand that you're never be able to satisfy everyone. And of course you prefer consistency with the other file formats.
You can close the issue now, and thanks again for your help.
Sure thing! Don't hesitate to file another issue if you run into any other problem. :)