Problems converting jpeg with embedded thumbnail #93

Closed
opened 3 years ago by josch · 0 comments
josch commented 3 years ago
Owner

By m-holger on 2021-04-06T09:54:34.769Z

When converting this 5.7MB jpeg , img2pdf
creates a 20.5MB pdf and produces the following diagnostics

DEBUG:root:imgformat = other
DEBUG:root:Converting frame: 0
DEBUG:root:input dpi = 350 x 350
DEBUG:root:rotation = 0°
DEBUG:root:input colorspace = RGB
DEBUG:root:width x height = 4912px x 3264px
DEBUG:root:Colorspace is OK: Colorspace.RGB
DEBUG:root:read_images() encoded an image as PNG
DEBUG:root:Converting frame: 1
DEBUG:root:input dpi = 350 x 350
DEBUG:root:rotation = 0°
DEBUG:root:input colorspace = RGB
DEBUG:root:width x height = 1616px x 1080px
DEBUG:root:Colorspace is OK: Colorspace.RGB
DEBUG:root:read_images() encoded an image as PNG
DEBUG:PIL.Image:Error closing: 'NoneType' object has no attribute 'close'

Originally reported at https://github.com/pdfarranger/pdfarranger/issues/457


By josch on 2021-04-06T11:04:01.281Z


Awesome, thank you for this test case! This is indeed a type of image I have never seen before. PIL identifies it as MPO: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#mpo

img2pdf needs to add support for that kind of images


By josch on 2021-04-12T22:01:25.351Z


Status changed to closed by commit d29c596fe7


By m-holger on 2021-04-13T08:52:20.936Z


@josch Thanks, that works.

One thought. Would it make sense to only extract the main image? Or at least exclude any preview images?

I assume this will go into 0.4.1 . Any idea when it will be released?


By josch on 2021-04-13T09:23:02.156Z


As far as PIL is concerned, MPO files are more similar to a GIF animation than to a JPEG and in the case of a GIF animation, img2pdf turns all frames into pdf pages with one frame each. This makes sense, because img2pdf tries to be lossless, so it will by default not omit pixel data contained in the image.

Notice, that an MPO image is not the same as a JPEG with a thumbnail in its exif data. For JPEG images with thumbnails img2pdf already does not add an extra page for the thumbnail. Now some camera manufacturers apparently abuse the MPO format for thumbnails. If you don't like that, complain to the camera manufacturer that they should use exif thumbnails as intended.

Now for your specific problem, the --first-frame-only option might help. Here is the relevant part from the --help output:

  --first-frame-only    By default, img2pdf will convert multi-frame images
                        like multi-page TIFF or animated GIF images to one
                        page per frame. This option will only let the first
                        frame of every multi-frame input image be converted
                        into a page in the resulting PDF.

MPO flies are multi-frame images, so with this option you will only see the first frame (aka the main image) in the resulting pdf.


By m-holger on 2021-04-13T10:52:31.778Z


img2pdf is used as a library by PDF Arranger to import images. I don't think having the odd extra preview image imported is a problem, and in this case using the first-frame-only option probably would cause more problems than benefits.

Could you give me an idea as to when you will release a version of img2pdf incorporating the fix.

As to my query regarding suppressing the preview, I have no opinion as to whether this is desirable as my knowledge of jpegs and exif would easily fit on the back of a stamp. I only raised the point because it seemed to me (perhaps wrongly) that the 1616x1080 image was flagged as a (large) thumbnail in the exif data:

File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
...
Preview Image Size              : 1616x1080
...
MPF Version                     : 0100
Number Of Images                : 2
MP Image Flags                  : Dependent child image
MP Image Format                 : JPEG
MP Image Type                   : Large Thumbnail (full HD equivalent)
MP Image Length                 : 697614
MP Image Start                  : 4938752



By josch on 2021-04-13T12:15:04.424Z


That is correct. It is flagged as a large thumbnail. The problem is, that we need to solve the general case and not just the case of this particular image. An MPO file can contain any number of images. What if there are multiple "Large Thumbnail (full HD equivalent)"? Do we omit them all? How do we know which thumbnail belongs to which full image? What if the user wants the image marked as thumbnail? We could add tons more options to img2pdf to deal with all the particularities of the MPO format but there are other tools that already do this much better. The unix philosophy is to have tools that each do one thing and to that one thing well. If somebody has an MPO file and some special use-case for it in mind, then there exist much better tools to extract exactly the one single image the user wants from the MPO and then pass the result to img2pdf. If we add options exclusively for MPO specific stuff, then we might as well ask ourselves why we do not do that for other formats either? But img2pdf is not an image manipulation tool but just converts images to pdf. If you want to convert a different image to pdf, then use an image manipulation tool before handing the image to img2pdf.

I plan to make a new release in the next few days.


By m-holger on 2021-04-13T12:34:23.224Z


I plan to make a new release in the next few days.

Thanks

As for the other matter, as I said, I do not have a view on this and I was not trying to change your mind. Sorry if it came across like this.


By josch on 2021-04-13T16:21:11.283Z


Even if you were trying to change my mind, that would be a good thing because maybe others do see things that I do not see and thus I'm interested in hearing the arguments others make. Please don't worry, you never came across as bothersome or impolite to me. Thank you for your bugreport!

*By m-holger on 2021-04-06T09:54:34.769Z* When converting [this]( https://angelika-schwarz.net/nextcloud/index.php/s/ZAmq954deBPNKg6/download) 5.7MB jpeg , img2pdf creates a 20.5MB pdf and produces the following diagnostics ``` DEBUG:root:imgformat = other DEBUG:root:Converting frame: 0 DEBUG:root:input dpi = 350 x 350 DEBUG:root:rotation = 0° DEBUG:root:input colorspace = RGB DEBUG:root:width x height = 4912px x 3264px DEBUG:root:Colorspace is OK: Colorspace.RGB DEBUG:root:read_images() encoded an image as PNG DEBUG:root:Converting frame: 1 DEBUG:root:input dpi = 350 x 350 DEBUG:root:rotation = 0° DEBUG:root:input colorspace = RGB DEBUG:root:width x height = 1616px x 1080px DEBUG:root:Colorspace is OK: Colorspace.RGB DEBUG:root:read_images() encoded an image as PNG DEBUG:PIL.Image:Error closing: 'NoneType' object has no attribute 'close' ``` Originally reported at https://github.com/pdfarranger/pdfarranger/issues/457 --- *By josch on 2021-04-06T11:04:01.281Z* --- Awesome, thank you for this test case! This is indeed a type of image I have never seen before. PIL identifies it as MPO: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#mpo img2pdf needs to add support for that kind of images --- *By josch on 2021-04-12T22:01:25.351Z* --- Status changed to closed by commit d29c596fe79e4bccd986b0f9e045bab3dbab02dd --- *By m-holger on 2021-04-13T08:52:20.936Z* --- @josch Thanks, that works. One thought. Would it make sense to only extract the main image? Or at least exclude any preview images? I assume this will go into 0.4.1 . Any idea when it will be released? --- *By josch on 2021-04-13T09:23:02.156Z* --- As far as PIL is concerned, MPO files are more similar to a GIF animation than to a JPEG and in the case of a GIF animation, img2pdf turns all frames into pdf pages with one frame each. This makes sense, because img2pdf tries to be lossless, so it will by default not omit pixel data contained in the image. Notice, that an MPO image is *not* the same as a JPEG with a thumbnail in its exif data. For JPEG images with thumbnails img2pdf already does not add an extra page for the thumbnail. Now some camera manufacturers apparently abuse the MPO format for thumbnails. If you don't like that, complain to the camera manufacturer that they should use exif thumbnails as intended. Now for your specific problem, the `--first-frame-only` option might help. Here is the relevant part from the `--help` output: ``` --first-frame-only By default, img2pdf will convert multi-frame images like multi-page TIFF or animated GIF images to one page per frame. This option will only let the first frame of every multi-frame input image be converted into a page in the resulting PDF. ``` MPO flies are multi-frame images, so with this option you will only see the first frame (aka the main image) in the resulting pdf. --- *By m-holger on 2021-04-13T10:52:31.778Z* --- img2pdf is used as a library by PDF Arranger to import images. I don't think having the odd extra preview image imported is a problem, and in this case using the `first-frame-only` option probably would cause more problems than benefits. Could you give me an idea as to when you will release a version of img2pdf incorporating the fix. As to my query regarding suppressing the preview, I have no opinion as to whether this is desirable as my knowledge of jpegs and exif would easily fit on the back of a stamp. I only raised the point because it seemed to me (perhaps wrongly) that the 1616x1080 image was flagged as a (large) thumbnail in the exif data: ``` File Type : JPEG File Type Extension : jpg MIME Type : image/jpeg ... Preview Image Size : 1616x1080 ... MPF Version : 0100 Number Of Images : 2 MP Image Flags : Dependent child image MP Image Format : JPEG MP Image Type : Large Thumbnail (full HD equivalent) MP Image Length : 697614 MP Image Start : 4938752 ``` --- *By josch on 2021-04-13T12:15:04.424Z* --- That is correct. It is flagged as a large thumbnail. The problem is, that we need to solve the general case and not just the case of this particular image. An MPO file can contain any number of images. What if there are multiple "Large Thumbnail (full HD equivalent)"? Do we omit them all? How do we know which thumbnail belongs to which full image? What if the user wants the image marked as thumbnail? We could add tons more options to img2pdf to deal with all the particularities of the MPO format but there are other tools that already do this much better. The unix philosophy is to have tools that each do one thing and to that one thing well. If somebody has an MPO file and some special use-case for it in mind, then there exist much better tools to extract exactly the one single image the user wants from the MPO and then pass the result to img2pdf. If we add options exclusively for MPO specific stuff, then we might as well ask ourselves why we do not do that for other formats either? But img2pdf is not an image manipulation tool but just converts images to pdf. If you want to convert a different image to pdf, then use an image manipulation tool before handing the image to img2pdf. I plan to make a new release in the next few days. --- *By m-holger on 2021-04-13T12:34:23.224Z* --- > I plan to make a new release in the next few days. Thanks As for the other matter, as I said, I do not have a view on this and I was not trying to change your mind. Sorry if it came across like this. --- *By josch on 2021-04-13T16:21:11.283Z* --- Even if you were trying to change my mind, that would be a good thing because maybe others do see things that I do not see and thus I'm interested in hearing the arguments others make. Please don't worry, you never came across as bothersome or impolite to me. Thank you for your bugreport!
josch closed this issue 3 years ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#93
Loading…
There is no content yet.