josch/img2pdf

Fork 10

Problem converting a very large number of images into a single PDF #3

New issue

Closed

opened 2021-04-25 19:57:26 +00:00 by josch · 0 comments

josch commented

2021-04-25 19:57:26 +00:00

Owner

By josch on 2015-03-15T09:41:46.023Z

Created by: sahwar

This is an awesome program, but I'm having problems converting a very large number of images into a single PDF file. The images are about 350 in number, 2 of them are color JPGs (the front cover and the back cover), while all the other images are monochrome (black-and-white) or greyscale PNGs.

The filesize of each image is >=500 KB to >=2 MB.

I think that the problem could lie in the fact that the dimensions of the images are very large (8,600x6,071 px for each of the PNGs and 2,616x3,753 px for the 2 color JPGs), or in the fact that the 2 color JPGs are smaller in terms of dimensions than the PNGs (Maybe img2pdf needs all input images to be of the same dimensions? I was left with the impression that this isn't the case).

However, I tried to convert ONLY the PNGs to a PDF and the error persisted. I then tried to convert just 1 of the color JPGs and img2pdf worked flawlessly... Weird.

Here's the error that I get:

Traceback (most recent call last):
  File "/usr/local/bin/img2pdf", line 9, in <module>
    load_entry_point('img2pdf==0.1.5', 'console_scripts', 'img2pdf')()
  File "/usr/local/lib/python2.7/dist-packages/img2pdf.py", line 402, in main
    args.verbose))
  File "/usr/local/lib/python2.7/dist-packages/img2pdf.py", line 313, in convert
    imgdata.close()
  File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line 528, in __getattr__
    raise AttributeError(name)
AttributeError: close

Here's a hyperlink to the command that I use for the conversion: the paste in PasteBin.com.

Other things that could be causing this are: maybe the fact that I haven't specified the DPI (-d), the -x and -y or maybe just the naming of the files is causing a trouble...

It would be nice if img2pdf's README.md were to include a mention of the theoretical limit of the number of input images that can be made into a single PDF and the limit of their filesize and image dimensions. Assuming that such limits exist, of course.

P.S. I should note that my PC is powerful enough so the problem is probably not related to insufficient hardware capabilities.

Two off-topic suggestions:
It would have been nice if img2pdf had the ability to resize (all or some of) the images to common paper sizes (like A4, etc.) for the pages of the output PDF file. The ability to make all (or some) of the pages in portrait or landscape PDF page mode is also good to have. I don't know how difficult it is to implement things like these, but they would definitely be very useful. Maybe ImageMagick could be used for this. Something similar is available in GhostScript.

Thanks in advance for your generous help!

Imported comments:

By josch on 2015-03-06 06:49:48 UTC

Hi,

Thanks for your bug report! Unfortunately I am unable to reproduce your error here.

But the problem seems to be that the PIL version you have does not offer the close() attribute. Can you tell me which version of PIL you have installed on your system?

Could you also copypaste your feature request to a separate bug report?

By sahwar on 2015-03-06 13:47:42 UTC

@josch
First of all, thanks for the quick reply!

Secondly, if you mean the Python Imaging Library (Pillow fork) for 'PIL', then the information
for my system is as follows:

python-imaging = 2.3.0-1ubuntu3
python-pil = 2.3.0-1ubuntu3

Thirdly, tell me if you need to know the version of any other related packages if that is needed
to figure out this problem (though you're probably on the right track with your guess).

My system is Linux Mint 17 KDE (32-bit), I'm actually using a LiveDVD version and
don't have it installed to the HDD of my computer, but that shouldn't be a problem.
Also, this is the vanilla Linux Mint 17 KDE (32-bit) with just a couple of packages
manually added (like the requirements/dependencies of img2pdf).

P.S. I'll now paste the feature request to a separate bug report, thanks for pointing that out.

By josch on 2015-03-06 14:15:12 UTC

Thanks, if you use Linux Mint 17, then it's easy for me to create a chroot environment of that and do the test myself :)

By josch on 2015-03-07 02:03:55 UTC

The exception should be gone now. @sahwar: Can you confirm?

By sahwar on 2015-03-07 15:37:59 UTC

@josch
Well, the exception really is gone and after waiting 5-10 minutes for the output PDF to be completed (because the input files are huge and are many in terms of their total number), I did get an output PDF and it seems to open fine with Okular.

However, I find it strange that 341 input files (with a total filesize of 176.5 MiB) gave a 303.7 MiB output PDF. Is the extra filesize due to the PDF overhead?

There's something else odd, though it isn't directly related to img2pdf and that is the fact that KDE's Dolphin reports the filesize of the output PDF as 0 B while Okular (KDE's PDF viewer) reports the correct filesize. Weird.

And there's an additional oddity which is related to the output PDF: the fist input image's page is scaled to be big while the last input image's page is shown normal (even though the last input image's file dimensions are larger than that of the first input image), you can see the difference here (I've edited out the contents of the pages with KolourPaint because they contain private information): http://imgur.com/a/NjhOQ#0. I haven't tested the PDF with a different PDF viewer other than Okular so I don't know if that's a problem with Okular itself or if img2pdf scaled the first page incorrectly. I thought that img2pdf puts the input images as pages without scaling them (and does so losslessly) and that this is the default behavior. Correct me if I'm wrong about that 'no scaling by default' thing.

By josch on 2015-03-07 17:22:53 UTC

You can get some progress on it by processing your input images in batches or individually and then joining the individual PDF files using pdftk or other tools.

Your output is very much bigger than the input because your input seem to be png images (if the file extension in the command you pasted in one of your earlier messages in an indication of the file type). To lossless put your png into the PDF, they will be unpacked into their raw data and then gzip compressed. This compression is quite a bit less efficient than the png format would be able to compress the same image data. This is why your file size explodes.

I have no idea why Dolphin reports the wrong file size. You should report a bug about that in the dolphin bug tracker.

About the scaling issue, please find some images that produce the same problem and which you are able to share. Then open a bug report with that issue, explaining in detail what input file you used, what the output is and what you expected to happen instead.

I understood that this bugreport was about the Python exception you experienced. Since this is solved, I'll close this bug now.

By sahwar on 2015-03-07 17:42:11 UTC

You can get some progress on it by processing your input images in batches or individually and then joining the individual PDF files using pdftk or other tools.

Yeah, I know, but I don't mind waiting a bit longer to get the output PDF in one go.

Your output is very much bigger than the input because your input seem to be png images (if the file extension in the command you pasted in one of your earlier messages in an indication of the file type).

You're right, almost all input images that I used in this case are PNG images.

To lossless put your png into the PDF, they will be unpacked into their raw data and then gzip compressed. This compression is quite a bit less efficient than the png format would be able to compress the same image data. This is why your file size explodes.

I see, I didn't realize that. It would be great if this is mentioned in the README.md as it really matters to know that the PDF output would be way bigger in filesize than the input images' filesize if you use PNGs as input. Maybe you should consider adding this explanation there since I now saw that you did say that

If it is in any other format, the image will be included as zip-encoded RGB. As a result, this tool will be able to lossless wrap any image into a PDF container while performing better (in terms of quality/filesize ratio) than existing tools in case the input image is a JPEG or JPEG2000 file.

but it would be nice to explicitly say it in a simpler way for us non-programmers.

I have no idea why Dolphin reports the wrong file size. You should report a bug about that in the dolphin bug tracker.

I closed Dolphin and reopened it and this bug is now gone so I won't bother reporting it for just this one case since Dolphin does display the filesize correctly after reloading the folder where the output PDF was...

About the scaling issue, please find some images that produce the same problem and which you are able to share. Then open a bug report with that issue, explaining in detail what input file you used, what the output is and what you expected to happen instead.

OK, thanks, I'll do that soon.

I understood that this bugreport was about the Python exception you experienced. Since this is solved, I'll close this bug now.

Yes, this bug is closed. Thanks for the fix that resolved the issue!

P.S. It's a pleasure to work with you, @josch, cheers!

By josch on 2015-03-08 06:41:55 UTC

I say multiple times in the README that everything that is not JPEG will be zip encoded:

If it is in any other format, the image will be included as zip-encoded RGB

will save all other graphics formats using lossless zip-compression.

For lossless conversion of other formats than JPEG or JPEG2000 files, zip/flate encoding is used.

So what information is missing?

I expect people to stumble across img2pdf after they have used any of the million other tools that can convert images to PDF like imagemagick, gimp, photoshop, ghostscript, latex, libre office, microsoft office. And then they noticed that either the resulting pdf files are huge in size or they are small but have lost quality.

You will then see that the README mainly talks about JPEG images as these are the kind that can be embedded into PDF without increase of file size or loss of quality.

If your goal is lossless embedding of images into pdf, then you img2pdf is exactly for you. If you don't, then just use any of the existing converters of image to pdf which will do a lossy conversion of your input images to JPEG and thus save you tons of space.

By sahwar on 2015-03-08 15:49:36 UTC

If your goal is lossless embedding of images into pdf, then you img2pdf is exactly for you. If you don't, then just use any of the existing converters of image to pdf which will do a lossy conversion of your input images to JPEG and thus save you tons of space.

Yes, that's my goal and I do know that this is what img2pdf does. Sorry if I didn't express myself clearly enough.

*By josch on 2015-03-15T09:41:46.023Z* *Created by: sahwar* This is an awesome program, but I'm having problems converting a very large number of images into a single PDF file. The images are about 350 in number, 2 of them are color JPGs (the front cover and the back cover), while all the other images are monochrome (black-and-white) or greyscale PNGs. The filesize of each image is >=500 KB to >=2 MB. I think that the problem could lie in the fact that the dimensions of the images are very large (8,600x6,071 px for each of the PNGs and 2,616x3,753 px for the 2 color JPGs), or in the fact that the 2 color JPGs are smaller in terms of dimensions than the PNGs (Maybe `img2pdf` needs all input images to be of the same dimensions? I was left with the impression that this isn't the case). However, I tried to convert **ONLY** the PNGs to a PDF and the error persisted. I then tried to convert just 1 of the color JPGs and `img2pdf` worked flawlessly... Weird. Here's the error that I get: ```` Traceback (most recent call last): File "/usr/local/bin/img2pdf", line 9, in <module> load_entry_point('img2pdf==0.1.5', 'console_scripts', 'img2pdf')() File "/usr/local/lib/python2.7/dist-packages/img2pdf.py", line 402, in main args.verbose)) File "/usr/local/lib/python2.7/dist-packages/img2pdf.py", line 313, in convert imgdata.close() File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line 528, in __getattr__ raise AttributeError(name) AttributeError: close ```` Here's a hyperlink to the command that I use for the conversion: [the paste in PasteBin.com](http://pastebin.com/BJPcMPZx). Other things that could be causing this are: maybe the fact that I haven't specified the DPI (`-d`), the `-x` and `-y` or maybe just the naming of the files is causing a trouble... It would be nice if `img2pdf`'s `README.md` were to include a mention of the theoretical limit of the number of input images that can be made into a single PDF and the limit of their filesize and image dimensions. Assuming that such limits exist, of course. P.S. I should note that my PC is powerful enough so the problem is probably not related to insufficient hardware capabilities. ***Two off-topic suggestions:*** It would have been nice if `img2pdf` had the ability to resize (all or some of) the images to common paper sizes (like A4, etc.) for the pages of the output PDF file. The ability to make all (or some) of the pages in portrait or landscape PDF page mode is also good to have. I don't know how difficult it is to implement things like these, but they would definitely be very useful. Maybe ImageMagick could be used for this. Something similar is available in GhostScript. Thanks in advance for your generous help! **Imported comments:** *By josch on 2015-03-06 06:49:48 UTC* Hi, Thanks for your bug report! Unfortunately I am unable to reproduce your error here. But the problem seems to be that the PIL version you have does not offer the close() attribute. Can you tell me which version of PIL you have installed on your system? Could you also copypaste your feature request to a separate bug report? *By sahwar on 2015-03-06 13:47:42 UTC* @josch First of all, thanks for the quick reply! Secondly, if you mean the Python Imaging Library (Pillow fork) for 'PIL', then the information for my system is as follows: ```` python-imaging = 2.3.0-1ubuntu3 python-pil = 2.3.0-1ubuntu3 ```` Thirdly, tell me if you need to know the version of any other related packages if that is needed to figure out this problem (though you're probably on the right track with your guess). My system is Linux Mint 17 KDE (32-bit), I'm actually using a LiveDVD version and don't have it installed to the HDD of my computer, but that shouldn't be a problem. Also, this is the vanilla Linux Mint 17 KDE (32-bit) with just a couple of packages manually added (like the requirements/dependencies of `img2pdf`). P.S. I'll now paste the feature request to a separate bug report, thanks for pointing that out. *By josch on 2015-03-06 14:15:12 UTC* Thanks, if you use Linux Mint 17, then it's easy for me to create a chroot environment of that and do the test myself :) *By josch on 2015-03-07 02:03:55 UTC* The exception should be gone now. @sahwar: Can you confirm? *By sahwar on 2015-03-07 15:37:59 UTC* @josch Well, the exception really is gone and after waiting 5-10 minutes for the output PDF to be completed (because the input files are huge and are many in terms of their total number), I did get an output PDF and it seems to open fine with Okular. However, I find it strange that 341 input files (with a total filesize of 176.5 MiB) gave a 303.7 MiB output PDF. Is the extra filesize due to the PDF overhead? There's something else odd, though it isn't directly related to `img2pdf` and that is the fact that KDE's Dolphin reports the filesize of the output PDF as 0 B while Okular (KDE's PDF viewer) reports the correct filesize. Weird. And there's an additional oddity which is related to the output PDF: the fist input image's page is scaled to be big while the last input image's page is shown normal (even though the last input image's file dimensions are larger than that of the first input image), you can see the difference here (I've edited out the contents of the pages with KolourPaint because they contain private information): http://imgur.com/a/NjhOQ#0. I haven't tested the PDF with a different PDF viewer other than Okular so I don't know if that's a problem with Okular itself or if `img2pdf` scaled the first page incorrectly. I thought that `img2pdf` puts the input images as pages without scaling them (and does so losslessly) and that this is the default behavior. Correct me if I'm wrong about that 'no scaling by default' thing. *By josch on 2015-03-07 17:22:53 UTC* You can get some progress on it by processing your input images in batches or individually and then joining the individual PDF files using pdftk or other tools. Your output is very much bigger than the input because your input seem to be png images (if the file extension in the command you pasted in one of your earlier messages in an indication of the file type). To lossless put your png into the PDF, they will be unpacked into their raw data and then gzip compressed. This compression is quite a bit less efficient than the png format would be able to compress the same image data. This is why your file size explodes. I have no idea why Dolphin reports the wrong file size. You should report a bug about that in the dolphin bug tracker. About the scaling issue, please find some images that produce the same problem and which you are able to share. Then open a bug report with that issue, explaining in detail what input file you used, what the output is and what you expected to happen instead. I understood that this bugreport was about the Python exception you experienced. Since this is solved, I'll close this bug now. *By sahwar on 2015-03-07 17:42:11 UTC* > You can get some progress on it by processing your input images in batches or individually and then joining the individual PDF files using pdftk or other tools. Yeah, I know, but I don't mind waiting a bit longer to get the output PDF in one go. > Your output is very much bigger than the input because your input seem to be png images (if the file extension in the command you pasted in one of your earlier messages in an indication of the file type). You're right, almost all input images that I used in this case are PNG images. > To lossless put your png into the PDF, they will be unpacked into their raw data and then gzip compressed. This compression is quite a bit less efficient than the png format would be able to compress the same image data. This is why your file size explodes. I see, I didn't realize that. It would be great if this is mentioned in the `README.md` as it really matters to know that the PDF output would be way bigger in filesize than the input images' filesize if you use PNGs as input. Maybe you should consider adding this explanation there since I now saw that you did say that > *If it is in any other format, the image will be included as zip-encoded RGB.* As a result, this tool will be able to lossless wrap any image into a PDF container while performing better (in terms of quality/filesize ratio) than existing tools **in case the input image is a JPEG or JPEG2000 file**. but it would be nice to explicitly say it in a simpler way for us non-programmers. > I have no idea why Dolphin reports the wrong file size. You should report a bug about that in the dolphin bug tracker. I closed Dolphin and reopened it and this bug is now gone so I won't bother reporting it for just this one case since Dolphin does display the filesize correctly after reloading the folder where the output PDF was... > About the scaling issue, please find some images that produce the same problem and which you are able to share. Then open a bug report with that issue, explaining in detail what input file you used, what the output is and what you expected to happen instead. OK, thanks, I'll do that soon. > I understood that this bugreport was about the Python exception you experienced. Since this is solved, I'll close this bug now. Yes, this bug is closed. Thanks for the fix that resolved the issue! P.S. It's a pleasure to work with you, @josch, cheers! *By josch on 2015-03-08 06:41:55 UTC* I say multiple times in the README that everything that is not JPEG will be zip encoded: > If it is in any other format, the image will be included as zip-encoded RGB or > will save all other graphics formats using lossless zip-compression. or > For lossless conversion of other formats than JPEG or JPEG2000 files, zip/flate encoding is used. So what information is missing? I expect people to stumble across img2pdf after they have used any of the million other tools that can convert images to PDF like imagemagick, gimp, photoshop, ghostscript, latex, libre office, microsoft office. And then they noticed that either the resulting pdf files are huge in size or they are small but have lost quality. You will then see that the README mainly talks about JPEG images as these are the kind that can be embedded into PDF without increase of file size or loss of quality. If your goal is lossless embedding of images into pdf, then you img2pdf is exactly for you. If you don't, then just use any of the existing converters of image to pdf which will do a lossy conversion of your input images to JPEG and thus save you tons of space. *By sahwar on 2015-03-08 15:49:36 UTC* > If your goal is lossless embedding of images into pdf, then you img2pdf is exactly for you. If you don't, then just use any of the existing converters of image to pdf which will do a lossy conversion of your input images to JPEG and thus save you tons of space. Yes, that's my goal and I do know that this is what `img2pdf` does. Sorry if I didn't express myself clearly enough.