Filesize #37

New issue

Closed

opened 2021-04-25 19:57:58 +00:00 by josch · 0 comments

josch commented

2021-04-25 19:57:58 +00:00

Owner

By Francisco on 2017-10-19T00:09:00.806Z

I am trying to write a tool to get images from a Wikimedia Commons category and create a pdf from them. We are probably going to have a lot of these categories as we are uploading images from an archive that digitized them in jpeg format.

It is very important to keep the quality of the original images, so I discarded a few approaches that would reencode the images before the conversion.

Now, this approach works, but it gives me a huge file size. For the first pdf I have 39 files that amount to a total of 21.5mb. My converted pdf came to 901.6mb. Is this expected?

The code I am using is this:

with open(cat.title().replace(' pages', '.pdf'), 'wb') as pdf_file:
    for page in pages_list:
        pdf_file.write(img2pdf.convert(pages_list))

You can check the files and the code in context if necessary.

By josch on 2017-10-19T04:27:22.254Z

Your code doesn't look right. You are going through the pages_list variable and convert all of them every time. So if your pages_list variable contains these 39 entries of overall 21.5 MiB, then you are writing 21.5 MiB 39 times for a total of 838.5 MiB which in turn explains the 901.6 MiB that you see. Instead you probably want:

for page in pages_list:
    pdf_file.write(img2pdf.convert([page]))

pdf_file.write(img2pdf.convert(pages_list))

By Francisco on 2017-10-19T10:50:32.394Z

I feel silly now. Thank you.

By Francisco on 2017-10-19T10:50:32.731Z

Status changed to closed

*By Francisco on 2017-10-19T00:09:00.806Z* I am trying to write a tool to get images from a Wikimedia Commons category and create a pdf from them. We are probably going to have a lot of these categories as we are uploading images from an archive that digitized them in jpeg format. It is very important to keep the quality of the original images, so I discarded a few approaches that would reencode the images before the conversion. Now, this approach works, but it gives me a huge file size. For the first pdf I have 39 files that amount to a total of 21.5mb. My converted pdf came to 901.6mb. Is this expected? The code I am using is this: ```python with open(cat.title().replace(' pages', '.pdf'), 'wb') as pdf_file: for page in pages_list: pdf_file.write(img2pdf.convert(pages_list)) ``` You can check [the files](http://paws-public.wmflabs.org/paws-public/12256150/Category%3ABrazilian%20Constitution%20of%201891%20pages/Category%3ABrazilian%20Constitution%20of%201891%20pages/) and [the code in context](http://paws-public.wmflabs.org/paws-public/12256150/Category%20to%20pdf.ipynb) if necessary. --- *By josch on 2017-10-19T04:27:22.254Z* --- Your code doesn't look right. You are going through the `pages_list` variable and convert all of them every time. So if your `pages_list` variable contains these 39 entries of overall 21.5 MiB, then you are writing 21.5 MiB 39 times for a total of 838.5 MiB which in turn explains the 901.6 MiB that you see. Instead you probably want: for page in pages_list: pdf_file.write(img2pdf.convert([page])) or pdf_file.write(img2pdf.convert(pages_list)) --- *By Francisco on 2017-10-19T10:50:32.394Z* --- I feel silly now. Thank you. --- *By Francisco on 2017-10-19T10:50:32.731Z* --- Status changed to closed