Re-evaluating Default DPI and handling DPI-less source images #189

Open
opened 3 months ago by F04 · 5 comments
F04 commented 3 months ago

I write a lot of little programs and the last thing I want is for them to not work by default, or not do what's intended quickly and with the fewest arguments given.

And I know I'm some stranger coming out of nowhere, but I think it might be better to not have a default dpi, than to have a default dpi.

Or alternately (assuming the goal is to produce an accurate pdf) img2pdf should generate a clear visible warning on stderr when used on source files missing dpi information because:

Quietly using a hard-coded default dpi can create unexpected bogus document dimensions in the output file.

Working with inaccurate data coming in is unfortunately common, even expected. But allowing inaccurate data out, without even a warning, feels wrong.

I have a basically modern scanner that scans at 300dpi. I convert to pdf with img2pdf and all the sudden dimensions are roughly 3x larger than they should be. weird.... but sort of okay... whatever... I can zoom out, scale on print. I'm still super thankful this tool exists at all Josch!

I realize that if the source image doesn't specify a dpi, it may be necessary to use SOMETHING to do the conversion to pdf. But if that something is picked out of the air, it is most likely not correct. Right? So why are we doing that?

e.g.: DPIs range from 50ish to 1200-ish today, so the chance that 96 is correct is about 1/1150. Maybe better than that, because it sounds like a common-ish number. But scanners are usually 150 300 600 1200 and I assume img2pdf is often used with scanned images but perhaps that's my confirmation bias. Presumably the optical element has 1200 sensors per inch and the common options wisely halve that to avoid interpolation. 96 48 32 sounds like icon sizes to me or like sizes for computer generated content.

So I propose that when no dpi is specified and the source image lacks that information, img2pdf should:

  1. either not produce output and exit nonzero, or
  2. produce the output file as it does currently (because changing behavior of a program used by many, sucks. And I know it really, really, feels ugly to exit nonzero and not make a pdf) but sternly warn the user on stderr that a) dimensions/dpi could not be detected in the source image b) 96 was chosen by default c) the output dimensions may be wrong

Note on #2a: It could be slightly more helpful to tell the user PNM files don't SUPPORT dpi. vs JPG files support it but your files don't have it. (mine don't)

One other idea. Maybe it's possible to produce output pdfs that lack any DPI or size information. That might be slightly better than producing output pdfs that contain wrong information. It would cause less mystery to the user "Where in the heck is 96 coming from?" And it's probably better if whatever consumes the PDFs knows "there is no DPI" vs "this pdf is certain that it's nine feet long"

I write a lot of little programs and the last thing I want is for them to **not work** by default, or not do what's intended quickly and with the fewest arguments given. And I know I'm some stranger coming out of nowhere, but I think it might be *better to not have a default dpi, than to have a default dpi.* Or alternately (assuming the goal is to produce an *accurate* pdf) img2pdf should generate a clear visible warning on stderr when used on source files missing dpi information because: > Quietly using a hard-coded default dpi can create unexpected bogus document dimensions in the output file. Working with inaccurate data coming in is unfortunately common, even expected. But allowing inaccurate data out, without even a warning, feels wrong. I have a basically modern scanner that scans at 300dpi. I convert to pdf with img2pdf and all the sudden dimensions are roughly 3x larger than they should be. weird.... but sort of okay... whatever... I can zoom out, scale on print. I'm still super thankful this tool exists at all Josch! I realize that if the source image doesn't specify a dpi, it may be necessary to use SOMETHING to do the conversion to pdf. But if that something is picked out of the air, it is most likely not correct. Right? So why are we doing that? e.g.: DPIs range from 50ish to 1200-ish today, so the chance that 96 is correct is about 1/1150. Maybe better than that, because it sounds like a common-ish number. But scanners are usually 150 300 600 1200 and I assume img2pdf is often used with scanned images but perhaps that's my confirmation bias. Presumably the optical element has 1200 sensors per inch and the common options wisely halve that to avoid interpolation. 96 48 32 sounds like icon sizes to me or like sizes for computer generated content. So I propose that when no dpi is specified and the source image lacks that information, img2pdf should: 1. either not produce output and exit nonzero, or 2. produce the output file as it does currently (because changing behavior of a program used by many, sucks. And I know it really, really, feels ugly to exit nonzero and not make a pdf) but sternly warn the user on stderr that a) dimensions/dpi could not be detected in the source image b) 96 was chosen by default c) the output dimensions may be wrong Note on #2a: It could be slightly more helpful to tell the user PNM files don't SUPPORT dpi. vs JPG files support it but your files don't have it. (mine don't) One other idea. Maybe it's possible to produce output pdfs that lack any DPI or size information. That might be slightly better than producing output pdfs that contain wrong information. It would cause less mystery to the user "Where in the heck is 96 coming from?" And it's probably better if whatever consumes the PDFs knows "there is no DPI" vs "this pdf is certain that it's nine feet long"
F04 changed title from Re-evaluating Default DPI and how to handle DPI-less source images to Re-evaluating Default DPI and handling DPI-less source images 3 months ago
josch commented 3 months ago
Owner

I think exiting with a non-zero exit status is a bad idea. I got lots of complaints in the past when I insisted that, if the input is garbage, img2pdf should just refuse to work.

But I'm all in favor of showing a big fat warning if the image did not come with dpi values or only broken values and thus a default was used. The warning can then also include instructions of what to do to specify the correct dpi by hand.

I'm afraid that "no dpi" doesn't work because the input image has pixels as its size and the output pdf has a physical size. So we somehow need to decide what physical size the pdf pages should have, based on the number of pixels width and height of the image. That is what dots per inch or pixel per centimeter decide.

I think exiting with a non-zero exit status is a bad idea. I got lots of complaints in the past when I insisted that, if the input is garbage, img2pdf should just refuse to work. But I'm all in favor of showing a big fat warning if the image did not come with dpi values or only broken values and thus a default was used. The warning can then also include instructions of what to do to specify the correct dpi by hand. I'm afraid that "no dpi" doesn't work because the input image has pixels as its size and the output pdf has a physical size. So we somehow need to decide what physical size the pdf pages should have, based on the number of pixels width and height of the image. That is what dots per inch or pixel per centimeter decide.
F04 commented 3 months ago
Poster

It sounds like you can set a pdf's unit to "pixels per inch=x" or "pixels per centimeter=x". Too bad it can't be tricked into "pixels per pixel=1"!

If you are certain there's absolutely no way to generate a pdf without declaring physical (non-pixel) dimensions, I'd default the output pdf to be 8.5x11 (because yay America) Having PDFs that are too large to come out of a printer is unhelpful.

And once you know what output page size you want to hit, scale the image as large as possible to fit within that and centered. Maybe instead of outputting 8.5x11 you putput maximum 8.5 by maximum 11. So the images won't be matted, but are guaranteed to fit on one page.

If you have to assume a DPI, any assumption is probably going to be wrong. So maximizing the utility of the common printable size is probably the next best thing. Even if I'd feel really stupid carrying around a gas pump receipt or a UPS label enlarged to 8.5x11. I've done that before...

I feel like this level of decision making should be left the e user's --args but if they complain when they don't give --args and don't get a pdf out, you're in a tough position.

Some folks like A4, or legal over letter. More relevantly, A4 is the standard for much of the world outside the US. It's hard for me to say whether A4 or Letter are more common in humanity or which are more common amongst people with computers. I do concede the A system is more intelligent. But we have more nukes and moon rocks. so....

Maybe there's a system default paper-size variable you could read, like /etc/timezone but I would be quite surprised.

Big fat warning is also great because it will increase user awareness of poor input data, while preventing the complaints that no output file would lead to.

I'm not an expert in pdf or jpg or pnm. But I switched my scanning from pnm to jpg when I learned that pnm can not contain dpi/dimensions. Only to learn that scanimage doesn't save dimensions in the jpg format either!

It sounds like you can set a pdf's unit to "pixels per inch=x" or "pixels per centimeter=x". Too bad it can't be tricked into "pixels per pixel=1"! If you are certain there's absolutely no way to generate a pdf without declaring physical (non-pixel) dimensions, I'd default the output pdf to be 8.5x11 (because yay America) Having PDFs that are too large to come out of a printer is unhelpful. And once you know what output page size you want to hit, scale the image as large as possible to fit within that and centered. Maybe instead of outputting 8.5x11 you putput maximum 8.5 by maximum 11. So the images won't be matted, but are guaranteed to fit on one page. If you have to assume a DPI, any assumption is probably going to be wrong. So maximizing the utility of the common printable size is probably the next best thing. Even if I'd feel really stupid carrying around a gas pump receipt or a UPS label enlarged to 8.5x11. I've done that before... I feel like this level of decision making should be left the e user's --args but if they complain when they don't give --args and don't get a pdf out, you're in a tough position. Some folks like A4, or legal over letter. More relevantly, A4 is the standard for much of the world outside the US. It's hard for me to say whether A4 or Letter are more common in humanity or which are more common amongst people with computers. I do concede the A system is more intelligent. But we have more nukes and moon rocks. so.... Maybe there's a system default paper-size variable you could read, like /etc/timezone but I would be quite surprised. Big fat warning is also great because it will increase user awareness of poor input data, while preventing the complaints that no output file would lead to. I'm not an expert in pdf or jpg or pnm. But I switched my scanning from pnm to jpg when I learned that pnm can not contain dpi/dimensions. Only to learn that scanimage doesn't save dimensions in the jpg format either!
josch commented 3 months ago
Owner

What would "pixel per pixel=1" mean?

A pdf is not a raster image. It is more like a vector graphic and its native unit is 1/72 inch. So you have to convert your measurement into inches one way or another. You cannot give measurements in pixels ever. That's not how pdf works.

You are making the assumption that people use img2pdf to print things. I know some people do. I have personally never have used it for that. I think defaulting to some dpi value and printing a warning if that value gets used is the right thing to do.

There is a way to figure out whether the user might prefer letter or A4: by using the locale the user has set. If you want to send a patch that enables this, I'd happily review it.

Yes pnm can store nothing else but pixel data. It does not contain any metadata.

I will not default to the letter format. You can compute the number of people who use letter versus those who use the ISO 216 sizes using this handy graphic:

https://upload.wikimedia.org/wikipedia/commons/1/1a/Prevalent_default_paper_size.svg

What would "pixel per pixel=1" mean? A pdf is not a raster image. It is more like a vector graphic and its native unit is 1/72 inch. So you have to convert your measurement into inches one way or another. You cannot give measurements in pixels ever. That's not how pdf works. You are making the assumption that people use img2pdf to print things. I know some people do. I have personally never have used it for that. I think defaulting to some dpi value and printing a warning if that value gets used is the right thing to do. There is a way to figure out whether the user might prefer letter or A4: by using the locale the user has set. If you want to send a patch that enables this, I'd happily review it. Yes pnm can store nothing else but pixel data. It does not contain any metadata. I will not default to the letter format. You can compute the number of people who use letter versus those who use the ISO 216 sizes using this handy graphic: https://upload.wikimedia.org/wikipedia/commons/1/1a/Prevalent_default_paper_size.svg
F04 commented 3 months ago
Poster

I seldom print. But I scan and use img2pdf and ocrmypdf and other tools to preserve paper documents in a reproducible form.

pixel per pixel=1 is me looking for a way to have a pdf without physical dimensions when such dimensions are unknown to begin with. hacking the format. not ideal

I think the Big Fat Warning on stderr when input is without dpi is better though.

I found there IS an /etc/papersize and it's set to (gasp) a4 on my system!
#to set it to letter:
paperconfig -p letter

x360:~# paperconf
letter
x360:~# paperconf -s
612 792
x360:~# paperconf -s -m
215.9 mm 279.4 mm

There's a libpaper library apparently for this.

supposedly /etc/papersize is used by groff, troff, and cups. LC_PAPER exists, but it's just set to en_US.UTF-8 which would need another table to look up from the country code to a papersize.

I seldom print. But I scan and use img2pdf and ocrmypdf and other tools to preserve paper documents in a reproducible form. pixel per pixel=1 is me looking for a way to have a pdf without physical dimensions when such dimensions are unknown to begin with. hacking the format. not ideal **I think the Big Fat Warning on stderr when input is without dpi is better though.** I found there IS an /etc/papersize and it's set to (gasp) a4 on my system! #to set it to letter: paperconfig -p letter x360:~# paperconf letter x360:~# paperconf -s 612 792 x360:~# paperconf -s -m 215.9 mm 279.4 mm There's a libpaper library apparently for this. supposedly /etc/papersize is used by groff, troff, and cups. LC_PAPER exists, but it's just set to en_US.UTF-8 which would need another table to look up from the country code to a papersize.
josch commented 3 months ago
Owner

The "table lookup" can be done by the locale utility. Try running this on your system:

$ locale width; locale height

It will give you width and height of the default paper format according to your LC_PAPER setting in mm. If this doesn't work, img2pdf could fall back to using the information in /etc/papersize, for example by running paperconf -s.

The "table lookup" can be done by the `locale` utility. Try running this on your system: $ locale width; locale height It will give you width and height of the default paper format according to your `LC_PAPER` setting in mm. If this doesn't work, img2pdf could fall back to using the information in `/etc/papersize`, for example by running `paperconf -s`.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#189
Loading…
There is no content yet.