Output is not deterministic unless using --engine=internal #150

Closed
opened 2 years ago by zkemwjcy · 3 comments

Using img2pdf 0.4.4 from debian bookworm. The issue also occurs on img2pdf 0.4.0 from debian bullseye. The issue does not occur on img2pdf 0.3.3 from debian buster.

The manpage says --nodate "makes the output deterministic between individual runs".

However:

wget 'https://upload.wikimedia.org/wikipedia/commons/0/0e/Felis_silvestris_silvestris.jpg'
img2pdf --nodate --output=output1.pdf Felis_silvestris_silvestris.jpg
img2pdf --nodate --output=output2.pdf Felis_silvestris_silvestris.jpg
diff -s output1.pdf output2.pdf

produces:

Binary files output1.pdf and output2.pdf differ

But if I instead use --engine=internal:

wget 'https://upload.wikimedia.org/wikipedia/commons/0/0e/Felis_silvestris_silvestris.jpg'
img2pdf --engine=internal --nodate --output=output1.pdf Felis_silvestris_silvestris.jpg
img2pdf --engine=internal --nodate --output=output2.pdf Felis_silvestris_silvestris.jpg
diff -s output1.pdf output2.pdf

I get:

Files output1.pdf and output2.pdf are identical

The need to use --engine=internal to produce pdf files deterministically is not documented anywhere. Version 0.3.3 used to produce deterministic pdf files just using --nodate as documented.

Using img2pdf 0.4.4 from debian bookworm. The issue also occurs on img2pdf 0.4.0 from debian bullseye. The issue does not occur on img2pdf 0.3.3 from debian buster. The manpage says `--nodate` "makes the output deterministic between individual runs". However: ```sh wget 'https://upload.wikimedia.org/wikipedia/commons/0/0e/Felis_silvestris_silvestris.jpg' img2pdf --nodate --output=output1.pdf Felis_silvestris_silvestris.jpg img2pdf --nodate --output=output2.pdf Felis_silvestris_silvestris.jpg diff -s output1.pdf output2.pdf ``` produces: ``` Binary files output1.pdf and output2.pdf differ ``` But if I instead use `--engine=internal`: ```sh wget 'https://upload.wikimedia.org/wikipedia/commons/0/0e/Felis_silvestris_silvestris.jpg' img2pdf --engine=internal --nodate --output=output1.pdf Felis_silvestris_silvestris.jpg img2pdf --engine=internal --nodate --output=output2.pdf Felis_silvestris_silvestris.jpg diff -s output1.pdf output2.pdf ``` I get: ``` Files output1.pdf and output2.pdf are identical ``` The need to use `--engine=internal` to produce pdf files deterministically is not documented anywhere. Version 0.3.3 used to produce deterministic pdf files just using `--nodate` as documented.
josch commented 2 years ago
Owner

Thank you. This is absolutely a bug.

The statement about deterministic output comes from the time when img2pdf only supported the internal engine as well as pdfrw. Today, pdfrw is unmaintainened and has been removed from the tests. It might be completely broken. Instead, we now have the new pikepdf engine which has since become the default. The pikepdf engine is the problem because it produced non-deterministic /ID values.

I will investigate.

Thank you. This is absolutely a bug. The statement about deterministic output comes from the time when img2pdf only supported the internal engine as well as pdfrw. Today, pdfrw is unmaintainened and has been removed from the tests. It might be completely broken. Instead, we now have the new pikepdf engine which has since become the default. The pikepdf engine is the problem because it produced non-deterministic `/ID` values. I will investigate.

I think one can pass static_id=True to pikepdf.Pdf.save().
However, --nodate only implies dates, so maybe a separate option should be added for this.

I think one can pass `static_id=True` to [`pikepdf.Pdf.save()`](https://pikepdf.readthedocs.io/en/latest/api/main.html#pikepdf.Pdf.save). However, `--nodate` only implies dates, so maybe a separate option should be added for this.
josch commented 2 years ago
Owner

Yes, this should be independent of --nodate. It would be wrong to overload --nodate with additional functionality other than not embedding the current date and time.

But it would also be wrong tho pass static_id=True to pikepdf.Pdf.save() because then the /ID metadata would not be generate anymore at all which would make the PDF files generated by img2pdf not uniquely identifiable anymore according to PDF 1.7 reference section 10.3 "File Identifiers".

I think this is the correct solution:

https://github.com/pikepdf/pikepdf/pull/400

Yes, this should be independent of `--nodate`. It would be wrong to overload `--nodate` with additional functionality other than not embedding the current date and time. But it would also be wrong tho pass `static_id=True` to `pikepdf.Pdf.save()` because then the `/ID` metadata would not be generate anymore at all which would make the PDF files generated by img2pdf not uniquely identifiable anymore according to PDF 1.7 reference section 10.3 "File Identifiers". I think this is the correct solution: https://github.com/pikepdf/pikepdf/pull/400
josch closed this issue 2 years ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#150
Loading…
There is no content yet.