Keywords argument adds extra quotation marks #194
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
On Windows machine I get an unexpected behaviour when keywords argument has commas in it.
The first command works as expected, while the second one adds double quotation marks. It looks like a bug to me, but if not, is there any way to avoid this behaviour?
I suspect this is your shell. Can you try the same commands with a different shell?
Same result with both cmd and PowerShell. I don't think it is shell related.
--title
works as expected when"One, two"
is used.Huh, that's super weird! The problem indeed seems to be windows specific. I cannot reproduce it on my linux box. What does this do on your system:
Output:
foo, bar
Another idea: download this image:
https://mister-muffin.de/mister-muffin.png
And then run:
I will do the same and then we can compare the results. Thanks!
Here you go.
In your PDF I see this:
So there are no quotes. Maybe this is something your PDF viewer does?
Can you please upload your file?
I have tried Adobe Acrobat and Foxit PDF Reader.
I have removed double quotation marks in Adobe Acrobat and saved the file. What does it show to you now? I no longer see them after manually removing.
No need, my file is bit-by-bit identical (same hashes) compared to yours.
That pdf still contains these lines (after uncompressing the respective streams):
But it also contains this:
So apparently, when you edit the keywords in acrobat, it leaves the original untouched but adds them to an rdf section.
Here is another theory: if you have more than one keyword, how do your viewers display them? If they separate multiple keywords by comma, then maybe your viewer adds the quotation marks to indicate for you which ones belong together?
I am not sure I understand your question. But if you asking what happens in this scenario:
img2pdf.exe mister-muffin.png --nodate --engine=internal --keywords "One two" -o out2.pdf
then it works as expected, i. e. no double quotation marks are added.
Btw, different app, but same issue, maybe it will give you some ideas about what is going on:
https://exiftool.org/forum/index.php?topic=4696.0
And here is a bit more info about this issue:
https://acrobatusers.com/forum/javascript/document-properties-keywords-field-scripts-add-double-quotes-multiple-values/
Any updates on this issue?
Sorry, I'm maintaining img2pdf in my free-time and my current real-life job ends at the end of march and things are a bit hectic with figuring out how things continue in april.
According to that page, the quotation marks come from adobe acrobat. And as I suspected earlier, the thing that "fixes" this is to use rdf instead of the pdf
/Keywords
key as documented in PDF spec 1.7 TABLE 10.2 "Entries in the document information dictionary"That link also suggests to add metadata as RDF.
So I guess this feature request is to add RDF support to img2pdf for the metadata?
Yes. It would be best, if end-users could choose where keywords are stored. One way would be to add extra argument for img2pdf. For example,
--keywords
would store data in/Keywords
key, while--keywords2
or--keywordsRdf
would store it as RDF.What is the use-case for only storing it in one or the other? Why would a user want to have control over that?
I was more thinking along the lines of what adobe acrobat apparently does and always storing them in both places.
I can confirm that not only Adobe Acrobat, but also Foxit PDF adds quotation marks, so this behavior is not isolated. Nitro PDF, PDF-XChange and Slim PDF on the other hand does not add those marks.
RDF will be stored within XMP. I like how img2pdf creates "pure" PDF files with no hidden/extra metadata included. In most cases I would not like to have XMP entries stored in PDF files unless there is absolutely no other way. That’s why I have suggested two methods of storing keywords.
Okay, but that software adds the quotation marks in its interface. The quotation markes are not stored in the pdf itself. They are something that are added by your pdf viewer when presenting the keywords to you, so that you know that "One, two" is one keyword and not two.
Correct me if I am wrong, but img2pdf doesn't allow to add separate keywords to PDF file, they will be always stored as one string. If you implement that feature by always storing keywords in both places, PDF files will always have XMP entries (when one ore more keywords are set), which is undesirable.
What I am trying to suggest is: don’t add by default XMP entries to PDF files unless user specifically wants/needs it. One way to figure out if user wants/needs XMP entries is to use separate arguments for keywords.
I am not sure myself either. I've read the PDF spec on the
/Keywords
entry and it just says:It does not talk about if there can be more than one keyword and if yes, how those would get stored (if not via XMP). I also had another look at your out_edited.pdf from above where you removed the quotation marks and that one still contains:
And I guess the quotation marks are removed because acrobat prefers XMP entries over
/Keywords
?My thought on the XMP metadata is, that if compressed, it would only be a few hundred bytes long. I find it a bit odd to only add the full XMP blob if more than one keyword is used. That looks like an unexpected surprise for the user to me. I see little reason to always add it, given that it is comparatively small. Maybe I would add it by default but add a
--no-xmp
switch for a user who really doesn't want to have it.It looks like there is a real mess with keywords. You can find some useful info in these topics:
https://exiftool.org/forum/index.php?topic=15469.0
https://exiftool.org/forum/index.php?topic=12086.0
So, keywords can be stored in three places and you can see that in out_edited.pdf:
/Keywords (One, two)
<pdf:Keywords>One, two</pdf:Keywords>
Different PDF readers treat keyword tags differently, but Adobe apparently does the following:
Adobe Reader ignores PDF:Keywords. It fills the Keywords Property by combining two tags, XMP-pdf:Keywords and XMP-dc:Subject.
I think it is a good idea, but I would suggest to not add XMP by default and use inverted switch instead (e.g.
--add-xmp
,--pdf-xmp
or alike).