Keywords argument adds extra quotation marks #194

Open
opened 2 months ago by soheday · 20 comments

On Windows machine I get an unexpected behaviour when keywords argument has commas in it.

img2pdf.exe in.png -S A4 --keywords "One two" -o out1.pdf
img2pdf.exe in.png -S A4 --keywords "One, two" -o out2.pdf

The first command works as expected, while the second one adds double quotation marks. It looks like a bug to me, but if not, is there any way to avoid this behaviour?

On Windows machine I get an unexpected behaviour when keywords argument has commas in it. ``` img2pdf.exe in.png -S A4 --keywords "One two" -o out1.pdf img2pdf.exe in.png -S A4 --keywords "One, two" -o out2.pdf ``` The first command works as expected, while the second one adds double quotation marks. It looks like a bug to me, but if not, is there any way to avoid this behaviour? ![](https://i.ibb.co/M7F9790/image.png)
josch commented 2 months ago
Owner

I suspect this is your shell. Can you try the same commands with a different shell?

I suspect this is your shell. Can you try the same commands with a different shell?
Poster

Same result with both cmd and PowerShell. I don't think it is shell related. --title works as expected when "One, two" is used.

Same result with both cmd and PowerShell. I don't think it is shell related. `--title` works as expected when `"One, two"` is used. ![](https://i.ibb.co/2g2nmNS/image.png)
josch commented 2 months ago
Owner

Huh, that's super weird! The problem indeed seems to be windows specific. I cannot reproduce it on my linux box. What does this do on your system:

python3 -c "import sys; print(sys.argv[1])" "foo, bar"
Huh, that's super weird! The problem indeed seems to be windows specific. I cannot reproduce it on my linux box. What does this do on your system: python3 -c "import sys; print(sys.argv[1])" "foo, bar"
Poster

Output: foo, bar

Output: `foo, bar`
josch commented 2 months ago
Owner

Another idea: download this image:

https://mister-muffin.de/mister-muffin.png

And then run:

img2pdf.exe mister-muffin.png --nodate --engine=internal --keywords "One, two" -o out.pdf

I will do the same and then we can compare the results. Thanks!

Another idea: download this image: https://mister-muffin.de/mister-muffin.png And then run: img2pdf.exe mister-muffin.png --nodate --engine=internal --keywords "One, two" -o out.pdf I will do the same and then we can compare the results. Thanks!
Poster

Here you go.

Here you go.
12 KiB
josch commented 2 months ago
Owner

In your PDF I see this:

1 0 obj
<<
    /Keywords (One, two)
    /Producer (img2pdf 0.5.1)
>>

So there are no quotes. Maybe this is something your PDF viewer does?

In your PDF I see this: ``` 1 0 obj << /Keywords (One, two) /Producer (img2pdf 0.5.1) >> ``` So there are no quotes. Maybe this is something your PDF viewer does?
Poster

Can you please upload your file?
I have tried Adobe Acrobat and Foxit PDF Reader.

Can you please upload your file? I have tried Adobe Acrobat and Foxit PDF Reader.
Poster

I have removed double quotation marks in Adobe Acrobat and saved the file. What does it show to you now? I no longer see them after manually removing.

I have removed double quotation marks in Adobe Acrobat and saved the file. What does it show to you now? I no longer see them after manually removing.
josch commented 2 months ago
Owner

Can you please upload your file?

No need, my file is bit-by-bit identical (same hashes) compared to yours.

I have removed double quotation marks in Adobe Acrobat and saved the file. What does it show to you now? I no longer see them after manually removing.

That pdf still contains these lines (after uncompressing the respective streams):

6 0 obj
<<
  /CreationDate (D:20240311095922+02'00')
  /Keywords (One, two)
  /ModDate (D:20240311095922+02'00')
  /Producer (img2pdf 0.5.1)
>>
endobj

But it also contains this:

1 0 obj
<<
  /Length 3371
  /Subtype /XML
  /Type /Metadata
>>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 9.1-c001 79.2a0d8d9, 2023/03/14-11:19:46        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <pdf:Keywords>One, two</pdf:Keywords>
         <pdf:Producer>img2pdf 0.5.1</pdf:Producer>
         <dc:format>application/pdf</dc:format>
         <dc:creator>
            <rdf:Bag/>
         </dc:creator>
         <dc:subject>
            <rdf:Bag>
               <rdf:li>One</rdf:li>
               <rdf:li>two</rdf:li>
            </rdf:Bag>
         </dc:subject>
         <xmp:ModifyDate>2024-03-11T09:59:22+02:00</xmp:ModifyDate>
         <xmp:CreateDate>2024-03-11T09:59:22+02:00</xmp:CreateDate>
         <xmp:MetadataDate>2024-03-11T09:59:22+02:00</xmp:MetadataDate>
         <xmpMM:DocumentID>uuid:988cf6bf-d02d-447a-a48d-ccfbcae070d8</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:005049d8-c4d5-461d-a387-7763f9adcee0</xmpMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

So apparently, when you edit the keywords in acrobat, it leaves the original untouched but adds them to an rdf section.

Here is another theory: if you have more than one keyword, how do your viewers display them? If they separate multiple keywords by comma, then maybe your viewer adds the quotation marks to indicate for you which ones belong together?

> Can you please upload your file? No need, my file is bit-by-bit identical (same hashes) compared to yours. > I have removed double quotation marks in Adobe Acrobat and saved the file. What does it show to you now? I no longer see them after manually removing. That pdf still contains these lines (after uncompressing the respective streams): ``` 6 0 obj << /CreationDate (D:20240311095922+02'00') /Keywords (One, two) /ModDate (D:20240311095922+02'00') /Producer (img2pdf 0.5.1) >> endobj ``` But it *also* contains this: ``` 1 0 obj << /Length 3371 /Subtype /XML /Type /Metadata >> stream <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 9.1-c001 79.2a0d8d9, 2023/03/14-11:19:46 "> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"> <pdf:Keywords>One, two</pdf:Keywords> <pdf:Producer>img2pdf 0.5.1</pdf:Producer> <dc:format>application/pdf</dc:format> <dc:creator> <rdf:Bag/> </dc:creator> <dc:subject> <rdf:Bag> <rdf:li>One</rdf:li> <rdf:li>two</rdf:li> </rdf:Bag> </dc:subject> <xmp:ModifyDate>2024-03-11T09:59:22+02:00</xmp:ModifyDate> <xmp:CreateDate>2024-03-11T09:59:22+02:00</xmp:CreateDate> <xmp:MetadataDate>2024-03-11T09:59:22+02:00</xmp:MetadataDate> <xmpMM:DocumentID>uuid:988cf6bf-d02d-447a-a48d-ccfbcae070d8</xmpMM:DocumentID> <xmpMM:InstanceID>uuid:005049d8-c4d5-461d-a387-7763f9adcee0</xmpMM:InstanceID> </rdf:Description> </rdf:RDF> </x:xmpmeta> ``` So apparently, when you edit the keywords in acrobat, it leaves the original untouched but adds them to an rdf section. Here is another theory: if you have more than one keyword, how do your viewers display them? If they separate multiple keywords by comma, then maybe your viewer adds the quotation marks to indicate for you which ones belong together?
Poster

Here is another theory: if you have more than one keyword, how do your viewers display them? If they separate multiple keywords by comma, then maybe your viewer adds the quotation marks to indicate for you which ones belong together?

I am not sure I understand your question. But if you asking what happens in this scenario:
img2pdf.exe mister-muffin.png --nodate --engine=internal --keywords "One two" -o out2.pdf
then it works as expected, i. e. no double quotation marks are added.

Btw, different app, but same issue, maybe it will give you some ideas about what is going on:
https://exiftool.org/forum/index.php?topic=4696.0

And here is a bit more info about this issue:
https://acrobatusers.com/forum/javascript/document-properties-keywords-field-scripts-add-double-quotes-multiple-values/

> Here is another theory: if you have more than one keyword, how do your viewers display them? If they separate multiple keywords by comma, then maybe your viewer adds the quotation marks to indicate for you which ones belong together? I am not sure I understand your question. But if you asking what happens in this scenario: `img2pdf.exe mister-muffin.png --nodate --engine=internal --keywords "One two" -o out2.pdf` then it works as expected, i. e. no double quotation marks are added. Btw, different app, but same issue, maybe it will give you some ideas about what is going on: https://exiftool.org/forum/index.php?topic=4696.0 And here is a bit more info about this issue: https://acrobatusers.com/forum/javascript/document-properties-keywords-field-scripts-add-double-quotes-multiple-values/
Poster

Any updates on this issue?

Any updates on this issue?
josch commented 1 month ago
Owner

Sorry, I'm maintaining img2pdf in my free-time and my current real-life job ends at the end of march and things are a bit hectic with figuring out how things continue in april.

https://exiftool.org/forum/index.php?topic=4696.0

According to that page, the quotation marks come from adobe acrobat. And as I suspected earlier, the thing that "fixes" this is to use rdf instead of the pdf /Keywords key as documented in PDF spec 1.7 TABLE 10.2 "Entries in the document information dictionary"

https://acrobatusers.com/forum/javascript/document-properties-keywords-field-scripts-add-double-quotes-multiple-values/

That link also suggests to add metadata as RDF.

So I guess this feature request is to add RDF support to img2pdf for the metadata?

Sorry, I'm maintaining img2pdf in my free-time and my current real-life job ends at the end of march and things are a bit hectic with figuring out how things continue in april. > https://exiftool.org/forum/index.php?topic=4696.0 According to that page, the quotation marks come from adobe acrobat. And as I suspected earlier, the thing that "fixes" this is to use rdf instead of the pdf `/Keywords` key as documented in PDF spec 1.7 TABLE 10.2 "Entries in the document information dictionary" > https://acrobatusers.com/forum/javascript/document-properties-keywords-field-scripts-add-double-quotes-multiple-values/ That link also suggests to add metadata as RDF. So I guess this feature request is to add RDF support to img2pdf for the metadata?
Poster

So I guess this feature request is to add RDF support to img2pdf for the metadata?

Yes. It would be best, if end-users could choose where keywords are stored. One way would be to add extra argument for img2pdf. For example, --keywords would store data in /Keywords key, while --keywords2 or --keywordsRdf would store it as RDF.

> So I guess this feature request is to add RDF support to img2pdf for the metadata? Yes. It would be best, if end-users could choose where keywords are stored. One way would be to add extra argument for img2pdf. For example, `--keywords` would store data in `/Keywords` key, while `--keywords2` or `--keywordsRdf` would store it as RDF.
josch commented 4 weeks ago
Owner

What is the use-case for only storing it in one or the other? Why would a user want to have control over that?

I was more thinking along the lines of what adobe acrobat apparently does and always storing them in both places.

What is the use-case for only storing it in one or the other? Why would a user want to have control over that? I was more thinking along the lines of what adobe acrobat apparently does and always storing them in both places.
Poster

I can confirm that not only Adobe Acrobat, but also Foxit PDF adds quotation marks, so this behavior is not isolated. Nitro PDF, PDF-XChange and Slim PDF on the other hand does not add those marks.

RDF will be stored within XMP. I like how img2pdf creates "pure" PDF files with no hidden/extra metadata included. In most cases I would not like to have XMP entries stored in PDF files unless there is absolutely no other way. That’s why I have suggested two methods of storing keywords.

I can confirm that not only Adobe Acrobat, but also Foxit PDF adds quotation marks, so this behavior is not isolated. Nitro PDF, PDF-XChange and Slim PDF on the other hand does not add those marks. RDF will be stored within XMP. I like how img2pdf creates "pure" PDF files with no hidden/extra metadata included. In most cases I would not like to have XMP entries stored in PDF files unless there is absolutely no other way. That’s why I have suggested two methods of storing keywords.
josch commented 4 weeks ago
Owner

Okay, but that software adds the quotation marks in its interface. The quotation markes are not stored in the pdf itself. They are something that are added by your pdf viewer when presenting the keywords to you, so that you know that "One, two" is one keyword and not two.

Okay, but that software adds the quotation marks in its *interface*. The quotation markes are not stored in the pdf itself. They are something that are added by your pdf viewer when presenting the keywords to you, so that you know that "One, two" is one keyword and not two.
Poster

Correct me if I am wrong, but img2pdf doesn't allow to add separate keywords to PDF file, they will be always stored as one string. If you implement that feature by always storing keywords in both places, PDF files will always have XMP entries (when one ore more keywords are set), which is undesirable.

What I am trying to suggest is: don’t add by default XMP entries to PDF files unless user specifically wants/needs it. One way to figure out if user wants/needs XMP entries is to use separate arguments for keywords.

Correct me if I am wrong, but img2pdf doesn't allow to add separate keywords to PDF file, they will be always stored as one string. If you implement that feature by always storing keywords in both places, PDF files will always have XMP entries (when one ore more keywords are set), which is undesirable. What I am trying to suggest is: don’t add by default XMP entries to PDF files unless user specifically wants/needs it. One way to figure out if user wants/needs XMP entries is to use separate arguments for keywords.
josch commented 4 weeks ago
Owner

I am not sure myself either. I've read the PDF spec on the /Keywords entry and it just says:

Keywords associated with the document

It does not talk about if there can be more than one keyword and if yes, how those would get stored (if not via XMP). I also had another look at your out_edited.pdf from above where you removed the quotation marks and that one still contains:

/Keywords (One, two)

And I guess the quotation marks are removed because acrobat prefers XMP entries over /Keywords?

My thought on the XMP metadata is, that if compressed, it would only be a few hundred bytes long. I find it a bit odd to only add the full XMP blob if more than one keyword is used. That looks like an unexpected surprise for the user to me. I see little reason to always add it, given that it is comparatively small. Maybe I would add it by default but add a --no-xmp switch for a user who really doesn't want to have it.

I am not sure myself either. I've read the PDF spec on the `/Keywords` entry and it just says: > Keywords associated with the document It does not talk about if there can be more than one keyword and if yes, how those would get stored (if not via XMP). I also had another look at your out_edited.pdf from above where you removed the quotation marks and that one still contains: /Keywords (One, two) And I guess the quotation marks are removed because acrobat prefers XMP entries over `/Keywords`? My thought on the XMP metadata is, that if compressed, it would only be a few hundred bytes long. I find it a bit odd to only add the full XMP blob if more than one keyword is used. That looks like an unexpected surprise for the user to me. I see little reason to always add it, given that it is comparatively small. Maybe I would add it by default but add a `--no-xmp` switch for a user who really doesn't want to have it.
Poster

It does not talk about if there can be more than one keyword and if yes, how those would get stored (if not via XMP). I also had another look at your out_edited.pdf from above where you removed the quotation marks and that one still contains:

/Keywords (One, two)

It looks like there is a real mess with keywords. You can find some useful info in these topics:
https://exiftool.org/forum/index.php?topic=15469.0
https://exiftool.org/forum/index.php?topic=12086.0

So, keywords can be stored in three places and you can see that in out_edited.pdf:

  1. PDF:Keywords (a.k.a PDF specifc tag): /Keywords (One, two)
  2. XMP-pdf:Keywords (a.k.a XMP PDF tag): <pdf:Keywords>One, two</pdf:Keywords>
  3. XMP-dc:Subject (a.k.a Dublin Core XMP tag):
<dc:subject>
   <rdf:Bag>
      <rdf:li>One</rdf:li>
      <rdf:li>two</rdf:li>
   </rdf:Bag>
</dc:subject>

And I guess the quotation marks are removed because acrobat prefers XMP entries over /Keywords?

Different PDF readers treat keyword tags differently, but Adobe apparently does the following:
Adobe Reader ignores PDF:Keywords. It fills the Keywords Property by combining two tags, XMP-pdf:Keywords and XMP-dc:Subject.

Maybe I would add it by default but add a --no-xmp switch for a user who really doesn't want to have it.

I think it is a good idea, but I would suggest to not add XMP by default and use inverted switch instead (e.g. --add-xmp, --pdf-xmp or alike).

> It does not talk about if there can be more than one keyword and if yes, how those would get stored (if not via XMP). I also had another look at your out_edited.pdf from above where you removed the quotation marks and that one still contains: > > /Keywords (One, two) It looks like there is a real mess with keywords. You can find some useful info in these topics: https://exiftool.org/forum/index.php?topic=15469.0 https://exiftool.org/forum/index.php?topic=12086.0 So, keywords can be stored in three places and you can see that in *out_edited.pdf*: 1. **PDF:Keywords** (a.k.a [PDF specifc tag](https://exiftool.org/TagNames/PDF.html)): `/Keywords (One, two)` 2. **XMP-pdf:Keywords** (a.k.a [XMP PDF tag](https://exiftool.org/TagNames/XMP.html#pdf)): `<pdf:Keywords>One, two</pdf:Keywords>` 3. **XMP-dc:Subject** (a.k.a [Dublin Core XMP tag](https://exiftool.org/TagNames/XMP.html#dc)): ``` <dc:subject> <rdf:Bag> <rdf:li>One</rdf:li> <rdf:li>two</rdf:li> </rdf:Bag> </dc:subject> ``` > And I guess the quotation marks are removed because acrobat prefers XMP entries over `/Keywords`? Different PDF readers treat keyword tags differently, but Adobe apparently does the following: `Adobe Reader ignores PDF:Keywords. It fills the Keywords Property by combining two tags, XMP-pdf:Keywords and XMP-dc:Subject.` > Maybe I would add it by default but add a `--no-xmp` switch for a user who really doesn't want to have it. I think it is a good idea, but I would suggest to not add XMP by default and use inverted switch instead (e.g. `--add-xmp`, `--pdf-xmp` or alike).
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#194
Loading…
There is no content yet.