convert(): add option to get engine document (e.g. pikepdf.Pdf) #203

Open
mara0004 wants to merge 7 commits from mara0004/img2pdf:return_engine_doc into main
Contributor

This allows the caller to get the engine document handle (e.g. pikepdf.Pdf) instead of saving to a file or memory buffer, which avoids overhead if you intend to continue processing the document with that engine anyway (e.g. add pages from other PDFs, or do N-up composition with the image pages).

This is achieved by adding a new option return_engine_doc to the convert() API.
Beside the document handle, this also returns the minimum output version, which may be needed for saving once post-processing has been done.

Related to #97

This allows the caller to get the engine document handle (e.g. `pikepdf.Pdf`) instead of saving to a file or memory buffer, which avoids overhead if you intend to continue processing the document with that engine anyway (e.g. add pages from other PDFs, or do N-up composition with the image pages). ~~This is achieved by adding a new option `return_engine_doc` to the `convert()` API. Beside the document handle, this also returns the minimum output version, which may be needed for saving once post-processing has been done.~~ Related to #97
mara0004 added 3 commits 2024-08-26 12:54:38 +00:00
mara0004 added 1 commit 2024-08-26 13:05:33 +00:00
output seems to be part of the git repository and used for comparisons
Owner

I'm not happy with the design of adding another function argument which then, if set, changes the output type of the function. I think a better pattern would be to factor out the relevant code into a new function which can then be called by convert or by the 3rd party utility which needs the underlying engine object.

I'm not happy with the design of adding another function argument which then, if set, changes the output type of the function. I think a better pattern would be to factor out the relevant code into a new function which can then be called by `convert` or by the 3rd party utility which needs the underlying engine object.
Author
Contributor

Fair enough, that might be cleaner.

To confirm, you only want me to factor out a new function from pdfdoc.tostream() but keep the option + if-check in convert(), right?
Or do you also want me to factor out a convert_to_enginedoc() from convert()?

Fair enough, that might be cleaner. To confirm, you only want me to factor out a new function from `pdfdoc.tostream()` but keep the option + if-check in `convert()`, right? Or do you also want me to factor out a `convert_to_enginedoc()` from `convert()`?
mara0004 added 1 commit 2024-08-27 20:43:39 +00:00
mara0004 force-pushed return_engine_doc from 8aca925ee3 to c1c77a7453 2024-08-27 20:44:47 +00:00 Compare
Author
Contributor

I think a better pattern would be to factor out the relevant code into a new function which can then be called by convert or by the 3rd party utility which needs the underlying engine object.

I think I updated the code accordingly. Kindly let me know if you are OK with it now, or if there's anything else to change.

> I think a better pattern would be to factor out the relevant code into a new function which can then be called by convert or by the 3rd party utility which needs the underlying engine object. I think I updated the code accordingly. Kindly let me know if you are OK with it now, or if there's anything else to change.
Author
Contributor

(In case you decide to merge any of my 4 PRs, please --squash the commits into one. Thanks!)

(In case you decide to merge any of my 4 PRs, please --squash the commits into one. Thanks!)
Owner

Kindly let me know if you are OK with it now, or if there's anything else to change.

What i had in mind was: take the convert function and put all the code that generates the pdf variable and put it in its own function. Let the convert function call that function. Now your code can call this new function and can do with the pdf variable whatever it wants.

> Kindly let me know if you are OK with it now, or if there's anything else to change. What i had in mind was: take the convert function and put all the code that generates the `pdf` variable and put it in its own function. Let the convert function call that function. Now your code can call this new function and can do with the `pdf` variable whatever it wants.
Author
Contributor

Aah, that's what I was wondering in Comment 2, whether you want the factoring out only for pdfdoc.tostream() or also for convert() itself.
I had presumed you meant only pdfdoc.tostream(), but this clarifies you actually want it for convert(). So I'll change that.

Aah, that's what I was wondering in [Comment 2](https://gitlab.mister-muffin.de/josch/img2pdf/pulls/203#issuecomment-2661), whether you want the factoring out only for `pdfdoc.tostream()` or also for `convert()` itself. I had presumed you meant only `pdfdoc.tostream()`, but this clarifies you actually want it for `convert()`. So I'll change that.
mara0004 added 1 commit 2024-08-27 22:32:41 +00:00
Author
Contributor

OK, updated. Please take a look at the new patch set. Thanks :)

OK, updated. Please take a look at the new patch set. Thanks :)
mara0004 added 1 commit 2024-08-27 22:39:38 +00:00
josch requested changes 2024-09-15 06:03:03 +00:00
josch left a comment
Owner

Thank you for your work and for your patience with me. I'm going to be on a business trip next week (I'm switching jobs end of september/beginning of october) so my next feedback might take a while but please feel free to ping me if you like.

Thank you for your work and for your patience with me. I'm going to be on a business trip next week (I'm switching jobs end of september/beginning of october) so my next feedback might take a while but please feel free to ping me if you like.
@ -1076,3 +1076,3 @@
return stream.getvalue()
def tostream(self, outputstream):
def finalize(self):
Owner

You split tostream() into finalize() and tostream() but then why does the new tostream() not call finalize()?

You split `tostream()` into `finalize()` and `tostream()` but then why does the new `tostream()` not call `finalize()`?
Author
Contributor

Because I though the embedder of convert_to_docobject() should not have to invoke finalize().
Instead, finalize() technically belongs into convert_to_docobject() itself, after all image pages have been added. So tostream() cannot also finalize() as that would result in a double call.

Because I though the embedder of `convert_to_docobject()` should not have to invoke `finalize()`. Instead, `finalize()` technically belongs into `convert_to_docobject()` itself, after all image pages have been added. So `tostream()` cannot also `finalize()` as that would result in a double call.
@ -2803,0 +2806,4 @@
# Input images can be given as file like objects (they must implement read()),
# as a binary string representing the image content or as filenames to the
# images.
def convert(*images, outputstream=None, **kwargs):
Owner

Please do not change the signature of the convert() function. This is necessary to preserve API stability. You have to extract "outputstream" from kwargs.

Please do not change the signature of the convert() function. This is necessary to preserve API stability. You have to extract "outputstream" from kwargs.
Author
Contributor

It's not clear to me why this is supposed to break the API -- can you explain?
AFAICS, kwargs is internal, and no callee expects kwargs["outputstream"], right?

It's not clear to me why this is supposed to break the API -- can you explain? AFAICS, `kwargs` is internal, and no callee expects `kwargs["outputstream"]`, right?
Author
Contributor

Relatedly, it looks like the _default_kwargs strategy silently ignores nonexistent parameters, which is problematic. (And of course it breaks IDE completion.)
I see you're using kwargs to unify access of {crop,bleed,trim,art}border, which is fine, but why can't any others params be in the signature directly?

Relatedly, it looks like the `_default_kwargs` strategy silently ignores nonexistent parameters, which is problematic. (And of course it breaks IDE completion.) ~~I see you're using kwargs to unify access of `{crop,bleed,trim,art}border`, which is fine, but why can't any others params be in the signature directly?~~
Author
Contributor

It's not clear to me why this is supposed to break the API -- can you explain?

Okay, self-answering the question: a SO search yielded that specifying optional params after a *capture raises a SyntaxError on Python 2. Never having written code for Python 2, I did not know that. I'll change to extract from kwargs as you said, then.

The issue with invalid kwargs assumably being ignored without error still stands, though.

> It's not clear to me why this is supposed to break the API -- can you explain? Okay, self-answering the question: a SO search yielded that specifying optional params after a `*capture` raises a `SyntaxError` on Python 2. Never having written code for Python 2, I did not know that. I'll change to extract from kwargs as you said, then. The issue with invalid kwargs assumably being ignored without error still stands, though.
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u return_engine_doc:mara0004-return_engine_doc
git checkout mara0004-return_engine_doc

Merge

Merge the changes and update on Forgejo.
git checkout main
git merge --no-ff mara0004-return_engine_doc
git checkout main
git merge --ff-only mara0004-return_engine_doc
git checkout mara0004-return_engine_doc
git rebase main
git checkout main
git merge --no-ff mara0004-return_engine_doc
git checkout main
git merge --squash mara0004-return_engine_doc
git checkout main
git merge --ff-only mara0004-return_engine_doc
git checkout main
git merge mara0004-return_engine_doc
git push origin main
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#203
No description provided.