Is there any value to using pdfrw? #74

Closed
opened 2021-04-25 19:58:42 +00:00 by josch · 0 comments
Owner

By Michał Górny on 2020-04-20T10:31:15.600Z

I'm trying to figure out whether using pdfrw over built-in PDF writer has any advantages. In fact, both help output and a quick glance at the code suggests that it has at least a few limitations which require using --without-pdfrw. However, are there any cases when you should use pdfrw over --without-pdfrw?

In a wider context, I'd like to remove pdfrw from Gentoo since it's dead and stinking (and broken with Python 3.7+) and img2pdf is the last program using it. I'm wondering whether it'd be feasible to just rely on non-pdfrw behavior unconditionally.


By josch on 2020-04-20T13:30:03.745Z


Hi!

The reason for pdfrw inclusion is, that I think it's a good idea to rely on existing functionality instead of re-implementing the world. Writing pdf files is tricky and I would like if somebody else would take care of all the corner cases. Yes, pdfrw cannot do some things that img2pdf needs, so I opened issues about the missing functionality with pdfrw in the hope that those would be fixed in the future and I can then drop the --without-pdfrw flag. If pdfrw is really dead as you say (last commit was more than two years ago) then that's really shitty indeed. I am not aware of a good alternative that I could switch to.

What you can do in gentoo is to just patch img2pdf so that it unconditionally operates without pdfrw.

What is the problem with python 3.7+? I'm on python 3.8 and didn't find any problems yet.

Thanks for packaging the software in gentoo!


By Michał Górny on 2020-04-20T14:12:09.737Z


I agree that it's better to reuse existing code and let somebody else maintain it. Except that in this case nobody is maintaining pdfrw, so effectively you're either forced to fork it (which could be a good idea if you're willing to maintain it and the code has a low F-factor) or duplicate it. Right now you do the latter, so you're not only have to reinvent the wheel but also maintain two code branches which IMHO doesn't serve your goal much.

What I'm wondering is whether you'd be interested in removing pdfrw support altogether. That would save us the work, and possibly save you some too in the future ;-).

I've fought pdfrw's tests today and from all tests that passed with py2.7, 1 fails with py3.6 and 5 fail with py3.7. The latter are cases of raise StopIteration in generator which is forbidden in py3.7. I've found one explicit case in the code that's trivial to fix but it apparently isn't tested at all. The actual failure comes from some indirect call which I can't really figure out, and it seems that PDF files that somehow worked with earlier Python versions start exploding on newer versions, and the code misses proper error handling.

I'm not an expert on pdfrw's code and I don't really have the time to fix it or even figure out what's really wrong, I'm afraid. However, the evidence so far suggests it's badly written and making a lot of untested assumptions.


By josch on 2020-04-20T18:42:03.354Z


Did you report the bugs you found in their github issue tracker? I would like to refer to actual problems when I remove pdfrw support.

What I can do for you for now is to make it such that pdfrw support is automatically off if pdfrw is not installed. That should work for you, right?


By Michał Górny on 2020-04-20T20:20:15.449Z


That one happens with all Python versions I've tested: https://github.com/pmaupin/pdfrw/issues/197

This one with py3.6+: https://github.com/pmaupin/pdfrw/issues/198

This one with py3.7+: https://github.com/pmaupin/pdfrw/issues/199

The StopIterator problem has been reported already as https://github.com/pmaupin/pdfrw/issues/145

What I can do for you for now is to make it such that pdfrw support is automatically off if pdfrw is not installed. That should work for you, right?

Yes, that would be very helpful. Also please make sure tests skip it gracefully. Thank you.


By josch on 2020-04-20T21:36:06.946Z


Thank you for your input. I also marked pdfrw for removal in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=95836


By Michał Górny on 2020-04-21T05:36:22.041Z


Thank you. I presume you meant https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=958362. Hopefully this will get some attention and the package can be revived.

The only consumer we have (besides img2pdf) is rst2pdf that I've queued for removing as needed only to build PDF manual for mpv (no clue why we did that, to be honest).


By josch on 2020-04-23T04:59:11.107Z


Status changed to closed by commit 0bbbc7a31a

*By Michał Górny on 2020-04-20T10:31:15.600Z* I'm trying to figure out whether using pdfrw over built-in PDF writer has any advantages. In fact, both help output and a quick glance at the code suggests that it has at least a few limitations which require using `--without-pdfrw`. However, are there any cases when you should use pdfrw over `--without-pdfrw`? In a wider context, I'd like to remove pdfrw from Gentoo since it's dead and stinking (and broken with Python 3.7+) and img2pdf is the last program using it. I'm wondering whether it'd be feasible to just rely on non-pdfrw behavior unconditionally. --- *By josch on 2020-04-20T13:30:03.745Z* --- Hi! The reason for pdfrw inclusion is, that I think it's a good idea to rely on existing functionality instead of re-implementing the world. Writing pdf files is tricky and I would like if somebody else would take care of all the corner cases. Yes, pdfrw cannot do some things that img2pdf needs, so I opened issues about the missing functionality with pdfrw in the hope that those would be fixed in the future and I can then drop the `--without-pdfrw` flag. If pdfrw is really dead as you say (last commit was more than two years ago) then that's really shitty indeed. I am not aware of a good alternative that I could switch to. What you can do in gentoo is to just patch img2pdf so that it unconditionally operates without pdfrw. What is the problem with python 3.7+? I'm on python 3.8 and didn't find any problems yet. Thanks for packaging the software in gentoo! --- *By Michał Górny on 2020-04-20T14:12:09.737Z* --- I agree that it's better to reuse existing code and let somebody else maintain it. Except that in this case nobody is maintaining pdfrw, so effectively you're either forced to fork it (which could be a good idea *if* you're willing to maintain it and the code has a low F-factor) or duplicate it. Right now you do the latter, so you're not only have to reinvent the wheel but also maintain two code branches which IMHO doesn't serve your goal much. What I'm wondering is whether you'd be interested in removing pdfrw support altogether. That would save us the work, and possibly save you some too in the future ;-). I've fought pdfrw's tests today and from all tests that passed with py2.7, 1 fails with py3.6 and 5 fail with py3.7. The latter are cases of `raise StopIteration` in generator which is forbidden in py3.7. I've found one explicit case in the code that's trivial to fix but it apparently isn't tested at all. The actual failure comes from some indirect call which I can't really figure out, and it seems that PDF files that somehow worked with earlier Python versions start exploding on newer versions, and the code misses proper error handling. I'm not an expert on pdfrw's code and I don't really have the time to fix it or even figure out what's really wrong, I'm afraid. However, the evidence so far suggests it's badly written and making a lot of untested assumptions. --- *By josch on 2020-04-20T18:42:03.354Z* --- Did you report the bugs you found in their github issue tracker? I would like to refer to actual problems when I remove pdfrw support. What I can do for you for now is to make it such that pdfrw support is automatically off if pdfrw is not installed. That should work for you, right? --- *By Michał Górny on 2020-04-20T20:20:15.449Z* --- That one happens with all Python versions I've tested: https://github.com/pmaupin/pdfrw/issues/197 This one with py3.6+: https://github.com/pmaupin/pdfrw/issues/198 This one with py3.7+: https://github.com/pmaupin/pdfrw/issues/199 The StopIterator problem has been reported already as https://github.com/pmaupin/pdfrw/issues/145 > What I can do for you for now is to make it such that pdfrw support is automatically off if pdfrw is not installed. That should work for you, right? Yes, that would be very helpful. Also please make sure tests skip it gracefully. Thank you. --- *By josch on 2020-04-20T21:36:06.946Z* --- Thank you for your input. I also marked pdfrw for removal in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=95836 --- *By Michał Górny on 2020-04-21T05:36:22.041Z* --- Thank you. I presume you meant https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=958362. Hopefully this will get some attention and the package can be revived. The only consumer we have (besides img2pdf) is rst2pdf that I've queued for removing as needed only to build PDF manual for mpv (no clue why we did that, to be honest). --- *By josch on 2020-04-23T04:59:11.107Z* --- Status changed to closed by commit 0bbbc7a31a66a892d1730da45c6453a86f339a63
josch closed this issue 2021-04-25 19:58:43 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#74
No description provided.