Memory 'leak' #127
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
img2pdf is blazing fast and we didn't had any issue with our Django app that converts multipage tiff to pdf until we had many concurrent users doing quick document switches.
The initial memory usage is minimal (around 50mb), but it starts adding up until all pdf pages are blank (page count is correct but the entire page is white)
Bellow is a simple test, just watch the memory usage go up;
Thanks for the bug report. Which Django app are you using img2pdf with?
It's a Document Manager app created by myself to be used internally at our company.
We're using Django 3.2.6 with Whitenoise for static files and it is hosted locally at our server with Waitres.
We deliver the file to browser using 'FileResponse' with a BytesIO object.
The mentioned issue also occurs if you just run the above code stand-alone, you might want to remove line #3 if you don't have tqdm installed.
Edit: Running Python 3.9.6 on Windows 10 20H2/Windows Server 2012R2
Okay, cool! Unfortunately this issue is very low on my priority list, so if you do this for some paid job, maybe you can find out what the issue is and send a patch that I can apply? I only maintain img2pdf in my free-time as a hobby project and my own usage is as an executable and not as a module, so this issue does not affect me personally.
I'm currently using the PIL method above as a workaround, I wont be able to look into img2pdf right now but since img2pdf it's just so much better I'll definitely take a look later, thanks!
To be frank, I find this reaction quite disappointing. A memory leak is a serious issue that ought to be addressed, regardless of the personal use case.
I'm sorry that you are feeling that way. I agree that this issue should be addressed but I also have a 10 month old daughter to take care of and thus I don't have much time for my hobby projects.
This is an unpaid free software project. If somebody doesn't like it, then they don't have to use it. If somebody has a really important issue they want to see fixed, then they are always free to spend the time and implement it. I will gladly merge patches. But as an unpaid volunteer I don't owe anything to the users of my software.
That's very understandable, naturally. I also don't have as much time for my hobby projects as I'd like to...
However, what your comment was about (or at least how I understood it) is classifying the priority of different img2pdf issues, and there addressing a memory leak should be more important than, say, working with viewer properties or adding new features.
Of course, you don't have to do this. It's your project, as you say. However, this is really the only image to pdf converter that is both lossless and fast, and a replacement is nowhere in sight. That's why I would be glad if img2pdf could keep being maintained in a way that cares about others' use cases as well.
The trouble is that the img2pdf code is highly complex and working with it requires in-depth knowledge of image format internals. The number of people who are up to this task is very small.
Thanks for the awesome library. I just started playing around with it to convert a propritary TIFF Wrapper to PDF. And I am also seeing the memory issue.
One solution of course is to run the conversion for each file in a new process, certainly a workable process for some use cases.
However, I plan performing some debugging to see where the memory leak is, I am pretty sure it's in the way the image is inspected. I have tried two different methods.
if filebased:
pages.append(dumpTiff(file,basename,seq,image,offset))
else:
pages.append(readTiff(file,image,offset))
In the first case pages is a list of temporary filenames where I write the tiff images
In the second case, pages is a list if bytes
In a thrird case not show, it was a object that had the filename,start,and length, and a read method to return bytes
With all cases there is no memory leak until I call the img2pdf.convert method.
with open(basename + ".pdf",'wb') as out:
out.write(img2pdf.convert(pages))
Thanks again, when I find the leak, I'll post back here and we can figure out how to incorportate it back into the release.
-tim
Thank you @tpurves for looking into this! I hope you find the culprit! Maybe another thing you can try is wether the problem occurs with different rendering engines or with different input types. The codepaths are quite different depending on whether the input is a JPG, PNG with or without transparancey, black/white TIF and so on.
Thanks! 👍
Yes, thanks @tpurves for starting to analyse the issue! I also tried to do some investigation but am unfortunately not familiar enough with the code to determine the cause.
Well I don't beleive there is a memory "leak", but something with the way that python or the libraries pulls memory from the OS. To the point that we know once a block of memory is allocated to the process, it is not released untl the process is terminated.
To test I created/used the following wrappers.
and for the profile wrapper
from memory_profiler import profile
And the following test function (convert below calls the img2pdf the library)
The results were interesting.
reguardless of how many times I run the loop (e.g., the range command), the heap dump remains the same. However the process memory consumed from the process goes up ~60MB based on each loop.
While the heap at the end stays consistant at ~224K reguardless of the number of loops
TL;DR - I think what is happening, (without looking at the python runtime or libraries that are used), something is allocating directly from the OS rather than using python alocators and we know that once the block is requested and allocated from the OS to the process, it will not be released util the process is terminated. I have tested this code on Windows, Linux, OSX and the behavior is exactly the same.
TL;DR2 - What I am planning; fork the conversion as a sub-process, I realize this will be a little slower, but it will address the memory issue,
Thoughts? What am I missing?
-tim
Sorry, I don't fully understand this yet. Why should more and more memory be allocated from the OS without incrementally releasing it again? Do you think the "leak" could be happening in the Pillow library?
It's pretty clear that the allocation is not comming from the standard python heap allocator. As there is no significant increate in the (python runtime manged) heap usage. So this means something is allocating from the OS memory, which we know once memory is allocated to the process it will nto be released until termination.
https://docs.python.org/3/c-api/memory.html
I am pretty sure the memory "leak" is in the raw domain. And quite possible in one of the other libraries, it's not in the img2pdf (as far as I can tell). Pillow is probably the only other library that is dealing with large amounts of data.
See the last post here (2016-2021) memory leak issues
https://github.com/python-pillow/Pillow/issues/2019
Need to validate what version of pillow is installed.
I think it's the engine. Compare this (the original example):
with this:
In the first case I see the memory growing larger and larger. The second case stays constant.
Can you confirm that observation?
It's pikepdf. Minimal reproducer:
Indeed, I can confirm it's pikepdf. Thanks for finding this out!
I reported this as https://github.com/pikepdf/pikepdf/issues/329
Thank you @Gistix for reporting this, @tpurves for investigating further an @mara0004 for insisting I look into this. :)
@Gistix, @tpurves until this is fixed in pikepdf, as a workaround, you can pass
engine=img2pdf.Engine.internal
toconvert()
so that pikepdf isn't used as the rendering engine.I was trying to update the package memory_profiler to allow wrapping class methods so I could find what was "leaking" memory in the img2pfd.pdfdoc/add_imagepage method and was striking out.
I can confirm that the memory usage seems to behave when using the
engine=img2pdf.Engine.internal
option on convert.Thanks @josch, et al.
@mara0004 looks like we can close this issue?
https://github.com/pikepdf/pikepdf/issues/329#issuecomment-1186473594
63f7955274
Yes, indeed. Thanks and sorry for the excessive noise!
At least I did not perceive any noise. Thank you for keeping track of things!