Memory 'leak' #127

Closed
opened 2 years ago by Gistix · 22 comments
Gistix commented 2 years ago

img2pdf is blazing fast and we didn't had any issue with our Django app that converts multipage tiff to pdf until we had many concurrent users doing quick document switches.

The initial memory usage is minimal (around 50mb), but it starts adding up until all pdf pages are blank (page count is correct but the entire page is white)

Bellow is a simple test, just watch the memory usage go up;

from io import BytesIO
import img2pdf
from tqdm import tqdm
import gc
from PIL import Image, ImageSequence

path = '13.tif'

# minimum memory usage for bellow code is 500mb
def pil_test():
	for i in range(10000):
	    file = BytesIO()

	    with Image.open(path) as image:
		    images = []

		    for i, page in enumerate(ImageSequence.Iterator(image)):
		        page = page.convert("RGB")
		        images.append(page)

		    if len(images) == 1:
		        images[0].save(file, format='PDF')
		    else:
		        images[0].save(file, format='PDF', save_all=True, append_images=images[1:])

		    del images

	    file.seek(0)
	    #file.close()
	    gc.collect() # With collect memory usage will never go over 700mb, without it memory can go over 3gb before rubber banding back to 500mb

# img2pdf is much faster and uses 10x less memory for multipage tiff, but after a while the memory adds up and pages start coming out blank (white)
def img2pdf_test():
	with open(path, 'rb') as f:
		for i in range(10000):
			img2pdf.convert(f)
			f.seek(0)
			gc.collect() # Doesn't work

#pil_test()
img2pdf_test()

img2pdf is blazing fast and we didn't had any issue with our Django app that converts multipage tiff to pdf until we had many concurrent users doing quick document switches. The initial memory usage is minimal (around 50mb), but it starts adding up until all pdf pages are blank (page count is correct but the entire page is white) Bellow is a simple test, just watch the memory usage go up; ``` from io import BytesIO import img2pdf from tqdm import tqdm import gc from PIL import Image, ImageSequence path = '13.tif' # minimum memory usage for bellow code is 500mb def pil_test(): for i in range(10000): file = BytesIO() with Image.open(path) as image: images = [] for i, page in enumerate(ImageSequence.Iterator(image)): page = page.convert("RGB") images.append(page) if len(images) == 1: images[0].save(file, format='PDF') else: images[0].save(file, format='PDF', save_all=True, append_images=images[1:]) del images file.seek(0) #file.close() gc.collect() # With collect memory usage will never go over 700mb, without it memory can go over 3gb before rubber banding back to 500mb # img2pdf is much faster and uses 10x less memory for multipage tiff, but after a while the memory adds up and pages start coming out blank (white) def img2pdf_test(): with open(path, 'rb') as f: for i in range(10000): img2pdf.convert(f) f.seek(0) gc.collect() # Doesn't work #pil_test() img2pdf_test() ```
josch commented 2 years ago
Owner

Thanks for the bug report. Which Django app are you using img2pdf with?

Thanks for the bug report. Which Django app are you using img2pdf with?
Gistix commented 2 years ago
Poster

It's a Document Manager app created by myself to be used internally at our company.
We're using Django 3.2.6 with Whitenoise for static files and it is hosted locally at our server with Waitres.

We deliver the file to browser using 'FileResponse' with a BytesIO object.

The mentioned issue also occurs if you just run the above code stand-alone, you might want to remove line #3 if you don't have tqdm installed.

Edit: Running Python 3.9.6 on Windows 10 20H2/Windows Server 2012R2

It's a Document Manager app created by myself to be used internally at our company. We're using Django 3.2.6 with Whitenoise for static files and it is hosted locally at our server with Waitres. We deliver the file to browser using 'FileResponse' with a BytesIO object. The mentioned issue also occurs if you just run the above code stand-alone, you might want to remove line #3 if you don't have tqdm installed. Edit: Running Python 3.9.6 on Windows 10 20H2/Windows Server 2012R2
josch commented 2 years ago
Owner

Okay, cool! Unfortunately this issue is very low on my priority list, so if you do this for some paid job, maybe you can find out what the issue is and send a patch that I can apply? I only maintain img2pdf in my free-time as a hobby project and my own usage is as an executable and not as a module, so this issue does not affect me personally.

Okay, cool! Unfortunately this issue is very low on my priority list, so if you do this for some paid job, maybe you can find out what the issue is and send a patch that I can apply? I only maintain img2pdf in my free-time as a hobby project and my own usage is as an executable and not as a module, so this issue does not affect me personally.
Gistix commented 2 years ago
Poster

I'm currently using the PIL method above as a workaround, I wont be able to look into img2pdf right now but since img2pdf it's just so much better I'll definitely take a look later, thanks!

I'm currently using the PIL method above as a workaround, I wont be able to look into img2pdf right now but since img2pdf it's just so much better I'll definitely take a look later, thanks!

Unfortunately this issue is very low on my priority list [...] and my own usage is as an executable and not as a module, so this issue does not affect me personally.

To be frank, I find this reaction quite disappointing. A memory leak is a serious issue that ought to be addressed, regardless of the personal use case.

> Unfortunately this issue is very low on my priority list [...] and my own usage is as an executable and not as a module, so this issue does not affect me personally. To be frank, I find this reaction quite disappointing. A memory leak is a serious issue that ought to be addressed, regardless of the personal use case.
josch commented 2 years ago
Owner

Unfortunately this issue is very low on my priority list [...] and my own usage is as an executable and not as a module, so this issue does not affect me personally.

To be frank, I find this reaction quite disappointing. A memory leak is a serious issue that ought to be addressed, regardless of the personal use case.

I'm sorry that you are feeling that way. I agree that this issue should be addressed but I also have a 10 month old daughter to take care of and thus I don't have much time for my hobby projects.

This is an unpaid free software project. If somebody doesn't like it, then they don't have to use it. If somebody has a really important issue they want to see fixed, then they are always free to spend the time and implement it. I will gladly merge patches. But as an unpaid volunteer I don't owe anything to the users of my software.

> > Unfortunately this issue is very low on my priority list [...] and my own usage is as an executable and not as a module, so this issue does not affect me personally. > > To be frank, I find this reaction quite disappointing. A memory leak is a serious issue that ought to be addressed, regardless of the personal use case. I'm sorry that you are feeling that way. I agree that this issue should be addressed but I also have a 10 month old daughter to take care of and thus I don't have much time for my hobby projects. This is an unpaid free software project. If somebody doesn't like it, then they don't have to use it. If somebody has a really important issue they want to see fixed, then they are always free to spend the time and implement it. I will gladly merge patches. But as an unpaid volunteer I don't owe anything to the users of my software.

That's very understandable, naturally. I also don't have as much time for my hobby projects as I'd like to...

However, what your comment was about (or at least how I understood it) is classifying the priority of different img2pdf issues, and there addressing a memory leak should be more important than, say, working with viewer properties or adding new features.

Of course, you don't have to do this. It's your project, as you say. However, this is really the only image to pdf converter that is both lossless and fast, and a replacement is nowhere in sight. That's why I would be glad if img2pdf could keep being maintained in a way that cares about others' use cases as well.

If somebody has a really important issue they want to see fixed, then they are always free to spend the time and implement it.

The trouble is that the img2pdf code is highly complex and working with it requires in-depth knowledge of image format internals. The number of people who are up to this task is very small.

That's very understandable, naturally. I also don't have as much time for my hobby projects as I'd like to... However, what your comment was about (or at least how I understood it) is classifying the priority of different img2pdf issues, and there addressing a memory leak should be more important than, say, working with viewer properties or adding new features. Of course, you don't have to do this. It's your project, as you say. However, this is really the only image to pdf converter that is both lossless and fast, and a replacement is nowhere in sight. That's why I would be glad if img2pdf could keep being maintained in a way that cares about others' use cases as well. > If somebody has a really important issue they want to see fixed, then they are always free to spend the time and implement it. The trouble is that the img2pdf code is highly complex and working with it requires in-depth knowledge of image format internals. The number of people who are up to this task is very small.

Thanks for the awesome library. I just started playing around with it to convert a propritary TIFF Wrapper to PDF. And I am also seeing the memory issue.

One solution of course is to run the conversion for each file in a new process, certainly a workable process for some use cases.

However, I plan performing some debugging to see where the memory leak is, I am pretty sure it's in the way the image is inspected. I have tried two different methods.

if filebased:
pages.append(dumpTiff(file,basename,seq,image,offset))
else:
pages.append(readTiff(file,image,offset))

In the first case pages is a list of temporary filenames where I write the tiff images
In the second case, pages is a list if bytes
In a thrird case not show, it was a object that had the filename,start,and length, and a read method to return bytes

With all cases there is no memory leak until I call the img2pdf.convert method.

with open(basename + ".pdf",'wb') as out:
out.write(img2pdf.convert(pages))

Thanks again, when I find the leak, I'll post back here and we can figure out how to incorportate it back into the release.

-tim

Thanks for the awesome library. I just started playing around with it to convert a propritary TIFF Wrapper to PDF. And I am also seeing the memory issue. One solution of course is to run the conversion for each file in a new process, certainly a workable process for some use cases. However, I plan performing some debugging to see where the memory leak is, I am pretty sure it's in the way the image is inspected. I have tried two different methods. if filebased: pages.append(dumpTiff(file,basename,seq,image,offset)) else: pages.append(readTiff(file,image,offset)) In the first case pages is a list of temporary filenames where I write the tiff images In the second case, pages is a list if bytes In a thrird case not show, it was a object that had the filename,start,and length, and a read method to return bytes With all cases there is no memory leak until I call the img2pdf.convert method. with open(basename + ".pdf",'wb') as out: out.write(img2pdf.convert(pages)) Thanks again, when I find the leak, I'll post back here and we can figure out how to incorportate it back into the release. -tim
josch commented 2 years ago
Owner

Thank you @tpurves for looking into this! I hope you find the culprit! Maybe another thing you can try is wether the problem occurs with different rendering engines or with different input types. The codepaths are quite different depending on whether the input is a JPG, PNG with or without transparancey, black/white TIF and so on.

Thanks! 👍

Thank you @tpurves for looking into this! I hope you find the culprit! Maybe another thing you can try is wether the problem occurs with different rendering engines or with different input types. The codepaths are quite different depending on whether the input is a JPG, PNG with or without transparancey, black/white TIF and so on. Thanks! 👍

Yes, thanks @tpurves for starting to analyse the issue! I also tried to do some investigation but am unfortunately not familiar enough with the code to determine the cause.

Yes, thanks @tpurves for starting to analyse the issue! I also tried to do some investigation but am unfortunately not familiar enough with the code to determine the cause.

Well I don't beleive there is a memory "leak", but something with the way that python or the libraries pulls memory from the OS. To the point that we know once a block of memory is allocated to the process, it is not released untl the process is terminated.

To test I created/used the following wrappers.

from guppy import hpy
import gc

def heap(func):
    def wrapper(*args, **kwargs):
        hp.setrelheap()
        heap_before = hp.heap()
        result = func(*args, **kwargs)
        gc.collect()
        heap_after = hp.heap()
        leftover = heap_after - heap_before
        print(leftover.bytype)
        print(leftover.byrcs)
        return result
    return wrapper

and for the profile wrapper

from memory_profiler import profile

And the following test function (convert below calls the img2pdf the library)

@heap
@profile
def test():

	files = ["sample","10159ASN2003274536DS","10159ASN2003274539DS","10159ASN2003274544DS"]
	for loop in range(0,1):
		[convert('testdata/' + file) for file in files]

The results were interesting.

reguardless of how many times I run the loop (e.g., the range command), the heap dump remains the same. However the process memory consumed from the process goes up ~60MB based on each loop.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================*
    74     37.4 MiB     37.4 MiB           1   @heap
    75                                         @profile
    76                                         def test():
    77
    78     37.4 MiB      0.0 MiB           1    files = ["sample","10159ASN2003274536DS","10159ASN2003274539DS","10159ASN2003274544DS"]
    79     60.3 MiB      0.0 MiB           2    for loop in range(0,1):
    80     60.3 MiB     22.9 MiB           7            [convert('testdata/' + file) for file in files]
    

While the heap at the end stays consistant at ~224K reguardless of the number of loops

Partition of a set of 1993 objects. Total size = 224995 bytes.
 Index  Count   %     Size   % Cumulative  % Type
     0    669  34    48234  21     48234  21 str
     1    516  26    43760  19     91994  41 tuple
     2    290  15    30461  14    122455  54 bytes
     3     63   3    24480  11    146935  65 dict
     4    109   5    19200   9    166135  74 types.CodeType   
     5     13   1    13832   6    179967  80 type
     6     96   5    13056   6    193023  86 function
     7     33   2    12008   5    205031  91 list
     8      4   0     8844   4    213875  95 re.Pattern       
     9    115   6     3308   1    217183  97 int

TL;DR - I think what is happening, (without looking at the python runtime or libraries that are used), something is allocating directly from the OS rather than using python alocators and we know that once the block is requested and allocated from the OS to the process, it will not be released util the process is terminated. I have tested this code on Windows, Linux, OSX and the behavior is exactly the same.

TL;DR2 - What I am planning; fork the conversion as a sub-process, I realize this will be a little slower, but it will address the memory issue,

Thoughts? What am I missing?

-tim

Well I don't beleive there is a memory "leak", but something with the way that python or the libraries pulls memory from the OS. To the point that we know once a block of memory is allocated to the process, it is not released untl the process is terminated. To test I created/used the following wrappers. ``` from guppy import hpy import gc def heap(func): def wrapper(*args, **kwargs): hp.setrelheap() heap_before = hp.heap() result = func(*args, **kwargs) gc.collect() heap_after = hp.heap() leftover = heap_after - heap_before print(leftover.bytype) print(leftover.byrcs) return result return wrapper ``` and for the profile wrapper `from memory_profiler import profile` And the following test function (convert below calls the img2pdf the library) ``` @heap @profile def test(): files = ["sample","10159ASN2003274536DS","10159ASN2003274539DS","10159ASN2003274544DS"] for loop in range(0,1): [convert('testdata/' + file) for file in files] ``` The results were interesting. reguardless of how many times I run the loop (e.g., the range command), the heap dump remains the same. However the process memory consumed from the process goes up ~60MB based on each loop. ``` Line # Mem usage Increment Occurrences Line Contents =============================================================* 74 37.4 MiB 37.4 MiB 1 @heap 75 @profile 76 def test(): 77 78 37.4 MiB 0.0 MiB 1 files = ["sample","10159ASN2003274536DS","10159ASN2003274539DS","10159ASN2003274544DS"] 79 60.3 MiB 0.0 MiB 2 for loop in range(0,1): 80 60.3 MiB 22.9 MiB 7 [convert('testdata/' + file) for file in files] ``` While the heap at the end stays consistant at ~224K reguardless of the number of loops ``` Partition of a set of 1993 objects. Total size = 224995 bytes. Index Count % Size % Cumulative % Type 0 669 34 48234 21 48234 21 str 1 516 26 43760 19 91994 41 tuple 2 290 15 30461 14 122455 54 bytes 3 63 3 24480 11 146935 65 dict 4 109 5 19200 9 166135 74 types.CodeType 5 13 1 13832 6 179967 80 type 6 96 5 13056 6 193023 86 function 7 33 2 12008 5 205031 91 list 8 4 0 8844 4 213875 95 re.Pattern 9 115 6 3308 1 217183 97 int ``` TL;DR - I think what is happening, (without looking at the python runtime or libraries that are used), something is allocating directly from the OS rather than using python alocators and we know that once the block is requested and allocated from the OS to the process, it will not be released util the process is terminated. I have tested this code on Windows, Linux, OSX and the behavior is exactly the same. TL;DR2 - What I am planning; fork the conversion as a sub-process, I realize this will be a little slower, but it will address the memory issue, Thoughts? What am I missing? -tim

I think what is happening, [...] something is allocating directly from the OS rather than using python alocators and we know that once the block is requested and allocated from the OS to the process, it will not be released util the process is terminated.

Sorry, I don't fully understand this yet. Why should more and more memory be allocated from the OS without incrementally releasing it again? Do you think the "leak" could be happening in the Pillow library?

> I think what is happening, [...] something is allocating directly from the OS rather than using python alocators and we know that once the block is requested and allocated from the OS to the process, it will not be released util the process is terminated. Sorry, I don't fully understand this yet. Why should more and more memory be allocated from the OS without incrementally releasing it again? Do you think the "leak" could be happening in the Pillow library?

It's pretty clear that the allocation is not comming from the standard python heap allocator. As there is no significant increate in the (python runtime manged) heap usage. So this means something is allocating from the OS memory, which we know once memory is allocated to the process it will nto be released until termination.

https://docs.python.org/3/c-api/memory.html

The three allocation domains are:

Raw domain: intended for allocating memory for general-purpose memory buffers where the allocation must go to the system allocator or where the allocator can operate without the GIL. The memory is requested directly to the system.

“Mem” domain: intended for allocating memory for Python buffers and general-purpose memory buffers where the allocation must be performed with the GIL held. The memory is taken from the Python private heap.

Object domain: intended for allocating memory belonging to Python objects. The memory is taken from the Python private heap.

I am pretty sure the memory "leak" is in the raw domain. And quite possible in one of the other libraries, it's not in the img2pdf (as far as I can tell). Pillow is probably the only other library that is dealing with large amounts of data.

See the last post here (2016-2021) memory leak issues
https://github.com/python-pillow/Pillow/issues/2019

Need to validate what version of pillow is installed.

Sorry, I don't fully understand this yet. Why should more and more memory be allocated from the OS without incrementally releasing it again? Do you think the "leak" could be happening in the Pillow library?

It's pretty clear that the allocation is not comming from the standard python heap allocator. As there is no significant increate in the (python runtime manged) heap usage. So this means something is allocating from the OS memory, which we know once memory is allocated to the process it will nto be released until termination. https://docs.python.org/3/c-api/memory.html ``` The three allocation domains are: Raw domain: intended for allocating memory for general-purpose memory buffers where the allocation must go to the system allocator or where the allocator can operate without the GIL. The memory is requested directly to the system. “Mem” domain: intended for allocating memory for Python buffers and general-purpose memory buffers where the allocation must be performed with the GIL held. The memory is taken from the Python private heap. Object domain: intended for allocating memory belonging to Python objects. The memory is taken from the Python private heap. ``` I am pretty sure the memory "leak" is in the raw domain. And quite possible in one of the other libraries, it's not in the img2pdf (as far as I can tell). Pillow is probably the only other library that is dealing with large amounts of data. See the last post here (2016-2021) memory leak issues https://github.com/python-pillow/Pillow/issues/2019 Need to validate what version of pillow is installed. > Sorry, I don't fully understand this yet. Why should more and more memory be allocated from the OS without incrementally releasing it again? Do you think the "leak" could be happening in the Pillow library?
josch commented 2 years ago
Owner

I think it's the engine. Compare this (the original example):

import img2pdf

with open("./src/tests/input/normal.jpg", 'rb') as f:
        for i in range(100000):
                img2pdf.convert(f)
                f.seek(0)

with this:

import img2pdf

with open("./src/tests/input/normal.jpg", 'rb') as f:
        for i in range(100000):
                img2pdf.convert(f, engine=img2pdf.Engine.internal)
                f.seek(0)

In the first case I see the memory growing larger and larger. The second case stays constant.

Can you confirm that observation?

I think it's the engine. Compare this (the original example): ```py import img2pdf with open("./src/tests/input/normal.jpg", 'rb') as f: for i in range(100000): img2pdf.convert(f) f.seek(0) ``` with this: ```py import img2pdf with open("./src/tests/input/normal.jpg", 'rb') as f: for i in range(100000): img2pdf.convert(f, engine=img2pdf.Engine.internal) f.seek(0) ``` In the first case I see the memory growing larger and larger. The second case stays constant. Can you confirm that observation?
josch commented 2 years ago
Owner

It's pikepdf. Minimal reproducer:

import pikepdf

for i in range(100000):
	with pikepdf.new() as pdf:
		pdf.add_blank_page()
It's pikepdf. Minimal reproducer: ```py import pikepdf for i in range(100000): with pikepdf.new() as pdf: pdf.add_blank_page() ```

Indeed, I can confirm it's pikepdf. Thanks for finding this out!

Indeed, I can confirm it's pikepdf. Thanks for finding this out!
josch commented 2 years ago
Owner

I reported this as https://github.com/pikepdf/pikepdf/issues/329

Thank you @Gistix for reporting this, @tpurves for investigating further an @mara0004 for insisting I look into this. :)

I reported this as https://github.com/pikepdf/pikepdf/issues/329 Thank you @Gistix for reporting this, @tpurves for investigating further an @mara0004 for insisting I look into this. :)
josch commented 2 years ago
Owner

@Gistix, @tpurves until this is fixed in pikepdf, as a workaround, you can pass engine=img2pdf.Engine.internal to convert() so that pikepdf isn't used as the rendering engine.

@Gistix, @tpurves until this is fixed in pikepdf, as a workaround, you can pass `engine=img2pdf.Engine.internal` to `convert()` so that pikepdf isn't used as the rendering engine.

@Gistix, @tpurves until this is fixed in pikepdf, as a workaround, you can pass engine=img2pdf.Engine.internal to convert() so that pikepdf isn't used as the rendering engine.

I was trying to update the package memory_profiler to allow wrapping class methods so I could find what was "leaking" memory in the img2pfd.pdfdoc/add_imagepage method and was striking out.

I can confirm that the memory usage seems to behave when using the engine=img2pdf.Engine.internal option on convert.

Thanks @josch, et al.

> @Gistix, @tpurves until this is fixed in pikepdf, as a workaround, you can pass `engine=img2pdf.Engine.internal` to `convert()` so that pikepdf isn't used as the rendering engine. I was trying to update the package memory_profiler to allow wrapping class methods so I could find what was "leaking" memory in the img2pfd.pdfdoc/add_imagepage method and was striking out. I can confirm that the memory usage seems to behave when using the `engine=img2pdf.Engine.internal` option on convert. Thanks @josch, et al.
josch commented 2 years ago
Owner
@mara0004 looks like we can close this issue? https://github.com/pikepdf/pikepdf/issues/329#issuecomment-1186473594 https://github.com/pikepdf/pikepdf/commit/63f7955274119d9da2caabc6cf73706c55284f51

Yes, indeed. Thanks and sorry for the excessive noise!

Yes, indeed. Thanks and sorry for the excessive noise!
josch commented 2 years ago
Owner

At least I did not perceive any noise. Thank you for keeping track of things!

At least I did not perceive any noise. Thank you for keeping track of things!
josch closed this issue 2 years ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: josch/img2pdf#127
Loading…
There is no content yet.