Python 中的垃圾回收机制是如何工作的?

翻译 Summer ⋅ 于 1个月前 ⋅ 275 阅读 ⋅ 原文地址
这是一篇协同翻译的文章,目前翻译进度为 20%,你可以点击『我来翻译』按钮来 参与翻译

我将在这篇文章中讨论一下在CPython中是如何实现垃圾回收机制的。

CPython中垃圾回收的主要思路

  1. 维护引用计数器 。对于每一个对象,都有一个对于该对象的引用次数的计数器。如果这个计数器的值减为了 0 ,这就代表这个对象在程序中已经没用了,那么该对象所占用的内存就会被释放。

  2. 定期检测是否循环引用。 当引用计数器的值下降到 0 时来释放内存的机制并不适用于所有的情况。假如两个对象 AB ,其中 A 拥有对 B 的引用,B 拥有对 A 的引用。 这就称之为循环引用。在这种情况下,这两个对象也没有存在的价值了,此时 AB 都应该被垃圾回收处理。但是,这两个对象的引用计数值不为零, 所以内存会一直被占用。为了解决这个问题,CPython 通过使用算法来检测是否存在循环引用并释放循环引用中的对象。

  3. 通过启发式算法提升性能。 越晚创建的对象更可能需要被回收。 CPython引入了一个 分代回收 的概念来判断一个对象使用的相对年龄。年轻一代是指最新被创建出来的对象,而老一代则代表早前创建的对象。每个对象都确定的属于某一代。 当垃圾回收机制执行时, CPython 会优先尝试回收年轻一代的对象。CPython会定期回收老一代的对象(由启发式算法确定该回收执行的效率).
Vimiix 翻译于 1个月前

The Garbage Collection Lifecycle

It's probably most instructive to go through the lifecycle of how the CPython garbage collector would typically get run. Let's create a new object and see what happens with the garbage collector:

  • Python wants to allocate a new object. To do this, it makes a call to _PyObject_GC_Malloc. This method assigns the object some memory locations and adds the object to the garbage collector's first generation (we'll call it generation 0). The method then checks to see if the number of objects in generation 0 is greater than some threshold. If it is and the garbage collector is not currently running, then a call to collect_generations is made to begin garbage collection. Otherwise, the object is just allocated normally.

  • Python starts to do garbage collection when collect_generations gets called. This method figures out what generation to do garbage collection on (by default CPython has 3 generations but this can be modified with the GC module. In addition, younger generations have lower indices, so generation 0 is the youngest generation). Python will loop over all generations (from oldest to youngest) and detect whether a particular generation's object count is greater than some threshold. If it is, then it will merge all younger generations with the current generation and perform garbage collection on that generation by calling collect. NOTE: Python wants to do garbage collection on generation 0 for better performance, because this has the newest objects and also will have the fewest objects to iterate over. Doing garbage collection on the oldest generation is equivalent to collecting over all objects because doing garbage collection on generation i will use all objects in generations 0 through i.
  • The collect method will run garbage collection on a specified generation. What this amounts to is running the reference cycle detection algorithm (explained later) and finding a set of reachable and unreachable objects in a particular generation. The reachable objects will be merged into the next higher generation (i.e. if collect was run on generation i, then the reachable objects from generation i would be merged into generation i+1). For the unreachable objects, CPython will make all necessary finalizer callbacks, make weak ref callbacks, and finally deallocate the objects.

  • Finally, the internal state of the garbage collection module will be updated as collect finishes performing its duties.

CPython's Algorithm for Detecting Reference Cycles

Python attempts to find reference cycles within a generation. Confining the search for reference cycles to a single generation decreases the amount of work that has to be done in a single collection (if it is one of the generations that holds younger objects).

To find reference cycles, Python uses young, the pointer to the head of the list of objects for the generation that's being garbage collected, and runs the following:

update_refs(young)
subtract_refs(young)
gc_init_list(&unreachable)
move_unreachable(young,  &unreachable)

The update_refs method makes a copy of the reference count for every object in the generation so that the garbage collector can mutate its own version of the reference count without messing with the real reference count.

Then subtract_refs goes through each of the objects i in the generation being garbage collected, and decrements the reference counts on any objects j in the generation list that are referenced by object i. After this method has run, the reference count on an object in the generation should equal the number of references to that object from objects which do not belong to that generation (since all references from objects within the same generation have been removed).

Now comes the fun part. The move_unreachable method scans through the young list and moves objects with a reference count of 0 into the unreachable list and changes their reference count to GC_TENTATIVELY_UNREACHABLE. Objects with a non-zero reference count are marked as GC_REACHABLE and the objects they reference are traversed and marked as GC_REACHABLE then moved to the end of the young list so they too can be traversed later.

The reason that objects with a reference count of 0 are tentatively unreachable is as follows. Suppose object A has been marked as tentatively unreachable and is referenced by some object B. Suppose that B is in the same generation as A and is actually reachable from outside the generation, but that B comes later in the young list than A. Then A would be sent to the unreachable list when move_unreachable scans over it. However, when move_unreachable scans over B, it will notice that it's reference count is non-zero, mark it as GC_REACHABLE, and traverse B's references and mark them as reachable. Now, A has become GC_REACHABLE as well and has been moved to the end of the young list so that it's references can also be marked as GC_REACHABLE. Thus, we only know that an object is unreachable after move_unreachable has scanned over the entire young list.

Once the entire young list has been traversed, then all the items left in the unreachable list are definitely unreachable and so they can be deallocated. The items in the young list are then merged into older generations.

Performance Notes:

  • CPython's garbage collector still stops the world, but does so more infrequently than other implementations. Reference count deallocation increases the time between collections because the number of objects in a generation decreases whenever an object's reference count falls to 0 and gets deallocated. Since collections are triggered when the number of objects in a generation are above a threshold, reference count deallocation decreases the number of collections as long as there aren't too many reference cycles.
  • If the hypothesis that younger objects are the objects more likely to need to be garbage collected is true, then running the garbage collector on younger generations can significantly reduce total runtime.
  • Memory fragmentation occurs. Reference counting will naturally cause memory to get fragmented since the deallocation of an object means there will be small chunks of memory that get added back onto the heap. Some garbage collectors deal with fragmentation by copying all live objects into a different section of memory and freeing up an entire section of memory, but CPython doesn't.
  • Normal execution is slower. There is an extra check whenever an object gets allocated, referenced, or dereferenced. This means that instead of running normally, then performing a garbage collection all at once like some tracing garbage collectors do, CPython's reference count garbage collector will spread the work across normal activities and perform collection less frequently. Unfortunately, this means the total amount of work that goes into garbage collection is higher (because of the added reference count checks).
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。

文章译者

回复数量: 0
    暂无评论~~
      请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!
    Ctrl+Enter