jpauclair

Because standing still is going backward…

Tamarin part III – Garbage Collector in Flash (10.0)

with 6 comments

I did a lot of very processing intensive application during the last few years. Applications that required both CPU optimization and memory optimization. And I had a LOT of problem with the Garbage Collector of Flash. It was often holding the CPU for too long thus creating visible pause. That kind of behavior was of course unacceptable for real time application. I tried a lot of things in FlashIDE and FlexSDK to optimize what I could but most of that work was based on trial an error or misunderstanding of how GC really worked.

So I decided to take this one step further on my own time and read the whole Tamarin source.

When I first started thinking of this article, I had no idea what I was getting into.
During my first reading of Tamarin a few months ago, I saw some things about a Mark and Sweep Algorithm… Some ZCT algorithm and I thought everything was kind of trivial. No way!

*everything that I’m going to talk about is based on my own reading of the open-source framework Tamarin and public news from Adobe. I’m not representing Adobe in any way and if my interpretation is bad about anything, please tell me!*

As you know, Adobe as worked on bringing flash to portable device for some time now and obviously, memory management and performances of flash was kind of an issue for smaller device. In last march Lars Hansen did a GREAT review of what were the actual Issues of the GC and some actions that should be taken to fix theses.

I also started a discussion recently with Lars about the F10.1 GC . That GC is more focused on small device and less on application that require a LOT of memory. With some test I did the GC was going nuts and taking 1.6Go of memory with visible freeze of a few seconds. I’ll be filling a bug soon, but if F10.1 is released as the official player soon, a lot of games will fail!

By the way, If anyone have interesting management solution for memory consumed (allocated and freed) by visual assets created on the timeline. Please post your solutions!

But enough with this… Let’s start!

Tamarin GC is a mix of:

    1. Conservative collection with incremental Mark/Finalize/Sweep algorithm
    2. Deferred Reference Counting (DRC) with ZeroCountTable (ZCT)

 

Conservative collection with incremental Mark/Finalize/Sweep

This implies many things:

    1. Mark / Sweep: Every object in the system must have a “mark bit”
    2. Finalize: Object can have a destructor function called before being deleted
    3. Incremental: The Algorithm is processing over multiple frames.
    4. Conservative: The finalize/ sweep phase is scanning the whole Heap for object pointers

Marking:
When the Mark phase start, the garbage collector is aware of “roots”, which are starting points from which all “live” application data should be reachable. The collector starts scanning objects, starting at the roots and fanning outwards. For every object it encounters, it sets the mark bit.
The marking algorithm is incremental and can be processed over multiple frames. To do this the GC is using a “Mark Stack” of item to process and spend some time each frame marking the object in the stack. At the beginning all the GC roots are pushed onto the stack. Items on the stack (grey) are conservatively marked and unmarked (white) GC pointers discovered while processing each item are pushed on to the stack. When the stack is empty all the marking is complete.

//StartIncrementalMark() finishes any outstanding lazy sweeping and pushes all the roots onto the mark stack and calls IncrementalMark

 

StartIncrementalMarking()

		// FIXME (policy): arguably a bug to allow MarkItem to go recursive here without checking for
		// the time it takes.  Probably better just to push the roots and let IncrementalMark handle
		// the marking.

		{
			MMGC_LOCK(m_rootListLock);
			GCRoot *r = m_roots;
			while(r) {
				GCWorkItem item = r->GetWorkItem();
				if(item.ptr)
					MarkItem(item, m_incrementalWork);
				r = r->next;
			}
		}

 

//IncrementalMark() processes items on the mark stack until it's empty or until the time budget for the increment is exhausted. If the mark stack is empty on entry to IncrementalMark then it calls FinishIncrementalMark().

 

//FinishIncrementalMark marks from all the roots again, then creates a work item for the entire stack and marks that, and then calls Sweep

But what happens if an object is created and linked to another object that have already been marked (black). The logical answer would be: “the mark bit was not set so the object is available for garbage collection”. To handle this (bad) behavior, the GC is using what is called a “Write barrier”. The algorithm uses a tri-color flags with write barriers. Every object has 3 states: black, gray and white.

    1. Black means the object has been marked and is no longer in the work queue
    2. Gray means the object is in the work queue but not yet marked
    3. White means the object isn’t in the work queue and hasn’t been marked

The first increment will push all the roots on to the queue, thus after this step all the roots are gray and everything else is white. As the queue is processed all live objects go through two steps, from gray to black and white to gray. Whenever a pointer to a white object is written to a black object we have to intercept that write and remember to go back and put the white object in the work queue, that’s what a write barrier does.

Finalizing:
A finalizer is a method which will be invoked by the GC when an unreachable object is about to be destroyed.
In fact, Finalize() is called just before sweeping all objects. It can be considered as a sub-part of sweeping.
To be eligible for finalization, an object must inherits from GCFinalizedObject or RCFinalizedObject. For these objects, iIt’s similar to a (c++) destructor. It differs from a destructor in that it is usually called nondeterministically, i.e. in whatever random order the GC decides to destroy objects. C++ destructors are usually invoked in a comparatively predictable order, since they’re invoked explicitly by the application code. But since the sweeping processing does not traverse objects in a “logical” order it’s impossible to tell what object will be finalized first.

The finalization process examine the bit vector of every heap object to find dead objects, so it touches every block in the heap (even if only to retrieve the address of the bit vector).
Touching every block in the heap may be very slow and may result in visible pauses. Plus, each finalizer can take arbitrarily long to run, there can be lots of finalizers (every reference counted object is finalized), and every finalizable dead object is finalized at the end of the collection.

To experience that, you can create small app that play an animation. Add to the scene an array of 5 Million small strings or XML Node. If you let the animation play, there is a good chance you will see a visible pause in the animation. Even if you have no CPU intensive code, when the GC will be triggered and the finalization / Sweep will run, there will be a freeze.

		void GCAlloc::Finalize()
	{
		m_finalized = true;
		// Go through every item of every block.  Look for items
		// that are in use but not marked as reachable, and delete
		// them.

		GCBlock *next = NULL;
		for (GCBlock* b = m_firstBlock; b != NULL; b = next)

Sweeping:
In the Sweep phase, every object that wasn’t marked in the Mark phase is destroyed and its memory reclaimed. If an object didn’t have its mark bit set during the Mark phase, that means it wasn’t reachable from the roots anymore, and thus was not reachable from anywhere in the application code. But the algorithm is conservative; hence the collector assumes that every memory location might potentially contain a GC pointer.
That means that it might occasionally turn up a “false positive.” A false positive is a memory location that looks like it contains a pointer to a GC object, but it’s really just some JPEG image data or an integer variable or some other unrelated data. When the GC encounters a false positive, it has to assume that it MIGHT be a pointer since it doesn’t have an exact description of whether that memory is a pointer or not. So, the not-really-pointed-to object will be leaked. With a conservative GC, the leaks tend to be random, such that they don’t grow over time.

//Sweep() calls Finalize() and then sweeps any entirely empty pages

 

//Finalize() calls the Finalize method on all the allocators owned by the GC; this causes each to traverse its block list and run the finalizer of each unmarked finalizable object, and compute the number of live items per block. Blocks without live objects are added to lists that are processed by Sweep Non-empty pages are not swept during collection, but on demand as more memory is needed, and sometimes eagerly before a new collection starts. Since marking and sweeping are interleaved with mutator work, the main latency in the collection process comes in at the beginning, when StartIncrementalMark sweeps unswept pages, and at the end, when FinishIncrementalMark scans the stack and examines every page in the system and runs the finalizers for all objects that are going to be reclaimed (through GC::Finalize). 

 

Sweep()

		// ISSUE: this could be done lazily at the expense other GC's potentially expanding
		// unnecessarily, not sure its worth it as this should be pretty fast
		GCAlloc::GCBlock *b = smallEmptyPageList;
		while(b) {
			GCAlloc::GCBlock *next = b->next;
#ifdef _DEBUG
			b->alloc->SweepGuts(b);
#endif
			b->alloc->FreeChunk(b);

			sweepResults++;
			b = next;
		}		

 

Deferred Reference Counting (DRC)

Reference counting is a technique of keeping the number of references (pointers) to a an object. When a reference count get to zero, it’s safe to deallocate the object (which are no longer referenced). Classic Reference counting are hard to maintain and need explicit count management each time an object is linked/unlinked.

The main problem with RC is Circular References. If two object point to each other, they both have at least one reference and can never be collected! (One more reason to have the Mark/Sweep algorithm ready to clean everything up!)

Tamarin use Deferred Reference Counting (DRC) with Zero Count Table (ZCT)
In Deferred Reference Counting, a distinction is made between heap and stack references.
Stack references to objects tend to be very temporary in nature. Stack frames come and go very quickly. So, performance can be gained by not performing reference counting on the stack references.

Heap references are different since they can persist for long periods of time. So, in a DRC scheme, we continue to maintain reference counts in heap-based objects. So, reference counts are only maintained to heap-to-heap references.
We basically ignore the stack and registers. They are considered stack memory.
But since we ignore the stack, we can’t just delete an object when it’s reference count drop to zero! There could be (ignored) pointers in the stack to that object.

To deal with this, there is a mechanism called the Zero Count Table (ZCT).
When an object reaches zero reference count, it is not immediately destroyed; instead, it is put in the ZCT and when the ZCT is full, it is “reaped” to destroy some objects.
If an object is in the ZCT, it is known that there are no heap references to it. So, there can only be stack references to it. When reaped, the GC scans the stack to see if there are any stack references to ZCT objects. Any objects in the ZCT that are not found on the stack are deleted.

 

But when is the GC triggered?

The algorithm that triggers the GC collection (or Heap grow) is simple and is based on memory allocation since last collect and the fraction of the total heap it represent.

	 /* When an allocation fails because a suitable memory block is
	 * not available, the garbage collector decides either to garbage
	 * collect to free up space, or to expand the heap.  The heuristic
	 * used to make the decision to collect or expand is taken from the
	 * Boehm-Demers-Weiser (BDW) garbage collector and memory allocator.
	 * The BDW algorithm is (pseudo-code):
	 *
	 *    if (allocs since collect >= heap size / FSD)
	 *      collect
	 *    else
	 *      expand(request size + heap size / FSD)
	 *
	 * The FSD is the "Free Space Divisor."

The first thing to determine is when we decide to start marking. Now that we know when we start marking there are two conflicting goals to achieve in selecting the marking time slice:

    1. Maintain the frame rate
    2. Make sure the collector gets to the sweep stage soon enough to avoid too much heap expansion

If we don’t maintain the frame rate the movie will appear to pause and if we don’t mark fast enough the mutator could get ahread of the collector and allocate memory so fast that the collection never finishes and memory grows unbounded. The ideal solution will result in only one mark incremental per frame unless the mutator is allocating memory so fast we need to mark more aggressively to get to the sweep. So the frequency of the incremental marking will be based on two factors: the rate at which we can trace memory and the rate at which the mutator is requesting more memory.


void* GC::AllocBlock(int size, int pageType, bool zero)
{
GCAssert(size > 0);

// perform gc if heap expanded due to fixed memory allocations
if(!marking && !collecting && policy.queryStartCollectionAfterHeapExpansion())
{
if(incremental && !nogc)
StartIncrementalMark();
else
Collect();
}

 


bool GCPolicyManager::queryStartCollectionAfterHeapExpansion()
{
// this is the existing policy but it's not a good idea
return (blocksInHeapAfterPreviousAllocation > lowerLimitHeapBlocks() &&
blocksInHeapAfterPreviousAllocation < heap->GetTotalHeapSize() &&
querySufficientTimeSinceLastCollection());
}

 

//Issue: The intent of the free space divisor is not documented, as with too much else about MMgc. Given that block allocation updates both allocsSinceCollect and totalGCPages by the same amount when a block is allocated or freed, the test is that a>(t+a)/4, or a>t/3. I have no idea whether that was the intent, but I'm inclined to think not.

 

	uint64_t GCPolicyManager::zctNewReapThreshold(uint64_t zctSize, uint64_t zctOccupancy)
	{
		// how many objects trigger a reap, should be high
		// enough that the stack scan time is noise but low enough
		// so that objects to away in a timely manner

		const int ZCT_REAP_THRESHOLD = 512;

		uint64_t zctReapThreshold = zctOccupancy + ZCT_REAP_THRESHOLD;
		if(zctReapThreshold > zctSize * GCHeap::kBlockSize/sizeof(uintptr_t))
			zctReapThreshold = zctSize * GCHeap::kBlockSize/sizeof(uintptr_t);
		return zctReapThreshold;
	}

There is also the well known AS3 localConnection hack:


try {
new LocalConnection().connect('foo');
new LocalConnection().connect('foo');
} catch (e:*) {}
// the GC will perform a full mark/sweep on the second call.

So… How to use the flash GC?

But in general, since the GC is scanning the whole object range before sweeping (and finalize some objects like Strings, XML Nodes etc.) it’s best to keep a minimum heap size and object number.
If you have tens of thousands objects instantiated in your project, there is a good chance of having visible pause.
Preventing that can’t really be done with a “magic trick” and require a good engine architecture supporting optimized live object serialization to make the number of instance lower.
For instance, you can use AMF or your own serialization to write most object into a mapped ByteArray if they don’t require fast access.

But for many projects, the real problem is coming from assets created on the timeline (Flash IDE) because we have no control over theses. We can’t tell FlashIDE to “re-use” a sprite that is created a lot of time. We can’t make “real” object pooling with the timeline. And we have no information on rules determining memory allocation for these “timeline graphics” from Adobe. Using good tweening library could be a solution for some basic animations.

The GC of flash caused me a lot of problem in the past but it’s not because it’s bad. It’s just because we are doing very big application. Since we use a lot of visual object (sprites) in flash, there is often a LOT of memory added to the heap and the triggers are harder to manage because of the objects size.
For a game engine, having access to explicit collection could be great. But Adobe made a good call by making it conservative and hence accessible to a lot more people.

 

Interesting block of code:

	void GCHashtable::put(const void *key, const void *value)
	{
		GCAssert(table != NULL);
		int i = find(key, table, tableSize);
		if (!equals(table[i], key)) {
			// .75 load factor, note we don't take numDeleted into account
			// numValues includes numDeleted
			if(numValues * 8 >= tableSize * 3)
			{
				grow();
				// grow rehashes
				i = find(key, table, tableSize);
			}

void GCHashtable::grow()
	{
		int newTableSize = tableSize;

		unsigned int occupiedSlots = numValues - numDeleted;
		GCAssert(numValues >= numDeleted);

		// grow or shrink as appropriate:
		// if we're greater than %50 full grow
		// if we're less than %10 shrink
		// else stay the same

 

Sources and References:

Can take more?

Next post

Interpreting the future of Flash from the sources!


Written by jpauclair

December 23, 2009 at 12:53 am

6 Responses

Subscribe to comments with RSS.

  1. That was downright fascinating! I wonder: in your experience studying the GC and implementing real-time projects for it, what techniques would you recommend so as to avoid noticeable pauses due to GC? I’d especially be interested if the article was tailored to games, since you and I both develop Flash-based MMOs. In the meantime, I’ll keep reading these articles. Every one has been great! Keep up the good work!

    Jackson Dunstan

    December 23, 2009 at 4:48 pm

    • I prefer to keep specific implementation from my job out of this blog. I’m sure you can understand why.
      But in general, since the GC is scanning the whole object range before sweeping plus doing finalization on certain objects (like String and XML Node etc.) it’s best to keep a minimum heap size and object number.
      If you have dozens of thousands objects instantiated in your project, there is a good chance of having visible pause.
      Preventing that can’t really be done with a “magic trick” and require a good engine architecture supporting optimized live object serialization to make the number of instance lower.
      For most project, the real problem is coming from assets created on the timeline because we have no control over theses. We can’t tell FlashIDE to “re-use” a sprite that is created a lot of time. We can’t make “real” object pooling with the timeline.
      I also forgot to conclude my post!
      The GC of flash caused me a lot of problem in the past but it’s not because it’s bad. It’s just because we are doing very big application. Since we use a lot of visual object (sprites) in flash, there is often a LOT of memory added to the heap and the triggers are harder to manage because of the objects size.
      For a game engine, having access to explicit collection can be great. But Adobe made a good call by making it conservative and hence accessible to a lot more people.

      jpauclair

      December 24, 2009 at 12:16 am

      • I understand wanting to keep specific implementation out of the blog. That’s been my policy too. I was asking for some general ideas… and you’ve provided them! Thanks again for the article and for the followup ideas and conclusion. It certainly gives me a lot to think about both architecturally and as far as day-to-day practices go.

        Jackson Dunstan

        December 24, 2009 at 3:12 am

  2. It’s indeed useful when you want to do “serious work” to understand the GC. But as you stated, there’s a whole unknown part of the Flash Player which is how the whole timeline and sprites are handled.

    For one of our game projects, we had to rewrite the whole timeline player in haXe and with some reuse optimizations we were able to get more performances than native implementation…

    Nicolas

    January 5, 2010 at 4:27 am

  3. [...] DisableIncrementalGC = 1|0 Lets you enable/disable Garbage Collector Incremental policies (more info on the GC) [...]

  4. [...] The flash GC doesn’t guarantee that all “dead” objects will be collected in a single p… But having those blue rectangle collected almost randomly is really fun to see. [...]


Leave a Reply