Thursday, October 30, 2008

The cost of [ThreadStatic] attribute

First of all, raw results:

Instance field: Operations: 2,650 B/s.
Static field: Operations: 2,630 B/s.
Volatile instance field: Operations: 2,649 B/s.
[ThreadStatic] field: Operations: 43,611 M/s.
Thread data slot: Operations: 4,068 M/s.

The test is actually quite simple: we read specified field in a loop. As before, on Core 2 Duo @2.66GHz. The code can be found in DataObjects.Net 4.0 test suite, see Xtensive.Core\Xtensive.Core.Tests\DotNetFramework\FieldTypeTest.cs.

- Reading regular, static or volatile field is quite cheap: ~ 0.2x in previously introduced metrics, or 20% of virtual method call time
- [ThreadStatic] fields are actually quite costly in comparison to others: ~ 14x.

Now the main question: why? It isn't so obvious [ThreadStatic] is ~ 60 times slower than static.

JITted [ThreadStatic] access code actually always consists of two parts:
- Call to a system routine returning address of [ThreadStatic] field by its token
- Regular field access instruction.

Obviously, the first part (call) "eats" almost the whole execution time: there is no more efficient way to get the address of such a field by its token rather than using hash table. As I've mentioned before, reading from a system hash table takes ~ 10x. So that's nearly what we have in this case.

Why they're implemented this way in .NET? I can't imagine why they didn't use some faster approach. E.g. I suspect calculating the lowest stack boundary (as well as the upper one) from the current stack pointer value is quite simple operation - something like bitwise and. Why we can't store the address of the first [ThreadStatic] location as fixed address nearby it, and use constant offset for each [ThreadStatic] field relatively to the address of the first one? In this case it would take ~ 1x to access it...

Ok, this is what could be, but in reality we have 14x.

Finally, there are thread data slots as well. But they're 10 times slower than [ThreadStatic] fields, so it's always better to simulate their behavior with e.g. a Dictionary stored in [ThreadStatic] field.

Wednesday, October 29, 2008

The cost of method calls

Here I'll talk about the cost (or performance) of various ways of method invocation in .NET.

First of all, let's assume:
- Virtual method call time is 1x. This is about ~ 600M calls / sec. on Core 2 Duo @ 2.66 GHz. To be precise, we're talking about instance method getting no arguments and returning a single Int32 value (i.e. it is an average property getter).

- Delegate method call time is 1.5x (the same method invoked by delegate pointing to it)
- Interface method call time is 2x (the same method invoked on reference of interface type)

A bit surprising, yes? The explanation is here. Briefly, interface method dispatch is more complex than delegate dispatch, since we must locate appropriate interface method table for the instance we have first. If count of implemented interfaces on a particular type is quite large (that's actually almost impossible), the only good way to do this is to use hash table. But if it is small (that's the most frequent case), there must be just a search in small [possibly - ordered] list. But this is anyway more complex than in case with delegate, since delegate already contains the exact method address (this isn't correct for open delegates).

Now one more fact to think about: all the tests I'm talking about are loop-based, performed in Release configuration, and their code is nearly the following:

for (int i = 0; i<...; i+=10) {
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();

.NET is able to cache resolved interface method tables (and possibly - even method addresses), so calling interface method in a loop must be a bit faster, than calling it once. So in general the cost of interface method call in comparison to delegate call is even bigger.

This explains why we have such types as Hasher(of T) (Comparer, Arithmetic, etc.). They cache the delegates performing a set of operations on T type faster than with use of similar interface in their fields. See e.g. Hasher(of T).GetHash field. Certainly, such an approach is used only when performance is essential - i.e. when it's well known these operations will be invoked many times e.g. on any index seek.

Let's look of few more metrics:
- Creating a delegate pointing to non-virtual method time is 7.5x
- Creating a delegate pointing to virtual method time is 50x
- Creating a delegate pointing to interface method time is 150x

So delegate construction isn't cheap at all.

Now we're turning to virtual generic methods:
- Virtual generic method call time is 10x - independently on of it is called on interface or not. Dependency on count of generic arguments almost absents as well - adding one more argument makes the call longer by ~ 0.5%.
- Creating a delegate pointing to generic virtual method time is 1000x. Not sure, why - it looks like because of some bug in .NET. Since delegates in .NET may point to fully parameterized methods only (they store method address), so the time of calling such a delegate is 1.5x, as before.

Why it's so costly to call virtual generic method? Because there can be generally any number of its implementations, dependently on the argument substitution, so .NET framework resolves its address using internal hash table, that must be bound to corresponding virtual (or interface) method table.

So we may also take that:
- Internal hash table seek time is ~ 10x - I'll use this time in my future posts to show how to estimate the cost of generally any basic operation in .NET.

And few conclusions related to generic virtual methods:
- The smallest heap allocation time is 3.5x; Int32 boxing time is 4x (with its heap allocation). So it's almost always cheaper to have non-generic virtual method with object type argument, rather than generic virtual method.
- If generic virtual method seems anyway preferable, you might think about implementing its "call caching" with use of delegates. E.g. we use DelegateHelper.CreateDelegates and DelegateHelper.ExecuteDelegates methods to perform the operations on Tuples of the same structure faster.

Sunday, October 26, 2008

Open delegates

Do you know what is open delegate? I was quite surprised when I knew this. They exists since .NET 2.0, but I suspect they were not documented in MSDN initially, so e.g. I have known this much later.

- An example
- Official documentation.

So what is open delegate? It is a delegate bound only to a method, but not to an instance, so you can use the same delegate to call the method it is bound to on many instances. That's it.

And few more notes:
- Open delegate invocation must take at least the same time as the underlying virtual or interface method invocation, since .NET can't use the same way of invocation as for the regular one (there is no single method call address to use). We didn't test this yet, but if we will, I publish the results here.
- An open delegate bound to virtual generic method must be the slowest one - by the same reason. Do you know why virtual generic methods are the slowest ones? My next post here will explain this.
- Btw, various ways of invocation in .NET are perfectly described here. The article covers i.e. regular, virtual, interface methods and delegates.

Friday, October 17, 2008

The cost of having a finalizer

It's well-known that having a finalizer is costly. But how much does it cost - to have a finalizer?

Allocation to nothing:
- SlimObject: Operations: 77.880 M/s.
- FinalizableSlimObject: Operations: 1.628 M/s.

Allocation to an array:
- SlimObject: Operations: 6.996 M/s.
- FinalizableSlimObject: Operations: 3.256 M/s.

"Allocation to nothing" here is allocation without storing a reference anywhere. This case must be the most frequent scenario, since most of allocated objects don't survive the next GC in generation 0.

There is no any additional GC cost for SlimObject in this case - there are no references to any of such objects, so garbage collector won't mark any of them, and consequently, won't spent any time on them at all.

But what's with FinalizeableSlimObject?
- It's allocation is costly, since it is registered for further finalization (added into a finalization queue). Registration cost must be similar to getting a lock and adding an object to Hashtable.
- All of such objects are processed (i.e. marked) during the GC
- Then they're removed from finalization queue (~ finding & removing an entry from a Hashtable)
- And added to FReachable queue (cheap, ~ adding to a list).
- They're also processed on heap compaction phase. We get an additional cost proportional to the size of an object here.
- A separate finalization thread will spend some time on them to finalize them. This must be not quite noticeable in this test, since the test have been performed on multicore CPU.
- And finally, if some of them will survive until the next GC in generation 1, we'll will add further overhead similar to described.

So as you see, allocation of short living finalizable object is 50 times more expensive than of the regular one!

Now let's turn to long-living objects: "Allocation to an array" here is allocation with storing a reference to allocated object in array. The test was running for about a second, so there must be several generation 1 GCs. Certainly SlimObject wins here as well (about 2 times), but not as mach as in above case. Why?
- Because all the objects survive many GCs, and GC overhead is the most significant factor for both types of them.
- One more factor slowing down the test with SlimObject is less efficient usage of L1\L2 caches: in first case all the data were "laying" mainly in cache, but here pages must "flow" through the caches. But I suspect this is much less important factor than the first one here.

In any case, probably it's time to finalize some finalizers ;) At least the ones that aren't really necessary.

Thursday, October 16, 2008

Dictionary: surprising fact about Remove and Clear methods

Both these methods never resize internal bucket list of the Dictionary - this means dictionary bucket list may only grow up during its lifetime, and the only way to "shrink" it is to recreate the dictionary (create a new one, copy the content of the old one to it and forget the old one).

To get the approval of this fact, study the source code of both methods - e.g. with .NET Reflector.

When Hashtable is better than Dictionary

We all know Hashtable class - it exists in .NET from the very beginning. .NET 2.0 has brought a generic replacement to it: Dictionary. So we usually prefer using it rather than Hashtable. But it isn't well known Hashtable has a very nice feature related to concurrent usage, which Dictionary does not have (at least by documentation):

- Hashtable supports single writer and multiple readers concurrent access policy
- But Dictionary - standard single writer or multiple readers concurrent access policy

This means usage of Hashtable is perfect e.g. in concurrent caching scenarios: only write operations on it require locking to ensure thread safety, but in case with Dictionary both read and write operations require locking.

The official "approval" of this fact is here.

And a fly in the ointment:
- It's difficult to find out if this was fully correct for .NET 2.0, since this fact was mentioned only recently. The issue report related to this is here.
- This is not completely true: here is the report.

Possible workarounds:
- Never clear or remove items from Hashtable at all. In many cases this is ok, e.g. when you cache something for a lifetime of the cache or the whole application. An example is caching of Reflection.Emit results or any other code loaded into AppDomain: you anyway can't unload the code, so there is no reason to clean up any references to it from caches.
- Implement item removal by dropping the old Hashtable and copying its content (without the item to remove) to a new one. Or implement a wrapper doing the same using two Hashtables (one contains just added keys and values, and andother one just removed keys) and periodically doing the garbage collection be re-creating both of them as described above. But actually the cost of looking into two Hashtables must be close to the cost of locking.

In any case, this feature is very nice even with all the mentioned problems. We utilize this feature of Hashtable e.g. in our ThreadSafeDictionary class.