First of all, raw results:
Instance field: Operations: 2,650 B/s.
Static field: Operations: 2,630 B/s.
Volatile instance field: Operations: 2,649 B/s.
[ThreadStatic] field: Operations: 43,611 M/s.
Thread data slot: Operations: 4,068 M/s.
The test is actually quite simple: we read specified field in a loop. As before, on Core 2 Duo @2.66GHz. The code can be found in DataObjects.Net 4.0 test suite, see Xtensive.Core\Xtensive.Core.Tests\DotNetFramework\FieldTypeTest.cs.
Conclusions:
- Reading regular, static or volatile field is quite cheap: ~ 0.2x in previously introduced metrics, or 20% of virtual method call time
- [ThreadStatic] fields are actually quite costly in comparison to others: ~ 14x.
Now the main question: why? It isn't so obvious [ThreadStatic] is ~ 60 times slower than static.
JITted [ThreadStatic] access code actually always consists of two parts:
- Call to a system routine returning address of [ThreadStatic] field by its token
- Regular field access instruction.
Obviously, the first part (call) "eats" almost the whole execution time: there is no more efficient way to get the address of such a field by its token rather than using hash table. As I've mentioned before, reading from a system hash table takes ~ 10x. So that's nearly what we have in this case.
Why they're implemented this way in .NET? I can't imagine why they didn't use some faster approach. E.g. I suspect calculating the lowest stack boundary (as well as the upper one) from the current stack pointer value is quite simple operation - something like bitwise and. Why we can't store the address of the first [ThreadStatic] location as fixed address nearby it, and use constant offset for each [ThreadStatic] field relatively to the address of the first one? In this case it would take ~ 1x to access it...
Ok, this is what could be, but in reality we have 14x.
Finally, there are thread data slots as well. But they're 10 times slower than [ThreadStatic] fields, so it's always better to simulate their behavior with e.g. a Dictionary stored in [ThreadStatic] field.
Thursday, October 30, 2008
The cost of [ThreadStatic] attribute
Wednesday, October 29, 2008
The cost of method calls
Here I'll talk about the cost (or performance) of various ways of method invocation in .NET.
First of all, let's assume:
- Virtual method call time is 1x. This is about ~ 600M calls / sec. on Core 2 Duo @ 2.66 GHz. To be precise, we're talking about instance method getting no arguments and returning a single Int32 value (i.e. it is an average property getter).
So:
- Delegate method call time is 1.5x (the same method invoked by delegate pointing to it)
- Interface method call time is 2x (the same method invoked on reference of interface type)
A bit surprising, yes? The explanation is here. Briefly, interface method dispatch is more complex than delegate dispatch, since we must locate appropriate interface method table for the instance we have first. If count of implemented interfaces on a particular type is quite large (that's actually almost impossible), the only good way to do this is to use hash table. But if it is small (that's the most frequent case), there must be just a search in small [possibly - ordered] list. But this is anyway more complex than in case with delegate, since delegate already contains the exact method address (this isn't correct for open delegates).
Now one more fact to think about: all the tests I'm talking about are loop-based, performed in Release configuration, and their code is nearly the following:
for (int i = 0; i<...; i+=10) {
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
o = c.GetValue();
}
.NET is able to cache resolved interface method tables (and possibly - even method addresses), so calling interface method in a loop must be a bit faster, than calling it once. So in general the cost of interface method call in comparison to delegate call is even bigger.
This explains why we have such types as Hasher(of T) (Comparer, Arithmetic, etc.). They cache the delegates performing a set of operations on T type faster than with use of similar interface in their fields. See e.g. Hasher(of T).GetHash field. Certainly, such an approach is used only when performance is essential - i.e. when it's well known these operations will be invoked many times e.g. on any index seek.
Let's look of few more metrics:
- Creating a delegate pointing to non-virtual method time is 7.5x
- Creating a delegate pointing to virtual method time is 50x
- Creating a delegate pointing to interface method time is 150x
So delegate construction isn't cheap at all.
Now we're turning to virtual generic methods:
- Virtual generic method call time is 10x - independently on of it is called on interface or not. Dependency on count of generic arguments almost absents as well - adding one more argument makes the call longer by ~ 0.5%.
- Creating a delegate pointing to generic virtual method time is 1000x. Not sure, why - it looks like because of some bug in .NET. Since delegates in .NET may point to fully parameterized methods only (they store method address), so the time of calling such a delegate is 1.5x, as before.
Why it's so costly to call virtual generic method? Because there can be generally any number of its implementations, dependently on the argument substitution, so .NET framework resolves its address using internal hash table, that must be bound to corresponding virtual (or interface) method table.
So we may also take that:
- Internal hash table seek time is ~ 10x - I'll use this time in my future posts to show how to estimate the cost of generally any basic operation in .NET.
And few conclusions related to generic virtual methods:
- The smallest heap allocation time is 3.5x; Int32 boxing time is 4x (with its heap allocation). So it's almost always cheaper to have non-generic virtual method with object type argument, rather than generic virtual method.
- If generic virtual method seems anyway preferable, you might think about implementing its "call caching" with use of delegates. E.g. we use DelegateHelper.CreateDelegates and DelegateHelper.ExecuteDelegates methods to perform the operations on Tuples of the same structure faster.
Sunday, October 26, 2008
Open delegates
Do you know what is open delegate? I was quite surprised when I knew this. They exists since .NET 2.0, but I suspect they were not documented in MSDN initially, so e.g. I have known this much later.
Links:
- An example
- Official documentation.
So what is open delegate? It is a delegate bound only to a method, but not to an instance, so you can use the same delegate to call the method it is bound to on many instances. That's it.
And few more notes:
- Open delegate invocation must take at least the same time as the underlying virtual or interface method invocation, since .NET can't use the same way of invocation as for the regular one (there is no single method call address to use). We didn't test this yet, but if we will, I publish the results here.
- An open delegate bound to virtual generic method must be the slowest one - by the same reason. Do you know why virtual generic methods are the slowest ones? My next post here will explain this.
- Btw, various ways of invocation in .NET are perfectly described here. The article covers i.e. regular, virtual, interface methods and delegates.
Friday, October 17, 2008
The cost of having a finalizer
It's well-known that having a finalizer is costly. But how much does it cost - to have a finalizer?
Allocation to nothing:
- SlimObject: Operations: 77.880 M/s.
- FinalizableSlimObject: Operations: 1.628 M/s.
Allocation to an array:
- SlimObject: Operations: 6.996 M/s.
- FinalizableSlimObject: Operations: 3.256 M/s.
"Allocation to nothing" here is allocation without storing a reference anywhere. This case must be the most frequent scenario, since most of allocated objects don't survive the next GC in generation 0.
There is no any additional GC cost for SlimObject in this case - there are no references to any of such objects, so garbage collector won't mark any of them, and consequently, won't spent any time on them at all.
But what's with FinalizeableSlimObject?
- It's allocation is costly, since it is registered for further finalization (added into a finalization queue). Registration cost must be similar to getting a lock and adding an object to Hashtable.
- All of such objects are processed (i.e. marked) during the GC
- Then they're removed from finalization queue (~ finding & removing an entry from a Hashtable)
- And added to FReachable queue (cheap, ~ adding to a list).
- They're also processed on heap compaction phase. We get an additional cost proportional to the size of an object here.
- A separate finalization thread will spend some time on them to finalize them. This must be not quite noticeable in this test, since the test have been performed on multicore CPU.
- And finally, if some of them will survive until the next GC in generation 1, we'll will add further overhead similar to described.
So as you see, allocation of short living finalizable object is 50 times more expensive than of the regular one!
Now let's turn to long-living objects: "Allocation to an array" here is allocation with storing a reference to allocated object in array. The test was running for about a second, so there must be several generation 1 GCs. Certainly SlimObject wins here as well (about 2 times), but not as mach as in above case. Why?
- Because all the objects survive many GCs, and GC overhead is the most significant factor for both types of them.
- One more factor slowing down the test with SlimObject is less efficient usage of L1\L2 caches: in first case all the data were "laying" mainly in cache, but here pages must "flow" through the caches. But I suspect this is much less important factor than the first one here.
In any case, probably it's time to finalize some finalizers ;) At least the ones that aren't really necessary.
Thursday, October 16, 2008
Dictionary: surprising fact about Remove and Clear methods
Both these methods never resize internal bucket list of the Dictionary
To get the approval of this fact, study the source code of both methods - e.g. with .NET Reflector.
When Hashtable is better than Dictionary
We all know Hashtable class - it exists in .NET from the very beginning. .NET 2.0 has brought a generic replacement to it: Dictionary.