标签:operating system thread vtune
1 ENVIRONMENTS
Table 1 ENV A
CPU |
Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz 10-core |
Memory |
100G |
Linux Version |
3.11.0 |
Ubuntu Version |
12.04 |
3 EXPLANATIONS
We need to explain why MENCODER with PARROT is faster than that without PARROT over four threads. [1] tells us that the Synchronization Context Switches with PARROT is much fewer than that without PARROT. And the reason is that
PARROT’s round-robin scheduling reduces contention and its synchronizations use a more efficient wait that combines spin- and block- waits. So we should do the following steps to find the detailed reasons.
(a) First, is Syn Context Switches the main factor that leads to the result?
(b) If it is, please see (c-e); otherwise jump to (f).
(c) What’ s the relationship between Syn Context Switches and threads contention(a.k.a. locks contention)?
(d) What causes the contention?
(e) How does PARROT’ s round-robin scheduling reduce contention?
(f) We need to find other factors.
3.1 Preparations
We can’t read the source code of MENCODER, and don’t know where PARROT works. So I want to find a simple model which is similar to MENCODER. It is located in xtern/apps/mplayer/, and contains speedup-example.cpp(with PARROT) and speedup-example-mutex-only.cpp(without PARROT).
We call it EXAMPLE, and I write the mk of speedup-example.cpp.
#!/bin/bash
cd $XTERN_ROOT/apps/mplayer
gcc speedup-example.cpp -o speedup-example -O2 -g \
-I$XTERN_ROOT/include -L$XTERN_ROOT/dync_hook -Wl,--rpath,$XTERN_ROOT/dync_hook
-lxtern-annot \
-lpthread
So we could run the model. Before this, however, we should prove that we could use EXAMPLE to explain MENCODER. We could regard MENCODER as the common producer-consumer idiom. The producer reads blocks from mp4, and the consumers convert into avi. When many consumers
strive for the same lock, it will causes performance loss. This is the key point. If we want to explore the reasons of the speedup of MENCODER with PARROT, we need to simplify it as the above. And EXAMPLE is a good choice. It lets many threads access the same
critical resource, and uses PARROT to get a speedup. Additionally, EXAMPLE and MENCODER both enjoy big speedups. So I think if we can find the detailed reasons of the speedup of the EXAMPLE with PARROT, the obtained reasons also apply to MENCODER.
3.2 Is Syn Context Switches the main factor that leads to the result?
Now we could use EXAMPLE to explain MENCODER. We run EXAMPLE over four threads, ENV A. Results are as follows.
Table 2 Results of EXAMPLE over four threads, ENV A
WITHOUTPARROT |
WITH PARROT |
|
Elapsed Time |
7.85s |
4.36s |
Table 3 Results of EXAMPLE without PARROT over 1 thread, ENV A
WITHOUTPARROT |
|
Elapsed Time |
1.38s |
speedup-example-mutex-only.cpp
void*thread_func(void*arg){ ... for (int j = 0; j < M; j++) { pthread_mutex_lock(&mutex); for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the mutex needs to be a little bit congested. sum += i; pthread_mutex_unlock(&mutex); ... } } int main(int argc, char* argv[]){ ... for(unsigned i=0; i<N; ++i) { ret = pthread_create(&th[i], NULL, thread_func, (void*)i); assert(!ret && "pthread_create() failed!"); } ... exit(0); }
Threads=1, without PARROT
Threads=4, without PARROT
Threads=4, with PARROT
Fig.2. and Fig.1. show that with the increase of the number of threads, Synchronization Context Switches is larger and larger. Fig.2. and Fig.3. show that PARROT reduces the number of Context Switches. Namely, with the increase of Synchronization Context Switches, the Elapsed Time increases, and with the reduce of it, the Elapsed Time reduces. Above all, we could prove that Syn Context Switches is the main factor that leads to the result that we try to explain.
3.3 What’s the relationship between Syn Context Switches and threads contention?
We have known that Syn Context Switches is the main factor. And we also know that with the increase of the number of threads, there is also increased threads contention.What’s the relationship between Syn Context Switches and threads contention? Contention
not only causes serialization, but also brings the overhead of thread switches[2]. From VTune Fig.4., we could see that threads spend more Wait Time on the Mutex lock, which caused by threads contention.
So if we could reduce the contention, or use a more efficient wait, we will get a high speedup. So we need to know how it is generated.
3.4 What causes the contention?
Let’s see the source code of EXAMPLE. Every thread executes the following codes:
pthread_mutex_lock(&mutex); for (long i = 0; i < loops; i++) sum += i; pthread_mutex_unlock(&mutex);
3.5 How does PARROT’s round-robin scheduling reduce contention?
First, let’s see the Locks and Waits of EXAMPLE with PARROT and without PARROT from VTune.
With PARROT
Without PARROT
Without PARROT, threads spend more Wait Time on the Mutex lock. While with PARROT, threads spend more wait time on Condition Variable rather than Mutex. It is because PARROT adds
to wake up all threads together. Let’s try to explain the benefit. Thread A finishes its access, and unlock the shared resources. Next, there will be many threads striving for the lock, including A itself. While PARROT lets A wait by (2), and lets B, C, D wait subsequently.
So the contention has been reduced. When all four threads lie in the wait queue, (3) will let the four threads run again. The following graph describes the number of threads that strive for the lock as time goes on.
So PARROT could reduce contention by schedule threads in order.
4 CONCLUSION
PARROT tells thread A, B, C and D: When you finish a access, you should wait for others, because you are cooperators. So the enemy, Syn Context Switches, reduces gradually. Although they need to wait, in fact the performance
will be improved as a whole.
5 REFERENCES
[1] Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang, Garth Gibson, and Randy Bryant. "Parrot: a Practical Runtime for Deterministic, Stable, and Reliable Threads". Proceedings of the 24th ACM Symposium
on Operating Systems Principles (SOSP ’13)
[2] Weiming Zhou, Multi-core Computing and Programming, Wuhan: Huazhong University of Science and Technology Press, 2009(in Chinese)
The Performance Analysis of MENCODER with PARROT,布布扣,bubuko.com
The Performance Analysis of MENCODER with PARROT
标签:operating system thread vtune
原文地址:http://blog.csdn.net/bluecloudmatrix/article/details/27711489