Compiler: Visual C++ 2010
Operating System: Windows 7 32bits
Tested machine CPU: Intel core i3
Download:
preVSpost (demo project) (4910 downloads )
A recent Visual C++ team’s comment on twitter.com reminded me a hot topic that exists in C++ programming world: there is a long discussion of using pre versus post increment operators, specially, for iterators. Even me I was witness to a discussion like this. The discussion started from a FAQ written by me on www.codexpert.ro.
The reason of preferring pre increment operators is simple. For each post-increment operator a temporary object is needed.
Visual C++ STL implementation looks similarly with next code:
_Myt_iter operator++(int) {
// postincrement
_Myt_iter _Tmp = *this;
++*this;
return (_Tmp);
}
But for pre-increment operator implementation this temporary object is not needed anymore.
_Myt_iter& operator++() {
// preincrement
++(*(_Mybase_iter *)this);
return (*this);
}
In the discussion that I mentioned above, somebody came with a dummy application and tried to prove that things have changed because of new compilers optimizations (the code exists in the attached file, too). This sample is too simple and far away to the real code. Normally the real code has more code line codes that eat CPU time even if you’re compiling with /O2 settings (is obviously).
Base on that VC++ team’s tweet related to viva64.com’s research I decided to create my own benchmark base on single and multicore architectures. For those that don’t know Viva64 is a company specialized on Static Code Analysis.
Starting from their project I extended the tested for other STL containers: std::vector, std::list, std::map, and std::unordered_map (VC++ 2010 hash table implementation).
For parallel core tests I used Microsoft’s new technology called Parallel Pattern Library.
1. How the tests were made
1.1. Code stuff
In order to get execution time I used same timer as Viva64 team (with few changes). Each container instance was populated with 100000 elements of same random data. An effective computing function was repeated 10 times. Into this function some template functions are called for 300 times. The single core computing function contains loops like this:
for (size_t i = 0; i != Count; ++i) {
x += FooPre(arr);
}
// where FooPre looks like
template T
size_t FooPre(const T &arr) {
size_t sum = 0;
for (auto it = arr.begin(); it != arr.end(); ++it)
sum += *it;
return sum;
}
For the parallel core computing the first simple for loop has changed in:
parallel_for (size_t(0), Count,
[&cnt,&arr] (size_t i) {
cnt.local() += FooPre(arr);
});
Where cnt is an instance of combinable class and the sum of partial computed elements is obtained by calling combine() method:
[cpp]cnt.combine(plus());[/cpp]
As you can see, the parallel_for function uses one of the new C++ standard features: a lambda function. This lambda function and the combinable class implements the so called parallel aggregation pattern and helps you to avoid the multithreaded common share resource issues. The code is executed on independent tasks. The reason that this approach is fast is that there is very little need for synchronization operations. Calculating the per-task local results uses no shared variables, and therefore requires no locks. The combine operation is a separate sequential step and also does not require locks.
1.2. Effective results
The tests were running on a Intel core i3 machine (4 cores) running Windows 7 on 32bits OS. I tested debug and release mode for single and multi cores computation. The test application was build in VC++ 2010 one of the first C++11 compliant.
The OX axis represents the execution repeated times, and the OY axis means time in seconds.
1.2.1. Single core computation
Debug
Release
1.2.2. Multi cores computation
As you know, multi core programming is the future. For C++ programmers Microsoft propose a very interesting library called Parallel Pattern Library.
The overall goal is to decompose the problem into independent tasks that do not share data, while providing a sufficient number of tasks to occupy the number of cores available.
This is how it looks my task manager when the demo application runs in parallel mode.
Isn’t it nice comparing to a single core use? 🙂
Debug
Release
1.2.3. Speedup
Speedup is an efficiency performance metric for a parallel algorithm comparing to a serial algorithm.
Debug
Release
Conclusions:
The biggest differences appear in the debugging area where the pre-increment is “the champion”.
With primitive types (like int and pointers), the opposite might be true, because of the pipe-lining that a CPU does. With post-increment, due to optimizations in release there is no copy to be returned for these simple types.
According to these results I have to agree with Viva64 team. Even if the results are so close in release version I keep my opinion that using pre increment operator is preferred instead of post increment operators. We all know how long it takes the debug period and how important is every second that we win in long debugging days.
If you still have doubts in using pre-increment operator or you need a flexible way of switching this operators in your code you can easily implement some macros like these:
#define VECTOR_ITERATOR(type, var_iter) std::vector::iterator var_iter;
#define VECTOR_FOR(vect, var_iter) for (var_iter = vect.begin(); var_iter != vect.end(); ++var_iter)
Would you mind to tell us what computer did you use for the tests? Thanks!
I told it inside of article (effective results area).