Detailed example of execution reordering by the compiler

Silly Fibonacci implementation

Say we have a silly function like fib below, a small main function to calculate the value for 42 and print it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <iostream>

int fib(int x)
{
    if (x < 2) return 1;
    return fib(x - 1) + fib(x - 2);
}

int main()
{
    int result = fib(42);
    std::cout << result << '\n';
    return 0;
}

Then using Visual Studio 2017 we compile it optimized (Release, not Debug) and run it.

1
2
433494437
Press any key to continue . . .

It prints the expected value, but it’s slow. It takes about 1 to 2 seconds. We know how to implement a better one. But we’d like to first find out how long it takes to run the silly version.

Measuring the duration using chrono

So we use the standard chrono library to take start and stop timestamps around the call to fib(42).

Then we print the duration using a weird looking duration_cast. Given that it takes seconds, a resolution of milliseconds should do.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <chrono>
#include <iostream>

int fib(int x)
{
    if (x < 2) return 1;
    return fib(x - 1) + fib(x - 2);
}

int main()
{
    auto start = std::chrono::steady_clock::now();
    int result = fib(42);
    auto stop = std::chrono::steady_clock::now();
    std::cout << result << '\n';
    std::cout << "It took: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() << '\n';
    return 0;
}

Then using Visual Studio 2017 we compile it optimized (Release, not Debug) and run it.

1
2
3
433494437
It took: 0
Press any key to continue . . .

And the problem is that it prints a duration of 0. But when we run it, it obviously takes more than 0 milliseconds by about three orders of magnitude.

And some look at the weird looking duration_cast and claim that there must be something wrong with it: look, it uses templates, so it must be wrong, oh why don’t we use the good ol’ GetTickCount?.

Measuring the duration using GetTickCount

So we use the good ol’ GetTickCount. It does overflow after about 48 days, but it will do for a quick test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <Windows.h>
#include <iostream>

int fib(int x)
{
    if (x < 2) return 1;
    return fib(x - 1) + fib(x - 2);
}

int main()
{
    auto start = ::GetTickCount();
    int result = fib(42);
    auto stop = ::GetTickCount();
    std::cout << result << '\n';
    std::cout << "It took: " << (stop - start) << '\n';
    return 0;
}

Then using Visual Studio 2017 we compile it optimized (Release, not Debug) and run it.

1
2
3
433494437
It took: 0
Press any key to continue . . .

Oh, it still prints a duration of 0, and it obviously still takes more like 1 to 2 seconds.

Generated code

So then we look at the assembly in the executable that the compiler generates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
    12:     auto start = ::GetTickCount();
 mov         esi,dword ptr [__imp__GetTickCount@0 (09B3000h)]
 push        edi
 call        esi
 mov         edi,eax
    14:     auto stop = ::GetTickCount();
 call        esi
    13:     int result = fib(42);
 mov         ecx,29h
    14:     auto stop = ::GetTickCount();
 mov         esi,eax
    13:     int result = fib(42);
 call        fib (09B1000h)
 mov         ecx,28h
 mov         edx,eax
 call        fib (09B1000h)
    15:     std::cout << result << '\n';
 mov         ecx,dword ptr [_imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A (09B3050h)]
    13:     int result = fib(42);
 add         edx,eax
    15:     std::cout << result << '\n';
 push        edx
 call        dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (09B3038h)]
 mov         ecx,eax
 call        std::operator<<<std::char_traits<char> > (09B13A0h)
    16:     std::cout << "It took: " << (stop - start) << '\n';
 sub         esi,edi
 push        esi
 push        ecx
 mov         ecx,dword ptr [_imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A (09B3050h)]
 call        std::operator<<<std::char_traits<char> > (09B1160h)
 add         esp,4
 mov         ecx,eax
 call        dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (09B303Ch)]
 mov         ecx,eax
 call        std::operator<<<std::char_traits<char> > (09B13A0h)

The compiler reordered the code it generated.

In particular it called GetTickCount twice in succession (lines from 1 to 7 above) to obtain start and stop, and the calculation for fib(42) is only done later (starting at line 8). That explains why the difference (stop - start) is almost always 0.

In this case it seems that the compiler chose to defer the calculation of fib(42) so that the second call to GetTickCount at line 7 can reuse the value of esi set at line 2 to the pointer to GetTickCount.

Functions like GetTickCount and printing the output could have visible side effects and their relative order is preserved by the compiler. But for others, like fib(42), the compiler has full visibility, and it has significant flexibility, as long as it’s calculated before it’s printed.

The calculation for fib(42) was unrolled a bit. 29h at line 9 is equal to 41 decimal and will be used to calculate fib(42-1). 28h at line 14 is equal to 40 decimal and will be used to calculate fib(42-2). The results will be added to calculate fib(42) = fib(42-1) + fib(42-2). This unrolling saves one test for the argument 42 being less than 2, but in this particular case it’s not a significant optimisation.

We can’t really complain about the optimisation for GetTickCount. Ideally the compiler would have gone further and figured out the result of fib(42) at compile time and just printed the constant (a constexpr decoration would probably helped, but it’s not a requirement for such optimisations).

As a side note there is also some interleaving of operations. One example: ecx is set to 29h at line 9, before calling fib at line 13 (part of eventually calculating fib(42)), but in between the result of GetTickCount is moved from eax to esi at line 11 (which is the assignment of stop).

Compiler operations graph

Another way of looking at it is that the compiler generates a graph of operations. Ignoring details like the lower level interleaving, the graph looks like this:

Operations Graph

The compiler in effect performs a topological sort and there is more than one possible ordering, some of which executes fib(42) after assigning stop.

Conclusion

We looked at a counter-intuitive example where the compiler reordered operations. Although inconvenient in this particular example, it is the result of optimisations with the goal of improving runtime performance.

References