Fibonacci Implementation Experiments
Lessons learned while calculating the 1 millionth entry in the Fibonacci sequence
Introduction
Ever since reading Chapter 3 in Elements of Programming, calculating Fibonacci, pretending it’s has high runtime complexity compared with other functions, looked silly.
Later Sean Parent had a presentation on concurrency in which he asks people to stop giving it as example for long running function, because here: he could easily calculate Fibonacci of 1 million in 0.72 seconds, on a normal consumer computer.
And yet, recently I indulged myself recently in this silly occupation, and here’s what I learned.
Exponential naive
The naive implementation, follows the definition to the letter. It handles the base case, then uses repeated recursion to calculate Fibonacci:
1
2
3
4
5
std::uint64_t exponential_naive(std::uint32_t x)
{
if (x < 2) return x;
return exponential_naive(x - 1) + exponential_naive(x - 2);
}
The problem with it is the exponential runtime growth, on my computer that takes:
- about 1 second for
42
- about 6 seconds for
46
After that it takes quite a long time, which makes it unusable (except for using it as an example of compiler re-ordering of operations.
Linear naive
It’s easy to implement a function that has linear runtime growth:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
std::uint64_t linear_naive(std::uint32_t x)
{
if (x < 2) return x;
std::uint64_t a = 0;
std::uint64_t b = 1;
for (--x; x > 0; --x)
{
std::uint64_t c = a + b;
a = b;
b = c;
}
return b;
}
It performs better, on my computer that takes:
- about nothing for
42
- about nothing for
46
- about nothing for
96
If we want to calculate the value for 1 million, the argument x
can fit into
an unsigned 32 bit integer, but the result won’t.
Fibonacci of 96
is 12200160415121876738
and is the biggest result that we
can store on an unsigned 64 bit integer variable.
Big O notation limitations
The big O
notation is only part of the story. It works on assumptions of
infinity, while engineering works with finite resources. A great example is
this one we’ve just seen: 96
is nowhere near infinite, for such small values
we can implement the function as a lookup into an array with 96 values.
Unsigned binary
So we need a big number. Sean Parent uses boost::multiprecision::cpp_int
. How
hard can it be to write our own?
We can define our own arbitrary precision unsigned number as a vector of
unsigned 32 bit integers. An empty vector is zero, value at index 0
is the
least significant part.
1
2
3
4
5
6
using unit = std::uint32_t;
struct unsigned_binary
{
std::vector<unit> units_;
// ...
};
One implementation choice is between making it a concrete type (as above) or a
template. I think short term, we save a lot of effort by making it a concrete
type. What we loose is the ability to precisely count the operations by
substituting with instrumented types for the unit
.
Then we can add arithmetical operators. We start with +
, which it turn out,
is defined in a funny asymmetric way, that takes left hand side by value, which
might help with expressions like a + b + c
.
1
2
3
4
5
6
7
8
9
10
11
12
using unit = std::uint32_t;
struct unsigned_binary
{
std::vector<unit> units_;
unsigned_binary & operator+=(const unsigned_binary & rhs);
friend unsigned_binary operator+(unsigned_binary lhs, const unsigned_binary & rhs)
{
lhs += rhs;
return lhs;
}
};
To implement that we need a larger data type, where we can add the
corresponding units and the carry. The carry is 0
or 1
.
The only catch is that in the implementation the left hand side is *this
which mutates: we both read and write, some care is required e.g. use indexes
instead of iterators that might get invalidated as we resize.
1
using double_unit = std::uint64_t;
The addition for the arbitrary precision has complexity O(n)
worst case where
n
is the maximum of the length of the two terms of the addition (for the case
that carry propagates all the way. The average case is probably O(n)
, but
this case the minimum of the length of the two terms, because the chances the
carry propagates decrease exponentially.
Similarly we can implement multiplication using the long multiplication
algorithm from primary school. That algorithm has a complexity if O(n^2)
(i.e. quadratic).
Unsigned decimal
The next problem to solve is to display the number. My plan is to have an associated type, that stores decimal digits in a vector, and convert between the two types.
1
2
3
4
5
6
7
8
9
struct unsigned_decimal
{
// vector of digits between 0 and 9
std::vector<unsigned char> digits_;
};
unsigned_decimal make_unsigned_decimal(const unsigned_binary & value);
std::string to_string(const unsigned_decimal & value);
Printing is trivial: reverse iterate through digits_
add '0'
(which is 48
in ASCII) to each value.
To do the conversion from unsigned_binary
, keep a value for the current power
of 2 in decimal, go through each bit in units_
and accumulate if the bit is
1, doubling (add to itself) the current power of two in each step. It has the
simplicity that it only needs addition for unsigned_decimal
.
First go
We’re ready to copy/paste Sean’s Parent Fibonacci implementation.
With a naive implementation for unsigned_binary
and conversion, for Fibonacci
of 1,000,000
it took:
- about 2 seconds to calculate Fibonacci in a
unsigned_binary
- but a very long time to convert to
unsigned_decimal
(and print the decimal result).
Improve conversion to decimal
It turns out that the conversion I used is O(n^2)
operations. So my quick
plan was to try to reduce the value n
to which the squaring applies.
I changed the approach used for unsigned_decimal
and stored instead a vector
of 32 bit unsigned integers in a vector, where each value represents a number
in base 1,000,000,000
(from 0
to 999,999,999
).
1
2
3
4
5
6
7
using big_digit = std::uint32_t;
struct unsigned_decimal
{
// vector of digits between 0 and 999,999,999
std::vector<big_digit> digits_;
};
To convert I then repeatedly divided the unsigned_binary
by 1,000,000,000
to obtain one big_digit
at a time.
It also turns out that the functions from_chars
and to_chars
from the
<charconv>
header might not be useful in this case, as all the values from
the vector (except the one at the highest index) need to be padded with 0
s,
and those functions don’t do padding.
In this approach I used a lot of info that’s not impossible, but not trivial to
deduce programatically either. E.g. the fact that base 1,000,000,000
is the
best fit for a power of 10
into a 32 bit unsigned integer. Also used switches
up to 9
to calculate the length of the serialized string.
With an improved conversion, for Fibonacci of 1,000,000
it took:
- about 2 seconds to calculate Fibonacci in a
unsigned_binary
(code unchanged) - and about 1 second to convert to
unsigned_decimal
(and print the decimal result, though most of the time is taken by the conversion).
Karatsuba multiplication
Addition is O(n)
, multiplication on the other side when implemented as
learned in school (long multiplication) is O(n^2)
. There are algorithms that
do a bit better, one such is the Karatsuba algorithm that has a complexity of
about O(n^1.58)
. It might not seem much of a difference but compare n = 10
:
10^2 = 100
, while 10^1.58 ~= 38
, which is less the half.
Unfortunately it also comes with constants due to additional allocations
(unless one is willing to go into custom allocators territory), so empirically
it seems that the threshold for using Karatsuba vs. long multiplication is
around 100
for the size of vectors of the operands.
Using Karatsuba multiplication, for Fibonacci of 1,000,000
it took:
- about 0.5 seconds to calculate Fibonacci in a
unsigned_binary
- and about 1 second to convert to
unsigned_decimal
(code unchanged).
Elements of Programming chapter 3 - revisited
It turns out that Sean Parent’s example is from “From Mathematics to Generic Programming”, not from Elements of Programming.
In Elements of Programming the implementation uses half of matrix.
Using half matrix, for Fibonacci of 1,000,000
it took:
- about 0.3 seconds to calculate Fibonacci in a
unsigned_binary
- and about 1 second to convert to
unsigned_decimal
(code unchanged).
So now the calculation is dominated by the conversion (and print).
Interestingly the power in EOP uses same type for argument and return value.
1
2
3
4
5
template<typename I>
requires(Integer(I)
I fibonacci(I n) {
// ...
}
That is nice from the point of view of deducing the arguments. But we’ve seen
that in practice the argument is much smaller than the return value, that’s why
I used different types: unsigned 32 bit for the argument of 1 million and
unsigned_binary
for the return value. For this case in particular it would
likely not make much of a difference.
Could have kept on going on
I stopped there. Could have tried to see what difference it makes by using a more optimised power algorithm. Could also spend more effort on the conversion issue. Could have also try to look at (and reduce) allocations.
More lessons learned
I did not always go for the absolute minimum steps that I needed. I could have
saved some effort there. For example I also implemented less than operators for
the arbitrary precision numbers, where equality was enough (for tests). I also
implemented conversion from unsigned_decimal
to unsigned_binary
, where I
only needed conversion the other way.
Also at the beginning I spend what looked like a disproportionate amount of time setting up build, test, debugging environment so that it’s trivially easy and convenient to perform those activities. This did pay off as when I refactored (e.g. multiplication), test showed me regressions and after a quick debug, 5 minutes later the issue was fixed.
References
- Alexander A. Stepanov and Paul McJones: Elements of Programming
- Sean Parent’s concurrency presentation
- More notes on Fibonacci
- Compiler reordering