While writing generic code that can be reused is a great ideal, I claim that in practice there is a need to repeatedly implement similar functionality.

Introduction

I’m going to use the following two common tasks as examples: read the whole binary contents of a file and find a value in your data.

APIs for such tasks would be functions with declarations like:

1
2
3
4
std::vector<char> read_contents(const std::string & file_name);

template<typename It, typename T>
It find(It first, It last, const T & value);

Ideally for each common task you would design one true API, implement it once, and invoke the API whenever you need to achieve the corresponding task. We want to reuse code, not implement repeatedly the same functionality. Code reuse means reducing the size of the code that needs to be maintained, leading to increased productivity. For very common tasks as the ones above, you might even use implementations from standard or third-party libraries, so you don’t have to implement the API. STL is a great canonical example of the ‘implement once, reuse many times’ principle.

However I find that complex projects there are orthogonal low level requirements (i.e. unrelated to each other) that each require additional APIs for what appears as the same high level task. The relation is exponential, i.e. the number of APIs grows exponentially with these low level requirements.

I call this the fragmentation of APIs.

Language

The above functions are fine if your language is C++. For other languages (Python, Java, etc.) you’ll have a different API, you might duplicate implementations per language, even if conceptually the APIs are same.

Platform dependencies

By platform I mean OS (operating system), version of OS, bitness (32 or 64 bit).

As soon as in your implementation you use something platform dependent, even if you maintain the same API, you might duplicate some parts of the implementation.

Exceptions vs. error codes

The implicit assumption in the read_contents function above is that if it encounters an error, it will throw.

That is fine if you read a configuration file that is required for operation of the program and it is the responsibility of a separate setup program to have made it available.

However there are domains where failing to read files happens routinely (maybe the setup program itself has to be resilient against bad configurations).

For that you need some additional strategy to handle the disappointment, for example add a version that does not throw and returns an error code that needs to be checked.

1
std::vector<char> read_contents(const std::string & file_name, std::error_code & ec) noexcept;

You will probably delegate the implementation of the throwing version to the non-throwing version of the API, but now there are two version of the API to choose from.

NOTE: Allocations make it particularly difficult to have noexcept versions (see the Issues section in this article).

Another related situation are functions that are called from destructors (say closing a file). You might have a version that is called as a normal function (which can throw), and a version that is called from a destructor. The version called from destructor (or move) does not throw. Instead of throwing it either ignores failure, just logs or event terminates the process (depending on the situation).

Performance vs. easy/local reasoning

It’s very convenient to return the contents of the file, but if you open a large number of files sequentially you might want to avoid repeated allocations. So you might then pass a buffer by reference. See for example std::getline for a similar approach.

1
void read_contents(const std::string & file_name, std::vector<char> & buffer);

As far as find is concerned, if you know that the data is sorted you might want to use something like std::lower_bound instead (never mind that different containers have their own find method).

But even for sequential find there are faster options if you’re willing to break the default/usual/basic rules e.g. if last can be dereferenced and written to. Then you can reduce the number of operations in the loop. See Knuth for this approach.

1
2
3
4
5
6
7
8
9
template<typename It, typename T>
// requires *last is writable
It mutable_find(It first, It last, const T & value) {
  *last = value;
  while (*first != value) {
    ++first;
  }
  return first;
}

Blocking vs. asynchronous

The implicit assumption in the read_contents function above is that it will block the thread until the file is opened and read. If your program performs several activities you’ll want a additional version that returns as soon as the operation is initiated and calls a callback when completed.

1
2
template<typename Callback>
void async_read_contents(const std::string & file_name, std::vector & buffer, Callback cb);

Parallelism variations

For find above the assumption is that the search will be single threaded and sequential. For some cases it might be desirable to be able to exploit parallelism. Here is a version specifying the type of parallelism to use.

1
2
template<typename ExecutionPolicy, typename It, typename T>
It find(ExecutionPolicy && policy, It first, It last, const T & value);

Similar versions might appear for thread related reasons e.g.: you have a thread safe version of a data structure.

Blob vs. chunk

The return from read_contents is a vector. That is appropriate for small files that you trust will fit in memory. You might need a lower level API that returns some RAII wrapper for the file and a read function that reads a buffer at a time. read_contents should be implemented using the lower level API, but this still adds additional choice to the user that wants to read the contents of a file.

Eager vs. deferred

To filter a sequence, copy_if can be used to copy then and there the values that meet a predicate (the eager approach), or you provide a view into the sequence i.e. you return an iterator which only when incremented will apply the predicate until it reaches a satisfying value.

Compile time vs. run time

The API could look different if you choose things to be done compile time. E.g. you want a run time array: std::vector. To construct of a desired size you provide the desired size as an argument to the constructor. If you want fixed compile time size you might use std::array. To construct of the desired size you provide it as a template parameter.

What types should we accept?

For a function like min, do we have to have overloads with const arguments?

The const variation is just the start. How much do we constrain the types we use in the API?

  • The file_name does it need to be a std::string, how about string literals like “config.ini”? Can address using some gsl::zstring_span
  • The file_name will probably wide (16 bits per character) on Windows. Can use a template, but might still need to split part of the implementation.
  • vector<char> can be vector<unsigned char> or vector<std::byte>
  • Does it even have to be a vector?
  • To support sentinels: do we have to have the same type for first and last?
  • To support ranges do we have to have an overload receiving a range instead of first and last.

But this can’t be generalized forever:

  • likely the file_name has to be zero terminated so that it can be passed to the OS API that is typically a C function expecting a zero terminated string.
  • When file_name is a string: what encoding do we accept?

Conclusion

The article shows that there are barriers to code reuse due to independent concerns that that have different requirements.

For some issues we can find some options so that the number of APIs for a common task does not multiply exponentially.

But in general the consequence is a fragmentation of the APIs, that is for a common task we need to provide multiple APIs and choose carefully which one to use each time.

Swiss-knife approaches like CreateFile where a single function has additional parameters or flags that fundamentally change behaviour (e.g. see the FILE_FLAG_OVERLAPPED flag) are really multiple APIs under disguise.

Sometimes the term bifurcation is used instead, e.g. with regards to dealing with exceptions. but I think that the term fragmentation captures better the many axes of spliting a API.

References