Finding a value by linear traversal

Introduction

find is a function that performs linear traversal to find a value. Essentially we’re going to implement std::find from the standard C++ library (see the article on min as to why).

There are some subtly different choices, mainly because the focus here is on what’s ideal, without any pressure to be backwards compatible.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
namespace algs {
  struct equal
  {
    template<typename T, typename U>
    bool operator()(T && a, U && b) const {
      return std::forward<T>(a) == std::forward<U>(b);
    }
  };

  struct identity
  {
    template<typename T>
    T && operator()(T && x) const {
      return std::forward<T>(x);
    }
  };

  // generalized find taking comparison and projection
  template<typename I, typename S, typename T, typename Cmp, typename Proj>
  // requires I is an InputIterator,
  //   S is a sentinel for I,
  //   Cmp is an equivalence relation between T and projection Proj of ValueType(I)
  I find(I f, S l, const T & v, Cmp cmp, Proj proj) {
    while (f != l) {
      if (cmp(v, std::invoke(proj, *f))) {
        return f;
      }
      ++f;
    }
    return f;
  }

  // generalized find taking comparison
  template<typename I, typename S, typename T, typename Cmp>
  // requires I is an InputIterator,
  //   S is a sentinel for I,
  //   Cmp is an equivalence relation between T and ValueType(I)
  I find(I f, S l, const T & v, Cmp cmp) {
    return find(f, l, v, cmp, algs::identity());
  }

  // straight forward find
  template<typename I, typename S, typename T>
  // requires I is an InputIterator,
  //   S is a sentinel for I,
  //   T is equality comparable with ValueType(I)
  I find(I f, S l, const T & v) {
    return find(f, l, v, algs::equal());
  }

  namespace range {
    // range find

    // generalized range find taking comparison and projection
    template<typename Range, typename T, typename Cmp, typename Proj>
    // requires R is an InputRange,
    //   Cmp is an equivalence relation between T and projection Proj of ValueType(Range)
    auto find(Range & r, const T & v, Cmp cmp, Proj proj) {
      return algs::find(std::begin(r), std::end(r), v, cmp, proj);
    }

    // generalized range find taking comparison
    template<typename Range, typename T, typename Cmp>
    // requires R is an InputRange,
    //   Cmp is an equivalence relation between T and ValueType(Range)
    auto find(Range & r, const T & v, Cmp cmp) {
      return algs::find(std::begin(r), std::end(r), v, cmp);
    }

    // straight forward
    template<typename Range, typename T>
    // requires R is an InputRange,
    //   T is equality comparable with ValueType(Range)
    auto find(Range & r, const T & v) {
      return algs::find(std::begin(r), std::end(r), v);
    }
  }
}

The rest of the article is about how did we get there.

Basic usage

Basic usage of find looks like this:

1
2
3
4
5
6
7
8
9
  std::vector<int> v{3, 5, 42, 4};
  auto it = algs::range::find(v, 42);
  if (it == v.end()) {
    std::cout << "Not found\n";
  }
  else {
    std::cout << "Found: " << *it << '\n';
  }
  // Prints: Found: 42

Tricks from min

I’ve reused some techniques that are explained in the min article such as:

using templates for a generic solution
algs namespace to address unwanted ADL
equal struct is similar to less from min; identity struct is identical
comparison and projection for more flexibility

Representing ranges

There are a few good ways to represent ranges.

One is by using a pointer to the element of an array and a count for the number of the elements in the range. In some situations this is good way. But for the situations when we consume from a range, this approach has the disadvantage that the pointer has to advance and the counter has to be decremented in sync.

Another way is to use two pointers. One pointer to the first element in the range. The second pointer points one past the last element in the range. This second pointer is traditionally called last, although strictly speaking it is one past the last. The rules inherited from C guarantee that one can point one past the end of an array as long it is not dereferenced.

A common usage scenario is to increment the first pointer until it becomes equal to the last:

1
2
3
4
  while (f != l) {
    // some code here using the dereferenced value *f
    ++f;
  }

This two pointers approach creates a closed-open range i.e. it includes the element pointed by the first pointer, but it does not include the element pointed by the last pointer. The advantage of a closed-open range is that it can represent empty ranges (when the two pointers are equal).

A generalization of the two pointers approach is to use iterators instead of pointers. We got to the end when the two iterators are equal.

A further generalization is to allow the types two be different so that only the first one is an iterator, the last one can act as a sentinel. For example the equality operator between the iterator and the sentinel can compare if the value pointer by the iterator is zero; in this case the sentinel is just an empty type such that the right equality operator is called. I took this approach in this implementation of find.

For convenience the range can be represented by a single object that groups the two values of either of the two approaches above by exposing methods like begin, end, size. That’s a range object.

Versions accepting range objects

When creating versions of find that accept range objects there is the problem of how to disambiguate between functions with the same number of arguments e.g. both functions below have three arguments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
namespace algs {
  template<typename I, typename S, typename T>
  I find(I f, S l, const T & v) {
    // ...
  }

  namespace range {
    template<typename Range, typename T, typename Cmp>
    auto find(Range & r, const T & v, Cmp cmp) {
      // ...
    }

  }
}

The choice I’ve made above is to use an additional namespace for the functions receiving a range object. That’s maybe not ideal.

Passing range objects as arguments

Standard containers such as std::vector, std::list, std::set etc. can be used as range objects. That’s what I’ve shown above for the basic usage sample.

We don’t want to pass such containers when they are temporaries (rvalues) to find because the returned iterator will be dangling.

In the range::find implementation above, I avoid this passing the range by lvalue reference: Range & r.

However sometimes it’s desirable to accept temporaries for range objects. That’s the case when the range object provides a view into a container. The range::find implementation above does not handle this, which is not ideal. See ranges work for proposal on how this will be dealt with by the standard libraries.

Return value

The return value of find is an iterator.

When first and last have the same type, either can be returned when they become equal. But when last plays the role of a sentinel, it can have a different type from first. This leads to the decision to return first (or a value reached from first) when first becomes equal to last. Therefore the return type of find is the same as the type for first.

The user to either use returned iterator to get the found value, or use it to build a range e.g.: either the range from the value to the end or from the beginning to the value (if the iterator is ForwardIterator).

Non-homogeneous comparison

In the example above we searched for 42 in a vector of int. The values in the container have the same type as the value we’re looking for. That equality comparison is homogeneous.

But there are cases where the value we’re searching for is not of the same type as the values in the container (or the values pointed by the iterator, i.e. the value type of the container and/or iterator):

1
2
3
4
5
6
7
8
9
void foo(const std::vector<std::string>& v) {
  auto it = algs::range::find(v, "bar");
  if (it == v.end()) {
    std::cout << "Not found\n";
  }
  else {
    std::cout << "Found: " << *it << '\n';
  }
}

In the example above we’re looking for a literal string in a vector of std::string. The reason this works is that there is an equality operator that allows comparing a std::string with a literal string. This equality comparison is heterogeneous. This is the meaning of the comment meant to describe the requirement that T is equality comparable with ValueType(Range).

This is the same idea as in the method find for a std::set which was made a template function in C++14 to allow e.g. searching using a literal string in a std::set<std::string>.

find_if, find_if_not (which take a predicate instead of a value)
lower_bound, upper_bound, equal_range (which work on sorted ranges)
search (which takes two ranges)

Faster linear find

There are faster options for linear find (even ignoring hardware speed-ups, or sorted input).

To start, if last can de dereferenced and the searched value stored, the version below saves an extra comparison (of first with last) for every iteration in the loop.

1
2
3
4
5
6
7
8
9
10
11
12
  template<typename I, typename T>
  // requires I is an ForwardIterator
  //   T is equality comparable,
  //   T is the same as ValueType(I)
  I find_by_setting_sentinel(I f, I l, const T & v) {
    // precondition: l can be dereferenced to store value as sentinel
    *l = v;
    while (*f != v) {
      ++f;
    }
    return f;
  }

This is an option for rare situations however, because it’s usually unexpected to a user to allow dereferencing last and mutating the input to find a value.

References

Elements of Programming (book by Alexander A. Stepanov and Paul McJones)
Article on implementing min
Donald Knuth: “Structured Programming with Goto Statements” (December 1974)