Archives | Categories

The Interpolation Search Algorithm with Python

Table of Contents

<2019-01-30 Wed>

One of the most important operations for all data structures is searching for the elements from the stored data. There are various methods to search for an element in the data structures; in this article, we shall explore the interpolation search strategy that can be used to find elements in a collection of items.

The source code used in this article is available at https://github.com/PacktPublishing/Hands-On-Data-Structures-and-Algorithms-with-Python-3.7-Second-Edition/tree/master/Chapter09.

1. Interpolation search

The interpolation searching algorithm is an improved version of the binary search algorithm. It performs very efficiently when there are uniformly distributed elements in the sorted list. In a binary search, we always start searching from the middle of the list, whereas in the interpolation search we determine the starting position depending on the item to be searched. In the interpolation search algorithm, the starting search position is most likely to be the closest to the start or end of the list depending on the search item. If the search item is near to the first element in the list, then the starting search position is likely to be near the start of the list.

The interpolation search is another variant of the binary search algorithm that is quite similar to how humans perform the search on any list of items. It is based on trying to make a good guess of the index position where a search item is likely to be found in a sorted list of items. It works in a similar way to the binary search algorithm except for the method to determine the splitting criteria to divide the data in order to reduce the number of comparisons. In the case of an interpolation search, we divide the data using the following formula:

mid_point = lower_bound_index + \
    (upper_bound_index - lower_bound_index) * \
    (search_value - input_list[lower_bound_index]) // \
    (input_list[upper_bound_index] - input_list[lower_bound_index])

In the preceding formula, the lower_bound_index variable is the lower-bound index, which is the index of the smallest value in the input_list list, denoting the index position of the lowest value in the list. The input_list[lower_bound_index] and input_list[upper_bound_index] variables are the lowest and highest values respectively in the list. The search_value variable contains the value of the item that is to be searched.

Let's consider an example to understand how the interpolation searching algorithm works using the following list of 7 items:


To find 120, we know that we should look at the right-hand portion of the list. Our initial treatment of binary search would typically examine the middle element first in order to determine if it matches the search term.

A more human-like method would be to pick a middle element in such a way as to not only split the array in half but to get as close as possible to the search term. The middle position was calculated using the following rule:

mid_point = (index_of_first_element + index_of_last_element) // 2

We shall replace this formula with a better one that brings us closer to the search term in the case of the interpolation search algorithm. The mid_point will receive the return value of the nearest_mid function, which is computed using the following method:

def nearest_mid(input_list, lower_bound_index, upper_bound_index, search_value):
    return lower_bound_index + \
        (upper_bound_index - lower_bound_index) * \
        (search_value - input_list[lower_bound_index]) // \
        (input_list[upper_bound_index] - input_list[lower_bound_index])

The nearest_mid function takes, as arguments, the lists on which to perform the search. The lower_bound_index and upper_bound_index parameters represent the bounds in the list within which we are hoping to find the search term. Furthermore, search_value represents the value being searched for.

Given our search list, 44, 60, 75, 100, 120, 230, and 250, nearest_mid will be computed with the following values:

lower_bound_index = 0
upper_bound_index = 6
input_list[upper_bound_index] = 250
input_list[lower_bound_index] = 44
search_value = 230

Let's compute the mid_point value:

mid_point = 0 + (6-0) * (230-44) // (250-44)
          = 5

It can now be seen that the mid_point value will receive the value 5. So, in the case of an interpolation search, the algorithm will start searching from the index position 5, which is the index of the location of our search term. Thus, the item to be searched will be found in the first comparison, whereas in the case of a binary search, we would have chosen the position of 100 as mid_point, which would have required another run of the algorithm.

A more visual illustration of how a typical binary search differs from an interpolation is given as follows. In a typical binary search, it finds the midpoint that looks like it's in the middle of the list:


One can see that the midpoint is actually standing approximately in the middle of the preceding list. This is as a result of dividing the list by two.

In the case of an interpolation search, on the other hand, the midpoint is moved to the most likely position where the item can be matched:


In an interpolation search, the midpoint is generally more to the left or right. This is caused by the effect of the multiplier being used when dividing to obtain the midpoint. In the preceding diagram, our midpoint has been skewed to the right.

The implementation of the interpolation algorithm remains the same as that of the binary search except for the way we compute the midpoint.

Here, we provide the implementation of the interpolation search algorithm, as shown in the following code:

def interpolation_search(ordered_list, term):

    size_of_list = len(ordered_list) - 1

    index_of_first_element = 0
    index_of_last_element = size_of_list

    while index_of_first_element <= index_of_last_element:
        mid_point = nearest_mid(ordered_list, index_of_first_element, index_of_last_element, term)

        if mid_point > index_of_last_element or mid_point < index_of_first_element:
            return None

        if ordered_list[mid_point] == term:
            return mid_point

        if term > ordered_list[mid_point]:
            index_of_first_element = mid_point + 1
        else:
            index_of_last_element = mid_point - 1

    if index_of_first_element > index_of_last_element:
        return None

The nearest_mid function makes use of a multiplication operation. This can produce values that are greater than upper_bound_index or lower than lower_bound_index. When this occurs, it means the search term, term, is not in the list. None is, therefore, returned to represent this.

So, what happens when ordered_list[mid_point] does not equal the search term? Well, we must now readjust index_of_first_element and index_of_last_element so that the algorithm will focus on the part of the array that is likely to contain the search term.

if term > ordered_list[mid_point]:
    index_of_first_element = mid_point + 1

If the search term is greater than the value stored at ordered_list[mid_point], then we only adjust the index_of_first_element variable to point to the mid_point + 1 index.

The following diagram shows how the adjustment occurs. The index_of_first_element is adjusted and pointed to the mid_point + 1 index:


On the other hand, if the search term is less than the value stored at ordered_list[mid_point], then we only adjust the index_of_last_element variable to point to the index mid_point - 1. This logic is captured in the else part of the if statement index_of_last_element = mid_point - 1:


The diagram shows the effect of the recalculation of index_of_last_element on the position of the midpoint.

Let's use a more practical example to understand the inner workings of both the binary search and interpolation algorithms.

Consider for example the following list of elements:

[ 2, 4, 5, 12, 43, 54, 60, 77]

At index 0, the value 2 is stored, and at index 7, the value 77 is stored. Now, assume that we want to find the element 2 in the list. How will the two different algorithms go about it?

If we pass this list to the interpolation search function, then the nearest_mid function will return a value equal to 0 using the formula of mid_point computation which is as follows:

mid_point = 0 + (7-0) * (2-2) // (77-2)
          = 0

As we get the mid_point value 0, we start the interpolation search with the value at index 0. Just with one comparison, we have found the search term.

On the other hand, the binary search algorithm needs three comparisons to arrive at the search term, as illustrated in the following diagram:


The first mid_point value calculated is 3. The second mid_point value is 1 and the last mid_point value where the search term is found is 0.

Therefore, it is clear that the interpolation search algorithm performs better than binary search in most cases.

2. Choosing a search algorithm

The binary search and interpolation search algorithms are better in performance compared to both ordered and unordered linear search functions. Because of the sequential probing of elements in the list to find the search term, ordered and unordered linear searches have a time complexity of O(n). This gives a very poor performance when the list is large.

The binary search operation, on the other hand, slices the list in two anytime a search is attempted. On each iteration, we approach the search term much faster than in a linear strategy. The time complexity yields O(log n). Despite the speed gain in using a binary search, the main disadvantage of it is that it cannot be applied on an unsorted list of items, neither is it advised to be used for a list of small size due to an overhead of sorting.

The ability to get to the portion of the list that holds a search term determines, to a large extent, how well a search algorithm will perform. In the interpolation search algorithm, the midpoint is computed in such a way that it gives a higher probability of obtaining our search term faster. The average-case time complexity of the interpolation search is O(log(log n)), whereas the worst-case time complexity of the interpolation search algorithm is O(n). This shows that interpolation search is better than binary search and provides faster searching in most cases.

3. Further Reading

If you found this article interesting, you can explore Hands-On Data Structures and Algorithms with Python to learn to implement complex data structures and algorithms using Python. Hands-On Data Structures and Algorithms with Python teaches you the essential Python data structures and the most common algorithms for building easy and maintainable applications.

Copyright © KDr2, SOME RIGHTS RESERVED UNDER CC BY-NC 4.0.

Built with Emacs 28.2 (Org mode 9.5.5).

Last updated: 2022-01-28 Fri 13:42.