Wednesday, 2 April 2014

Sorting and Efficency

Sorting data is probably one of the most basic functions that you might have a computer do, its one or two steps above asking the computer to store data or perform arithmetic. The problem seems very simple and intuitive to all of us and so you might imagine that this problem is equally as simple in computer science. Well, yes and no; it really depends on what you're looking for. Yes, writing an algorithm to sort an array is very simple, you can brute force it if you like (please don't!). But sorting efficiently? Now thats a non-trivial problem and one that I want to address here.

Asymptotic Run Time

To begin, we first need to develop some tools in describing the efficiency of an algorithm. Now, the number of steps/the time it takes for a sorting algorithm to run depends heavily on the machine it's being run on and the details of the implementation of algorithm being used. However, when we talk about an algorithm's efficiency, we want to detach our analysis from these specifics and a good way to do this is by looking at its asymptotic running time, i.e. big O. Now, say that the number of steps it takes for an algorithm to sort a list of n elements is T(n). Informally, we can say that T(n) is in O(g(n)) if, as n becomes really large, T(n) can be bounded from above by the function c*g(n), where c is a constant. So in other words, g(n) provides an upper limit for the asymptotic growth of the function T(n). If T(n) is in O(n) (linear time)  then it cannot grow faster than a linear function, for example it cannot grow like a quadratic or exponential function.

Sorting Algorithms

Now that we're able to describe the efficiency of a sorting algorithm, let's look at a few and see what each has to offer. Note: this is in no way meant to be comprehensive.

Selection Sort

This algorithm is quite simple, you simply find the smallest element of the original list, delete it and then append it to a second list. You keep doing this until the original list is empty and then you return the second list. Over a single pass the algorithm makes n-1 comparisons where n is the current size of the original list. This leads a total of n-1 + n -2 + ... + 1 comparisons which means T(n) is in O(n2). One characteristic of this algorithm is that all lists of size n take the same amount of time to sort.

Insertion Sort

This is another very similar algorithm which involves building a sorted list one element at a time by swapping elements in a sublist until that sublist is sorted. The worst case scenario for this algorithm is the same as selection sort, i.e. O(n2) however it is very fast for nearly-sorted lists i.e. O(n) and is one of the most efficient sorting algorithms for small lists.

Merge Sort

This algorithm splits the list in half and then calls itself recursively on those halves. It then combines the sorted halves using a linear time merge operation. In total, the algorithm performs log2n merges making it a O(nlogn) sorting algorithm and therefore much better with large lists than both selection and insertion sort.  Merge sort also shares with selection sort the characteristic that the order of the elements in the presorted list do not affect its runtime.

Quick Sort

Quick sort works by choosing a pivot point in the list and then moving all values in the list smaller than that pivot point to the left and all values larger to the right. It then recursively sorts the two sublists. The efficiency of quick sort is highly dependent on the choice of the pivot point. At its worst, its a O(n2) algorithm like insertion sort and selection sort (this is extremely unlikely if the pivot point is chosen) but on average it's a O(nlogn) algorithm like merge sort. In fact, its performance is actually better than merge sort given a wise choice of pivot point.

Counting Sort

This reads the value of each integer in an list and adds a counter to an bookkeeping array for each copy of a specific integer found. The sorted list is made by reading off counters from the bookkeeping array. You might have already realized that because you only go over the list once, this is always a O(n) algorithm! Every sorting algorithm which has been mentioned so far is an example of a comparison based sorting algorithm. Theoretically, the average runtime of these algorithms cannot be faster than O(nlogn) however counting sort is not comparison based and is therefore not subject to this lower bound. The problem with bin sort is that it can only be implemented of arrays of integers and it cannot be efficiently implemented when the difference between the max and min values of the list (the algorithm is also in linear time with respect to this difference) grow to much larger than the size of the array itself.

The Ideal Solution

As you can see from above, no algorithm has it all; each algorithm has its own strength and weaknesses. Insertion sort is by far the best sorting algorithm for small arrays but, as a O(n2) sorting algorithm it is quickly out-scaled by both merge sort and quick sort. Quick sort and merge sort may be both O(nlogn) on average but Quick sort has a worst case of O(n2) whereas merge sort is always O(nlogn). Counting sort's average case of O(n) is superior to all the other sorting algorithm but it is much more limited in breadth compared to the others. In the end, there probably won't be an ideal solution which works for every case but there is certainly not a short supply of algorithms which are well-suited for whatever specification you desire.