This is a famous algorithm challenge, it’s also a very tough one. I’ve seen this problem so many times around and never took the time to actually solve it. Practicing this stuff is good to keep the mind sharp so this time I invested some time to solve it, actually more time than expected. I’m writing this to help future seekers of knowledge that will come by.
Here’s the challenge:
You are given a set of integers coming from user input. The first number you get is the amount of integer values in the set. The next numbers you get are the integers in the set. You have to find the running median of the set of integers. The running median is the midpoint value of the sorted set, or the average of the two mid points if the number of elements in the set is even.
So if your data set is [3, 2, 4, 1, 5] you first need to sort it into [1, 2, 3, 4, 5] and then find the element with such a position that an equal number of elements are present on both left and right of that position.
If the set contains an even amount of elements, such as [1, 2, 3, 4] then there is no such position, in this case the median is the average of the two elements that, when taken together, satisfy the property, in this case the median will be 2.5.
What makes this challenge interesting is not how to find the median of a sorted array, everybody can do that. The real issue is how to solve the problem in the most efficient way when it comes to execution time, note that there is a sorting action that needs to be done for every element received. I’m going to provide here the worst possible approach and then explain why something better is needed.
The idea is straightforward, read a new element, insert the element in a List, sort the List and finally get the median out of it. But the idea is also highly inefficient, there is a call to Collections.sort() for every new element in the data set. This means that the time complexity of that code is something like N, number of elements in the set multiplied by Nlog(N) which is probably the complexity of Collections.sort(), or at least is the best possible complexity for the sorting.
Can this problem be solved with a faster algorithm? Yes.
A better approach using Heaps
To find the median we need to keep some sort of information about where the center of the data set is, possibly without sorting the set every time a new element is added. Introducing Heaps, a data structure that has an interesting property, the first element of a heap, called the head, is always the biggest or the smallest value in the whole heap. See it as a magical array that always has the smallest or biggest element at position 0. A min heap will always have the smallest element as head, a max heap will always have the biggest element as head. There is plenty of documentation on heaps out there so I won’t be explaining the internals here. Adding elements to a heap takes log(N) time, if we combine this property with our for loop that reads the elements from the user then we end up with a time complexity of N*log(N), which is a lot better than N*N*log(N).
We need two heaps, a left heap and a right heap.
- The left heap is a max heap, it contains roughly half of the elements in the set, the half that falls to the left of the median. The left heap’s head is its biggest element.
- The right heap is a min heap, it contain roughly half of the elements in the set, the half that falls to the right of the median. The right heap’s head is its smallest element.
The idea is to update left and right heap in such a way that we can always figure out the median value from the two heads. This is the algorithm step by step:
- When the first element is received, add it to the left heap.
The median is the element.
Wait for the next element.
- If the new element is bigger than the left heap’s head then add the new element to the right heap.
Otherwise add the new element to the left heap.
With this logic we try to split elements which are smaller than the median from those which are bigger. However there is a problem, consider the dataset [1,2,3,4,5], in this case our left heap would contain only 1 while the right heap would contain everything else because all the elements after 1 are bigger than 1. We need a way to keep the same amount of elements in both heaps. The next step addresses this.
- Take the head from the heap that has more elements and move it to the other heap. If the right heap has more elements then remove the head from the right heap and add it to the left heap. Otherwise, if left heap has more elements then remove the head from the left heap and add it to the right heap. Doing this at each step guarantees the balancing.
- At this point the two heaps are balanced, their sizes will differ by zero after receiving an even number of elements, and by one after receiving an odd number of elements. To find the median with an even number of elements we simply average the two heads. To find the median with an odd number of elements we take the head of the longest heap.
That’s our efficient algorithm, here is my implementation in java. It uses PriorityQueue which is a java implementation of heap.
Hope you liked the explanation, please let me know if something still sounds foggy so I can improve it.