The heap
A heap is a special type of tree that needs to satisfy two requirements:
- The heap is a complete binary tree
- The value of each node in the heap must be greater than or equal to (or less than or equal to) the value of each node in its child tree
For each node whose value is greater than or equal to every node’s worth heap in the subtree, we call this the big top heap, and for each node whose value is less than or equal to every node’s worth heap in the subtree, we call this the small top heap.
How do YOU implement a heap
To implement a heap, we need to know how to store a heap and what operations the heap supports.
How do I store a heap?
Because the heap is a complete binary tree, and a complete binary tree is more suitable to use arrays to store, can save storage space.
The operation of the heap
1. Insert an element into the heap
After an element is inserted into the heap, the position of the element is adjusted to meet the characteristics of the heap. This is called heapification.
There are actually two ways to heap, bottom to top and top to bottom, so let’s talk about the bottom to top way to heap.
The newly inserted data is placed at the end of the array and heaped from the bottom up.
2. Delete the heaptop element
The last node of the heap is placed on the top of the heap, and then the same parent-child node comparison method is used. For those nodes that do not satisfy the parent-child node size relationship, the two nodes are swapped and the process is repeated until all nodes satisfy the size relationship. That’s how you stack from the top down.
Complexity analysis
For a complete binary tree with n nodes, the height of the tree cannot exceed log2n. The heapification process is commutated along the path of the nodes, so the time complexity of heapification is proportional to the height of the tree, that is, O(logn). The primary operation of inserting data and removing heaptop elements is heapification, so the time complexity of inserting an element into the heap and removing the heaptop element is O(logn).
Heap sort
The sorting method that we implement with the help of the heap data structure is called heap sort, and heap sort has a very stable time complexity, O(nlogn), and it is an in-place sorting algorithm, so good, how does it do that?
We can roughly break down the heap sort process into two big steps: build heap and sort.
1. Build the heap
We start by building the array into a heap in place, and there are two ways to think about it.
The first way is to use the idea of inserting an element into the heap. Even though the array already contains n data, we can assume that initially there is only one data in the heap, the data with subscript 1, and then we call the previous insert operation to insert the data with subscript 2 through n into the heap.
In the second way, we heap the data in the array one by one, and then heap it. Since the leaf node does not need to be heaped, we heap it from top to bottom, starting with the first non-leaf node.
Code implementation:
@classmethod
def build_heap(cls, a: List[int]) - >None:
"""Data in a needs to start from index 1."""
for i in range((len(a) - 1) / /2.0, -1):
cls._siftdown(a, len(a) - 1, i)
@classmethod
def _siftdown(cls, a: List[int], count: int, root_index: int = 1) - >None:
i = larger_child_index = root_index
while True:
left, right = cls._left(i), cls._right(i)
if left <= count and a[i] < a[left]:
larger_child_index = left
if right <= count and a[larger_child_index] < a[right]:
larger_child_index = right
if larger_child_index == i: break
a[i], a[larger_child_index] = a[larger_child_index], a[i]
i = larger_child_index
Copy the code
Heap build time complexity: O(n)
2. The sorting
After the heap is built, the data in the array is organized according to the characteristics of the big top heap. The first element in the array is the top of the heap, which is the largest element. We swap it with the last element, the largest element is placed at subscript n, and then we heap the remaining n-1 elements, reconstructing them into heaps. Once the heap is done, we take the top element of the heap, place it at index n-1, and repeat the process until there is only one element with index 1 left in the heap, and we are done.
Code implementation:
@classmethod
def sort(cls, a: List[int]) - >None:
"""Data in a needs to start from index 1."""
cls.build_heap(a)
k = len(a) - 1
while k > 1:
a[1], a[k] = a[k], a[1]
k -= 1
cls._siftdown(a, k)
Copy the code
Performance analysis
The whole process of heap sort requires very little temporary storage, so heap sort is an in-place sort algorithm.
Heap sort consists of two operations: heap build and sort. The time complexity of heap build is O(n) and the time complexity of sort is O(nlogn), so the overall time complexity of heap sort is O(nlogn).
Heap sort is not a stable sorting algorithm because it is possible to change the original relative order of data with the same value by swapping the last node of the heap with the top node.
Why does quicksort perform better than heap sort in real development?
First, heapsort data access is not as friendly as quicksort.
For quicksort, the data is accessed sequentially, while for heap sort, the data is accessed in a hop, which makes heap CPU cache unfriendly.
Second, heapsort exchanges more data than quicksort does for the same data.
The application of the heap
Priority queue
A priority queue, as the name implies, should first and foremost be a queue. As we said earlier, the biggest feature of a queue is fifO. However, in a priority queue, data is not queued in first-in, first-out order, but in priority order. The data with the highest priority is queued first.
How do you implement a priority queue? There are many ways to do this, but using the heap is the most straightforward and efficient. This is because the heap and priority queues are very similar. A heap can be considered a priority queue. Most of the time, they’re just conceptual distinctions. Adding an element to the priority queue is equivalent to adding an element to the heap; Removing the highest priority element from the priority queue is equivalent to removing the top of the heap element.
Don’t underestimate this priority queue, it can be used in many different scenarios. A lot of the data structures and algorithms that we’ll talk about later depend on it. For example, Huffman coding, graph shortest path, minimum spanning tree algorithm and so on.
1. Merge ordered small files
Suppose we have 100 small files, each 100MB in size, and each file stores an ordered string. We want to merge these 100 small files into one ordered large file. That’s where the priority queue comes in.
The whole idea is a little bit like the merge function in merge sort. We take the first string from each of the 100 files, put it in the array, compare the size, put the smallest string in the merged large file, and delete it from the array.
Assuming that the smallest string comes from a small file called 13.txt, we take another string from this small file and place it in the array, re-size it, and select the smallest one to place in the merged larger file and delete it from the array. And so on, until all the data in the file is in the larger file.
Here we’re using an array, a data structure, to store strings pulled from a small file. Every time I get the smallest string from the array, I have to loop through the array, which is obviously not very efficient. Is there a more efficient way?
This is where you use priority queues, or heaps. We take the string from the small file and put it into the small top heap, and the element at the top of the heap, the element at the top of the priority queue, is the smallest string. We put the string into a large file and remove it from the heap. Then it takes the next string from the small file and puts it in the heap. By repeating this process, you can place the data from 100 smaller files into a larger file.
2. High-performance timer
Suppose we have a timer that maintains a number of scheduled tasks, each of which has a point in time to trigger execution. The timer scans a task every time a small unit of time (say, 1 second) passes to see if any task has reached the set execution time. If it arrives, take it out and execute it.
However, scanning the task list every second is inefficient for two reasons. First, the scheduled execution time of the task may be a long time away from the current time, so many previous scans are useless. Second, scanning the entire task list each time can be time-consuming if the task list is large.
To solve these problems, we can use priority queues. We store tasks in a priority queue according to their execution time, with the first task to be executed stored at the head of the queue (the top of the small top heap). This way, the timer doesn’t need to scan the list of tasks every second. It takes the execution time of the first task and subtracts it from the current time to get an interval T. This interval, T, is how long it takes from the current time before the first task needs to be executed. In this way, the timer can be set in T seconds after the execution of the task. The timer doesn’t need to do anything between the current point in time and (t-1) seconds. When T seconds pass, the timer takes the priority of the first task in the queue. Then calculate the difference between the execution time point of the new queue leader task and the current time point, and take this value as the waiting time for the next task of the timer.
This way, the timer does not have to poll every second, nor does it have to traverse the entire task list, which improves performance.
2. Find Top K
I’ve abstracted this problem of finding Top K into two categories. One is for a static data set, which means that the data set is predetermined and never changes. The other is for dynamic data sets, that is, the data set is not determined in advance, and there is data added to the set dynamically.
For static data, how to find the first K big data in an array containing N data? We can maintain a small top heap of size K, go through the array sequentially, and compare the data from the array with the top elements of the heap. If it is larger than the heaptop element, we remove the heaptop element and insert it into the heap. If it is smaller than the top of the heap, no processing is done and the array continues to be iterated. After all the data in the array is traversed, the data in the heap is pre-K big data.
It takes O(n) time to go through the array, and O(logK) time for one heaping-operation, so in the worst case, n elements are all in the heap at once, so the time is O(nlogK).
To calculate Top K for dynamic data is real-time Top K. How do you understand that? Let me give you an example. There are two operations in a data set, one is to add data, the other is to ask the current K big data.
If every time before asking K big data, we recalculate based on the current data, then the time complexity is O(nlogK), and N represents the size of the current data. In fact, we can always maintain a small top heap of K size, and when data is added to the collection, we can compare it to the elements at the top of the heap. If it is larger than the heaptop element, we remove the heaptop element and insert it into the heap. If it is smaller than the top of the heap, it is not processed. In this way, whenever we need to query the current front K big data, we can immediately return to him.
Find the median
The median, as the name suggests, is the number in the middle. If the number of data is odd, the median is n2+1\frac{n}{2}+12n+1 if the number of data is odd, the median is n2+1\frac{n}{2}+12n+1. N ^ 2 frac{n}{2}2n and n ^ 2+1 frac{n}{2}+12n+1 N ^ 2 frac{n}{2}2n.
For a static set of data, the median is fixed, we can sort first, n2\frac{n}{2}2n is the median. Every time we ask for the median, we just return the fixed value. So, although the cost of sorting is high, the marginal cost is small. However, if we’re dealing with a dynamic data set, where the median is constantly changing, it’s not efficient to sort first every time you ask for the median.
The heap is a data structure that allows us to do medians very efficiently without sorting. Let’s see, how does it do that?
We need to maintain two heaps, one big top heap and one small top heap. The big top heap stores the first half of the data, and the small top heap stores the second half of the data, and the data in the small top heap is larger than the data in the big top heap.
That is, if there are n items, and n is even, and we sort them from smallest to largest, then n2\frac{n}{2}2n items are stored in the big top heap, and n2\frac{n}{2}2n items are stored in the small top heap. So the top element in the big top heap is the median we’re looking for. If n is odd, the situation is similar, the large top heap stores n2+1\frac{n}{2}+12n+1 data, and the small top heap stores n2\frac{n}{2}2n data.
As we mentioned earlier, data changes dynamically, so when a new data is added, how do we adjust the two heaps so that the top element in the big top heap continues to be the median?
If the new data is less than or equal to the top element of the big top heap, we insert the new data into the big top heap; If the new data is greater than or equal to the top element of the small top heap, we insert the new data into the small top heap.
If n is even, the number of data in both heaps is n2\frac{n}{2}2n; If n is odd, the big top heap has n2+1\frac{n} {2}+12n+1 data and the small top heap has n2\frac{n}{2}2n data. At this point, we can constantly move the top element from one heap to the other, so that the data in both heaps meets the convention above.
Thus, we can use two heaps, a big top heap and a small top heap, to achieve the operation of finding the median in a dynamic data set. The time to insert data is order (logn) because it involves heaping-but to find the median we just need to return the top element of the big top heap, so it’s order (1).