I. Complexity
Complexity analysis is a fundamental skill in implementing data structures and algorithms, the goal of which is to write code that takes less space and takes less time to run.
Time complexity
- Big O notation:
T(n) = O(f(n))
- Represents the trend of code execution time as data size increases.
- Generally, low-order, constant and coefficient are ignored in the calculation of complexity because it only represents the trend of change
- There are several common levels of time complexity:
Polynomial magnitude:
- Constant order O (1)
- The logarithmic order O (logn)
- Linear order O (n)
- Linear log order O(nlogn)
- Order squared O(n²) O(n³)… O(n^k)
Non-polynomial magnitude :(the more n, the higher the execution time, the lower the performance)
- The index order O (2 ^ n)
- Factorial order O (n!
- The addition rule and the multiplication rule
- Rule of addition: The total complexity is equal to the complexity of the code with the highest order of magnitude
- Product rule: The complexity of nested code is equal to the product of the complexity of both inside and outside the nested code
- Average time complexity:
- Also called weighted average time complexity or expected time complexity.
- Be able to calculate: best, worst, average time
- Amortize the time complexity
- Special average time complexity
- In other words, the algorithm has a rule to follow. When calculating the time, the time of a time-consuming manipulation can be evenly divided into several time-consuming manipulations.
Data structure
An array of 1.
An Array is a linear table data structure. It uses a contiguous set of memory Spaces to store a set of data of the same type. Features:
- The linear table
- Contiguous memory space
- Same type of data
- Random access
- Data manipulation is inefficient and the average time complexity is O(n)
Why do arrays start at zero
- Because arrays are linear table data structures. It uses a contiguous set of memory Spaces to store a set of data of the same type.
So:
- If subscripts start at 0:
- The formula for calculating the address of an object with subscript k is:
a[k]_address = base_address + k * type_size
- The formula for calculating the address of an object with subscript k is:
- If the subscript starts at 1:
- The formula for calculating the address of an object with subscript k is:
a[k]_address = base_address + (k-1) * type_size
- For the CPU, it’s one more subtraction instruction.
- The formula for calculating the address of an object with subscript k is:
- C language designers started counting array subscripts with 0, and later Java, JavaScript and other high-level languages followed C’s example.
Can containers completely replace arrays?
Java’s ArrayList, for example, has the greatest advantage of encapsulating many of the details of array operations. For example, array insertion and data removal need to move other data. It also has the advantage of supporting dynamic expansion.
So, as a high-level language programmer, is array useless? Of course not, there are times when arrays are more appropriate, but in general, for business development, using containers directly is enough to save time and effort. After all, the loss of a loss of performance, will not affect the overall performance of the system. But if you’re doing something very low-level, like developing a network framework, where performance needs to be optimized to the extreme, arrays will be preferred over containers.
2. Linked Lists
Instead of a contiguous memory space, it uses “Pointers” to connect a set of discrete memory blocks in series.
Several common forms of linked list: 1. Single linked list 2. Circular linked list 3. Bidirectional cyclic listCopy the code
Contrast with arrays:
The contrast between arrays and linked lists, however, is not limited to time complexity. Moreover, in real software development, complexity analysis alone cannot be used to determine which data structure to use to store data.
Tips for writing linked list code: 1. Understand the meaning of Pointers or references, beware of pointer loss and memory leaks 2. Use sentry to simplify the difficulty of implementation. 3. Pay attention to the processing of boundary conditions. 4Copy the code
Writing linked list code is the most logical test of thinking ability. Because the linked list code is full of pointer manipulation and boundary condition handling, the slightest mistake can be prone to bugs. If the linked list code is written well, you can see whether a person is careful enough to write the code, whether the problem is considered comprehensively, whether the thought is careful. That’s why many interviewers like to have lists written by hand. So, you have to write your own code to implement the things that I’m going to talk about in this section.
- Single-linked list inversion
- Linked list central detection
- Merge two ordered linked lists
- Delete the NTH node from the penultimate list
- Find the intermediate node of the linked list
3. The stack
- Sequential stack implemented with arrays
- A chain stack with linked lists
- Out and in time complexity space complexity is O (1)
- After the advanced
Application:
- 1, storage destruction of temporary variables of a function
- 2. Evaluate the expression
- 3. Browser forward and backward
4. The queue
Features: First in, first out
- Implement sequential queues with arrays
- Use linked lists to implement chained queues
Queue expansion:
- Circular queue
- Solve the problem of data migration in arrays
- Empty: head == tail
- Queue full :(tail+1)%n=head
- Blocking queue
- When the queue is full, no entry is allowed.
- Producer-consumer model
- Concurrent queue
- Thread-safe queues are called concurrent queues
5. Jump table
We know that arrays support fast random access, while linked lists do not, so binary lookup cannot be used for fast lookup of linked lists. In fact, we only need to modify linked lists to support binary search algorithms. The modified data structure is called a Skip list.
A skip table is a multi-level “index” of an ordered list. Every two (or other number) nodes are extracted to the next level, which we call an index or index level. You can see my drawing. In the figure, down indicates the down pointer pointing to the next node.
Now, if we want to find some node, let’s say 16. We can go through the index level first, and when we get to 13 in the index level, we find that the next node is 17, so 16 must be somewhere between these two nodes. Then we go down to the original list level through the down pointer of the index layer node and continue iterating. At this point, we only have to go through two more nodes, and we can find the node that is equal to 16. So if you wanted to find 16, you had to go through 10 nodes, now you only have to go through 7.
The examples I gave are small in data volume, and the improvement in search efficiency is not obvious. In order for you to really feel the index to improve query efficiency. I drew a linked list of 64 nodes, and built up five levels of indexes in the same way.
As you can see from the graph, before there was no index, 62 nodes were traversed to find 62. Now we only need to traverse 11 nodes. Isn’t that much faster? Therefore, when the length n of the linked list is large, such as 1000 or 10000, the search efficiency will be significantly improved after the index is built.
Time complexity:
What is the time complexity of jumping tables to query certain data?
According to what we just said, every two nodes will extract one node as the node of the upper level index, so the number of nodes of the first level index is about N /2, the number of nodes of the second level index is about N /4, the number of nodes of the third level index is about N /8, and so on, that is to say, The number of nodes in the k-th index is 1/2 of the number of nodes in the k-th index, so the number of nodes in the k-th index is n/(2k).
Suppose an index has rank H, and the highest index has two nodes. Using the formula above, we can get n/(2h)=2, so h=log2n-1. If the original list layer is included, the height of the entire hop list is log base 2n. If m nodes are traversed at each level, the time complexity of querying a data in the hop table is O(m*logn).
So what is the value of m? According to the above index structure, we only need to traverse 3 nodes at most at each level of index, that is, m=3.
So the time to query any data in a hop table is order logn. The time of this search is the same as binary search. In other words, we’re actually implementing binary search based on a single linked list. Isn’t that amazing? However, there is no free lunch, this kind of query efficiency improvement, the premise is to establish many levels of index, that is, we talked about in section 6 of the space for time design ideas.
Space complexity:
Aren’t skip tables a waste of memory? Skip tables need to store multiple indexes and consume more storage space than a single linked list. How much extra storage does it take? Let’s analyze the spatial complexity of hop tables.
It is not difficult to analyze the spatial complexity of a hop table. As I said earlier, assuming that the original linked list size is N, the first level index has about N /2 nodes, the second level index has about N /4 nodes, and so on, reducing by half at each level until there are only 2 nodes left. If we write down the number of nodes in each index, it is a geometric sequence.
The sum of these indexes is n/2+n/4+n/8… + 8 + 4 + 2 = n – 2. So, the space complexity of the hop table is O(n). In other words, if we construct a single linked list with n nodes as a hop list, we need to use nearly n additional storage space. Is there a way to reduce the memory footprint of the index?
We used to take one node out of every two nodes to the parent index, but if we took one node out of every three or five nodes to the parent index, wouldn’t there be so many index nodes?
A first-level index needs about N /3 nodes, and a second-level index needs about N /9 nodes. At each step up, we divide the number of index nodes by 3. For the sake of calculation, we assume that the highest level of index nodes is 1. Let’s write down the number of nodes at each level of the index, which is also a geometric sequence.
The total number of index nodes is approximately N /3+ N /9+n/27+… + 9 + 3 + 1 = n / 2. Although the space complexity is O(n), it reduces the storage space of index nodes by half compared with the above index construction method of picking one node for every two nodes.
In fact, in software development, we don’t have to worry too much about the extra space taken up by indexes. Talking about data structures and algorithms, we habitually look to deal with data as an integer, but in the actual software development, the original chain table may be stored in the object, and the index nodes only need to store the key value and a few Pointers, does not need to store objects, so when the object is larger than the index node many, the index takes the extra space can be ignored.
The hop table index is dynamically updated
If we do not update the index when we are constantly inserting data into the hop table, we may have a situation where there is too much data between two index nodes. In extreme cases, hoppers can degenerate into single linked lists.
As a dynamic data structure, we need some way to maintain the balance between the index and the size of the original linked list, that is, if there are more nodes in the list, the index nodes should be increased accordingly to avoid complexity degradation, and the performance of lookup, insert, and delete operations.
When we insert data into the hop table, we can choose to insert the data into the partial index layer at the same time. How do you choose which index layers to add?
We use a random function to determine which level of index we want to insert this node into, so if the random function generates K, we add this node to the k-level index from level 1 to level K.
Random function selection is very careful, in terms of probability, can ensure the index size and data size balance of the hop table, not excessive degradation of performance.
Jumper features:
- The premise is ordered linked lists
- Dynamic data structure
- Supports fast query, insert, and delete operations in O(logn) time.
- On the surface, the space complexity is O(n), but because the index only needs to store key values and a few Pointers, it does not need to store objects, so when the object is much larger than the index node, the extra space occupied by the index can be ignored.
- Advantages over red-black trees: When you need to find data by interval, the skip table can do O(logn) time complexity to locate the beginning of the interval, and then iterate through the original linked list.
- The code implementation is much easier than red-black trees.
6. A hash table
Features:
- Array – based features can be quickly queried according to the subscript
- Using the hash function, you can hash the key into a positive integer, which is the index of the array, for quick lookup.
- Insert, find, delete all take O(1)
Hash conflict:
- Hash values are highly likely to duplicate, so you have hash conflicts
- There are two ways to resolve hash conflicts:
- Open addressing: linear detection, quadratic detection, double hashing
- Advantages: The data in the hash table is stored in an array, which can effectively use the CPU cache to speed up the query. Moreover, the hash table implemented in this method is relatively easy to serialize.
- Disadvantages: 1. It is troublesome to delete data, which requires special marking of deleted data; 2. 2. The upper limit of the load factor should not be too large, which also results in this method wasting more memory space than linked list method.
- Conclusion: When the data volume is small and the loading factor is small, the open addressing method is suitable. This is why ThreadLocalMap in Java uses open addressing to resolve hash conflicts.
- The linked list method
- Advantages: 1. The utilization ratio of memory is higher than that of open addressing method. 2. Higher tolerance to large loading factor; 3. Skip lists and red-black trees can be used instead of ordinary linked lists, so that even in extreme cases, the time complexity is only O(logn).
- Conclusion: Hash tables are more suitable for storing large objects and large amounts of data. Moreover, it is more flexible than open addressing and supports more optimization strategies, such as replacing linked lists with red-black trees.
- Open addressing: linear detection, quadratic detection, double hashing
- The number of vacancies is expressed by the loading factor
- Load factor = number of elements filled in/length of hash table
- The larger the load factor, the fewer free positions, the more conflicts, and the worse hash table performance.
Industrial-level hash tables:
Final requirements:
- Supports quick query, insert, and delete operations.
- Reasonable memory occupancy, can not waste too much memory space;
- Performance is stable, and in extreme cases, hash table performance does not degrade to unacceptable levels.
Specific design direction:
- Hash function requirements:
- Design the hash values as evenly as possible
- You can’t make it too complicated and take too long to compute
- Dynamic capacity Expansion
- Dynamic expansion is performed according to the size of the load factor. When the load factor exceeds the threshold, expansion is performed.
- Set the threshold value of loading factor reasonably, if too large, it will lead to too many conflicts; If it’s too small, it can waste a lot of memory.
- The load factor threshold is set to weigh time and space complexity. If the memory space is not tight and the execution efficiency is high, the threshold of the load factor can be reduced. Conversely, if memory is tight and execution efficiency is not high, you can increase the value of the load factor, or even greater than 1.
- Choose appropriate conflict resolution methods
Hash table and linked list combined application
LRU cache elimination algorithm
With the help of hash tables and linked lists, we can reduce the time complexity of LRU cache elimination algorithm to O(1).
With hash tables, you can make it O(1) time to find some data in a linked list, and O(1) time to delete and insert the list itself.
Redis ordered collection
For example, the user score leader board has such a function: we can find the score information by the user ID, and can also find the user ID or name information by the score range. This contains the user information of ID, name and score, which is the member object. The user ID is key, and the score is score.
So, if we were to refine the Redis ordered set operation, it would look like this:
- Add a member object;
- Delete a member object by key;
- Find a member object by key;
- Search the data according to the score interval, for example, search the member object whose integral is between [100, 356];
- Sort member variables from smallest to largest;
If we organize member objects into a hop table structure only by points, deleting and querying member objects by key values will be slow, similar to the solution of the LRU cache elimination algorithm. We can build a hash table based on the key, so that the time to delete and find a member object based on the key becomes O(1). At the same time, with the help of the skip table structure, other operations are very efficient.
Java LinkedHashMap
If you’re familiar with Java, you’ll use this container almost every day. As we mentioned earlier, the underlying implementation of a HashMap is a data structure called a hash table. LinkedHashMap is a hash that resolves hash conflicts using Linked lists. LinkedHashMap is a hash that resolves hash conflicts using Linked lists.
In fact, LinkedHashMap is not that simple, and “Linked” does not simply mean that it resolves hash conflicts using Linked lists.
As you might have guessed, LinkedHashMap is also implemented by combining hash and linked lists. Let’s start with the following code:
// 10 is the initial size, 0.75 is the load factor, and true is sorted by access time
HashMap<Integer, Integer> m = new LinkedHashMap<>(10.0.75 f.true);
m.put(3.11);
m.put(1.12);
m.put(5.23);
m.put(2.22);
m.put(3.26);
m.get(5);
for (Map.Entry e : m.entrySet()) {
System.out.println(e.getKey());
}
Copy the code
This code prints 1,2,3,5.
In fact, the LinkedHashMap sorted by access time is itself a caching system that supports LRU cache elimination. In fact, they both work exactly the same way.
To summarize, LinkedHashMap is actually implemented through a combination of two data structures, a two-way linked list and a hash table. The “Linked” in LinkedHashMap actually refers to bidirectional Linked lists, not to resolving hash conflicts using Linked lists.
Why are hash tables and linked lists often used together?
A hash table is a data structure that supports very efficient insertion, deletion, and lookup operations, but the data in a hash table is stored irregularly after being scrambled by the hash function. That is, it does not support fast traversal of data in some order. If we want to iterate over the hash table in order, we need to copy the hash table into an array, sort it, and iterate over it again.
Since the hash table is a dynamic data structure, data is constantly being inserted and deleted, so whenever we want to traverse the data in the hash table in order, we need to sort first, which is bound to be inefficient. To solve this problem, we use hash tables and linked lists (or hops) together.
Three algorithms.
1. The recursion
We can use recursive conditions:
- The solution of a problem can be decomposed into several subproblems
- This problem and the decomposition of the sub-problem, except the data scale is different, the solution idea is exactly the same
- There are recursive termination conditions
Write recursive algorithm idea:
- Generalize the recursive expression
- Looking for termination conditions
Disadvantages of recursive code:
- Stack overflow
- Double counting is possible
- Function calls take a lot of time
- High spatial complexity
2. The sorting
There are three factors to measure a good sorting algorithm:
- Execution efficiency
- Best time complexity
- Worst time complexity
- Average time complexity
- Time complexity coefficient, constant, low order (because sorted data size is generally not very large)
- The number of times to compare or exchange
- Extra memory consumption (memory consumption O(1) is called in-place sort)
- Stability, whether stable sort (that is, if there are elements of equal value in the sequence to be sorted, the original order of equal elements remains unchanged after sorting)
Classification by time complexity:
- O(n²) : bubble sort, insert sort, select sort
- O(nlogn) : merges and quicksorts
- O(n) : Bucket sort, count sort, radix sort (harsh conditions, applicable to some scenarios)
2.1 Bubble sort
Principle: Compare two adjacent data successively from bottom to top. If the lower data is larger than the upper data, the two data positions are switched.
- Best time complexity O(n)
- Worst time complexity O(n^2)
- Average time complexity O(n^2)
- Sort in place, stable sort
2.2 Insertion Sort
Principle: Divided into arranged area and unarranged area, take the first number in the unarranged area each time, and insert it into the correct position in the arranged area.
- Best time complexity O(n)
- Worst time complexity O(n^2)
- Average time complexity O(n^2)
- Sort in place, stable sort
2.3 Selection Sort
Principle: It is divided into arranged areas and unarranged areas. Select the smallest number from the unarranged areas and place it to the last part of the arranged areas.
- Best time complexity O(n^2)
- Worst time complexity O(n^2)
- Average time complexity O(n^2)
- Sort in place, non – stable sort algorithm
- This algorithm is generally not considered.
2.4 Merge Sort
How it works: The core idea of merge sort is pretty simple. To sort an array, we divide the array into two parts, sort the two parts separately, and merge the sorted parts together so that the entire array is sorted.
- Non-in-place sort, order n space.
- A stable sort
- Use the divide-and-conquer recursion idea
- Recursive formula:
Merge_sort (p... R) = merge (merge_sort (p... Q), merge_sort (q + 1... r))
- Best, worst, and average time complexity are all O(nlogn).
2.5 QuickSort
- In situ sorting
- Unstable sort
- Recursive formula:
Quick_sort (p... R) = partition (p... R) + quick_sort (p... q-1) + quick_sort(q+1, r)
- Best time complexity: O(nlogn)
- Worst time complexity: O(n^2) (extremely rare)
- Average time complexity: O(nlogn)
2.6 bucket sort
Bucket sorting, as the name implies, uses “buckets”. The core idea is that the data to be sorted is sorted into several ordered buckets, and the data in each bucket is sorted separately. After the bucket is sorted, the data in each bucket is taken out in sequence to form an orderly sequence.Why is bucket sorting O(n)? Let’s analyze it together. If we have n numbers to sort, and we divide them evenly into m buckets, each bucket will have k=n/m elements. Each bucket uses internal quicksort and the time complexity is O(k x logk). The time complexity of sorting m buckets is O(m * k * logk). Since k=n/m, the time complexity of sorting m buckets is O(n*log(n/m)). When the number of buckets m is close to the number of data n, log(n/m) is a very small constant, and the time complexity of bucket sorting is close to O(n).
Harsh conditions:
- The data to be sorted needs to be easily divided into m buckets
- There is a natural order of size between buckets
- Data is distributed fairly evenly across buckets
2.7 Counting Sort
Counting sort is actually a special case of bucket sort: the data access is small (for example, age, score of examinee), and the number of buckets is limited. Take the ranking of the scores of college entrance examination examinees as an example. The full score of examinees is 900 and the minimum score is 0, corresponding to 901 buckets. Put examinees all over the country into the 901 buckets, and the data in the bucket are all examinees with the same score, so there is no need to rank them.
Special requirements:
- This can only be used in scenarios where the data range is small. Counting sort is not suitable if the data range k is much larger than the data to be sorted
- Counting sort can only sort non-negative integers. If the data to be sorted is of another type, it is converted to a non-negative integer without changing its relative size. If the data to be sorted has negative numbers in the range [-1000, 1000], then we need to add 1000 to each data first, converting it to a non-negative integer. If your score is accurate to the last decimal place, you need to multiply all scores by 10 and round them.
2.8 Radix Sort
So let’s look at this sort of sorting problem. Suppose we have 100,000 mobile phone numbers and we want to sort them from smallest to largest. What’s a quick way to sort them?
We talked about quick sorting, time order nlogn, is there a more efficient sorting algorithm? Can bucket sort, count sort come in handy? The 11-digit range of mobile phone numbers is too large to use either sorting algorithm. Is there an O(n) algorithm for this sorting problem? Now I’m going to introduce a new sort algorithm, radix sort.
If you want to compare the size of two mobile phone numbers a and B, if a is already larger than B in the first few digits, you don’t need to look at the last few digits.
With the help of stable sorting algorithm, here is a clever way to achieve the idea. Remember our example of orders in Section 11, when we explained the stability of the sorting algorithm? We can use the same process here, sort the phone number by the last digit, then by the second to last, and so on, and finally by the first digit. After 11 sequences, the phone numbers were in order.
The phone number is a little bit long, so it’s a little bit hard to see, but I’ve drawn a breakdown of radix sort using the string sort example, so you can look at it.
Note that if the sorting algorithm is stable, it’s not going to be right. Because if it’s an unstable sort algorithm, then the last sort only takes into account the order of the highest bit, regardless of the size of the other bits, then the lowest bit sort is completely meaningless.
Sort by each bit, we could do bucket sort or count sort, which is order n in time. If the data to be sorted has k bits, then we need k bucket sorts or count sorts, and the total time is O(k times n). When k is small, as in the case of mobile phone number sorting, k is at most 11, so the time complexity of radix sorting is approximately O(n).
In fact, sometimes the data to sort are not all the same length, for example, we sort the Oxford Dictionary of 200,000 English words, the shortest is only 1 letter, the longest I specially checked, there are 45 letters, the Chinese translation is pneumoconiosis. Does radix sort still work for data of unequal lengths?
In fact, we can add all the words to the same length. If we don’t have enough digits, we can add zeros after them, because all letters are greater than zeros according to ASCII values, so adding zeros doesn’t affect the original size order. So we can continue to sort by radix.
To summarize, radix sort requires the data to be sorted:
- Independent “bits” need to be separated for comparison, and there is a progressive relationship between the bits. If the high position of data A is larger than that of data B, the remaining low position need not be compared
- In addition, the data range of each bit should not be too large, so that the linear sorting algorithm can be used to sort, otherwise, the time complexity of radix sort cannot be O(n).
3. Look for
3.1 Binary search method
- Depends on array structures (too much data for binary lookup, data requires contiguous storage)
- The data must be sorted
- Time complexity: O(logn)
Left left left left left left left left left left down down down down down down down down down left left left left left left left left left left down down down down down down down down down left left left left left left left left left left down down down down down down down down down left left left left left left left left left left down down down down down down down down down
If you think it’s useful, please give me a thumbs up. Personal public account “Grade GUI”, welcome to follow