This article for nuggets community first contract article, not authorized to forbid reprint.
sequence
This is a special topic on data structures and algorithms. This topic will be divided into three parts:
-
Basic data structures: In addition to these in the chapter title, there will be hash tables, trees, heaps, and other data structures.
-
Sorting algorithm: introduce some common common algorithms such as bubble, select, insert, merge, quicksort, heap sort and so on.
-
Advanced data structures: Advanced data structures are not meant to be more advanced. They are mainly extensions of the underlying data structures mentioned above, such as B+ trees, red-black trees, and some improved hashes such as cuckoos.
Of course, these are just my current writing plans, and more may be added, please wait and see.
Before I start this chapter, let me start with my thoughts on data structures.
Program = data structure + algorithm.
How important it is for us as programmers to learn about data structures and algorithms and let me just give you a couple of examples.
First of all, you should know that data structures and algorithms are all designed to solve real problems. For example, there are 1000W pieces of data in our database. I want to query a data I want instantly.
If you are not familiar with data structures and algorithms, the first response to a query is to iterate, but traversing 1000W of data is extremely slow.
Is there a better way to solve this problem?
Yes, we can organize the data in a hash table data structure, which allows us to quickly query any data we want in order of one time, whether it’s 1000W or 2000W, in a fraction of a second, and it’s amazing. The queries in the middleware Redis and the domestic new database TiDB are done by hash table.
In addition to hash tables, we can do this with tree data structures, such as the B+ index commonly used in relational databases, which is a tree structure, it’s an n-tree, and it can query at order logN, not as fast as hash tables, but it supports range queries, which is very important for databases. So the database generally chooses it as the primary key index.
I gave you two simple examples of data structures, but what about algorithms? In many cases, algorithms and data structures complement each other.
Like a tree, why does it query fast?
It uses a dive-and-conquer algorithm idea, in the data query will use dichotomy algorithm to constantly reduce the size of the data, to a logN time complexity.
Through the deformation of the tree, it also gave birth to the data structure of heap, heap is what we often say priority queue, through the characteristics of the data structure of heap, also gave birth to the idea of heap sort, which has a good effect on the problem of finding topK.
So what about the tree? What about the hash table?
A lot of it depends on how well the hash algorithm works, and almost all languages have built-in data structures like HashMap in Java, Dict in some scripting languages, and they implement hash tables. The Python2.7 dictionary also has bugs due to implementation problems.
Therefore, data structure and algorithm are very important, although we usually do not use directly, but we are in contact with a lot of class libraries and middleware have their shadow, from the perspective of utility, in dafa, algorithm has been a must test.
Without further ado, let’s get started
Note: In the whole data structure and algorithm topic, I will try to use practical applications to illustrate, and the program implementation will use Java, I believe that front-end readers can also understand, JS can, Java has what difficult?
1. Big O notation
Any development engineer, even without a training background, has probably heard of the over-O notation at some point.
This is a kind of counting representation used to measure the time complexity and space complexity, which is to measure whether the algorithm is long or not and how much memory it occupies. In this paper, it is mainly used to represent the time complexity, but not the space complexity.
For example, if you want to measure how fast an algorithm is, your first instinct might be to judge it by how long it runs, but the same algorithm runs differently on an older Pentium processor than on the latest i9-9700K processor.
So we need a more precise notation, which is the big O notation, where we measure how fast an algorithm is by the number of steps.
public void test(int[] items) {}Copy the code
Let’s say this is a traversal algorithm that needs to access every element in the items, so if the items are of length N, then the traversal algorithm needs to perform N steps, and its time complexity is O(N).
If this is a traversal algorithm, then it has an input N, and it has to access every element, so it has to perform N steps, so its time complexity is O(N).
The time complexity of the big O notation can be roughly divided into the following levels:
-
O(1) : constant level, which takes a constant number of steps regardless of the size of the input, and does not increase as the input gets larger. This is the level at which a hash table lookup is performed.
-
O(N) : linear level. As the input gets larger, the number of steps is positively correlated. This is the level of the traversal algorithm.
-
O(logN) : logarithmic level, each time the input doubles, the cost of the step increases by 1, binary search algorithm belongs to this level.
-
O(N²) : the square level. The number of steps takes doubles as the input gets larger. This is usually the level when your algorithm uses a double-level for loop, such as bubble sort.
And then there’s the cubic level, which I won’t go into too much, which is the same as the square level, only bigger.
The four levels above should be easy to understand except for the logarithm level, which involves the mathematical symbol log, which has a base of two and is omitted by default.
The logarithmic level can be understood with a small example: we have a sorted array and use the binary algorithm to find one of the numbers.
If I have 16 numbers in the array, I need to look them up four times, because 2 to the fourth power is 4.
Assuming there are 2,147,483,647 numbers in the array (2.1 billion), you need to look it up 31 times, since 2 to the 31st power is 2,147,483,648, which is pretty fast.
So at the logarithmic level you can understand that when you double the size, you only need to take +1 steps.
The other thing I want to emphasize about the big O notation is that it’s not as mathematically accurate.
Let’s say you have an algorithm that takes 2N iterations each time to complete, and the time complexity is also O(N), and the constants are ignored in big O notation, because the constants are fixed, they don’t change.
Or order of one, even if you haven’t studied hash tables, there’s no algorithm that can do one step to find a number, it should be several steps, but whether it’s five steps or six steps, because it’s a constant number of steps and it doesn’t depend on the size of the input, so it’s order of one.
Finally, the same segment algorithm under different input may can take steps of completely different, so the big O notation generally take a general value, such as some sort algorithm under the condition of the input reverse and sequence can present different time complexity, generally take cost more that, in order to say again.
An array of 2.
An array is a data structure that can be accessed quickly. It is also one of the cornerstones of data structures. All languages have built-in support for arrays.
In modern programming languages, arrays are often used as a container that can conveniently store hundreds or thousands of elements that would otherwise require hundreds or thousands of references.
In an array, each array has an address, and the memory address of each element can be conveniently calculated by the index of the array, thus achieving fast access and assignment, so the efficiency of the index lookup is O(1) level.
When you insert or delete an element from the end of an array, you can do it directly, but when you insert or delete another element, you need to adjust the position of other elements. For example, after you remove the first element, you need to move all the following elements forward by one bit.
One of the big weaknesses of arrays is that once the array is full, you can’t load any more elements.
In high-level programming languages, dynamic array is often used to solve this problem, the so-called dynamic array is automatically array expansion, when the array capacity reaches a critical point, dynamic array will open up a larger array, and then copy the original elements in the past.
The ArrayList class in Java does just that.
Next I’ll simply implement a dynamic array:
public class DiyList<T> {
private Object[] items;
private int size = 0;
public DiyList(a) {
items = new Object[16];
}
public T get(int index) {
if (index > size) {
throw new NoSuchElementException();
}
return (T) items[index];
}
public boolean add(T item) {
if (Objects.isNull(item)) {
throw new NullPointerException();
}
if (size >= items.length / 2) {
grow();
}
items[size++] = item;
return true;
}
private void grow(a) {
Object[] newItems = new Object[items.length * 2];
System.arraycopy(items, 0 , newItems, 0 , items.length);
items = newItems;
}
public int size(a) {
returnsize; }}Copy the code
Because generics in Java have generic erasures, I can’t define an array of generics, so I can only implement an Object array that is forcefully converted to a generic element when I get it.
In this example, I only implemented the basic GET and add methods. The main method for dynamic expansion is the grow method, which is responsible for the expansion of the array. The expansion method can also be as simple as creating a larger array and then copying the original.
While dynamic arrays help you automatically scale, they also come with the cost of copying elements:
-
When an array needs to be expanded, elements need to be copied.
-
When deleting an element from an array, you also need to copy the element, because when you delete an element, there will be an empty space in the middle of the array, and you need to move all the following elements forward one space.
Therefore, dynamic arrays are more suitable for subscript query, insertion and deletion require additional time and space consumption.
2.1 After-class practice
On the basis of the above example, add the delete method, the example has been self-tested, can directly copy run.
3. The linked list
In the previous section, arrays were more accessible, and this section’s lists are more accessible for insertion and deletion.
Arrays are forced to be contiguic, but lists can be discontiguic. They refer to the next element by address, so the list’s data structure looks more like a string.
The code structure of a linked list generally looks like this:
public class DiyLinked<T> {
private int size = 0;
private Node<T> item;
private static class Node<T> {
private T item;
private Node<T> next;
public Node(T t) { item = t; }}}Copy the code
In addition to its own data item, Node elements often hold a next object, which is used as a reference to the next Node.
The data structure of the linked list also doomed that it could not access an element quickly, and it could only find it slowly by traversing. Therefore, the query data of the linked list is relatively slow, at the level of O(N), but its insertion and deletion in the head node (the first node) is relatively fast, which only needs O(1). Because all you need to do is change the reference to the first element.
Now, the reason WHY I mention the head node is because if you want to insert and delete at the end, you have to go through the list, find the last list node and insert again, and the cost of going through the list is O(N).
The list mentioned above with only one next reference is called a unidirectional list, and to solve the tail-interpolation problem, there is a bidirectional list.
A bidirectional list is a list that has two Pointers, one to the preceding element and one to the following element.
The painting gradually fell apart
The code for a two-way linked list generally looks like this:
public class DiyLinked<T> {
private int size = 0;
private Node<T> first;
private Node<T> last;
private static class Node<T> {
private T item;
private Node<T> next;
private Node<T> prev;
public Node(T t) { item = t; }}}Copy the code
The bidirectional list maintains both the first and last nodes, so when you need to insert the last node, insert or delete the last node directly.
Java LinkedList is a two-way LinkedList, the implementation is very simple, the list is another cornerstone of the data structure, many data structures can be deformed on the basis of the list.
3.1 After-class practice
Implement a rollovers list. Give you the head node of the list, flip it over, and return a new list with the elements in reverse order.
4. The queue
A queue is a first-in, first-out data structure that can be implemented using arrays or linked lists.
The nature of the queue is the same as our daily queue, first come, first served.
As you can see from the diagram, there are only two actions in a queue: enqueue and dequeue, usually inserting at the end and removing at the head.
Queues can be divided into two types:
-
Bounded queue: a limited number of elements can be loaded.
-
Unbounded queue: The number of elements that can be loaded is infinite, as long as memory is available.
I’ll take advantage of the properties of arrays and lists to construct these two queues, respectively.
4.1 Array construction queue
Arrays, as mentioned earlier, have two disadvantages:
-
Arrays are constant, so you can only use dynamic arrays to make them larger.
-
Removing an element from an array forces other elements to shift.
If we want to use arrays to construct queues, the first problem can only be solved with dynamic arrays, but in practice, many queues are constant, you can’t make them bigger or smaller, because the memory of the computer is not infinite, so it makes sense to use arrays to construct bounded queues.
The second problem, when constructing the queue, is to reuse the array position, that is, to attach two Pointers to the array, one to the tail node and one to the head node, which is called a circular array.
Two operations on a ring array queue:
-
Set the first subscript to NULL and add 1 to the subscript pointed to by first.
-
Join the queue: before joining the queue, you need to judge whether there is an element in the index end+1. If there is an element, the queue is full.
4.2 List Construction queue
Using a linked list to construct a queue is simply a matter of using the bidirectional linked list we mentioned above, which is inherently useful as a queue because it maintains both the head and tail nodes, and because it can be linked forever, unbounded queues are appropriate.
The common LinkedList interface in Java also implements Queue, which means that LinkedList also supports Queue operations.
Since the linked list has been mentioned above and is very simple, I will not implement it here ~
4.3 After-class practice
Implement a bounded queue with an array.
5. The stack
Stack is a first-in, first-out/last-in, first-out data structure. If I want to give an example in life, I think the most vivid one should be a magazine. The first bullet put in is often the last bullet out.
Stacks can also use both arrays and lists to construct data structures, since they are very similar to the queues mentioned earlier.
In the computer world, the stack is often used as the call frame stack of a program. Sometimes StackOverFlow exception occurs in a program because the call stack is exhausted, so the stack is usually relatively fixed depth, from this point of view, using an array to construct the stack is a good idea. (Stack depth is also often affected by the memory Settings)
The stack uses push to insert elements and pop to read elements. The last element to be pushed is usually the first element to be pop.
In Java, there is also a data structure called Stack. It is called Stack, but it has not only two operations on the Stack, but also some operations on the array, because it inherits from the List class
5.1 After-class exercises
Implement a stack with an array.
6. Conclusion
In conclusion, let’s review the four data structures:
-
Array: subscript lookup is fast, insert and delete is slow.
-
Linked list: Insert and delete fast, slow search.
-
Queue: tail plug and head out, time complexity is O(1), very fast.
-
Stack: tail plug tail out, time complexity is O(1), very fast.
In today’s programming, only the first two are used more often, and which container you choose depends on your business attributes.
-
When selecting an array, you can estimate the size of the collection and initialize a more appropriate dynamic array to avoid multiple expansions.
-
When choosing a linked list, you can first ask yourself whether you are querying the scene or inserting the scene.
-
When using queues and stacks, you can also look at whether the underlying implementation of the library you’re using is an array or a linked list. When you’re dealing with something that’s no longer a black box to you, it’s easy to make a decision
After reading this article, I believe that readers should be able to master the four basic data structures, but whether it is an array container or a linked list container, when I want to insert into the middle of the container, there is a large search cost, the next chapter is to introduce a new data structure: tree.
Tree search, insert and delete are all O(logN), which is a data structure that has advantages in all aspects. Tree can solve the efficiency problem of insertion in the middle, which I will explain in depth in the next article, including some algorithms not covered in the book AVL tree.
Finally, ask everyone for a like, have any question can also leave a message in the comment area, I will reply in time.
bibliography :
-
Algorithm Edition 4
-
Data structure and algorithm analysis
-
Data structure and algorithm diagram
-
Introduction to Computer Science
Recommended reading
-
Delayed execution and immutable, the system explains JavaStream data processing
-
Reduction, grouping, and partitioning, an in-depth look at JavaStream finalization