Learn more about Java basics


This series of notes is based on the geek Time course data Structures and The Beauty of Algorithms

This directory

  • Jump table
  • The tree
  • The heap
  • figure

Jump table

What is a skip watch?[code]

To create a multi-level index for a list of ordered values, such as extracting one node for every two nodes to the next level, we call the extracted level an index or index level. As shown in the following figure, down indicates the down pointer pointing to the next-level node. Similarly, for a list of n nodes, approximately log2n-1 level indexes can be created. Data structures such as these that create multi-level indexes for linked lists are called skip tables.

What is the time complexity of the hop table?

  1. Calculates the height of the jumper

If the linked list has N nodes and every two nodes extract one node as the node of the upper index, the number of nodes of the first index is approximately N /2, the number of nodes of the second index is approximately N /4, and so on, the number of nodes of the k index is approximately N /(2^k). Assuming that the index has h level, the highest index has 2 nodes, then n/(2^h)=2, h=log2n-1, including the original linked list layer, the height of the entire table is log2n.

  1. Calculate the time complexity of the hop table

If m nodes are traversed at each level, then the time complexity of querying a data in the hop table is O(m * logn). So what is this m? As shown in the figure below, suppose the data we are looking for is X. In the k-level index, after traversing y node, we find that x is greater than y and less than the following node Z. Therefore, we drop from k-level index to k-1 through the down pointer of Y. In the k-1 index, there are only 3 nodes (including y and z) between y and z, so we need to traverse at most 3 nodes in the k-1 index, and so on, at most 3 nodes in each level index. So I m = 3. So the time to query something in a hop table is O(logn).

How to optimize the space complexity of hop tables?

  1. Calculates the total number of nodes for the index

If the linked list has N nodes and one node is extracted from every two nodes as the node of the upper level index, the number of nodes in each level index is: N /2, N /4, N /8… , 8,4,2, the sum of the geometric sequence n-1, so the space complexity of the hop table is O(n). 2. How to optimize the time complexity If the linked list has N nodes and one node is extracted from every 3 or 5 nodes as the node of the upper level index, then the number of nodes in each level index is (take 3 as an example) : N /3, N /9, N /27… , 27,9,3,1, sum n/2, so the space complexity of the hop table is O(n), which is much lower than the time complexity of extracting every 2 nodes.

Efficient dynamic insert and delete?

In essence, a hop table is a linked list, so the time complexity of insertion, insertion and deletion is O(1). However, in practice, to insert or delete a node, it is necessary to find the specified location first, and this search operation is time-consuming, but in the hop table, the time complexity of the search operation is O(logn), so, The insert and delete operations of the hop table are O(logn) in time.

Dynamic update of the hop table index?

When inserting data into a hop table, you can choose to insert the data into part of the index layer, so how to select the index layer? A random function can be used to determine which level of index to insert this node into. For example, if the random function generates the value K, this node can be added to levels 1 through K.

Binary Tree Foundation (1)

The tree

Common concepts of trees

The root node, the leaf node, the parent node, the child node, the sibling node, the height, the depth and the number of layers of the node, the height of the tree.

Concept to explain

  • Node: Each element in the tree is called a node
  • Parent-child relationship: The connection between two adjacent nodes is called parent-child relationship
  • Root node: a node that has no parent node
  • Leaf node: a node that has no children
  • Parent node: node that points to child node
  • Child node: node pointed to by the parent node
  • Sibling: Multiple nodes with the same parent are called sibling relationships
  • Node height: the number of edges contained in the longest path from a node to a leaf node
  • Node depth: the number of edges contained in the path from the root node to the node
  • Number of layers of nodes: depth of nodes +1 (the number of layers of root nodes is 1)
  • Tree height: Equal to the height of the root node

Binary tree

concept

  • What is a binary tree? A tree with a maximum of two children per node, the left child and the right child.
  • What is a full binary tree? There is a binary tree in which each node has left and right child nodes in addition to the leaf node. Such a binary tree is called full binary tree.
  • What is a complete binary tree? There is a binary tree in which the leaves are in the bottom two layers, the leaves are arranged to the left in the last layer, and the number of nodes in all but the last layer is maximized.

Storage of complete binary trees

Chain store

Each node consists of three fields, one of which stores data and the other two are Pointers to the left and right child nodes. By holding the root node, we can string the whole tree together using Pointers to the left and right child nodes. This storage method is common, and most binary tree code is implemented in this way.

Sequential storage

For a complete binary tree, if node X is stored with subscript I in the array, its left child is stored with subscript 2i, its right child is stored with subscript 2i+1, and conversely, its parent is stored at subscript I /2. Notice that the root node is stored at subscript 1. Full binary trees are the most efficient way to store them in arrays.

Traversal of binary treescode

  • Pre-order traversal: For any node in the tree, print the node first, then its left subtree, and finally its right subtree.
  • Middle-order traversal: For any node in the tree, print its left subtree first, then itself, and finally its right subtree.
  • Back-order traversal: For any node in the tree, print its left subtree first, then its right subtree, and finally print itself.
PreOrder (r) = print r->preOrder(r->left)->preOrder(r->right) InOrder (r) = inOrder(r->left)->print r->inOrder(r->right)  postOrder(r) = postOrder(r->left)->postOrder(r->right)->print rCopy the code

Time complexity: In the three traversal modes, each node is accessed at most twice, so the time complexity is O(n).

Binary Tree Foundation (Part 2)

Common operations on binary trees

Search by properties & pre-middle and post-order traversal & Hierarchical traversal & High depth hierarchical computation & find the smallest child node of the right subtree of a node & Find the largest child node of the left subtree of a node & find the precursor node and its successors & left-right rotation & insert and delete operations according to the situation of the child nodes & Rotate according to the parent and uncle nodes

Binary search tree for duplicate data:

  1. Each node stores the same value on the same node through linked lists or dynamically expanded arrays
  2. A CRD rule is unified. For example, when inserting, the same value is inserted into the right subtree of the node. The query and deletion are similar, and the search is needed in the right subtree until the leaf node is encountered

Advantages of binary lookup trees over hash tables

As we discussed in the hash table section, the time complexity of the insert, delete, and find operations can be constant O(1), which is very efficient. Binary search tree in a balanced case, insert, delete, search operation time complexity is O(logn), relative to hash table, seems to have no advantage, so why we still use binary search tree? I think there are several reasons:

  1. The data in the hash table is stored unordered. To output ordered data, sort it first. For binary search trees, we only need middle-order traversal to output ordered data sequences within O(n) time complexity.

  2. It takes a lot of time to expand hash table, and the performance is not stable when hash conflicts are encountered. Although the performance of binary lookup tree is not stable, in engineering, the performance of the most commonly used balanced binary lookup tree is very stable, and the time complexity is stable at O(logn).

  3. In general, although the time complexity of operations such as lookup of hash tables is constant, the constant is not necessarily less than logn because of hash conflicts, so the actual lookup may not be faster than O(logn). Plus the time consuming of the hash function, it is not necessarily more efficient than balancing binary search trees.

  4. Hash table construction is more complex than binary search tree, there are many things to consider. Such as the design of hash functions, conflict resolution, capacity expansion, capacity reduction, etc. Balanced binary search tree only needs to consider the problem of balance, and the solution of this problem is relatively mature and fixed.

  5. Finally, in order to avoid too many hash collisions, the hash table load factor should not be too large, especially if the hash table is based on the open addressing method, otherwise some storage space will be wasted.

Taken together, balanced binary search trees are superior to hash tables in some ways, so the existence of the two is not in conflict. In the actual development process, we need to choose which one to use according to specific requirements.

Red Black Tree (1)

Binary search tree is a common binary tree, which supports fast insert, delete and search operations. The time complexity of each operation is proportional to the height of the tree. Ideally, the time complexity is O(logn). In many books, when we talk about balanced binary search trees, we use red-black trees as an example. In engineering, red black trees are used in many places where balanced binary lookup trees are used.

What is “Balanced binary Search Tree”

  • Definition: The height difference between the left and right subtrees of any node in a binary tree cannot be greater than 1.

So: complete binary trees, full binary trees are balanced binary trees, incomplete binary trees may also be balanced binary trees.

  • Balanced binary search tree not only satisfies the above definition of balanced binary search tree, but also satisfies the characteristics of binary search tree.

  • The original intention of the invention of balanced binary search tree is to solve the problem of time complexity degradation of ordinary binary search tree in the case of frequent insertion, deletion and other dynamic updates.

Therefore, the meaning of “balance” in the balanced binary search tree is to make the whole tree look more “symmetric”, more “balance”, do not appear the left subtree is high, the right subtree is very short. This makes the height of the tree relatively low, and the insertion, deletion, and lookup operations more efficient.

  • If a new balanced binary lookup tree is designed, it can still be considered a qualified balanced binary lookup tree, even though it does not conform to the strict definition of a balanced binary lookup tree, as long as the height of the tree is not much greater than log2n (for example, the height of the tree is still of logarithm order).

How to define a “red black tree”

There are many balanced binary search trees, such as Splay Tree(stretch Tree), Treap (stack Tree), etc., but when we mention balanced binary search Tree, we hear basically red black Tree. His exit rate is even higher than the words “balanced binary lookup tree”, even in some cases, the default balanced binary lookup tree is red black tree

Red-black Tree: Red-black-tree, r-B Tree for short, has the following features:

  • It is a kind of balanced binary search tree which is not strict.
  • Nodes in a red-black tree where one category is marked black and another is marked red.
  • The root node is black;
  • Every leaf node is black NIL, that is, the leaf node does not store data;
  • No adjacent nodes can be red at the same time, that is, red nodes are separated by black nodes.
  • Each node, and all paths from that node to its reachable leaf node, contains the same number of black nodes;

Binary search tree many operations are directly proportional to the height of the tree, a lesson extremely balanced binary tree (full or complete binary tree) height is about log2n, so to prove that the red black tree is approximately balanced, we just need to analyze whether the height of the red black tree is relatively stable to log2n.

Height analysis of red-black trees

  1. First, if red nodes are removed from a red-black tree, the height of a red-black tree containing black nodes is smaller than that of a complete binary tree containing the same number of nodes. So the height of the black tree without the red nodes is no higher than log2n.
  2. In a red-black tree, the red nodes cannot be adjacent, that is, each red node must have at least one black node separated from the other red nodes.

The path that contains the most black nodes in a red-black tree cannot exceed log2n. Therefore, after adding red nodes, the longest path cannot exceed 2log2n. That is, the height of the red-black tree is approximately 2log2n 3. The height of the red-black tree is only twice as large as the height of the highly balanced AVL tree (log2n), and the performance decline is not much.

Red-black balanced binary search trees are popular in engineering for the following reasons:

  1. Treap and Splay Tree operate efficiently in most cases, but they cannot avoid the degradation of time complexity in extreme cases. Although these situations occur infrequently, they are not applicable for scenarios where the time of a single operation is very sensitive.
  2. AVL tree is a highly balanced binary tree, so the search efficiency is very high. However, there are pros and cons. In order to maintain such a high balance, AVL tree has to pay more costs. So AVL trees are a little more expensive for data sets with frequent insertions and deletions.
  3. Red-black trees are approximately balanced, not strictly balanced, so the cost of maintaining equilibrium is lower than AVL trees.

Therefore, red-black tree insert, delete, search various operations are relatively stable performance. For engineering applications, the resulting state is controllable and predictable.

Red Black Tree (2)

My red-black tree learning path

The heap

How to understand “heap”

  1. The heap is a complete binary tree;

A complete binary tree requires all nodes to be full except for the last layer, which is arranged to the left. 2. Each node in the heap must be greater than or equal to (or less than or equal to) the value of each node in its child tree. Each node in the heap has a value greater than or equal to (or less than or equal to) the value of its left and right children. 3. A heap whose value of each node is greater than or equal to the value of each node in the subtree is called the “big top heap”. A heap whose value of each node is less than or equal to the value of each node in the subtree is called a small top heap.

How to implement “heap”

To implement a heap, you need to know what operations the heap supports and how to store a heap.

How to store a heap:

Full binary trees are best stored in arrays. Using arrays to store full binary trees is very storage efficient. Because there is no need to store Pointers to the left and right child nodes, we can find the left and right child and parent nodes of a node simply by using the subscript of the array.

Inserts an element into the heap

After you insert an element into the heap, you need to continue to satisfy two characteristics of the heap

  • If a newly inserted element is placed at the end of the heap, it does not conform to the heap’s characteristics and needs to be adjusted to conform to the heap’s characteristics again, a process called heapify
  • There are actually two kinds of heap, bottom up and top down
  • Heap method:

Heapification is very simple, just follow the path of the nodes, up or down, compare, and swap.

Delete the heaptop element

The last node is placed on the top of the heap, and then the same parent-child node comparison method is used. For those nodes that do not satisfy the parent-child node size relationship, the two nodes are swapped and the process is repeated until the parent-child node size relationship is satisfied, which is the top-down heapification method.

A complete binary tree with n nodes, no higher than log2n. The heapification process is commutated along the path of nodes, so the time complexity of heapification is proportional to the height of the tree, that is, O(log n). The primary logic for inserting data and removing heaptop elements is heapization, so the time complexity for inserting an element into the heap and removing the heaptop element is order (log n).

How do I sort based on the heap

  • There’s bubble sort, insertion sort, selection sort, O(n^2) merge sort, quick sort, linear sort.

  • The sorting algorithm implemented with the help of heap data structure is called heap sort, which has a very stable time complexity of O(nlogn), and it is still in place sorting algorithm.

Heap sort process

It basically breaks down into two steps: heap building and sorting

  • Building pile:

    1. Start by building the array into a heap in place. “In place” : Operating on an in place array without resorting to another array.
    2. There are two ways to build a heap:
      • First: the idea of inserting an element into the heap. Even though the array contains n data, you can assume that the heap initially contains only one data, the one with subscript 1. Then, the insert method is called, which inserts data with subscripts from 2 to n into the heap, organizing an array of n data into the heap
      • Second: process the array from back to front, with each data heaped from top to bottom.

    The second way is the opposite of the first way. The first way is to process the data post-processing, and each data is heaped from the bottom up as it is inserted into the heap. The second method heaps data with subscripts from n/2 to 1. Nodes with subscripts from N /2 + 1 to N are leaf nodes and do not need to be heaped3. Heap building time complexity

    If the heaping-time for each node is order log n, is the total heaping-time for n/2 nodes order log n? This answer is also correct, but it is not precise enough.In fact, heapsort’s heap building process is order n in time..

  • Sorting: After the heap is built, the data in the array is organized according to the characteristics of the big top heap. The first element in the array is the top of the heap, which is the largest element. Swap it with the last element, and you put the largest element at index n. This process is similar to the operation of “removing the top element”. When the top element is removed, the element with subscript N is put on the top of the heap, and then the remaining n-1 elements are rebuilt from the heap by heapization. Once heaped, take the top element, place it at index n-1, and repeat the process until there is only one element with index 1 left in the heap, and the sorting is done.

Time, space complexity, and stability analysis

Heap sort is an in-place sort algorithm, which requires only a small amount of temporary storage. Heap sort consists of two operations: heap build and sort. The time complexity of heap build is O(n) and the time complexity of sort is O(nlogn), so the time complexity of heap sort is O(nlogn).

O(n) derivation of heap building time complexity

Because leaf nodes do not need to be heaped, nodes that need to be heaped start at the second-to-last layer. During the heapification of each node, the number of nodes to be compared and exchanged is proportional to the height k of this node.

I’ve plotted the number of nodes in each layer and their corresponding heights, so you can look at them. We just need to sum the heights of each node to get the heap time complexity.

We sum the heights of each non-leaf node, which is the following formula:

This is a little tricky, but I think we all learned in high school that if you multiply both sides of the equation by 2, you get another formula, S2. If we misalign S2, and subtract S1 from S2, we get S.

The middle part of S is a geometric sequence, so it can be calculated using the summation formula of geometric sequence, and the final result is the one drawn in the figure below.

Since h=log2 n, if you plug in S, you get S=O(n), so the heap building time is O(n).

The application of the heap

Priority queue

  • In a priority queue, data is dequeued according to priority rather than first in, first out. The data with the highest priority is dequeued first.
  • There are many ways to implement a priority queue, but using a heap is the most straightforward and efficient, because a heap is very similar to a priority queue. A heap can be thought of as a priority queue, and most of the time, they are just conceptual distinctions. Adding an element to the priority queue is equivalent to adding an element to the heap; Removing the highest priority element from the priority queue is equivalent to removing the top of the heap element.
  • Priority queue is widely used, such as Huffman coding, graph shortest path, do small tree algorithm and so on

Priority queue application 1: Merge ordered small files

Suppose you have 100 small files, each 100MB in size, and each file stores an ordered string. Now you need to merge these 100 small files into one ordered large file. Ideas:

  1. The whole idea is a little bit like the merge function in merge sort, where you take the first string from each of the 100 files, put it in an array, compare the size, merge the smallest string into a larger file, and delete it from the array.
  2. Assuming that the smallest string comes from the 13.txt file, take the next string from the file again, place it in the array, re-size it, and select the smallest one to place in the merged large file and delete it from the array. And so on, until all the data in the file is in the larger file.
  3. Using an array to store strings extracted from a small file, it is inefficient to iterate over the entire array each time the smallest string is fetched from the array.
  4. You can use a priority queue, or a heap. The string from the small file is placed in the small top heap, where the element at the top of the priority queue is the smallest string.
  5. Take the next string from the small file, place it in the heap, and repeat the process.

The time to remove the top of the heap and to insert the data into the heap is O(logn), where n is the number of data in the heap, which is 100

Priority queue application 2: High-performance timer

Assume that there is a timer that maintains many scheduled tasks

  1. Scanning the task list every second is inefficient. Cause 1: The scheduled execution time of the task may be a long time away from the current time, so a large number of scans are useless. Cause 2: Scan the entire task list each time. If the task list is large, it takes time.
  2. For such files, priority queues can be used. Tasks are stored in a priority queue according to their execution time, with the first task to be executed stored at the head of the queue (minimum top heap).
  3. So the timer doesn’t have to scan the list of tasks every second. It takes the execution time of the first task and subtracts it from the current time to obtain an interval T.
  4. When T seconds pass, the timer takes the first task in the priority queue and calculates the difference between the execution time of the new first task and the current time.
  5. This improves performance by eliminating the need for the timer to poll every second or traverse the entire task list.

Let’s use the heap to find Top K

The problem of Topk can be abstracted into two categories:

  1. For static data, you can maintain a small top heap of size K, iterate through the array sequentially, and compare the data from the array with the top elements of the heap. If the heaptop element is large, the heaptop element is removed and inserted into the heap. If it is smaller than the top of the heap, we leave it alone and continue iterating through the array. After all the data in the array is traversed, the data in the heap is pre-K big data. It takes O(n) time to traverse the data, O(logk) time for a heaping-operation, and O(nlogk) time for a worst-case scenario where all n elements are in the heap at once.

  2. To calculate Topk for dynamic data is real-time Topk. A data set has two operations, one is to add data, the other is to ask for the current k big data. It is possible to maintain a small top heap of k size all the time and compare it to the pairs of elements at the top of the heap when data is added to the collection. If it is larger than the heaptop element, the heaptop element is removed and inserted into the heap. If it is smaller than the heaptop element, it is not processed. In this way, whenever you need to query the current pre-K big data, you can immediately return to him.

Use the heap to find the median

  1. For a static set of data, the median is fixed, you sort first, and the NTH / 2nd digit is the median.
  2. For dynamic data sets, you can’t sort first, but with the help of a heap, you can do the median operation very efficiently without sorting.

Implementation idea:

  1. Two heaps need to be maintained: the big top heap stores the first half of the data, and the small top heap stores the second half of the data, and the data in the small top heap is larger than the data in the big top heap.
  2. That is, if there are n items, and n is even, sorted from smallest to largest, then the first n/2 items are stored in the big top heap, and the last n/2 items are stored in the small top heap. So the top element in the big top heap is the median.
  3. If the newly added data is less than or equal to the top element of the big top heap, the data is inserted into the big top heap; Otherwise, insert the small top heap
  4. When the amount of data in the two heaps does not conform to the convention of the median, the top element of the heap is constantly moved from one heap to the other, and the data in the two heaps meets the convention.

Therefore, we can use two heaps to implement the operation of finding the median in a dynamic data set. The time complexity of inserting the data is O(logn) because it involves heapification, but the time complexity of finding the median is O(1) because we only need to return the top element of the large top heap.

figure

How to Understand “Graph”

  • Like trees, graphs are nonlinear table data structures, but unlike trees, graphs are more complex nonlinear table structures
  • Elements in a tree are called nodes, and elements in a graph are called vertices.
  • A vertex can be related to any other vertex. This relationship is called an edge. The number of lines connected to a vertex is called a vertex’s degree.
  • Graph can be divided into directed graph and undirected graph two kinds, directed graph edge has direction.
  • In directed graphs, degrees can be divided into in-degree and out-degree. The entry of a vertex, how many edges point to that vertex; The output of a vertex is how many edges point from this vertex to other vertices.
  • Weighted graph: In a weighted graph, each edge has a weight

Adjacency matrix storage method

The most intuitive way to store graphs is the Adjacency Matrix, whose underlying layer relies on a two-dimensional array. For undirected graphs, if there are edges between vertex I and vertex j, we label A[I][j] and A[j][I] as 1. For A directed graph, if there is an arrow between vertex I and vertex J pointing from vertex I to the edge of vertex J, we label A[I][j] as 1. Similarly, if we have an arrow pointing from vertex J to the edge of vertex I, we label A[j][I] as 1. For a weighted graph, the corresponding weights are stored in the array.

Using adjacency matrix to represent a graph is simple and intuitive, but it wastes storage space.

  • For a two-dimensional array of undirected graphs, if you divide it diagonally into two parts, you only need to use half of the space above or below, and the other half is wasted.
  • An adjacency Matrix is more space-wasting if it is a Sparse Matrix with many vertices but few edges per vertex.

Advantages of using adjacency matrix storage:

  1. The storage method is simple and straightforward, and because it is based on arrays, it is very efficient to obtain the relationship between two vertices.
  2. Secondly, it is convenient to calculate. Because you can translate a lot of graph operations into matrix operations. For example, floyd-Warshall algorithm will be mentioned when solving the shortest path problem, which is to obtain the result by multiplying matrix cycles several times.

Adjacency list storage method

The Adjacency List can solve the problem that Adjacency matrix storage wastes memory space

Each vertex corresponds to a linked list that stores other vertices connected to this vertex. In addition, I need to explain that what is drawn in the figure is the storage mode of adjacency list of a directed graph. In the linked list corresponding to each vertex, the storage is the vertex pointing to. The same is true for undirected graphs, except that the list of vertices that have edges attached to each vertex is stored in the list

  • Adjacency matrix storage is a waste of space, but use more time saving, adjacency list storage is the opposite.

In this figure, if you want to determine whether there are edges from vertex 2 to vertex 4, you need to traverse the linked list corresponding to vertex 2 to see if there is vertex 4 in that list. But linked lists are not stored in a cache-friendly way.

  • We can change the linked list of adjacency list into balanced binary search tree, and we can also choose red black tree in practical development. This is a quick way to find out if there are edges between two vertices.

The application of figure

How to store friend relationships in weibo, wechat and other social networks?

  1. Weibo and wechat are two kinds of “graph”, the former is directed graph, the latter is undirected graph.
  2. Data structures serve algorithms, so which storage method you choose depends on the operations you want to support. For the user relationship of Weibo, support is required:
    • Determine whether user A follows user B.
    • Check whether user A is A fan of user B
    • User A cares about user B.
    • User A unfollows user B.
    • The user’s fan list is paginated according to the alphabetical order of the user name.
    • A list of users’ concerns is obtained in pages sorted by the first letter of the user name.
  3. Because social network is a sparse graph, adjacency matrix is a waste of storage space, so adjacency list is used for storage.
  4. But it’s not enough to store this kind of directed graph with an adjacency list. It’s easy to find out which users are followed by a user, but it’s very difficult to know which users are followed by a user.
  5. Therefore, an inverse adjacency list is required. The adjacency list stores the user’s concerned relationship, while the inverse adjacency list stores the user’s concerned relationship.
  6. So on the graph, the adjacency list, the linked list of each vertex, stores the vertices that point to that vertex, and the inverse adjacency list, the linked list of each vertex, stores the vertices that point to that vertex.