What is the data structure?

Program = data structure + algorithm

A program is composed of data structures and algorithms. Of course, data structures and algorithms are complementary and cannot be viewed independently, but this article will focus on those commonly used data structures.

What is the data structure?

The first thing you have to know is what is the data? Data is the symbolic representation of objective affairs. In computer science, it refers to all the symbols that can be input into a computer and processed by a computer program. Then why add the word “structure” **?

Data elements are the basic units of data, and in any problem, data elements do not exist independently, but there is always some relationship between them, which we call structure.

Thus, we have the following definition:

Data structures are the way computers store and organize data. A data structure is a collection of data elements that have one or more specific relationships with each other. More often than not, a carefully chosen data structure can lead to higher running or storage efficiency. Data structures are often associated with efficient retrieval algorithms and indexing techniques.

Simply put, data structures are ways of organizing, managing, and storing data. Although all data can be mixed in theory, or mix, or hungry, literally storage, but the computer is the pursuit of efficient, if we can know the structure of data and find a more suitable for the current problem scenario data structure, the relationship between the data in storage and computing time can be a more efficient use of adaptation algorithm, The efficiency of the program will certainly improve.

The four commonly used data structures are:

  • Set: Only relations belonging to the same set, no other relations
  • Linear structure: There is a one-to-one relationship between data elements in a structure
  • Tree structure: There is a one-to-many relationship between data elements in a structure
  • Graph or network: graph or network

What are logical and storage structures?

The logical relationship between data elements is called logical structure, that is, we define a mathematical description of the operation object. But we also have to know how to represent it in a computer. The representation of a data structure in a computer (also known as an image) is called the physical structure of the data, also known as the storage structure.

The previous relationship of data elements is represented in two different ways in the computer: sequential and non-sequential, and thus results in two different storage structures: Sequential storage structure and chain storage structure, such as sequential storage structure, we want to express the complex number z1 = 3.0-2.3i, can directly use the relative position of the elements in memory to express the logical relationship between the data elements:

Z1 = 3.0-2.3i. The next one is 100, which is an address. According to the address, find the real data -2.3i:

(bit)

The smallest unit of information in a computer is a binary digit, called a bit. So the bottom layer of the computer is all kinds of transistors, circuit boards, so no matter what kind of data it is, even pictures, sounds, it’s zeros and ones at the bottom layer. If you have eight circuits, then each circuit has its own closed state, there are eight twos multiplied by each other, two to the eighth to the eighth, That’s 256 different signals.

But in general we need to represent a negative number, so the highest bit is the sign bit, 0 is the positive number, 1 is the negative number, so the maximum number of 8 bits is 01111111, which is 127.

It is worth noting that in the world of computers, there are many concepts of source code, inverse code and complement:

  • Source code: the first bit represents the symbol and the remaining bits represent the value
  • Inverse: the inverse of the complement of a positive number is itself, the inverse of a negative number is the sign bit remains unchanged, and the rest of the bits are reversed.
  • Complement: The complement of a positive number is itself, and the complement of a negative number is + 1 based on its inverse

Why have the original code and complement code?

We know that addition and subtraction is a high frequency operation, people can be very intuitive to see the plus and minus sign, immediately can be calculated, but the computer if the distinction between different symbols, so the addition and subtraction will be more complex, such as positive + positive, positive – positive, positive – negative, negative + negative… And so on. Therefore, some people want to use the same arithmetic unit (plus arithmetic unit), solve all the addition and subtraction calculation, can reduce a lot of complex circuits, as well as the overhead of various symbol conversion, calculation is also more efficient.

As we can see, the result of the operation of the following negative numbers also conforms to the rule of complement:

00100011 35 + 11011101-35 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 00000000-0Copy the code
00100011 35 + 11011011-37 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 11111110-2Copy the code

Of course, if the result is more than the number of digits can represent, it is an overflow, meaning that more digits are needed to represent it correctly.

In general, bitwise operation can be used as far as possible, because it is more efficient, common bit operation:

  • ~: Reverse by bit
  • &: Press as and to calculate
  • |: bit or operation
  • ^: XOR by bit
  • <<: move left with sign, for example35 (00100011).And move one to the left is70 (01000110)..- 35 (11011101).We move one to the left- 70 (10111010).
  • >>: move right with sign, for example35 (00100011).To the right is17 (00010001)..- 35 (11011101).We move one to the left18 (11101110).
  • <<<: unsigned left shift, for example35 (00100011).And move one to the left is70 (01000110).
  • >>>: unsigned right shift, for example- 35 (11011101).To the right is110 (01101110).
  • x ^= y; y ^= x; x ^= y;Exchange:
  • s &= ~(1 << k): the firstkPosition 0

To tell where to use the classic bit operation, then to count the Bloom filter, you can refer to aphysia.cn/archives/ca…

What is a Bloom filter?

The Bloom Filter, proposed by Burton Howard Bloom in 1970, is actually a long binary vector and a series of random hash mapping functions (in plain English, binary arrays store data features). It can be used to determine whether an element exists in the set. Its advantages are high query efficiency and small space, but its disadvantages are certain errors, and when we want to remove elements, they may affect each other.

That is, when an element is added to the set, it is mapped to k points in the array by multiple hash functions, set to 1.

The point is that multiple hash functions can hash data into different bits, and only when all of these bits are 1 can we determine that the data already exists

So if you have three hash functions, then different elements will use three hash functions to hash into three places.

Let’s say we have another triple, and when we hash it, we hash it to the same position, where all the bits are 1, and we say that the triple is already in there.

Is there a possibility of miscarriage of justice? This is possible, for example, now there is only Zhang SAN, Li Si, Wang Wu, CAI Ba, the hash mapping value is as follows:

But unfortunately, the three hash functions of the hash function have just been changed to 1 by the hash of other elements, so it can be determined that it already exists, but in fact, the hash did not exist before.

The above situation is misjudgment, bloom filter will inevitably misjudgment. But one of the nice things about it is that the Bloom filter, which determines that elements exist, may not exist, but determines that elements that don’t exist, must not exist. Because the absence of the judgment means that at least one hash will not work.

It’s also because there are multiple elements that might hash together, but one of the data gets kicked out of the set, and we want to set its mapped bit to zero, which is essentially deleting that data. And when that happens, it’s going to affect other elements, maybe it’s going to set the mapping bits of other elements to 0. This is why bloom filters cannot be deleted.

An array of

Linear representation The most common and simplest kind of data structure, a linear representation of a finite sequence of N data elements, has the following characteristics:

  • There is a unique first data element
  • There is a unique data element called the last
  • Every element in the set except the first has a precursor
  • Every data element in the collection except the last element has a successor element

Linear tables include the following:

  • Array: query/update fast, find/delete slow
  • The list
  • The queue
  • The stack

An array is a type of linear table. The sequential representation of a linear table refers to storing the data elements of a linear table sequentially in a set of storage cells with contiguous addresses:

Represented in Java as:

int[] nums = new int[100];
int[] nums = {1.2.3.4.5};

Object[] Objects = new Object[100];
Copy the code

Expressed in C++ as:

int nums[100];
Copy the code

An array is a linear structure. It is generally a continuous space at the bottom layer, storing the same type of data. Due to the continuous compact structure and natural index support, the data query efficiency is high:

If we know that the first value of array A is address 296 and the data type in it is 2 units, then we can obtain the time complexity of O(1) if we expect the fifth value: 296+ (5-1) *2 = 304.

The essence of update is also search, first find the element, you can start to update:

But if you want to insert data, you need to move the following data, such as the following array, insert element 6, and worst of all, move all elements, O(n) time.

While deleting an element requires moving the following data to the front, the worst time complexity is also O(n):

Java code to implement array add, delete, change, check:

package datastruction;

import java.util.Arrays;

public class MyArray {
    private int[] data;

    private int elementCount;

    private int length;

    public MyArray(int max) {
        length = max;
        data = new int[max];
        elementCount = 0;
    }

    public void add(int value) {
        if (elementCount == length) {
            length = 2 * length;
            data = Arrays.copyOf(data, length);
        }
        data[elementCount] = value;
        elementCount++;
    }

    public int find(int searchKey) {
        int i;
        for (i = 0; i < elementCount; i++) {
            if (data[i] == searchKey)
                break;
        }
        if (i == elementCount) {
            return -1;
        }
        return i;
    }

    public boolean delete(int value) {
        int i = find(value);
        if (i == -1) {
            return false;
        }
        for (int j = i; j < elementCount - 1; j++) {
            data[j] = data[j + 1];
        }
        elementCount--;
        return true;
    }

    public boolean update(int oldValue, int newValue) {
        int i = find(oldValue);
        if (i == -1) {
            return false;
        }
        data[i] = newValue;
        return true; }}/ / test class
public class Test {
    public static void main(String[] args) {
        MyArray myArray = new MyArray(2);
        myArray.add(1);
        myArray.add(2);
        myArray.add(3);
        myArray.delete(2); System.out.println(myArray); }}Copy the code

The list

In the example above, we can see that the array needs contiguous space. If the size of the array is 2, the third element has to be expanded, and the elements have to be copied. Some delete and insert operations cause more data movement.

Linked lists, or chained data structures, do not require logically adjacent data elements to be physically adjacent, so they do not have the disadvantages of sequential storage structures, but also lose the advantage of looking up elements directly by index subscripts.

Important: Linked lists are not contiguous in the computer’s storage, but rather, the previous node stores the pointer (address) to the next node, and the next node is found by address.

Here is the structure of a singly linked list:

Usually we manually place a front node, also known as a header, in front of a single linked list, but this is not always the case:

The general linked list structure is divided into the following types:

  • Unidirectional linked list: For each node in a linked list, there is only one pointer to the next node, and the last node points to null.
  • Two-way linked list: Each node has two Pointers (we call them for convenienceBefore the pointer.After the pointer), pointing to the previous node and the next node respectively, with the front pointer of the first node pointing toNULL, the last node after the pointer toNULL
  • Circular linked list: Each node has a pointer to the next node, and the last node has a pointer to the first node (although it is a circular linked list, it is necessary to identify the head node or tail node to avoid endless loops)
  • Complex linked lists: Each linked list has a back pointer to the next node and a random pointer to any node.

Time complexity of linked list operations:

  • Query:O(n), you need to traverse the linked list
  • Insert:O(1), modify the pointer before and after
  • Delete:O(1), also modify the pointer before and after
  • Modified: If no query is requiredO(1)Is required for queryO(n)

What’s the structure code for a linked list?

The following only represents the single-linked list structure, C++ represents:

/ / node
typedef struct LNode{
  / / data
  ElemType data;
  // Pointer to the next node
  struct LNode *next;
}*Link,*Position;

/ / list
typedef struct{
  // end node
  Link head,tail;
  / / the length
  int len;
}LinkList;
Copy the code

Java code representation:

    public class ListNode {
        int val;
        ListNode next = null;

        ListNode(int val) {
            this.val = val; }}Copy the code

To achieve their own simple linked list, add, delete, change, check function:

class ListNode<T> {
    T val;
    ListNode next = null;

    ListNode(T val) {
        this.val = val; }}public class MyList<T> {
    private ListNode<T> head;
    private ListNode<T> tail;
    private int size;

    public MyList(a) {
        this.head = null;
        this.tail = null;
        this.size = 0;
    }

    public void add(T element) {
        add(size, element);
    }

    public void add(int index, T element) {
        if (index < 0 || index > size) {
            throw new IndexOutOfBoundsException("Out of list length range");
        }
        ListNode current = new ListNode(element);
        if (index == 0) {
            if (head == null) {
                head = current;
                tail = current;
            } else{ current.next = head; head = current; }}else if (index == size) {
            tail.next = current;
            tail = current;
        } else {
            ListNode preNode = get(index - 1);
            current.next = preNode.next;
            preNode.next = current;
        }
        size++;
    }

    public ListNode get(int index) {
        if (index < 0 || index >= size) {
            throw new IndexOutOfBoundsException("Out of list length");
        }
        ListNode temp = head;
        for (int i = 0; i < index; i++) {
            temp = temp.next;
        }
        return temp;
    }

    public ListNode delete(int index) {
        if (index < 0 || index >= size) {
            throw new IndexOutOfBoundsException("Out of list node range");
        }
        ListNode node = null;
        if (index == 0) {
            node = head;
            head = head.next;
        } else if (index == size - 1) {
            ListNode preNode = get(index - 1);
            node = tail;
            preNode.next = null;
            tail = preNode;
        } else {
            ListNode pre = get(index - 1);
            pre.next = pre.next.next;
            node = pre.next;
        }
        size--;
        return node;
    }

    public void update(int index, T element) {
        if (index < 0 || index >= size) {
            throw new IndexOutOfBoundsException("Out of list node range");
        }
        ListNode node = get(index);
        node.val = element;
    }

    public void display(a) {
        ListNode temp = head;
        while(temp ! =null) {
            System.out.print(temp.val + "- >");
            temp = temp.next;
        }
        System.out.println(""); }}Copy the code

The test code is as follows:

public class Test {
    public static void main(String[] args) {
        MyList myList = new MyList();
        myList.add(1);
        myList.add(2);
        / / 1 - > 2
        myList.display();

        / / 1
        System.out.println(myList.get(0).val);

        myList.update(1.3);
        / / 1 - > 3
        myList.display();

        myList.add(4);
        / / 1 - > 3 - > 4
        myList.display();

        myList.delete(1);
        / / 1 - > 4myList.display(); }}Copy the code

Output result:

1 -> 2 -> 
1
1 -> 3 -> 
1 -> 3 -> 4 -> 
1 -> 4 ->
Copy the code

Insert a new node into a list. Insert a new node into a list. Insert a new node into a list.

How do you remove an intermediate node? Here’s how it works:

You might wonder, a5 is just a pointer gone, so where did it go?

For Java programs, the garbage collector will collect nodes that are not referenced, which helps us reclaim the memory. However, in order to speed up garbage collection, we need to empty nodes that are not needed, such as node = null. In C++ programs, we need to manually reclaim nodes. Otherwise, problems such as memory leakage may occur.

The operation of complex linked list is here for the time being. Later, I will share the data structure and common algorithms of linked list separately. This article mainly talks about the overall picture of data structure.

Jump table

As we can see above, it is very troublesome to search a linked list. If this node is at the end of the list, we need to traverse all the nodes to find it. The search efficiency is really low.

There are always more ways than problems, but there is no such thing as absolute “how fast, how good and how cheap”. There is sacrifice and gain, and the world of computer is full of philosophy. Since search efficiency is problematic, we might as well sort the linked list. After sorting the list, you can only know the top and bottom nodes, know the middle range, but to find the middle node, you still have to go through the old way. What if we store the intermediate nodes? If we save it, we know whether it’s in the top half or the bottom half. If you’re looking for 7, you’re going to start at the middle node. If you look for 4, you start from the beginning, and if you look for the middle node, you stop.

However, this still does not completely solve the problem, because the linked list is very long, can only be found through the front and back two parts. Than back to the principle: the space and time, we choose the time, that is about to give up part of the space, we each node to add a pointer, there are 2 layers of pointer (note: only a node, it is the same node, just to look good, and made two is actually the same node, there are two Pointers, such as 1, point to 2, both also points to 5) :

For every two nodes, add one more layer:

This is a hop table. The hop table is defined as follows:

SkipList (full name SkipList) is a data structure used for quick search of ordered element sequence. A SkipList is a randomized data structure, which is essentially a binary search ordered linked list. Skip table in the original ordered list above the increase of multi-level index, through the index to achieve fast search. Skip tables not only improve search performance, but also insert and delete performance. It has the same performance as red black trees and AVL trees, but the principle of the hop table is very simple, and the implementation is much simpler than red black trees.

The main principle is to use space for time, can achieve almost binary search efficiency, in fact, the consumption of space, assuming that every two plus a layer, 1 + 2 + 4 +… Plus n is 2n minus 1, so I have almost twice as much space. Do you think it looks like a book catalog, level 1 catalog, level 2, level 3…

If we continue to insert data into the hop table, there may be a very large number of nodes in a certain section. At this time, we need to dynamically update the index. In addition to inserting data, we also need to insert it into the linked list of the upper layer to ensure the query efficiency.

Redis uses a jump table to achieve ZSET,redis uses a random algorithm to calculate the hierarchy, calculate how many layers of each node index, although it can not absolutely guarantee the balance, but basically ensure the efficiency, to achieve a bit simpler than those balanced tree,red black tree algorithm.

The stack

A Stack is a data structure, represented in Java as a Stack class. Its essence is advanced and then out, just like a bucket, can only be constantly put on top, when taken out, can only continue to take out the top data. If you want to retrieve the underlying data, you have to wait until all the upper data is retrieved. Of course, if there is such a need, we would use a two-way queue.

Here is a demonstration of the stack’s features:

What is implemented at the bottom of the stack? In fact, you can use linked lists, you can also use arrays, but the underlying JDK stack, is implemented with arrays, encapsulation, API operation is always the last element, stack is often used to achieve recursive functions. If you want to learn about stack or other collection implementation analysis in Java, check out this series of articles: aphysia.cn/categories/…

Adding an element is called pushing (pushing), taking an element out of the stack is called pushing, and the element at the top of the stack is the last element put in.

Using arrays to implement a simple stack (note that this is for reference only, there are actual thread safety issues) :

import java.util.Arrays;

public class MyStack<T> {
    private T[] data;
    private int length = 2;
    private int maxIndex;

    public MyStack(a) {
        data = (T[]) new Object[length];
        maxIndex = -1;
    }

    public void push(T element) {
        if (isFull()) {
            length = 2 * length;
            data = Arrays.copyOf(data, length);
        }
        data[maxIndex + 1] = element;
        maxIndex++;
    }

    public T pop(a) {
        if (isEmpty()) {
            throw new IndexOutOfBoundsException("No data in the stack.");
        } else {
            T[] newdata = (T[]) new Object[data.length - 1];
            for (int i = 0; i < data.length - 1; i++) {
                newdata[i] = data[i];
            }
            T element = data[maxIndex];
            maxIndex--;
            data = newdata;
            returnelement; }}private boolean isFull(a) {
        return data.length - 1 == maxIndex;
    }

    public boolean isEmpty(a) {
        return maxIndex == -1;
    }

    public void display(a) {
        for (int i = 0; i < data.length; i++) {
            System.out.print(data[i]+"");
        }
        System.out.println(""); }}Copy the code

Test code:

public class MyStackTest {
    public static void main(String[] args) {
        MyStack<Integer> myStack = new MyStack<>();
        myStack.push(1);
        myStack.push(2);
        myStack.push(3);
        myStack.push(4); myStack.display(); System.out.println(myStack.pop()); myStack.display(); }}Copy the code

The output is as follows:

1, 2, 3, 4, 4, 1, 2, 3Copy the code

The characteristics of stack is first-in, first-out, but if you need to randomly take out the front data, the efficiency will be relatively low, need to dump out, but if the underlying array, theoretically can be taken out by index subscript, Java is just so.

The queue

Since there are fifO data structures in front of us, we must also have fifO data structures. During the epidemic, everyone must have tested nucleic acids in the queue, so the queue is long.

A queue is a special kind of linear table, special in that it only allows deletion at the front of the table and insertion at the rear. Like a stack, a queue is a linear table with limited operation. The end that inserts is called the end of the queue, and the end that deletes is called the head of the queue.

Queues are characterized by first in, first out, and the following are examples:

Usually when you talk about FIFO, or First In First Out, you think about queues, but if you want to have queues where you can fetch elements from the head of the queue and you can fetch elements from the tail of the queue, you need to use special queues (two-way queues), and two-way queues are usually easier to implement using two-way linked lists.

Let’s implement a simple one-way queue in Java:

class Node<T> {
    public T data;
    public Node next;

    public Node(T data) {
        this.data = data; }}public class MyQueue<T> {
    private Node<T>  head;
    private Node<T>  rear;
    private int size;

    public MyQueue(a) {
        size = 0;
    }

    public void pushBack(T element) {
        Node newNode = new Node(element);
        if (isEmpty()) {
            head = newNode;
        } else {
            rear.next = newNode;
        }
        rear = newNode;
        size++;
    }

    public boolean isEmpty(a) {
        return head == null;
    }

    public T popFront(a) {
        if (isEmpty()) {
            throw new NullPointerException("Queue has no data");
        } else {
            Node<T> node = head;
            head = head.next;
            size--;
            returnnode.data; }}public void dispaly(a) {
        Node temp = head;
        while(temp ! =null) {
            System.out.print(temp.data +"- >");
            temp = temp.next;
        }
        System.out.println(""); }}Copy the code

The test code is as follows:

public class MyStackTest {
    public static void main(String[] args) {
        MyStack<Integer> myStack = new MyStack<>();
        myStack.push(1);
        myStack.push(2);
        myStack.push(3);
        myStack.push(4); myStack.display(); System.out.println(myStack.pop()); myStack.display(); }}Copy the code

Running results:

1 -> 2 -> 3 -> 
1
2 -> 3 -> 
2
3 -> 
Copy the code

The common queue types are as follows:

  • Unidirectional queue: what we call normal queue, first in, first out.

  • Bidirectional queue: You can enter and exit a queue in different directions

  • Priority queue: The internal queue is automatically sorted and the queue is out in a certain order

  • Blocking queue: Removing elements from a queue will block if there are no elements in the queue, as will adding elements to a queue if the queue is full.

  • Circular queue: Can be thought of as a circular linked list, but it is usually necessary to identify the first and last nodes to prevent endless loops. The next of the last node points to the head node.

Queues can generally be used to hold data that needs to be ordered, or to hold tasks, which can be done in hierarchical tree traversal, or breadth-first search in general.

Hash table

The previous data structure, when looking up, usually use = or! < and > might be used for half lookups and other range queries. Ideally, we want to be able to get to a location (storage location) without any comparison. In this case, elements in an array can be retrieved by index. So, if we match the data we need to store with the index of the array, and it is one-to-one, we can locate the element quickly, right?

You can find where k is just by using the function f of k, which is a hash function. It represents a mapping relationship, but different values may be mapped to the same value (the same hash address), that is, F (k1) = f(k2), which is called conflict or collision.

The hash table is defined as follows:

A Hash table is a data structure that accesses data stored in memory based on a Key. That is, it accesses records by computing a function on key values that maps the data for the desired query to a location in the table, which speeds up lookups. This mapping function is called a hash function, and the array of records is called a hash table.

Commonly used hash functions are:

  • Direct addressing: fetching a keyword or a linear function of the keyword as a hash function, for exampleH(key) = keyorH(key) = a * key + b
  • Numerical analysis: for the possible value of all understanding, take a number of digits of the keyword hash address
  • Square take middle method: take the middle several bits after the key word is squared as hash address
  • Folding method: divide the keyword into several parts with the same number of digits (the last part of the number can be different), take the superposition of these parts and (drop the carry), as the hash address.
  • Divide the remainder method: take the keyword is not longer than a hash table tablemThe number ofpThe remainder after division is the hash address. Or hash(k)=k mod p.p< =m. It can not only take the modulus of the key word directly, but also take the modulus after the operation of folding method, square method and so on. rightpThe choice of is very important, usually take prime ormIf,pBad choices are prone to conflict.
  • Random number method: take the random function value of the keyword as its hash address.

However, none of these methods can avoid hash conflicts, but can only be consciously reduced. What are the general methods for handling hash conflicts?

  • Open address method:hashAfter the calculation, if the location already has data, then the address+ 1That is, look back until you find an empty place.
  • againhashMethod: after a hash conflict, you can use anotherhashThe function is recalculated to find emptyhashAddresses, if any, can be superimposedhashFunction.
  • Chain address method: allhashWith the same value, the link becomes a linked list and hangs behind the array.
  • Create a public overflow zone: uncommon, meaning all elements, if and elements in the tablehashConflict, all get another table, also called overflow table.

In Java, the chained address method is used:

However, if hash conflicts are serious, the linked list will be long, and subsequent linked lists need to be traversed during query. Therefore, the JDK has optimized a version. When the length of the linked list exceeds the threshold, it will become a red-black tree.

But you have to think, what if the array is too small, and I put a lot of data in it? The probability of rehash collisions gets higher and higher, which triggers a scaling mechanism that doubles the array size, rehashes the old data, and hashes it into a different array.

The hash table has the advantage of fast search speed. However, if rehash is triggered repeatedly, the response speed will be slow. Also, if you want range queries, hash tables are not a good choice.

The tree

Arrays and lists are linear structures, and trees, as I’m going to describe here, are nonlinear structures. In reality, a tree is a pyramid, a tree in a data structure, called the root node at the top.

How do we define a tree structure?

Tree is a kind of data structure, which is composed of n(n≥1) finite nodes to form a set with hierarchical relationship. It’s called a tree because it looks like an upside-down tree, meaning it has roots up and leaves down. It has the following characteristics:

Each node has zero or more child nodes; A node without a parent is called the root node. Each non-root node has one and only one parent; In addition to the root node, each child node can be divided into multiple disjoint subtrees. (Baidu Encyclopedia)

Here are the basic terms for trees (from Tsinghua University Data Structures C language edition) :

  • Degree of node: The number of subtrees a node contains is called the degree of the node
  • Tree degree: in a tree, the largest node degree is called tree degree;
  • Leaf node or terminal node: node with degree zero;
  • Non-terminal node or branch node: node whose degree is not zero;
  • Parent node or parent node: If a node has children, this node is called the parent node of its children.
  • Child node or child node: The root node of the subtree that a node contains is called the child node of that node.
  • Sibling node: nodes with the same parent node are called sibling nodes.
  • Node hierarchy: defined from the root, the root is the first1Layer, the child node of the root is the first2Layers, and so on;
  • Depth: for any noden.nThe depth of is the unique path length from root to n, and the depth of root is0;
  • Height: for any noden.nThe height of is fromnThe longest path to a leaf is long, and the height of all leaves is zero0;
  • Cousin node: nodes whose parent node is in the same layer are Cousins of each other.
  • Ancestor of a node: all nodes from the root to the branch through which the node passes;
  • Descendant: Any node in the subtree rooted in a node is the descendant of the node.
  • Ordered tree: if the subtrees of the nodes of a tree species are considered to be ordered from left to right (not interchangeable), the tree should be called ordered, otherwise it is unordered
  • First child: The root of the leftmost subtree in an ordered tree is called the first child
  • Last child: The root of the rightmost subtree in the ordered tree is called the last child
  • Forest:m(m>=0A collection of trees that do not intersect each other is called a forest;

Trees, in fact, we most commonly use binary trees:

The characteristic of binary tree is that each node only has two subtrees at most, and the subtree can be divided into left and right subtrees, and the order of the left and right child nodes cannot be arbitrarily reversed.

Binary trees are represented in Java:

public class TreeLinkNode {
    int val;
    TreeLinkNode left = null;
    TreeLinkNode right = null;
    TreeLinkNode next = null;

    TreeLinkNode(int val) {
        this.val = val; }}Copy the code

Full binary tree: A binary tree with depth K and 2K-1 nodes is called full binary tree

Complete binary tree: a binary tree with n nodes of depth K is called a complete binary tree if and only if each node corresponds to nodes numbered from 1 to N in a full binary tree of depth K.

There are several general traversal of binary trees:

  • Front-order traversal: Traverses the sequential root node –> left child node –> right child node
  • Middle-order traversal: Traverses the sequential left child node –> root node –> right child node
  • Back-order traversal: Traverses the sequential left child node –> right child node –> root node
  • Breadth/level traversal: Traversal from top to bottom, layer by layer

If it is a chaotic binary tree, the search or search efficiency will be relatively low, and a chaotic linked list is no different, why more complex structure?

In fact, binary trees can be used for sorting or searching, because binary trees have strict left and right subtrees, and we can define the size of the root node, the left child, and the right child. Hence the binary search tree:

A Binary Search Tree is either an empty Tree or a Binary Tree with the following property: if the left subtree is not empty, the value of all nodes in the left subtree is less than the value of the root node; If its right subtree is not empty, the value of all nodes in the right subtree is greater than the value of its root node. Its left and right subtrees are also binary sort trees respectively. As a classical data structure, binary search tree has the advantages of quick insertion and deletion of linked list and quick search of array. Therefore, it is widely used, for example, in file system and database system, this data structure is generally used for efficient sorting and retrieval operations.

Binary search tree example:

For example, in the tree above, if we need to find 4, we start at 5, 4 is less than 5, we go to the left subtree, we find 3, 4 is more than 3, we go to the right subtree, we find 4, which is a tree of 7 nodes, we only find 3 times, which is the number of levels, let’s say n nodes, that’s log of n plus 1.

Tree maintenance, query efficiency is high, but if the tree is not well maintained, easy to degenerate into a linked list, query efficiency will decline, such as:

A query friendly binary tree should be a balanced or nearly balanced binary tree.

The height difference between the left and right subtrees of any node in a balanced binary search tree is at most 1. Balanced binary trees are also called AVL trees.

In order to make sure that the binary tree is still a balanced binary tree after inserting or deleting data and so on, you need to adjust the nodes, which is also called the balancing process, and it involves all kinds of rotation adjustments, so I’m not going to expand it here.

However, if it involves a large number of updates, delete operations, balancing the various adjustments of tree species need to sacrifice no small performance, to solve this problem, there are big names proposed red black tree.

Red Black Tree is a self-balanced binary search Tree, which is a data structure used in computer science. It is typically used to implement associative arrays. [1]

The red-black tree is in 1972 by [Rudolf Bayer] (baike.baidu.com/item/Rudolf Bayer / 3014716), was called a balanced binary tree B (symmetric binary B – trees). It was later modified in 1978 by Leo J. Guibas and Robert Sedgewick to the current “red-black tree”. [2]

Red-black tree is a kind of specialized AVL tree (balanced binary tree), which maintains the balance of binary search tree by certain operation during insertion and deletion operation, so as to obtain high search performance.

Red-black trees have the following characteristics:

  • Property 1. The node is red or black.

  • Property 2. The root is black.

  • Nature 3. All leaves are black. (Leaves are NIL nodes)

  • Property 4. Two children of every red node are black. (No two consecutive red nodes on all paths from each leaf to the root)

  • Property 5. All paths from each leaf of any node contain the same number of black nodes.

It is these properties that make red-black tree adjustments less difficult and less frequent than normal balanced binary tree adjustments. That is to add rules, make it meet certain standards, reduce the chaos and frequency of the balancing process.

The hash table mentioned above, implemented in Java, is exactly the application of red black tree, in the hash conflict, will be converted to red black tree.

These are all binomial trees, but we have to talk about multinomial trees. Why? While the various search trees in binary trees, red and black trees, are great, we have to factor in IO when interacting with disk, mostly in data storage, because disk IO is much slower than memory. If the height of the index tree is in the tens of thousands, the number of times disk reads is too many. B trees are more suitable for disk storage.

In 970, R.Bayer and E. McCreight proposed a tree suitable for external lookup, which was a balanced multi-fork tree called B tree (or B-tree, B_ tree).

A balanced tree of Order M is a balanced m-way search tree. It is either an empty tree or a tree that satisfies the following properties:

1. The root node has at least two children.

2. The number of keywords j contained in each non-root node satisfies m/ 2-1 <= j <= m-1;

3. The degree of all nodes except the root node (excluding the leaf node) is exactly the total number of keywords plus 1, so the number of internal subtrees k satisfies m/2 <= k <= M;

4. All leaf nodes are in the same layer.

When each node puts more data, the search in memory is much faster than the disk, and the B-tree can reduce the number of DISK I/OS. B tree:

However, the data of each node may be very large, which will result in a small amount of data detected on each page, and the number of IO queries will naturally increase. Therefore, it is better to only store data in leaf nodes:

B+ tree is a variant of B tree. The leaf nodes of B+ tree store keywords and corresponding recorded addresses, and the layers above the leaf nodes are used as indexes. A B+ tree of order m is defined as follows:

(1) Each node has at most M children;

(2) Every node except the root node has at least [m/2] children, and the root node has at least two children;

(3) Nodes with K children must have K keywords.

In general, the leaf nodes of b+ trees are connected by linked lists for easy traversal and range traversal.

This is the b+ tree. B + trees have the following advantages over B trees:

  1. b+The middle node of the tree does not store data, and each IO query can find more indexes. It is a chunky tree.
  2. For range lookup,b+The tree simply walks through the list of leaf nodes,bTrees need to have leaves from the root node.

In addition to the above Tree, there is also a kind of Huffman Tree: given N weights as N leaf nodes, a binary Tree is constructed. If the weight path length of the Tree reaches the minimum, such a binary Tree is called optimal binary Tree, also known as Huffman Tree. Huffman tree is a tree with the shortest weight path length, and the node with higher weight is closer to the root.

Generally used for compression, because in the data, the frequency of each character is not the same, the higher the frequency of the character, we use the shorter the code to save, can achieve the purpose of compression. So where does this code come from?

Assuming the character is hello, the encoding might be (just a rough version of the encoding, shorter for high-frequency characters), and the encoding would be the string 01 from the root node to the current character:

The Huffman tree is effectively compressed by encoding different weights.

The heap

A heap, in fact, is a type of binary tree. The heap must be a complete binary tree. A complete binary tree is one in which the number of nodes in all layers is full except for the last layer, and the nodes in the last layer are concentrated in the left continuous position.

The heap also requires that the value of each node in the heap be greater than (or less than) the value of its left and right children.

There are two main types of heap:

  • Big top heap: Each node is greater than or equal to its children (the top of the heap is the maximum)
  • Small top heap: Each node is less than or equal to its subtree node (the top of the heap is the minimum)

In general, we use arrays to represent heaps, such as the following small top heap:

The parent-child node and the left/right node relationships in the array are as follows:

  • i The parent of a nodeparent = floor((i-1)/2) (Round down)
  • i The left child of the node2 * i +1
  • i The right child of the node2 * i + 2

Since it stores data, it must involve insert and delete operations. Insert and delete in the heap will involve the adjustment of the heap to meet its definition again after adjustment. This adjustment process is called heapification.

In the case of the small top heap, the adjustment is mainly to ensure that:

  • It’s a perfect binary tree
  • Each node in the heap is less than or equal to its left and right children

For the small top heap, the adjustment is that the small elements float up and the large elements sink down, which is a constant exchange process.

The heap can generally be used to solve TOP K problems, or priority queues, etc.

figure

Finally, we come to the illustration of the graph. The graph is actually a two-dimensional plane. I wrote about mine clearance before, and the whole square area of mine clearance is actually related to the graph. A graph is a nonlinear data structure composed mainly of edges and vertices.

At the same time, the graph is divided into directed graph and undirected graph. The graph above is undirected graph, because the edge does not indicate the direction, but only represents the correlation between the two, while directed graph is like this:

If each vertex is a place and each edge is a path, then this is a map network, so the graph is often used to find the shortest distance. Let’s take a look at some concepts related to graphs:

  • Vertices: The most basic units of a graph, those nodes
  • Edge: Relation between vertices
  • Adjacent vertices: vertices directly related by edges
  • Degree: The number of adjacent vertices that a vertex is directly connected to
  • Weight: The weight of an edge

There are several ways to represent graphs in general:

  1. The adjacency matrix, represented by a two-dimensional array, is 1 for connected, and 0 for disconnected. Of course, if the path length is represented, it can be greater than0The number of represents the path length, and is used- 1Indicates disconnection.

In the picture below, 0 is connected to 1, 2, and we can see that row 0, column 1, and column 2 are connected. One more thing: the vertices themselves are marked with 0, indicating that they are not connected, but some cases can be considered connected.

  1. Adjacency list

Adjacency list, storage method is similar to the tree child chain representation, is a combination of sequential allocation and chain allocation storage structure. If the vertex corresponding to the table head node has adjacent vertices, the adjacent vertices are placed in the one-way linked list pointed to by the table head node.

For undirected graphs, data redundancy will also occur when adjacency lists are used for storage. If there is A table node pointing to C in the linked list pointed to by table head node A, there will also be A table node pointing to A in the linked list pointed to by table head node C.

The traversal in the graph is generally divided into breadth-first traversal and depth-first traversal. Breadth-first traversal refers to the traversal of vertices directly related to the current vertex, which is generally implemented by queue. Depth-first traversal is to go in one direction until you can’t go any more, which means you don’t go back until you hit the south wall. It is generally achieved by recursion.

Graph, in addition to the use of computational minimum path, there is a concept: minimum spanning tree.

A spanning tree of a connected graph with n nodes is a minimal connected subgraph of the original graph, containing all n nodes of the original graph and having the least number of edges that keep the graph connected. The minimum spanning tree can be solved by Kruskal algorithm or Prim algorithm.

There’s a way of saying that a graph is a point on a plane, and we pick up one of those points, and we take the edge that’s going to carry the other vertices, and we take the least weight, and we get rid of the extra edge, and that’s a minimum spanning tree.

Of course, the minimum spanning tree is not necessarily unique, and there can be multiple outcomes.

Qin Huai @ viewpoint

Knowing these basic data structures, and being able to choose more appropriate ones when writing code or modeling data, is of great use. Computers are for people, and so is code, and there are all kinds of data structures that we can’t learn all at once, but the basic thing is that it’s not going to change that much unless it’s revolutionized by a new generation.

Programs are made up of data structures and algorithms, and data structures are like cornerstones.

In order to write a “good” program, it is necessary to analyze the characteristics of the objects to be processed and the relationship between the objects to be processed, which is the background of the discipline and development of “data structure”.

[Author profile] : Qin Huai, public number [Qin Huai grocery store] author, personal website: aphysia.cn, the road of technology is not at that time, mountain high water long, even if slow, chi and not stop.

Offer all questions in PDF

Open Source Programming Notes