This article was first published on the wechat public account “Programmer Interviewer”.
Arrays are the most common data structure used by almost any software engineer, and for that reason, many developers don’t take them seriously enough.
Interview questions like, “Is there a performance difference between the first and last member of a million-member array? Why?”
In addition, we often occurs in the usual business development will be an array of a shuttle, in most cases we will be in the form of an array, and has a habit to read the source code developers may find that in some of the underlying library, the place where we can we use an array, the underlying library chose other data structures, is this why?
I hope you will bring the above questions to our discussion.
What is an array
Arrays are one of the most basic data structures in computer science, built into most programming languages, and most commonly used by developers.
An Array (English: Array) is a data structure consisting of a collection of elements of the same type, allocated a contiguous chunk of memory for storage.
Of course, in some dynamic languages such as Python lists or JavaScript arrays it is possible to have discontinuous memory and to store different types of elements.
For example, we have the following array:
arr = [1.2.3.4.5]
Copy the code
Its in-memory representation should look like this:
As we can see, this array is stored in memory as a continuous linear format. This continuous linear format has both advantages and disadvantages. Only when we understand the advantages and disadvantages can we make better use of arrays in future development.
Array properties
A data structure usually has four basic operations: insert, find, delete, and read. We will analyze the performance differences caused by each of these operations.
Let’s start by differentiating a concept — performance.
The performance here is not the speed in the absolute sense, because the hardware base of different devices will produce a huge difference in speed. The performance here is the concept of “complexity” in our algorithm analysis.
The concept of complexity can be analyzed by moving algorithms
Insert performance
We already know that an array is a contiguous chunk of memory, but what if we want to insert a new element into k? In this case, you need to move all elements after k one bit back and insert new elements in the position of k index.
And we see that there’s a lot more work to be done at this point, and typically, the time of the insertion is order n.
Delete the performance
In fact, delete operation is very similar to insert. Similarly, if I want to delete the element in the index position of K in the array, we need to remove it, in order to maintain the continuity of memory, we need to move all the elements after K forward one bit. In this case, the time complexity is also O(n).
To find the performance
For example, if we want to find if there is an element of 2 in an array, what does the computer have to do?
If it’s a human, with a small amount of data, we can see at a glance if there’s a 2 element, but a computer doesn’t. The computer needs to match from index 0 down until it matches a 2 element.
The search process is actually a common linear search, where the average number of steps to match is related to the length of the array, n, and the time complexity is also O(n).
Read performance
We have emphasized that arrays have the same data type and a contiguous piece of linear memory. Therefore, based on the above characteristics, arrays have excellent read performance and time complexity of O(1). Compared with linked lists, binary trees and other data structures, it has obvious advantages.
So how can arrays achieve such low time complexity?
Given that our array memory address is start, the element type is size, and the array index is I, we can easily obtain the address formula for the array memory address:
arr[i]_address = start + size * i
Copy the code
For example, if we want to read the value of ARr [3], we only need to substitute 3 into the addressing formula, and the computer can query the corresponding element in one step. Therefore, the time complexity of array reading is only O(1).
Performance optimization
We already know that the time complexity of all operations except “read” is O(n), so is there an effective way to optimize performance?
Lookup performance optimization
When the elements of an array are unordered, we can only use a relatively slow linear search. When the elements are ordered (increasing or decreasing), we can use another more efficient method, binary search.
Suppose we have an array of ints, stored incrementally:
arr = [1, 2, 3, 4, 5, 6, 7]
Copy the code
If we’re looking for an element of 6, we do a linear lookup from 0 by the array index until we hit the element of index 5.
Binary lookup is more efficient because we know that the elements of the array are ordered incrementing:
- We can take an element with index 3 and call it p
- Compare p with the target value of 6 to find the value of P
4 < 6
, the target value must be in the element after index 3 since it is an incrementing array - At this point, the new intermediate value p is extracted from the element at the tail after index 3 and compared with the target value, and the process is repeated until the target value is found
We can see that this operation can exclude half of the current number of elements in each comparison, and the overall time complexity is only O(log n), indicating that this method is very efficient.
This efficient method is more effective in the case of a large amount of data, such as the current array of 1 billion members is ordered increasing, if the linear search, in the worst case, 1 billion of the search operations to find the result, while binary search only needs 7 times.
Insertion performance optimization
For example, if we have the following array, we want to insert a new member orange into index 1. Normally, we want to move the last three members back, and orange takes index 1.
But what if our requirements don’t necessarily require index orderliness? In other words, we can think of the array as a set concept. We can insert orange at index 1 and create a new memory at the end of the array and store banana at 1 in the new memory. Even though the index is messed up, the whole operation only takes O(1) time.
arr = ['apple'.'banana'.'grape'.'watermelon']
Copy the code
Delete performance optimizations
Delete operations require the elements behind the output position to be collectively moved forward, which can be costly in performance, especially in the case of frequent delete and insert operations.
We can record the related operations first, but not delete them immediately. When a certain node is reached, we can operate the array according to the records once again. In this way, the frequent movement of array members becomes a one-time operation, which can greatly improve the performance.
This idea is very widely used:
- The virtual DOM of the front-end framework stores a large number of operations on the DOM in the difference queue and then updates them all at once, avoiding DOM backflow and redrawing.
- The tag removal algorithm in V8 and JVM is also based on this idea. The tag removal algorithm is divided into two stages. In the tag phase, all objects accessed are marked with an identity.
summary
Back to the topic of the problem, we can know clearly now “1 million members of the array from the first and the last one if there is a performance gap”, the answer is clearly no, because the array is a linear continuous piece of memory, we can step out corresponding members, by addressing formula that has nothing to do with the location of the members.
And finally, we often get this problem in interviews or in LeetCode, which is the child element problem in an array.
For example, given an array of integers, calculate the maximum sum of contiguous subarrays of length ‘k’.
What can be done to minimize time complexity? Just say what you think.