Hello, I’m Yue Chuang.
I will share the use of Numpy library data analysis, the content of this article is more, it is impossible to output each section of code process, output results analysis this is obviously not a little bit of work. But I have combined a large number of code blocks, I hope friends to run the code and analyze the results. When you can do this, in the future: whether it’s a Numpy update that results in API changes or something else, you’ll be able to solve and learn new things with ease.
And for the results, analysis can not get the results of the regular partners, do not panic. If you don’t understand the results of the example code in this article or have any other questions, please leave a comment below. Of course, you can also pay attention to the public number: AI Yue Chuang, add my friend “not a trumpet oh”, I pull you into the group. If you have any questions, you can ask them directly in the group, or you can also @ me. I have time and saw, is sure to reply to you. “Private chat is not recommended, because you may have a problem that other people have also encountered, don’t be shy, exchange and learn together. This message is always valid, welcome you to chat!
1. Basis of data analysis
So let’s look at the lowest level of data analysis — Array, which is the most critical concept, so let’s look at the concept of arrays.
What is an array?
Simply put, it’s an ordered sequence of elements. For example, the list [1,2,3,4] is a simple one-dimensional array with only four elements and cannot be split into other array combinations. To complicate things, [[1,2,3], [4,5,6]] is a two-dimensional array consisting of two one-dimensional arrays.
I don’t know what an array is. In fact, it’s very simple: groups of numbers. Of course, there must be different ways to arrange a group of numbers. We can see the picture below: what do 1D, 2D and 3D mean? In fact, that is: one-dimensional array, two-dimensional array, three-dimensional array.
So, in other words, when we arrange numbers I can arrange them in one dimension, what does one dimension look like? This is the first picture on the left. What about two dimensions? It’s kind of like a table, the “second figure” in the middle. In three dimensions you get a cube like this, which is the figure on the far right.
So, this gives you a pretty good idea of what it means:
- A one-dimensional array has only one direction;
- A two-dimensional array it’s just two directions;
- A three-dimensional array is, of course, three directions “that is, a direction is added on the basis of two dimensions”;
2. General process of data processing
Next, let’s take a look at what kind of process we have for data analysis in general. Generally speaking, there are four steps as shown above:
-
Data collection: The first step is actually very important, which is where we get our data from.
-
Data preprocessing: Before we process the data, we need to do data preprocessing. So some of you might ask: what’s the difference between pretreatment and treatment?
As for the data processing step, I formally want to process and analyze the data. Preprocessing, its role is to facilitate the third part of the data processing and data processing.
-
Data presentation: Here is my data processing and analysis completed, how I can make the results of our analysis more intuitive.
1. Data collection
There are several common methods for data collection:
- Web crawler: can write crawler code, high degree of freedom, the trouble is to write their own crawler;
- Public data: such as some news data, microblog comment data and so on some data sets that can be downloaded for you;
- Buy data: there are special crawler companies that can write specific code to get data for your needs;
- Directly provided by the company internally: for example, if you are engaged in operation, you can use data analysis to see the sales situation for operation and so on;
- Data acquisition through other channels: questionnaires and other forms;
2. Data preprocessing
Here I simply give you a list, do not understand is no relationship:
- The normalized
- binarization
- Dimension transformation
- duplicate removal
- Invalid data filtering
1. The normalization
The normalization method takes two forms, one is to change a number to a decimal between (0,1), and the other is to change a dimensional expression to a dimensionless expression. It is mainly put forward for the convenience of data processing. It is more convenient and fast to map the data to the range of 0 ~ 1 and should be put into the category of digital signal processing.
Change the number to a decimal between 0 and 1
Example 1: {2.5 3.5 0.5 1.5} normalized to {0.3125 0.4375 0.0625 0.1875}
Solution:
2.5 + 3.5 + 0.5 + 1.5 = 8,
2.5/8 = 0.3125,
3.5/8 = 0.4375,
0.5/8 = 0.0625,
1.5/8 = 0.1875.
The normalization is to change the sum in parentheses to 1 and write the ratio of each number.
1.1 Dimensionless expression “Select look”
Normalization is a way of simplifying the calculation, where a dimensional expression, transformed into a dimensionless expression, becomes a scalar.
In statistics, the specific function of normalization is to generalize a uniform sampleStatistical distributionSex. Normalization is statistical between 0 and 1The probability ofThe distribution, normalized between minus 1 and plus 1 is a statistical coordinate distribution.
Definition of normalization: Normalization is the process (by some algorithm) of the data that you need to process within a certain range.
- First, normalization is to facilitate subsequent data processing, and second, to ensure faster convergence during program running.
- The specific function of normalization is to induce the statistical distribution of uniform samples.
- Normalization is a statistical probability distribution between 0 and 1, and normalization is a statistical coordinate distribution on some interval.
- Normalization has the meaning of identity, unity and oneness.
If it is the value on the interval, it can use the relative position on the interval to normalize, that is, select a phase reference point and use the ratio of the relative position to the whole interval or the given value of the whole interval as the ratio to get a normalized data, such as a probability value 0<= P <=1;
If it is a numerical value, many common mathematical functions can be normalized to make the comparability between them more obvious and stronger, such as logarithmic normalization, exponential normalization, trigonometric or inverse trigonometric function normalization, normalization purpose: May be making no comparable data become comparable, but will also keep comparing the relative relationship between the two data, such as size, big big, still small still small, or to drawing, the original is very difficult to make on a figure to, after normalization can easily give the relative position on the graph, etc.;
From the point of view of collection, you can do the dimension d a, namely abstraction to a, not important, is not comparable to the elements in the collection of attributes, keep people care about those attributes, in this way, was not comparable objects or things, you can return to a, is classified as a class, then you can compare, and people often like to use a relative amounts to compare, For example, the height and weight of people and cows are not comparable, but the value of height/weight may be comparable. How much people eat and how much cows eat may not be directly comparable, but it is comparable relative to the weight or the amount of food required to provide energy in a day. These, from a mathematical point of view, can be thought of as turning dimensionless quantities into dimensionless ones.
Data Normalization Method
The standardization/normalization of data processing, in form, is a variation of expression, but in essence, it is for comparison. Data normalization is scaling the data so that it falls into a small, specific interval. Since the measurement units of each index in the credit index system are different, in order to participate in the evaluation calculation of the index, it is necessary to normalize the index and map its value to a certain value interval through function transformation.
2. Binarization
The original:
Effect after binarization:
Binarization is the simplest method for image segmentation. Binarization can transform gray image into binary image. The pixel gray value greater than a critical gray value is set as gray maximum value, the pixel gray value less than this value is set as gray minimum value, so as to achieve binarization.
According to different threshold selection, binarization algorithms can be divided into fixed threshold and adaptive threshold. The more commonly used binarization methods are: bimodal method, P parameter method, iterative method and **OTSU method **.
Binarization is: I take a bunch of data, we take it and we divide it into two categories: low and high, something like that.
2.1 Auxiliary Understanding
1. Image segmentation
In computer vision, image segmentation refers to the process of subdividing a digital image into multiple image sub-regions (sets of pixels) (also known as superpixels). The purpose of image segmentation is to simplify or change the representation of the image, making the image easier to understand and analyze. [1] Image segmentation is usually used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of labeling each pixel in the image, which makes pixels with the same label have some common visual properties.
The result of image segmentation is the set of sub-regions on the image (the whole of these sub-regions covers the whole image), or the set of contour lines extracted from the image (such as edge detection). Each pixel in a subregion is similar under a measure or calculated property, such as color, brightness, texture. Adjacent regions vary greatly in the measure of certain properties. [1]
2. Gray
In computing, a Gray scale digital image is an image with only one sample color per pixel. Such images are usually shown as shades of gray from darkest to brightest white, although in theory this sample could be different shades of any color, or even different colors at different levels of brightness. Gray image is different from black and white image. In the field of computer image, black and white image has only two colors, and gray image has many levels of color depth between black and white. However, outside of the digital image realm, “black and white image” also means “gray image”, for example, a grayscale photograph is often called “black and white photograph”. In some articles on digital images, monochrome images are equivalent to grayscale images, and in others to black and white images.
Grayscale images are often obtained by measuring the brightness of each pixel within a single electromagnetic spectrum such as visible light.
Grayscale images for display are usually stored at a nonlinear scale of 8bits per sampled pixel, so there can be 256 grayscales (8bits is 2 ^ 8 =256). This precision is just enough to avoid visible stripe distortion and is very easy to program. In the application of medical image and remote sensing image, more series are often used to make full use of sensor accuracy of 10 or 12 bits per sample and avoid approximate error in calculation. The popular use of 16 bits in such applications is 65536 combinations (or 65536 colors).
3. Binary image
A binary image is a digital image with only two possible values per pixel. People often use black and white, B&W, monochromatic images to represent binary images, but it can also be used to represent any image with only one sample value per pixel, such as grayscale images.
Binary images often appear in digital image processing as image masks or as a result of image segmentation, binarization, and dithering. Some input and output devices, such as laser printers, fax machines, monochrome computer monitors, etc., can process binary images.
Binary images are often stored in bitmap format.
Binary images can be interpreted as two-dimensional integer lattice Z2, and the field of image deformation processing is largely inspired by this view.
3. Dimension transformation
You can think of it as a transformation from two dimensions to a one-dimensional array.
4. To heavy
If there are many duplicate data, we can deal with them during data preprocessing.
5. Invalid data filtering
Maybe the data is missing or something.
3. Data processing
- Data sorting: similar to sorting from largest to smallest;
- 2. To search according to certain conditions;
- Statistical analysis of data
There are a lot of them. I’m just listing a few.
4. Data presentation
- The list of
- The chart
- Dynamic interactive graph
So that’s the basic flow of data processing.
3. Why Numpy
- A high performance
- Open source
- Array operation
- Read and write quickly
Simply put, Python has several built-in data types that are not efficient for computationally intensive scenarios, such as matrix operations. Hence Numpy, which is considered the foundation package for high-performance scientific computing and data analysis.
Almost all of the advanced tools introduced in data analysis are based on Numpy. Because most of NumPy’s code is written in C, its underlying algorithms are designed to perform extremely well, making NumPy far more efficient than pure Python code. As a basic tool, Numpy is really easy to use. You only need to understand three key points: data type creation, index slicing of data layer, and array manipulation. We’ll expand them out in different sections.
For new students, it is important to note that although most data analysis work will not directly manipulate Numpy objects, understanding array oriented programming and logic is the key to becoming a Python data analysis leader.
Array oriented programming, the biggest characteristic is to use array expression to complete the data operation task, no need to write a lot of loops. Vectorized array operations are one or two orders of magnitude faster than their pure Python equivalents. In the subsequent study, we will have the opportunity to savor the differences and advantages.
1. High performance
import numpy as np
import time
list_array = list(range(int(1e6))) # 10 to the sixth
start_time = time.time()
python_array = [val * 5 for val in list_array] # 1 million numbers, each of which is multiplied by 5
end_time = time.time()
print('Python array time: {}ms'.format(round((end_time - start_time) * 1000.2)))
np_array = np.arange(1e6)
start_time = time.time()
np_array = np_array * 5
end_time = time.time()
print('Numpy array time: {}ms'.format(round((end_time - start_time) * 1000.2)))
print('What sup! ')
Copy the code
4. Install Numpy
Windows: PIP install numpy
For Mac OS: pip3 install numpy
5. Use Numpy module
5.1 Creating a Python File
-
Import the Numpy module
import numpy as np Copy the code
-
As np means np for numpy in the following program
-
Import module AS is an abbreviation
5.2 The base type of Numpy — Ndarray
One of the most important features of Numpy is that it can quickly create an N-dimensional array object (that is, an NDARray object and an array. This article does not make a conceptual distinction between NDARray objects and arrays). Then you can use nDARray data structures to perform mathematical operations very efficiently. And the syntax style is basically the same as Python.
1. Create an array
1. One-dimensional arrays
Create an array of ndarray. We create arrays directly in Python like this:
data = [2.4.6.5.8]
Copy the code
The simplest way to create an Ndarray is to use the array function, which takes a sequence object (such as a list) and converts it into an Ndarray object. So, if you want to create a one-dimensional array of type Numpy, you need to write code like this:
data = np.array([2.4.6.5.8]) # np.array() fills a list of numbers directly inside
Copy the code
Of course, you might want to keep the list separate, so you could write it like this:
In [14]: python_list = [2.4.6.5.8]
...: data = np.array(python_list)
In [15]: data
Out[15]: array([2. , 4. , 6.5.8. ])
Copy the code
The interesting thing is that we pass in float and int, but when we create ndarray objects, we convert them to float by default. This is because Ndarray is a generic isomorphic multidimensional container of data. All elements must be of the same type, and Numpy will convert them based on the actual input. “That is, if no data type is specified at creation time, Numpy will use the smallest data type in the array as data.”
2. Two-dimensional arrays
Above we create a one-dimensional array. Next we create a two-dimensional array with rows and columns. We create it as follows:
import numpy as np
data = np.array([[1.2.3], [4.5.6]]) Data = np.array([(1, 2), (3, 4)])
Copy the code
Of course, our format could also be: “A little clearer.”
import numpy as np
data = np.array(
[
[1.2.3],
[4.5.6]])Copy the code
Ps: Nested lists or nested progenitors are available, and the output can be tried by yourself.
I’m going to create a two-dimensional array again, and I’m going to add a picture for you to understand.
Create a two-dimensional array
arr2d=np.arange(9, dtype=np.float32).reshape(3.3)
Copy the code
Obviously, a two-dimensional array has two dimensional indexes. If mapped to a flat space, the two axes of a two-dimensional array are axis 0 and axis 1. Numpy indexes axis 0 first and then axis 1 by default. (Arrays can actually be thought of as nested lists, and the outermost index is usually defined as Axis 0 and increments in ascending order. This rule applies to higher-dimensional arrays as well.)
If you’re careful, you’ll notice that I used dtype so let me add:
Determine the data type of the array
data = np.array([1.2.3.4.5])
np.issubdtype(data.dtype, np.integer)
Out: True
## Array types can also be referred to by character codes, mainly to maintain backward compatibility with older packages such as Numeric. Some documents may still reference these documents, for example:
data = np.array([1.2.3], dtype='f') # equivalent to array([1., 2., 3.], dType =float32) we recommend using dType objects instead.
data
Out: array([1..2..3.], dtype=float32)
data.dtype
Out[13]: dtype('float32')
Copy the code
3. Three-dimensional arrays
A three-dimensional array has one more dimension than a two-dimensional array. Three-dimensional array is quite common in the field of pictures. For RGB images with three primary colors, an array of m× N ×3 size is used to represent an image, where M represents the vertical size of the image, N represents the horizontal size of the image, and 3 represents the three primary colors. So how do we create three-dimensional arrays in Python?
import numpy as np
data = np.array(
[
[[1.2.3], [4.5.6]],
[[7.8.9], [10.11.12]]])print(data.ndim)
Copy the code
4. High-dimensional arrays
In our field of data analysis, even in the field of AI big data, the original input layer is two-dimensional, that is, each sample is one-dimensional (defined by N indicators to define a sample), and the sample set is two-dimensional. In the field of image recognition, the raw input layer is usually three-dimensional, because images are usually three-dimensional arrays. So friends who are familiar with common 2d and 3D indexes and slices should be able to handle most real scenes. High-dimensional arrays are not recommended for further exploration.
- Four dimensional array
import numpy as np
data = np.array(
[
[[[1.2.3], [4.5.6]], [[7.8.9], [10.11.12]]],
[[[13.14.15], [16.17.18]], [[19.20.21], [22.23.24]]]])print(data.ndim)
Copy the code
- Five dimensional array
import numpy as np
data = np.array(
[
[[[[1.2.3], [4.5.6]], [[7.8.9], [10.11.12]]], [[[13.14.15], [16.17.18]], [[19.20.21], [22.23.24]]]],
[[[[1.2.3], [4.5.6]], [[7.8.9], [10.11.12]]], [[[13.14.15], [16.17.18]], [[19.20.21], [22.23.24]]]]])print(data.ndim)
Copy the code
- The six dimensional array
import numpy as np
data = np.array(
[
[[[[[1.2.3], [4.5.6]], [[7.8.9], [10.11.12]]], [[[13.14.15], [16.17.18]], [[19.20.21], [22.23.24]]]], [[[[1.2.3], [4.5.6]], [[7.8.9], [10.11.12]]], [[[13.14.15], [16.17.18]], [[19.20.21], [22.23.24]]]]],
[[[[[1.2.3], [4.5.6]], [[7.8.9], [10.11.12]]], [[[13.14.15], [16.17.18]], [[19.20.21], [22.23.24]]]], [[[[1.2.3], [4.5.6]], [[7.8.9], [10.11.12]]], [[[13.14.15], [16.17.18]], [[19.20.21], [22.23.24]]]]]])print(data.ndim)
Copy the code
Other dimensions of the array can be tried, not written does not mean there is no! Try it yourself!
2. Arange creates an arithmetic array
If you were to create an arithmetic list of 100 items from 1 to 100, it would be very convenient to do it quickly with range of 1,101,1. Numpy provides a similar method, arange(), which has a very similar usage to range().
In [1] :import numpy asnp ... :# specify start, stop, and step. Arange, like range, is arange that closes on the left and opens on the right.. : arr_uniform0 = np.arange(1.10.1)
...: It is also possible to pass only one argument, in which case the default is start=0 and step=1. : arr_uniform1 = np.arange(10)
In [2]: arr_uniform0
Out[2]: array([1.2.3.4.5.6.7.8.9])
In [3]: arr_uniform1
Out[3]: array([0.1.2.3.4.5.6.7.8.9])
Copy the code
3. Ndmin Specifies the array type to be created
"Minimum dimensions 2: Minimum dimensions 2: Notes: ndmin: Specifies the dimension type, similar to casting" "
# Example 1:
import numpy as np
data = np.array([1.2.3], ndmin=2)
print(data)
print(data.ndim)
# output result:
[[1 2 3]]
2
# Example 2:
import numpy as np
data = np.array([1.2.3], ndmin=3)
print(data)
print(data.ndim)
# output result:
[[[1 2 3]]]
3
Copy the code
4. Determine the dimension of nDARRay “Determine the dimension of array” — nDIM
ndim = property(lambda self: object(), lambda self, v: None.lambda self: None) # default
"""Number of array dimensions. Array dimensions. Examples -------- >>> x = np.array([1, 2, 3]) >>> x.ndim 1 >>> y = np.zeros((2, 3, ((4)) # np. Zeros, row, column)) 3 d # np. The zeros ((row, column)) 2 d > > > y.n dim 3 "" "
Copy the code
In fact, the above code already uses nDIM, this can help us to see the dimensions of the array, sometimes it is not clear to the naked eye, secondly, your array source may be another way to source, using nDIM is more intelligent.
import numpy as np
data = np.array(
[
[1.2.3],
[4.5.6]])print(data.ndim)
Copy the code
5. Understand the dimensions of ADOBE Ray
Not only do I want to know how many dimensions this array has, we also want to know what the length of the array is above each dimension. For example, we have a two-dimensional array, and we want to know how many rows and columns the two-dimensional array is. “Length of each dimension”
import numpy as np
data = np.array(
[
[1.2.3],
[4.5.6]])print(data.shape)
Copy the code
Running results:
(2.3) # 2 rows, 3 columns
Copy the code
Of course, we can use Shape not only to check the dimensions of the array, but also to transform the dimensions of the array:
import numpy as np
data = np.array(
[
[1.2.3],
[4.5.6]])print(F "dimensions of the original array:{data.ndim}")
data.shape = (6.)print(data)
print(F "transforms the dimensions of the array:{data.ndim}")
Copy the code
6. Create an array of all zeros
For example, if I want to initialize an array of zeros, I’ll create an array of zeros for now.
import numpy as np
data = np.zeros(10)
print(data)
Copy the code
Running results:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Copy the code
The 0 above is actually 0.0
Supplementary content:
import numpy as np
data = np.zeros((2.3.4))
# np.zeros((group, row, column)) 3d
# np.zeros((row, column)) two-dimensional
print(data)
print(data.ndim)
Copy the code
Running results:
[[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]]
3
Copy the code
7. Create a two-dimensional array full of ones
After reading the subtitle, students might ask: What about creating a one-dimensional array of all ones?
import numpy as np
data = np.ones(10)
print(data)
Run result
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Copy the code
Next, let’s create a two-dimensional array of all ones:
import numpy as np
data = np.ones((3.10))
# np.ones((group, row, column)) 3d
# np.ones((row, column)) two-dimensional
print(data)
print(data.ndim)
Copy the code
Running results:
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
2
Copy the code
Create a 3d array of all 1s:
import numpy as np
data = np.ones((2.4.6))
# np.ones((group, row, column)) 3d
# np.ones((row, column)) two-dimensional
print(data)
print(data.ndim)
Copy the code
Note: The length of a multi-digit group needs to be expressed in tuples.
8. Index and slice arrays
1. Get a number in an array (in the jargon: index)
import numpy as np
data = np.arange(10)
print(data)
print(data[5])
# output
[0 1 2 3 4 5 6 7 8 9]
5
Copy the code
2. Get a number in a two-dimensional array (in the jargon: index)
import numpy as np
arr2d = np.arange(9, dtype=np.float32).reshape(3.3)
print(arr2d)
# output
[[0. 1. 2.]
[3. 4. 5.]
[6. 7. 8.]]
Copy the code
If only axis 0 is indexed:
# inside square brackets, you can read it as an operation on an axis. Here the square brackets are a single integer, indicating the operation on the outermost axis 0
print(arr2d[1])
Please refer to the left part of the figure below
[3. 4. 5.]
Copy the code
Index both axis 0 and axis 1:
# 2 integers in square brackets, representing operations on Axis 0 and axis 1 in sequence (axis 0 comes first); Take the element that satisfies both Axis 0 and Axis 1 where index=1
print(arr2d[1.1])
Please refer to the right part of the figure below
4.0
Copy the code
Other forms of indexing:
# line starts at 0
print(arr2d[0] [2]) # 2.0
print(arr2d[0.1]) # 1.0
print(arr2d[1.0]) # 3.0
Copy the code
3. Get a number in a THREE-DIMENSIONAL array (in the jargon: index)
import numpy as np
data = np.array([[1.2.3.4], [5.6.7.8]])
# line starts at 0
# change to a 3d array
data.shape = (2.2.2)
print(data)
print(data.ndim)
print(data[0] [0] [1]) # 2
print(data[0.0.1]) # 2
Copy the code
0 0 This is 0 0 0 instead of 0 0 0 directly creating a 3 dimensional array or 0 0 0 0 0 is 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Shape 0 The main differences between Shape and 0 are 0 Shape: One is changing the array itself, and the new array is 0
4. Get some numbers from a one-dimensional array (jargon: slice)
Extend the index slightly and select multiple consecutive indexes at the same time, that is the slicing effect.
import numpy as np
data = np.arange(10)
print(data) # [0 1 2 3 4 5 6 7 8 9]
print(data[3:6]) # [3, 4, 5]
Copy the code
If the starting position of the slice is 0, it can be omitted
For example: data[:6] and data[0:6] get the same result
5. Get some numbers from a two-dimensional array (jargon: slice)
import numpy as np
arr2d = np.arange(9, dtype=np.float32).reshape(3.3)
print(arr2d)
print(arr2d[0] [0:3])
print(arr2d[0.0:3])
# output
[[0. 1. 2.]
[3. 4. 5.]
[6. 7. 8.]]
[0. 1. 2.]
[0. 1. 2.]
Copy the code
So what I showed you above is the simplest slice, and I’m going to show you a little bit more advanced. Savor the following action: “Below, use IPython for easy reading.”
In [16]: arr2d=np.arange(9, dtype=np.float32).reshape(3.3)
In [17]: arr2d
Out[17]:
array([[0..1..2.],
[3..4..5.],
[6..7..8.]], dtype=float32)
In [18]: arr2d[0]
Out[18]: array([0..1..2.], dtype=float32)
In [19]: arr2d[0:2.1:3]
Out[19]:
array([[1..2.],
[4..5.]], dtype=float32)
In [20]: arr2d[0:2] [1:3]
Out[20]: array([[3..4..5.]], dtype=float32)
In [21]: arr2d=np.arange(12, dtype=np.float32).reshape(4.3)
In [22]: arr2d
Out[22]:
array([[ 0..1..2.],
[ 3..4..5.],
[ 6..7..8.],
[ 9..10..11.]], dtype=float32)
In [23]: arr2d[0:2.1:3]
Out[23]:
array([[1..2.],
[4..5.]], dtype=float32)
In [24]: arr2d[0:3.1:2]
Out[24]:
array([[1.],
[4.],
[7.]], dtype=float32)
In [25]: arr2d[0:3] [1:2]
Out[25]: array([[3..4..5.]], dtype=float32)
In [26]: arr2d[0:3]
Out[26]:
array([[0..1..2.],
[3..4..5.],
[6..7..8.]], dtype=float32)
In [27]: arr2d[1:2]
Out[27]: array([[3..4..5.]], dtype=float32)
Copy the code
6. Get some numbers from a THREE-DIMENSIONAL array (jargon: slice)
import numpy as np
data = np.array([[1.2.3.4], [5.6.7.8]])
print(data.ndim) # 2
Three_dimensional_array = data.reshape(2.2.2)
print(Three_dimensional_array.ndim) # 3
# array([])[group][row][]
print(Three_dimensional_array[0] [0] [0:3]) # [1, 2]
print(Three_dimensional_array[0] [0] [:])# [1, 2]
print(Three_dimensional_array[0] [0]) # [1, 2]
Copy the code
* * attention! ** Sliced data, corresponding to the original data. Any changes are reflected in the original data.
import numpy as np
data = np.arange(10)
data_slice = data[3:6]
data_slice[2] = 100
print(f"data_slice:{data_slice}")
print(f"data:{data}")
# output
data_slice:[ 3 4 100]
data:[ 0 1 2 3 4 100 6 7 8 9]
Copy the code
Want a copy without affecting the original data? Please use the data [set]. Copy ()
import numpy as np
data = np.arange(10)
data_slice = data[3:6].copy()
data_slice[2] = 100
print(f"data_slice:{data_slice}")
print(f"data:{data}")
# output
data_slice:[ 3 4 100]
data:[0 1 2 3 4 5 6 7 8 9]
Copy the code
Boolean index
Boolean indexes are a little more down to earth when arrays come up against real scenarios. Boolean indexes, for example, are more like filters in Excel that determine which data is our target based on the Boolean value of the result of the condition.
Let’s start with an example:
cities = np.array(["hz"."sh"."hz"."bj"."wh"."sh"."sz"])
arr_rnd = np.random.randn(7.4) Create a normally distributed array that matches 7x4
arr_rnd
Out:
array([[ 0.52214772.0.70276312, -2.2606387 , 0.44816176],
[ 1.8575996 , -0.07908252, -0.60976332, -1.24109283],
[ 0.79739726.0.86862637.0.91748762.1.58236216], [...2.01706647.1.02411895, -0.27238117.0.11644394], [...0.5413323 , 0.41044278, -0.54505957, -0.27226035],
[ 0.85592045.1.14458831.0.36227036, -0.22211316],
[ 2.40476032.1.22042702, -1.07018219.0.95419508]])
Copy the code
Create an array of Boolean types by comparing arrays
cities == "hz"
Out: array([ True.False.True.False.False.False.False])
Copy the code
# Use Boolean array, array index; Observe the pattern of indexes
# We can infer that the length of the Boolean array must be the same as the length of the axis being indexed
arr_rnd[cities == "hz"]
Out:
array([[ 0.52214772.0.70276312, -2.2606387 , 0.44816176],
[ 0.79739726.0.86862637.0.91748762.1.58236216]])
Copy the code
Note here that Boolean indexes can be used with index slicing:
# Use Boolean array, slice to index 2 dimensions
arr_rnd[cities == "hz"To:3]
Out:
array([[ 0.52214772.0.70276312, -2.2606387 ],
[ 0.79739726.0.86862637.0.91748762]])
Copy the code
Of course, Boolean indexes, when used properly, have a magical effect on corruption. For example, for the arr_Rnd array that follows the standard normal distribution generated in the previous step, I want to be able to filter out all the negative numbers of the arr_RND array and juxtize them to 0. It’s actually quite simple:
arr_rnd[arr_rnd<0] = 0
arr_rnd
Out:
array([[0.52214772.0.70276312.0. , 0.44816176],
[1.8575996 , 0. , 0. , 0. ],
[0.79739726.0.86862637.0.91748762.1.58236216],
[0. , 1.02411895.0. , 0.11644394],
[0. , 0.41044278.0. , 0. ],
[0.85592045.1.14458831.0.36227036.0. ],
[2.40476032.1.22042702.0. , 0.95419508]])
Copy the code
The Boolean arithmetic operator for Boolean arrays is a bit tricky, but it’s easy to understand how to easily implement the and, or, and not operations between Boolean arrays:
cities == "hz"
Out: array([ True.False.True.False.False.False.False])
cities == "sz"
Out: array([False.False.False.False.False.False.True])
# no operation ~
~(cities == "hz")
Out: array([False.True.False.True.True.True.True])
The # and operation &
(cities == "hz") & (cities == "sz")
Out: array([False.False.False.False.False.False.False])
| # or operations
(cities == "hz") | (cities == "sz")
Out: array([ True.False.True.False.False.False.True])
Copy the code
8. Fancy indexes
In summary, we’ve covered how to create arrays, some common Numpy functions, array operations, and how to index single integers, slicing, Boolean lists, and combinations of them. In fact, they are already powerful enough to handle most scenarios. Let’s take an example. I have a 4×6 two-dimensional array and I want to do an interesting slice. I want to take the four corners of the two-dimensional array and make a 2×2 array.
It’s possible to do this using what we’ve explained so far, but it’s a little bit more cumbersome to do this because we’ve learned about continuous slicing or individual indexes. Friends read here, might as well think about it, you have a way to solve this problem?
Here we can offer at least two ideas:
Create a 4×6 array arr_
arr_demo01 = np.arange(24).reshape(4.6)
arr_demo01
Our:
array([[ 0.1.2.3.4.5],
[ 6.7.8.9.10.11],
[12.13.14.15.16.17],
[18.19.20.21.22.23]])
Copy the code
# method 1: index the elements of each corner, and then form a new 2×2 array
arr_method1 = np.array([[arr_demo01[0.0], arr_demo01[0, -1]],
[arr_demo01[-1.0],arr_demo01[-1, -1]]])
arr_method1
Out:
array([[ 0.5],
[18.23]])
Copy the code
# Method 2: Using a Boolean index, it is possible to index discontinuous rows simultaneously. Index axis 0 direction and Axis 1 direction respectively. But it should be noted that the score is indexed twice;
arr_method2 = arr_demo01[[True.False.False.TrueThe [:]], [True.False.False.False.False.True] arr_method2: Out: array([[0.5],
[18.23]])
Copy the code
The first method is easier to understand, and the second method is to do two separate indexes. The first step is to perform a Boolean index to axis 0; The second step is to combine operational slices and Boolean indexes, which index axis 1 direction on the results generated in the previous step.
Is there a simpler way to do this? This is where the fancy index comes in.
Fancy index, in fact, is the use of integer array to index. Based on the arr_demo01 we generated above, let’s look at two simple examples.
# we pass in an integer array, index axis 0, and the order of the index results corresponds to the integer array passed in:
arr_demo01[[2.0]]
Out:
array([[12.13.14.15.16.17],
[ 0.1.2.3.4.5]])
Copy the code
If we pass in two arrays of integers at the same time, the result may be somewhat different from what we expected.
If two arrays of integers are passed, separate them with commas. So these two arrays index elements in pairs. Instead of a rectangular index area!
arr_demo01[[0, -1], [0, -1]]
Out:
array([ 0.23])
Copy the code
The actual index here is going to be (0, 0), (-1, -1), not a rectangle. So how to achieve the effect of the demo above? Here we introduce several methods for comparison and learning.
Method 3: Pass in the coordinates of all 4 angles. The idea is very similar to method 1, but it is more succinctly written:
# method 3: pass in the coordinates of 4 angles respectively, please observe the law of the array of 2 integers passed in
arr_demo01[[0.0, -1, -1], [0, -1.0, -1]]
Out: array([ 0.5.18.23])
arr_demo01[[0.0, -1, -1], [0, -1.0, -1]].reshape(2.2)
Out:
array([[ 0.5],
[18.23]])
Copy the code
Notice that the data this way is just a 1-d array of elements, and we need to additionally change the shape of the data using the 0 0 method
# Method 4: Use fancy index and slice mix, the overall idea is very similar to method 2. Also by two consecutive indexes, a rectangular region is obtained
arr_demo01[[0, -1The [:]], [0, -1]]
Out:
array([[ 0.5],
[18.23]])
Copy the code
Finally, we introduce an indexer, using np.ix_ function, the passed two one-dimensional integer array, into a region indexer for selecting elements.
# method 5: build a rectangle indexer using the function np.ix_ :
arr_demo01[np.ix_([0, -1], [0, -1])]
Out:
array([[ 0.5],
[18.23]])
Copy the code
Using indexing and slicing, we can change the structure of an array, extract its elements and set of elements at will.
Overall, indexes can be divided into four types: single-integer indexes, Boolean indexes, slicing indexes (as Python lists do), and integer arrays. A little more complicated are the composite indexes, such as the combination of other indexes and slice indexes. I believe that friends carefully read the content of the above, will be able to master the index method of data. In daily learning, it is suggested to regard this article as a reference book like a case, focusing on logical understanding. When encountering problems, you can refer to it without rote memorization.
Indexing and slicing are a basic part of this course. In Pandas, we will explore the problem of slicing and indexing.
9. Transform array dimension 0 ()
Now let’s write code to feel it:
import numpy as np
data = np.arange(10)
print(data)
print(data.reshape((2.5)))
# output
[0 1 2 3 4 5 6 7 8 9]
[[0 1 2 3 4]
[5 6 7 8 9]]
Copy the code
The new code is 0 0 You may have seen that other people’s code is 0 0
This is actually 0 0 showing the general use of 0 0, and I’m going to show the 0 0 completely in code plus language
0 0 Numpy.arange (n). 0 0 Is 0 0 Is 0 0 0 Is 0 0
In [1]:
np.arange(16).reshape(2.8) Generate 16 natural numbers in 2 rows and 8 columns
Out[1]:
array([[ 0.1.2.3.4.5.6.7],
[ 8.9.10.11.12.13.14.15]])
Copy the code
Special usage: 0 0 Mat (or Array).0 0 Is 0 0 Must be in matrix or array shape to use. 0 0 (C, -1) function that restructures this matrix or array in rows C and columns D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [1] :import numpy as np
In [2]: arr=np.arange(16).reshape(2.8)
In [3]: arr
Out[3]:
array([[ 0.1.2.3.4.5.6.7],
[ 8.9.10.11.12.13.14.15]])
In [4]: arr.reshape(4, -1) C =4, d=16/4=4
out[4]:
array([[ 0.1.2.3],
[ 4.5.6.7],
[ 8.9.10.11],
[12.13.14.15]])
In [5]: arr.reshape(8, -1) C =8, d=16/8=2
out[5]:
array([[ 0.1],
[ 2.3],
[ 4.5],
[ 6.7],
[ 8.9],
[10.11],
[12.13],
[14.15]])
In [6]: arr.reshape(10, -1) C =10, d=16/10=1.6! = Int)
out[6]:
ValueError: cannot reshape array of size 16 into shape (10,newaxis)
Copy the code
Resize (); resize()
Reshape transform array dimension is to belong to a new array, resize transform array belongs to the in situ transformation, of course We can see this function in addition to the re and the size, so the resize also can change the array length.
3. To flatten a group of numbers (in the order in which the data is stored in memory), resize it, and adjust its shape:
import numpy as np
a = np.array([[0.1], [2.3]], order='C')
C # according to the line
a.resize((2.1))
print(a)
a = np.array([[0.1], [2.3]], order='C')
a.resize(2.2)
print(a)
# output
[[0]
[1]]
[[0 1]
[2 3]]
Copy the code
import numpy as np
a = np.array([[0.1], [2.3]], order='F')
# F in columns
a.resize((2.1))
print(a)
a = np.array([[0.1], [2.3]], order='F')
a.resize((2.2))
print(a)
# output
[[0]
[2]]
[[0 1]
[2 3]]
Copy the code
Enlarge the array: as above, but missing items are filled with zeros: “If you change the length of the array by more than the original number, zeros are automatically filled.”
import numpy as np
a = np.array([[0.1], [2.3]]) The default order # = 'C'
print(a)
a.resize((2.3))
print(a)
# output
[[0 1]
[2 3]]
[[0 1 2]
[3 0 0]]
import numpy as np
a = np.array([[0.1], [2.3]], order="F")
print(a)
a.resize((2.3))
print(a)
# output
[[0 1]
[2 3]]
[[0 1 0]
[2 3 0]]
Copy the code
Although the above code has already been demonstrated, the following code is used as a supplementary and simple demonstration:
import numpy as np
data = np.arange(10)
print(data)
print(data.resize((2.5)))
Copy the code
What if we want to keep the array unchanged? We can do this by referring to an array: “Referencing an array prevents resizing…”
import numpy as np
a = np.array([[0.1], [2.3]])
c = a
a.resize((2.3))
print(a)
# output
Traceback (most recent call last):
File "/Users/apple/PycharmProjects/Coder/project.py", line 5.in <module>
a.resize((2.3))
ValueError: cannot resize an array that references or is referenced
by another array in this way.
Use the np.resize function or refcheck=False
Copy the code
ValueError: Cannot resize the reference or array being referenced. Unless refCheck is false:
So we can change the code as follows:
import numpy as np
a = np.array([[0.1], [2.3]])
c = a
a.resize((2.3), refcheck=False)
print(a)
# output
[[0 1 2]
[3 0 0]]
Copy the code
11. Matrix transpose
If you’ve learned linear algebra, you should know what matrices are, for example, if you have two rows and five columns, we use.t to change the two rows and five columns into five rows and two columns.
But let me just do a little bit of math for the transpose:
Take the m x n matrix
The matrix resulting from the transposition of the columns and columns of, is calledAThe transpose of omega, let’s call it omega, i.e.,
By definition,A 为 m x n, n x mMatrix.
For example,..
If the square matrix of the NTH order is equal to its transpose, namely, the matrix A is calledSymmetric matrices.
if, the matrix A is calledAntisymmetric matrix.
The operation code is as follows:
import numpy as np
a = np.arange(10)
print(a)
print(a.reshape(2.5))
print(a.reshape(2.5).T)
# output
[0 1 2 3 4 5 6 7 8 9]
[[0 1 2 3 4]
[5 6 7 8 9]]
[[0 5]
[1 6]
[2 7]
[3 8]
[4 9]]
Copy the code
Empty creates an empty array that allocates memory but does not fill it with any values
In [33] :The # empty function returns an uninitialized garbage value. : np.empty((2.3), dtype=np.int8)
Out[33]:
array([[0.0.0],
[0.0.0]], dtype=int8)
Copy the code
13. The identity function creates an identity matrix of size n×n (diagonal 1, rest 0).
The # identity function prototype is as follows:
np.identity(n, dtype=<type ‘float'>)Copy the code
In [42] :Create an identity matrix of size 3×3. : np.identity(3, dtype=np.int8)
Out[42]:
array([[1.0.0],
[0.1.0],
[0.0.1]], dtype=int8)
Copy the code
14. Eye function, upgraded version of Identity
The prototype of the eye function is as follows:
np.eye(N, M=None, k=0, dtype=<type ‘float'>)Copy the code
If only N is specified, an N×N square matrix is printed, which functions the same as identity. If both N and M are specified, a rectangular matrix of size N by M is output. K is the adjustment value, and the position deviation degree of the diagonal line that is adjusted to 1. Here’s an example:
In [44] :Create a 3×3 square matrix. : np.eye(N=3, dtype=np.int8)
Out[44]:
array([[1.0.0],
[0.1.0],
[0.0.1]], dtype=int8)
In [45] :Create a 3×4 rectangular matrix. : np.eye(N=3, M=4, dtype=np.int8)
Out[45]:
array([[1.0.0.0],
[0.1.0.0],
[0.0.1.0]], dtype=int8)
In [46] :Create a 3×4 rectangular matrix with the diagonal of 1 offset to the right by 1.. : np.eye(N=3, M=4, k=1, dtype=np.int8)
Out[46]:
array([[0.1.0.0],
[0.0.1.0],
[0.0.0.1]], dtype=int8)
In [47] :Create a 3×4 rectangle with 1 offset diagonally to the right by 2.. : np.eye(N=3, M=4, k=2, dtype=np.int8)
Out[47]:
array([[0.0.1.0],
[0.0.0.1],
[0.0.0.0]], dtype=int8)
Copy the code
Note that k can be negative. For example, if k=-2, the diagonal of 1 is shifted to the left by 2 units.
5.3 Array arithmetic, “lazy” essential
The Tao Te Ching of Lao Zi says: Tao gives birth to one, life to two, two gives birth to three, and three gives birth to everything. It is the process of “Tao” creating all things, that is, the process of “Tao” creating all things from less to more, from simple to complex.
Derived from our math learning, we learned the creation of data, just like the process of “Tao gives birth to one, life is two”, and the array operation contains infinite combinations, similar to the feeling of “two gives birth to three, three gives birth to everything”. Indeed, array arithmetic has greatly expanded our ability to use arrays to solve practical problems, to master the universe of arrays.
Array operations seem mysterious, but actually very simple, you do not need to learn complex mathematical knowledge, in daily use, with ordinary operations are not much different.
1. Operations between arrays and scalars
Usually we call individual numbers scalars, arrays can be evaluated directly against scalars, and the calculation logic is automatically propagated to all the elements of the array. Let’s take a few simple examples:
Array and scalar addition
import numpy as np
arr = np.array([[1.2.3], [4.5.6]])
arr + 5
Out:
array([[ 6.7.8],
[ 9.10.11]])
Multiply an array with a scalar
arr * 2
Out:
array([[ 2.4.6],
[ 8.10.12]])
Array and scalar division, find the reciprocal of each element in the array
1 / arr
Out:
array([[1. , 0.5 , 0.33333333],
[0.25 , 0.2 , 0.16666667]])
Square each element of an array with a scalar
arr ** 2
Out:
array([[ 1.4.9],
[16.25.36]], dtype=int32)
Take the arithmetic square root of each element in the array
arr ** 0.5
Out:
array([[1. , 1.41421356.1.73205081],
[2. , 2.23606798.2.44948974]])
Copy the code
2. General function operations on arrays
Ufunc stands for Universal Function. Does that sound powerful? True to its name, it functions on every element in the array. Many of the Ufunc functions in NumPy are very fast because they are implemented in C. The general function ufunc is the ability to micro each element in an array, which is element-level function operations.
- arithmetic
The simplest general-purpose function is an array and its four operations. However, when performing the four operations on the array, we need to ensure that the dimensions of the two operations are the same.
# array subtraction:
arr - arr
Out:
array([[0.0.0],
[0.0.0]])
Array multiplication:
arr * arr
Out:
array([[ 1.4.9],
[16.25.36]])
Copy the code
It is important to note that the multiplication here is the multiplication of the elements representing the corresponding positions in the array, not the multiplication of matrices in advanced mathematics. Of course, the rules for adding and dividing arrays are similar, so I’m not going to give an example here.
In fact, Numpy also encapsulates functions for four operations. Here we use generic multiplication of arrays as an example:
# another way of writing array multiplication is the same as the * asterisk multiplication:
np.multiply(arr, arr)
Out:
array([[ 1.4.9],
[16.25.36]])
Copy the code
In general, Numpy wraps some functions around common array operations.
function | instructions |
---|---|
add | Compute the sum of two arrays |
subtract | Subtract the second array from the first array |
multiply | Computes the product of two array elements (not matrix multiplication) |
divide | First array element divided by second array element |
power | The first element, A, the second element, B, evaluates A to the B |
fmax | Compute the larger of the two elements in each position |
fmin | Compute the smaller of the two elements in each position |
-
Add: Calculates the sum of two arrays
data = np.add(1.0.4.0) data Out: 5.0 data1 = np.array([1.3.5.7.9]) data2 = np.array([2.4.6.8.10]) np.add(data1, data2) Out: array([ 3.7.11.15.19]) data1 = np.arange(9.0).reshape((3.3)) data2 = np.arange(3.0) f"data1:{data1}" Out: 'data1:[[0. 1. 2.]\n [3. 4. 5.]\n [6. 7. 8.]]' f"data2:{data2}" Out: 'data2:[0. 1. 2.]' np.add(data1, data2) Out: array([[ 0..2..4.], [ 3..5..7.], [ 6..8..10.]]) Copy the code
Divide an array element by a second array element. True_divide = true_divide = true_divide = true_divide = true_divide = true_divide
-
Power: first element A, second element B, evaluates A^B
import numpy as np data1 = np.array([2.6.5]) data2 = np.array([1.2.3]) print(np.power(data1, data2)) print(data1 ** data2) print(np.power(data1, 2)) # output [ 2 36 125] [ 2 36 125] [ 4 36 25] Copy the code
-
Fmax: Computes the larger of the two elements in each position
import numpy as np data1 = np.array([2.6.5]) data2 = np.array([1.2.3]) print(np.fmax(data1, data2)) # output [2 6 5] import numpy as np data = np.array([2.6.5]) print(np.fmax(data, 1)) print(np.fmax(data, 9)) # output [2 6 5] [9 9 9] Copy the code
-
Fmin: Computes the smaller of the two elements in each position
import numpy as np data1 = np.array([2.6.5]) data2 = np.array([1.2.3]) print(np.fmin(data1, data2)) # output [1 2 3] import numpy as np data = np.array([2.6.5]) print(np.fmin(data, 1)) print(np.fmin(data, 9)) # output [1 1 1] [2 6 5] Copy the code
According to the number of parameters accepted by the general function, we often divide it into unary function and binary function.
- A function of
Rounding an array is a typical unary function, as shown below:
Create a normal distribution array with a mean of 5 and a standard deviation of 10
arr_rnd = np.random.normal(5.10, (3.4))
arr_rnd
Out:
array([[19.03116154.13.58954268.11.93818701.4.85006153],
[ 0.57122874.4.33719914.8.67773155.10.15552974],
[ 7.04757778.6.98288594.10.60656035.17.95555988]])
Round the array. Note that the result array still retains the dTYPE attribute of the input array
np.rint(arr_rnd)
Out:
array([[19..14..12..5.],
[ 1..4..9..10.],
[ 7..7..11..18.]])
Copy the code
The array trigonometric function, average, power operation, etc. are input into the category of one variable function, we put the commonly used one variable function listed as follows, for everyone to refer to:
function | instructions |
---|---|
Abs, fabs | Compute absolute values, and for non-negative values, use faster FABs |
SQRT, square, e to the | Compute the square root, square, and exponent e^x of each element |
Log, log10, log2, log1p | Calculate the natural logarithm (base E), log of base 10, log of base 2, and log(1+x) respectively. |
sign | Calculate the sign of each element: 1 (integer), 0 (zero), -1 (negative) |
ceil | Calculates the smallest integer greater than or equal to the value of each element. |
floor | Calculates the largest integer greater than or equal to that value for each element. |
rint | Round the values of each element to the nearest integer and keep the dtype |
modf | Returns the decimal and integer parts of the array as two separate arrays |
isnan | Determines whether each element is emptyNaN , returns a Boolean |
Cosine, cosh, sin, sinh, tan, tanh | Ordinary and hyperbolic trig functions |
-
Abs: Calculates the absolute value
data = np.array([1, -2, -4.2]) print(data) # [1-2-4 2] print(np.abs(data)) # [1 2 4 2] Copy the code
-
SQRT: Calculates the square root of each element
data = np.arange(10) print(np.sqrt(data)) # output [0. 1. 1.41421356 1.73205081 2. 2.23606798 2.44948974 2.64575131 2.82842712 3. ] Copy the code
-
Square: Calculates the square
data = np.array([1, -2, -4.2]) print(data) # [1-2-4 2] print(np.square(data)) # [1 4 16 4] Copy the code
-
Exp: Calculates the exponent e^x
data = np.array([1, -2, -4.2]) print(data) # [1-2-4 2] print(np.exp(data)) # [2.71828183 0.13533528 0.01831564 7.3890561] Copy the code
-
Sign: calculates the sign of each element: 1 (integer), 0 (zero), -1 (negative)
data = np.array([1, -2, -4.2.0.100, -90]) print(data) # [1-2-4 2 0 100-90] print(np.sign(data)) # [1 -1 -1] Copy the code
-
Ceil: Calculates the minimum integer greater than or equal to the value of each element.
data = np.array([-1.7, -1.5, -0.2.0.2.1.5.1.7.2.0]) print(data) # [-1.7-1.5-0.2 0.2 1.5 1.7 2. print(np.ceil(data)) # [-1. -1. -0. Copy the code
-
Floor: Calculates the largest integer greater than or equal to the value of each element.
data = np.array([-1.7, -1.5, -0.2.0.2.1.5.1.7.2.0]) print(data) # [-1.7-1.5-0.2 0.2 1.5 1.7 2. print(np.floor(data)) # [-2.-2.-1.1.2. Copy the code
-
Isnan: Checks whether each element is an empty NaN and returns a Boolean
data = np.array([-1.7, np.log(-1.), np.log(0), 1.1.5, np.inf, np.nan]) data Out: array([-1.7, nan, -inf, 1. , 1.5, inf, nan]) np.isnan(data) Out: array([False.True.False.False.False.False.True]) # Inf = Inf = infty = Infinity = PINF Infinity; Infinite; Infinite distance, NP. Inf: INF # np.log(-1.): nan # np.log(0) : -inf # np.nan : nan Copy the code
- Dual function
In addition to the four operations, it is a typical binary function to judge two array elements, as shown in the following example:
# Generate 2 arrays using random function
x = np.random.normal(5.10, (3.1))
y = np.random.normal(5.10, (3.1))
x
Out: array([ 9.5336068 , 8.31969942.15.20601081])
y
Out: array([22.52827938.3.01609475.9.03514098])
# calculate, compare the maximum value of the element level
np.maximum(x,y)
Out: array([22.52827938.8.31969942.15.20601081])
# calculate, compare the minimum values at the element level
np.minimum(x,y)
Out: array([9.5336068 , 3.01609475.9.03514098])
# evaluate, perform element-level comparisons
np.greater(x,y)
Out: array([False.True.True])
Copy the code
We list the commonly used binary functions as follows for your reference:
function | instructions |
---|---|
Maximum, fmax | Calculate the maximum value of the element level,fmax Null values are automatically ignoredNaN |
Minimum, fmin | Compute the minimum at the element level,fmin Null values are automatically ignoredNaN |
Greater, greater_equal | Performs element-level comparisons to produce Boolean arrays. The effect is equal to >, ≥ |
Less, less_equal | Performs element-level comparisons to produce Boolean arrays. The effect is equal to <, ≤ |
Equal, not_equal | Performs element-level comparisons to produce Boolean arrays. The effect is equivalent to ==,! = |
Logical_and, logical_OR, logic_xor | Perform the element level logic operation, be equivalent to perform operator &, |, ^ |
I want to remind you that array arithmetic itself is not complicated, it’s just a process of applying formulas. However, before using the array, it is important to pay attention to whether there are empty values in the array. The presence of empty values may lead to error results or even errors. To determine whether an array has a null value, use the isnan function.
3. Linear algebra of arrays
- Matrix multiplication
Linear algebra (such as matrix multiplication, matrix factorization, determinants, and other mathematical functions) is an important part of any data analysis library. Numpy also provides this ability, for example, we have learned the above matrix element level multiplication, so how to do linear algebra multiplication? It’s actually quite convenient:
The dimensions of the input 2 arrays must meet the requirements of matrix multiplication, otherwise an error will be reported;
# arr.T denotes the transpose of the ARR array
# np.dot denotes matrix multiplication of the two input arrays
np.dot(arr, arr.T)
Out:
array([[14.32],
[32.77]])
Copy the code
- Numpy. Linalg tools
Numpy. Linalg encapsulates a standard set of matrix factorization operations and features such as inversions and determinants. Let’s take a quick look.
## Using inV function, solve the inverse matrix of the matrix (note: the matrix is variable, first must be square matrix)
# Step 1: Guide the package
from numpy.linalg import inv
arr_lg = np.array([[0.1.2], [1.0.3], [4, -3 ,8]])
arr_lg
Out:
array([[ 0.1.2],
[ 1.0.3],
[ 4, -3.8]])
arr_inv = inv(arr_lg)
arr_inv
Out:
array([[-4.5.7. , -1.5], [...2. , 4. , -1. ],
[ 1.5, -2. , 0.5]])
# Test: a matrix multiplied by its own inverse should yield the identity matrix
np.dot(arr_lg, arr_inv)
Out:
array([[1..0..0.],
[0..1..0.],
[0..0..1.]])
Copy the code
The function solve in numpy. Linalg can solve linear equations of the form Ax = b, where A is A matrix, b is A one-dimensional array, and x is an unknown variable.
# Solve the following equations:
# x-2y+z=0
# 2y-8z=8
# -4x+5y+9z=-9
Import the solve function
from numpy.linalg import solve
A = np.array([[1, -2.1], [0.2, -8], [...4.5.9]])
b = np.array([0.8, -9])
X = solve(A,b)
X
Out:array([29..16..3.])
# test AX = b
np.equal(np.dot(A, X), b)
Out: array([ True.True.True])
Copy the code
There are some other functions that are packaged in numpy. Linalg, which are not listed here.
function | instructions |
---|---|
diag | Returns the diagonal (or non-diagonal) elements of a square matrix as a one-dimensional array, or converts a one-dimensional array to a square matrix |
trace | Compute the sum of the diagonal elements |
det | Compute the determinant of the matrix |
eig | Calculate the eigenvalues and eigenvectors of the square matrix |
inv | Compute the inverse of the square matrix |
pinv | Compute the Moore-Penrose pseudo-inverse of the matrix |
qr | Calculate QR decomposition |
svd | Computational Singular Value Decomposition (SVD) |
solve | Solving linear equation |
lstsq | Compute the least squares solution to Ax is equal to b |
4. Aggregate function operation of array
An aggregate function is a function that operates on a set of values (such as an array) and returns a single value as the result, such as the sum of the elements of the array. Common aggregation functions are: sum, find the maximum minimum, find the average, find the standard deviation, find the median and so on.
- Common aggregate functions
List of common aggregate functions:
function | instructions |
---|---|
sum | Computes the sum of all elements of an array. |
mean | Calculate the average of all elements of an array. |
std | Calculates the standard deviation of all elements in the array |
min, max | Computes the minimum or maximum value of all elements in an array |
argmin, argmax | Calculates the minimum or maximum position in all elements of the array |
cumsum | Cumulative summation operation |
median | And the median |
var | Strives for the variance |
As simple as it seems, let’s take a look at a case that maximalizes arr_Rnd:
arr_rnd = np.random.normal(5.10, (3.4))
np.max(arr_rnd)
Out: 19.324449336215558
Copy the code
However, in practical engineering, we usually form several samples into an array for operation. For example, the size of ARR_RND is 3×4, which can be regarded as composed of three samples, and each sample is a horizontal vector of length 4. So what do I do if I want to maximize the four dimensions of my sample?
Here we introduce the idea of direction of operation.
- Numpy operation direction axis details
To simplify things, we’ll focus on the two-dimensional array scenario, which covers most of our scenarios.
! Described [images] (Numpy foundation and basic application. Assets / 5 df333c500013dd916000674-20200814161256596. JPG)
In the previous section, to visualize the slicing process, we defined the vertical direction of the two-dimensional array as axis 0 and the horizontal direction as axis 1. When we perform Numpy, we specify Axis =0 for vertical calculations and axis=1 for horizontal calculations.
So, obviously, if we want to solve the above question, we can specify the direction of the maximum value as the vertical direction, i.e. Axis =0:
arr_rnd = np.random.normal(5.10, (3.4))
arr_rnd
Out:
array([[ 3.03935618.10.4099711 , -2.52635821.28.64354885],
[ 1.70071308.12.09126794, -19.11971849.16.37838998], [...0.49338333.0.63231608.17.84866128.0.30924362]])
np.max(arr_rnd)
Out: 28.643548853069557
np.max(arr_rnd, axis=0)
Out: array([ 3.03935618.12.09126794.17.84866128.28.64354885])
Copy the code
- Explore argmin, Argmax principle
Argmin, argmax: Calculates the minimum or maximum position of all elements in the array
Let’s start with the simplest example
Suppose we now have an array A = [3, 1, 2, 4, 6, 1]. Now what is the index of the largest number in group A that we want to count?
This problem can be solved for students just learning programming, the most direct idea: first assume that the 0th number is the largest, and then compare this with the following number, find the large index update: code as follows
a = [3.1.2.4.6.1]
maxindex = 0
i = 0
for tmp in a:
if tmp > a[maxindex]:
maxindex = i
i += 1
print(maxindex) # 4
Copy the code
Argmin: Argmin: argmin: argmin: argmin: argmin: argmin: argmin: argmin: argmin: argmin: argmin
a = [3.1.2.4.6.10]
minindex = 0
i = 0
for tmp in a:
if tmp < a[minindex]:
minindex = i
i += 1
print(minindex) # 1
Copy the code
explain
Here I will explain argmax and not argmin, basically similar. Again, starting with a one-dimensional array, look at the following example:
import numpy as np
a = np.array([3.1.2.4.6.1])
print(np.argmax(a)) # 4
import numpy as np
a = np.array([3.1.2.4.6.1.6])
print(np.argmax(a)) # 4
Copy the code
Argmax returns the index of the largest number, np. Argmax takes a single argument, Axis, which defaults to 0 to indicate the maximum number of dimensions. What if we want to look at the maximum in two dimensions? (At this point in this article, IT suddenly strikes me that using IPython to run and demo code really smells good, so I’ll use IPython to demo code.) Let’s first demonstrate the official sample code:
In [2] :import numpy as np
In [3]: a = np.arange(6).reshape(2.3) + 10
In [4]: a
Out[4]:
array([[10.11.12],
[13.14.15]])
In [5]: np.argmax(a)
Out[5] :5
In [6]: np.argmax(a, axis=0)
Out[6]: array([1.1.1])
In [7]: np.argmax(a, axis=1)
Out[7]: array([2.2])
Copy the code
Ps: Returns the index corresponding to the first occurrence in the case of multiple occurrences of the maximum value. “In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.
5. Test knife: Array oriented programming
After the above steps, friends have a preliminary understanding of the high efficiency of the array operation, the use of Numpy encapsulated functions, we can save the complex steps of writing a loop, we put this use of array to perform batch calculations, and omit the process of writing a loop, called vectorization.
- Use functions to solve some problems
In practice, we are faced with scenarios that probably cannot be solved by one or two simple functions. For example, in order to reduce the influence of personal prejudice and ensure the objective and fair result of the competition, we usually remove a highest score and a lowest score, and take the average of the remaining scores as the final result of the contestant.
# Use the built-in random number generating function to generate the scoring results of the judges of 5 contestants, there are 7 judges in total. The scores are presented in a 5×7 array
votes = np.random.randint(1.10, (5.7))
votes
Out: array([[8.6.4.1.6.1.5],
[7.6.1.9.2.1.5],
[9.6.8.9.3.5.4],
[8.5.8.4.7.9.7],
[6.2.1.2.1.8.3]])
Copy the code
# Total score - highest score - lowest score, then calculate the average, you can get the final result
(np.sum(votes, axis=1)-np.max(votes, axis=1)-np.min(votes, axis=1)) /5
Out: array([4.4.4.2.6.4.7. , 2.8])
Copy the code
What about things that can’t be solved by a simple function? Let’s briefly introduce two methods:
- Use Numpy to realize conditional judgment
Conditional judgment is a very common scenario in computing. For example, I want to judge the arr_Rnd data grid generated above and replace it with NaN if the data element is less than or equal to 5. In fact, using np.where function is easy to implement:
# where (true); # where (false)
# Null values are a new data format in Numpy, we use Np.nan to generate null values
np.where(arr_rnd<5, np.nan, arr_rnd)
Out: array([[19.03116154.13.58954268.11.93818701, nan],
[ nan, nan, 8.67773155.10.15552974],
[ 7.04757778.6.98288594.10.60656035.17.95555988]])
Copy the code
- np.frompyfunc
If you still can’t find a suitable function to achieve your purpose, then you might as well write one, also very simple. All we have to do is use frompyfunc to convert a function that evaluates a single element into a function that operates on every element of the array.
Here’s a simple chestnut:
Suppose A Taobao shop does wholesale business, and the original price of popular product A in the shop is 20 yuan.
- 60 percent off for purchases of 100 or more;
- 20 percent off for purchases of 50 or more, less than 100;
- 10 percent off for purchases of 10 or more, less than 50;
- No discount for less than 10 pieces.
Given the specific purchases of 5 customers on a given day, find the turnover of that day. There are many methods, so today here is a recommended array based calculation method:
# define function, purchase x piece order, return order amount
def order(x) :
if x>=100:
return 20*0.6*x
if x>=50:
return 20*0.8*x
if x>=10:
return 20*0.9*x
return 20*x
The frompyfunc function takes three input arguments: the function to be converted, the number of input arguments to the function, and the number of return values to the function
income = np.frompyfunc(order, 1.1)
# order_lST is the order quantity for 5 customers
order_lst = [600.300.5.2.85]
# Calculate the turnover of the day
np.sum(income(order_lst))
Out: 12300.0
Copy the code
After learning the operation of the array, the basic content of the initial order of Numpy. This article is mainly on the array operation of the common functions to do a systematic introduction, including: array general function, linear algebra, aggregation function, and a number of practical demo for friends to learn.
There are more functions involved in learning, we do not learn by rote mentality involved in learning. In the data analysis learning stage, should be based on understanding and practice, as for too many functions can not remember, it is left to time to solve it, use more nature will remember. For the partial door function, in the application of the time to refer to.
6. Sort arrays
In [8]: data = np.array([1.9.3.2.7.4.5.6.8])
In [9]: data.sort()
In [10]: data
Out[10]: array([1.2.3.4.5.6.7.8.9])
Copy the code
5.4 Access to NDARray
Create a TXT file and save it as: data. TXT
1,2,3,4,5,6,7,8,9
Copy the code
The code we read is as follows:
import numpy as np
data = np.genfromtxt('data.txt', delimiter=', ')
print(data)
Copy the code
If we know that the data we are reading is an integer before we read it, we can read or modify the data type in two ways.
Method one:
data = np.genfromtxt('data.txt', dtype='int', delimiter=', ')
Copy the code
Method 2:
Array. Astype (type to be converted to)
data = np.genfromtxt('data.txt', delimiter=', ')
print(data.astype(int))
Copy the code
Genfromtxt method is more recommended to students to read the source code analysis, here is not redundant.
5.5 Advanced: reference, copy, and View
The annual Apple mobile phone launch conference, will definitely firmly occupy the hot weibo list search, although “slot” constantly, but the purchase or have to buy, start a bit not softhearted. The new iPhone Xs Max has finally smashed the sky, reaching 12,799 yuan. We might as well imagine, this year’s year-end bonus sent 100000 large banknotes, decided to reward themselves and the object, each bought a same Apple mobile phone, configuration, color, price, and even boot password are set as the same, absolute lovers. So the question is, are these two identical phones?
If I define my cell phone as variable M and the object’s cell phone as variable N, how can I tell if they are the same?
M == N
Copy the code
If we use the equal statement above, then the result must be True. Because this method is based on the content of the variable, since the two phones have the same configuration, color, function, etc., it must be True. Of course, no two pieces of hardware are exactly the same in the world, so don’t drill your horn here.
M is N
Copy the code
What if you use IS? That’s a tricky answer, and indeed, the answer should be False. It’s based on identity, is my phone your phone? Apparently not.
This example is a good example of copy copy, where my phone and object’s phone copy each other, but we are different!
This article will start from Python reference, copy, and in-depth exposition of Numpy copy mechanism, to help you in the data analysis, in the process of data processing, to the array variables and variables between the connection, have a deeper understanding.
1. The Python
- Python reference
A reference to an object is the process of assignment. Here’s an example:
a = ["a"."b"."c"]
b = a
a is b
Out:
True
Copy the code
In the previous example, we reassigned a to B, so the assignment simply creates a new layer of references from B to the actual list. The schematic diagram of this process is as follows:
Therefore, when we use is to compare a and B, we are comparing the same block of memory that they both refer to, and the identity judgment is naturally True. Obviously, if we change a, B will change, and vice versa. That’s the idea of a reference.
To determine whether two variables are identical, you can also use the id() function, which retrieves the memory address of the object. For example, for the above two variables, we can do the following:
The address encoding is generally different when the memory address is queried in a specific environment
# But the underlying rule is the same: a and B have the same memory address
id(a)
Out: 1430507484680
id(b)
Out: 1430507484680
Copy the code
- Deep and shallow copies of Python
Before we do that, let’s conclude that the memory in both numbers and strings points to the same address, so both deep and shallow copies are meaningless to them. In other words, when we look at deep copies and shallow copies, we look at mutable objects, and the most common case is lists.
- Deep copy
The so-called deep copy, which is based on the copied variable object, builds a completely identical copy of the object, the two in addition to looking exactly the same, but independent of each other, will not affect each other.
Let’s take a list as an example, a single layer of regular list chestnuts:
For a single layer regular list, after a deep copy operation, no operation on the copied object can change the original object
Python's native copy operation requires importing a copy package, where the deepcopy() function represents a deepcopy
import copy
m = ["Jack"."Tom"."Brown"]
n = copy.deepcopy(m)
Copy the code
I also include a schematic diagram:
# If the two are equal, the result is obvious
m == n
Out: True
# Determine whether the two are identical. The answer is no. In other words, list M and list N are stored at different addresses.
m is n
Out: False
# Change the element in the first place of M, and it is found that n does not change, indicating that the two do not affect each other
m[0] = "Helen"
m
Out: ['Helen'.'Tom'.'Brown']
n
Out: ['Jack'.'Tom'.'Brown']
Copy the code
The example above shows that when we use deep copy on the mutable object M, we copy the data structure of M completely and assign the value to N. And that’s what we know intuitively.
- Shallow copy
In fact, since Python copies are shallow and deep, it is obvious that shallow copies must be different. There is no difference between a shallow copy and a deep copy, if only for a single layer of regular lists of immutable objects. Here’s a quick test:
Use the copy method in the copy library to represent shallow copies
import copy
m = ["Jack"."Tom"."Brown"]
n = copy.copy(m)
Is the shallow copy identical before and after? The answer is no
m is n
Out: False
# change the value of m and find no change in n. The pattern here remains consistent with deep copy
m[0] = "Black"
n
Out: ['Jack'.'Tom'.'Brown']
Copy the code
** What about nested lists? ** will be a little different. We create a two-tier list with a list of length 3 in memory representing the student’s name, height, and weight; The first of the outer list indicates the class:
The length of the # students list is 3, where the first bit is a string and all other bits are lists
students = ["Class 1"["Jack".178.120], ["Tom".174.109]]
students_c = copy.copy(students)
Copy the code
I made a GIF demo for you:
Check to see if the embedded list is identical
students[1] is students_c[1]
Out: True
Students_c = students_c; students_c = students_C; students_C = students_C
students[1] [1] = 180
students
Out: ['Class 1'['Jack'.180.120], ['Tom'.174.109]]
students_c
Out: ['Class 1'['Jack'.180.120], ['Tom'.174.109]]
Students_c has also changed, indicating that the shallow copy does not copy the mutable elements in the nested list (deep data structures), but only references them
We then try to change the class information in students
students[0] = "Class 2"
students
Out: ['Class 2'['Jack'.180.120], ['Tom'.174.109]]
students_c
Out: ['Class 1'['Jack'.180.120], ['Tom'.174.109]]
Students_c does not change. Shallow copy is the same as deep copy for immutable elements in a nested list
Copy the code
Through the above studies, we can draw the following conclusions:
1) A list of immutable objects. Shallow copy and deep copy have the same effect and are independent of each other.
2) When a list contains a mutable element, a shallow copy simply creates a reference (pointer) to the new list. When the element changes, the copied object changes as well.
3) Deep copy does not save memory at all, shallow copy is relatively memory saving, shallow copy only copies the first layer elements;
- Slice and shallow copy
In general, we copy lists, slicing is a broad and convenient operation, so if we change the structure of the sliced list, will it change the source list?
Let’s start with the conclusion: slicing is simply a shallow copy of some elements of the source list!
# We use the data in the students list above, through a series of micro-operations such as slicing students
students = ["Class 1"["Jack".178.120], ["Tom".174.109]]
students_silce = students[:2]
Students_silce = students_silce;
Students_silce = students_silce; students_silce = students_silce
students_silce[-1] [1] = 185
students_silce
Out: ['Class 1'['Jack'.185.120]]
students
Out: ['Class 1'['Jack'.185.120], ['Tom'.174.109]]
## comparison shows that the change value of the slice result is also passed to the source list. Note The data structure of the mutable element is only referenced, not copied.
Students_silce, change the class name, and compare the changes in the source list and slice results
students_silce[0] = "Class 3"
students_silce
Out: ['Class 3'['Jack'.185.120]]
students
Out: ['Class 1'['Jack'.185.120], ['Tom'.174.109]]
# comparison shows that the change value of the slice result is not passed to the source list. Note For immutable elements, the slices are independent of each other.
## Comprehensive comparison, you can find that the effect of slice is actually shallow copy!
Copy the code
Shallow and deep copies of Python are fundamental to understanding Python. Understanding the difference between the two will help you further and understand multidimensional arrays.
2. Numpy
As we mentioned earlier, Numpy has been optimized for memory to accommodate the characteristics of big data. The so-called optimization is to save memory as the premise, as far as possible in the process of slicing to reduce the overhead of memory.
For Numpy, we primarily identify two concepts, views and replicas.
Note that since multidimensional arrays can be thought of as nested lists, the concept of shallow copies of nested lists also applies here. The effect of a multidimensional array view can be interpreted as a shallow copy of a nested list. Copy is basically the same as deep copy.)
Views A view is a reference to data. Through this reference, you can easily access and operate the original data without copying the original data. If we make changes to the view, it will affect the original data because they are physically in the same place.
A copy is a full copy of the data (a deep copy in Python), and if we make changes to the copy, it doesn’t affect the original data; their physical memory is not in the same location.
- view
There are two ways to create a view: Numpy slicing and calling the view() function.
Let’s look at a case where Leigh calls the view() function to create a view:
The view creates a new reference, but changing the dimension of the view does not change the original array
import numpy as np
arr_0 = np.arange(12).reshape(3.4)
view_0 = arr_0.view()
view_0
Out: array([[ 0.1.2.3],
[ 4.5.6.7],
[ 8.9.10.11]])
# From the perspective of id, the two are not identical.
id(arr_0) is view_0
Out: False
If you change the elements of the view, the raw data will be linked
view_0[1.1] = 100
arr_0
Out: array([[ 0.1.2.3],
[ 4.100.6.7],
[ 8.9.10.11]])
Change the dimension of the view:
View latitude changes are not passed to the original array
view_0.shape = (4.3)
print("arr_0 shape:", arr_0.shape, "view_0 shape:", view_0)
Out: arr_0 shape: (3.4) view_0 shape: (4.3)
Copy the code
Creating a view with a slice is familiar to all of you. Let’s see what happens when we test a one-dimensional array:
# slice the one-dimensional array and make changes to the result to see if the original array is affected
arr_1 = np.arange(12)
slice_1 = arr_1[:6]
slice_1[3] = 99
slice_1
Out: array([ 0.1.2.99.4.5])
The fourth element of # arr_1 is changed to 99. When an array is one dimension, the pattern is a little different from a list, so be careful here.
arr_1
Out: array([ 0.1.2.99.4.5.6.7.8.9.10.11])
Copy the code
- A copy of the
A copy, or deep copy, is relatively rough on memory and easier to understand. Before and after the replica is created, the two variables are completely independent.
# Numpy creates copies slightly differently.
Numy copy ();
Deepcopy () = deepCopy (); Here we focus on method one:
arr_2 = np.array([[1.2.3], [4.5.6]])
copy_2 = arr_2.copy()
copy_2[1.1] = 500
copy_2
Out: array([[ 1.2.3],
[ 4.500.6]])
arr_2
Out: array([[1.2.3],
[4.5.6]])
# Comparison shows that the two do not affect each other after the replica is established. Consistent with the above conclusion.
Copy the code
Here explains the knowledge point partial theory some, but is the friend from the entry to the advanced process, must take a step. If you can’t master the first reading, you can take questions and deepen your understanding in subsequent practice.
In fact, it is mainly to distinguish between direct assignment, shallow copy and deep copy. The difficulty is to understand the concept of shallow copy.
Shallow copies are in Python native lists and need to be distinguished from nested lists. In the case of nested lists, the underlying list and copied results change as one party changes. If it is a list of immutable elements, the shallow copy is no different from the deep copy.
Shallow copy is simpler in Numpy, where any element-level modification to a view or slice result is reflected in the original array, regardless of the array’s dimensions.
5.6 Supplementary knowledge points for Numpy
1. linspace()
Numpy is a library that you can use to learn and master.
The linspace() method is slightly more complicated, with the following function call parameters:
np.linspace(start, stop[, num=50[, endpoint=True[, retstep=False[, dtype=None]]]]])
# start, stop, arange();
# num is the number of elements in the array to be created. The default is 50
# endpoint=True, which defaults to True; If endpoint=False, the range is left closed and right open
# retstep is used to control the form of the returned value. Defaults to False and returns an array; If True, returns a meta-ancestor consisting of an array and a step
Copy the code
Take a look at a few examples:
# set the endpoint to true
In [43]: arr_uniform3 = np.linspace(1.99.11)
In [44]: arr_uniform3
Out[44]: array([ 1. , 10.8.20.6.30.4.40.2.50. , 59.8.69.6.79.4.89.2.99. ])
Set the endpoint to False
In [41]: arr_uniform4 = np.linspace(1.99.11, endpoint=False)
In [42]: arr_uniform4
Out[42]:
array([ 1. , 9.90909091.18.81818182.27.72727273.36.63636364.45.54545455.54.45454545.63.36363636.72.27272727.81.18181818.90.09090909])
Copy the code
# retstep is set to True and returns the array and step size, respectively
In [45]: arr_uniform5 = np.linspace(1.99.11, retstep=True)
In [46]: arr_uniform5
Out[46]:
(array([ 1. , 10.8.20.6.30.4.40.2.50. , 59.8.69.6.79.4.89.2.99. ]),
9.8)
Copy the code
The main feature of linspace() is that you can directly define the length of the array, which makes it easy to adjust the size of the array. 0 This is 0 “It’s already mentioned that this is 0 :
The 0 array is 0 0 Define a 0 array of size 0 0 0
In [49]: arr_uniform6 = np.linspace(1.100.20)
In [50]: arr_uniform6.reshape(5.4)
Out[50]:
array([[ 1. , 6.21052632.11.42105263.16.63157895],
[ 21.84210526.27.05263158.32.26315789.37.47368421],
[ 42.68421053.47.89473684.53.10526316.58.31578947],
[ 63.52631579.68.73684211.73.94736842.79.15789474],
[ 84.36842105.89.57894737.94.78947368.100. ]])
Copy the code
The 0 0 method is very flexible to use to adjust the size of the array, but not change the array’s length. An array of length 100, which you can easily adjust to 1×100 or 4×25 or 5×20. This is useful when adjusting horizontal quantities to column vectors.
2. Create an equal ratio array
Geometric sequences are also widely used in calculations, such as interest rates. Here we introduce two ways to create geometric data.
- Geomspace () method to create an exponential geometric sequence
Let’s say I want to create a geometric sequence from 2 to 16, and I don’t know the exact common ratio, but I want my sequence to be four. Here’s what I can do:
# start item is 2, end item is 16, and the length of the sequence is 4. Notice here that the default is left closed right closed array
In [51]: arr_geo0 = np.geomspace(2.16.4)
In [52]: arr_geo0
Out[52]: array([ 2..4..8..16.])
Copy the code
The geomspace() method is very simple, and its parameters are described as follows:
geomspace(start, stop, num=50, endpoint=True, dtype=None)
# start and stop, the start and end values of the interval respectively, are mandatory parameters;
# num is the length of the geometric sequence to be generated. After this is specified, the program will automatically calculate the common ratio of the geometric sequence.
# endpoint defaults to True, resulting in a left/close/right required interval. Otherwise False, left-closed and right-open interval;
Copy the code
- The logspace() method creates a logarithmic geometric sequence
The logspace() method is similar to geomspace(), except that the start and end values of the interval are defined as exponents. Logspace () is used as follows:
logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)
# start: The interval starts with the base power of start
# stop: base stop = base stop
# num: the length of the geometric sequence to be generated. Divide by logarithm, the start and stop values. The default value is 50
# endpoint: If True (the default), we can set the interval termination value, that is, close the left and close the right interval. Same rule as above
Copy the code
What if we wanted to produce the same geometric sequence as above?
# the start term is 2^1, the end term is 2^4, and the length of the sequence is 4. Note here that the start term is base, and the start value is a power of the exponent.
If logspace has a large number of parameters, we recommend that all parameters except start and stop be passed as key-value pairs to avoid errors
In [53]: arr_geo1 = np.logspace(1.4, num=4, base=2)
In [54]: arr_geo1
Out[54]: array([ 2..4..8..16.])
Copy the code
That’s all there is to creating a geometric sequence. A glimpse of the leopard, through Numpy encapsulation, really provides us with great convenience. This is the beauty of Python for data analysis.
3. Create a random number array
Random numbers have many uses in the programming world. For example, we’ve all played a game called Match-3, where when a block is eliminated, a random block of color automatically drops from the top of the screen. And when it comes to fun games, shuffling is a random process.
But sometimes we also have certain requirements for the generated random number, for example, in the game of elimination, the probability of each color block is not the same, especially in the difficult level, the program seems to deliberately increase the “game difficulty”. In fact, the random numbers here are carefully calculated, carefully designed, so let’s look at how to generate some “high order” random numbers.
- Create a uniformly distributed set of random numbers between [0, 1]
The input of the function is a number of integers, indicating the size of the output random number is D0 ×d1×... Dn....
If no argument is entered, a random number of type float is returnednumpy.random.rand(d0, d1, ... , dn)Copy the code
Generate an array of size 3×2 that is evenly distributed between 0 and 1
In [57]: arr_rand0 = np.random.rand(3.2) Three rows, two columns
In [58]: arr_rand0
Out[58]:
array([[0.59348424.0.30368829],
[0.73058467.0.66220976],
[0.6186512 , 0.32079605]])
Copy the code
- Create a evenly distributed set of random numbers between [low, high]
The uniform method specifies a range of random numbers to generate (low, high), size is the shape of an array, and the input format is integer (one-dimensional) or integer primitive
# If size is not specified, a random number that follows the distribution is returned
numpy.random.uniform(low=0.0, high=1.0, size=None)
Copy the code
Generate an array of size 3×2 that is evenly distributed between 0 and 10
arr_rand1 = np.random.uniform(1.10, (3.2))
arr_rand1
Out:
array([[6.72617294.5.32504844],
[7.6895909 , 6.97631457],
[1.3057397 , 3.51288886]])
Copy the code
- Create arrays that follow standard normal distribution (mean 0, variance 1)
Select * from rand (select * from rand); select * from rand (select * from rand); Dn....
# If no argument is entered, a random float number that follows a normal distribution is returnednumpy.random.randn(d0, d1, ... , dn)Copy the code
Generate an array of size 3×2 that is normally distributed
arr_rand2 = np.random.randn(3.2)
arr_rand2
Out:
array([[-0.70354968, -0.85339511],
[ 0.22804958.0.28517509],
[ 0.736904 , -2.98846222]])
Copy the code
- Create a normally distributed array of μ= LOc, σ=scale
# loc: specify the mean value μ; Scale: Specifies the standard deviation sigma
# size: The input format is integer (one-dimensional) or integer primitive, specifying the shape of the array
numpy.random.normal(loc=0.0, scale=1.0, size=None)
Copy the code
Generate a normally distributed array of size 3×2, mean 5, and standard deviation 10
arr_rand3 = np.random.normal(5.10, (3.2))
arr_rand3
Out:
array([[ -7.77480714, -2.68529581],
[ 4.40425363, -8.39891281], [...13.08126657, -9.74238828]])
Copy the code
So far, we’ve been generating random floating point numbers in some interval or in some pattern, so can we randomly generate integers? Apparently it can.
- An array of uniformly sampled discrete samples in the specified interval [low, high]
Dtype specifies the data type of the sample array to return. The default is integer
# size: The input format is integer (one-dimensional) or integer primitive, specifying the shape of the array
numpy.random.randint(low, high=None, size=None, dtype=np.int64)
Copy the code
# discretely and evenly sampling between [1, 5), array shape is 3 rows and 2 columns
arr_rand4 = np.random.randint(1.5, (3.2))
arr_rand4
Out:
array([[4.4],
[3.3],
[4.2]])
Copy the code
For the execution result of Np.random. Randint (1, 5, (3, 2)), it can be understood as follows: Assuming that there are four balls numbered 1, 2, 3 and 4, we put back the samples for 6 times each time, and combine the numbers of the balls selected each time and adjust them into a 3×2 size array.
Numpy.random. Randint is a very convenient way to sample a given integer, but if I want to sample a specific object, is there a better way? Let’s keep watching.
- Specific samples are sampled with or without replacement
A can be an array,a list or an integer. If it is an integer, it represents the discrete sampling of [0,a].
# replace is False, indicating no replace sampling; Replace is True, indicating that the sampling is put back
# size indicates the size of the generated sample
# p is the probability of occurrence of elements in a given array
numpy.random.choice(a, size=None, replace=True, p=None)
Copy the code
Let’s take a look at a case, the probability of xiao Ming’s shooting from the free throw line is 0.65, with a total of 10 shots. Let’s look at the computer simulation results:
# Since each shot ideally does not affect the outcome of the next one, boil this down to a sampling of returns, 10 times in total
# shoot_lst Stores the result of a shot
Select * from [" hit ", "miss "] select * from [" hit "," miss "]; select * from [" hit ", "miss "]; select * from [" hit "," miss "]; select * from [" hit ", "miss "]
shoot_lst = np.random.choice(["Hit"."Miss"], size=10, replace=True, p=[0.65.0.35])
shoot_lst
Copy the code
Out: ['Miss'.'命中'.'Miss'.'命中'.'命中'.'命中'.'命中'.'命中'.'命中'.'Miss']
Copy the code
As a result, he made 3 out of 10 shots. The actual hit rate is 70%. Of course, if the computer continues the simulation, the hit rate will eventually approach 65 percent.
4. Trivia: The secret of sampling
Statistics is a methodological science that studies random phenomena and is characterized by inference. The idea of “extrapolating from part to whole” runs throughout statistics.
For example, every year, relevant departments issue reports on the physical fitness of college students, among which height is one of the most basic characteristics. So where do these heights come from? Obviously, the cost of obtaining the height of all college students is very high, so the simplest trick to reduce the cost is random sampling. So today we’re going to look at how random sampling can reflect the sample as a whole.
Through consulting some data, we know that the height of college students obey a normal distribution law.
- Let’s assume that the average height of college students is 175 centimeters;
- The standard deviation of height is 10 centimeters;
Then we can generate a height sample of 100,000 college students through Numpy. This study is based on these 100,000 samples. Through continuous sampling, we observe how the mean value changes as the number of samples increases. We use numpy.random. Choice function to simulate the sampling process. (The program will use a simple Numpy calculation and matplotlib drawing function, friends can learn about it first, the details will be covered in later chapters of this course.)
Set the matplotlib image style to be displayed in the Jupyter Notebook
%matplotlib inline
# guide package
import numpy as np
import matplotlib.pyplot as plt
Generate a height sample of 100,000 college students
arr_height = np.random.normal(175.10, size=100000)
Sample_height = nDARray
sample_height = np.random.choice(arr_height, size=1, replace=True)
# Average is used to store the average height calculated after each sample
average = []
# 1000 rounds of cyclic sampling, since only 1 sample is collected at a time, the whole process can be regarded as having put back sampling
n = 10000
for round in range(n):
sample = np.random.choice(arr_height, size=1, replace=True)
sample_height = np.append(sample_height, sample)
average.append(np.average(sample_height))
The process of drawing is explained in detail in data visualization
plt.figure(figsize=(8.6))
plt.plot(np.arange(n), average, alpha=0.6, color='blue')
plt.plot(np.arange(n), [175 for i in range(n)], alpha=0.6, color='red', linestyle=The '-')
plt.xlabel("Sample Rounds", fontsize=10)
plt.ylabel("Average Height", fontsize=10)
plt.show()
Copy the code
From the visual effect of the graph, the mean value of the sampled samples changes dramatically during the first 2000 sampling times. However, as the number of samples increases, the mean value of samples is closer and closer to 175 cm. Therefore, reasonable and scientific sampling methods can greatly reduce the cost of daily statistical work of relevant departments.
Case 6.
1. Data normalization
- Download the data
Using NumPy, download the IRIS dataset. Extract only the second column of iris dataset usECols = [1]
import numpy as np
url = 'https://images-aiyc-1301641396.cos.ap-guangzhou.myqcloud.com/Data_Analysis/iris.data'
wid = np.genfromtxt(url, delimiter=', ', dtype='float', usecols=[1])
Copy the code
- Display data
array([3.5.3. , 3.2.3.1.3.6.3.9.3.4.3.4.2.9.3.1.3.7.3.4.3. ,
3. , 4. , 4.4.3.9.3.5.3.8.3.8.3.4.3.7.3.6.3.3.3.4.3. ,
3.4.3.5.3.4.3.2.3.1.3.4.4.1.4.2.3.1.3.2.3.5.3.1.3. ,
3.4.3.5.2.3.3.2.3.5.3.8.3. , 3.8.3.2.3.7.3.3.3.2.3.2.3.1.2.3.2.8.2.8.3.3.2.4.2.9.2.7.2. , 3. , 2.2.2.9.2.9.3.1.3. , 2.7.2.2.2.5.3.2.2.8.2.5.2.8.2.9.3. , 2.8.3. ,
2.9.2.6.2.4.2.4.2.7.2.7.3. , 3.4.3.1.2.3.3. , 2.5.2.6.3. , 2.6.2.3.2.7.3. , 2.9.2.9.2.5.2.8.3.3.2.7.3. , 2.9.3. , 3. , 2.5.2.9.2.5.3.6.3.2.2.7.3. , 2.5.2.8.3.2.3. ,
3.8.2.6.2.2.3.2.2.8.2.8.2.7.3.3.3.2.2.8.3. , 2.8.3. ,
2.8.3.8.2.8.2.8.2.6.3. , 3.4.3.1.3. , 3.1.3.1.3.1.2.7.3.2.3.3.3. , 2.5.3. , 3.4.3. ])
Copy the code
Univariate, a one-dimensional NumPy array of length 150.
- The normalized
Find the maximum and minimum value:
smax = np.max(wid)
smin = np.min(wid)
In [51]: smax,smin
Out[51] : (4.4.2.0)
Copy the code
Normalization formula:
s = (wid - smin) / (smax - smin)
Copy the code
The normalized
Min-Max Normalization
x’ = (x – X_min) / (X_max – X_min)
Mean normalization
X ‘= (x – μ)/(MaxValue – MinValue)
A defect of (1) and (2) is that when new data is added, Max and min may change and need to be redefined.
Nonlinear normalization
- Log function conversion: y = log10(x)
- Inverse cotangent function conversion: y = atan(x) * 2 / π
- It is often used in scenarios where data is highly fragmented, some values are large and some are small. The original values are mapped by some mathematical function. The method includes log, exponent, tangent and so on. Depending on the distribution of the data, you need to determine the curve of the nonlinear function, such as log(V, 2) or log(V, 10).
standardized
- Z-score normalization (standard deviation normalization/zero mean normalization)
- X ‘= (x – μ) / σ
centralized
- X ‘= x – μ
Supplementary: Data Analysis — Normalized “Essay”
An easier way is to use the PTP method, which directly evaluates the difference between a maximum and a minimum:
s = (wid - smin) / wid.ptp()
Copy the code
- NumPy print Settings
Print only three places after the decimal point:
np.set_printoptions(precision=3)
Copy the code
Normalized results:
array([0.625.0.417.0.5 , 0.458.0.667.0.792.0.583.0.583.0.375.0.458.0.708.0.583.0.417.0.417.0.833.1. , 0.792.0.625.0.75 , 0.75 , 0.583.0.708.0.667.0.542.0.583.0.417.0.583.0.625.0.583.0.5 , 0.458.0.583.0.875.0.917.0.458.0.5 ,
0.625.0.458.0.417.0.583.0.625.0.125.0.5 , 0.625.0.75 ,
0.417.0.75 , 0.5 , 0.708.0.542.0.5 , 0.5 , 0.458.0.125.0.333.0.333.0.542.0.167.0.375.0.292.0. , 0.417.0.083.0.375.0.375.0.458.0.417.0.292.0.083.0.208.0.5 , 0.333.0.208.0.333.0.375.0.417.0.333.0.417.0.375.0.25 , 0.167.0.167.0.292.0.292.0.417.0.583.0.458.0.125.0.417.0.208.0.25 , 0.417.0.25 , 0.125.0.292.0.417.0.375.0.375.0.208.0.333.0.542.0.292.0.417.0.375.0.417.0.417.0.208.0.375.0.208.0.667.0.5 , 0.292.0.417.0.208.0.333.0.5 , 0.417.0.75 , 0.25 , 0.083.0.5 , 0.333.0.333.0.292.0.542.0.5 ,
0.333.0.417.0.333.0.417.0.333.0.75 , 0.333.0.333.0.25 ,
0.417.0.583.0.458.0.417.0.458.0.458.0.458.0.292.0.5 ,
0.542.0.417.0.208.0.417.0.583.0.417])
Copy the code
- Distribution visualization:
import seaborn as sns
sns.distplot(s,kde=False,rug=True)
Copy the code
Frequency distribution histogram:
sns.distplot(s,hist=True,kde=True,rug=True)
Copy the code
Histogram with gaussian density kernel function:
2. 11 Basic NumPy questions
- Create an array with [3,5] all elements True
In [1]: np.ones((3.5),dtype=bool)
Out[1]:
array([[ True.True.True.True.True],
[ True.True.True.True.True],
[ True.True.True.True.True]])
Copy the code
- Create an array with [3,5] all elements False
In [2]: np.zeros((3.5), dtype=bool)
Out[2]:
array([[False.False.False.False.False],
[False.False.False.False.False],
[False.False.False.False.False]])
Copy the code
- One dimensional array to two dimensional array
In [3]: a = np.linspace(1.5.10)
In [4]: a.reshape(5.2)
Out[4]:
array([[1. , 1.44444444],
[1.88888889.2.33333333],
[2.77777778.3.22222222],
[3.66666667.4.11111111],
[4.55555556.5. ]])
Copy the code
- All odd numbers in the array are replaced with -1
In [5]: m = np.arange(10).reshape(2.5)
In [6]: m[m%2= =1] = -1
In [7]: m
Out[7]:
array([[ 0, -1.2, -1.4], [...1.6, -1.8, -1]])
Copy the code
- Extract all the odd numbers in the array
In [8]: m = np.arange(10).reshape(2.5)
In [9]: m[m%2= =1]
Out[9]: array([1.3.5.7.9])
Copy the code
- The intersection of two NumPy arrays
In [10]: m ,n = np.arange(10), np.arange(1.15.3)
In [11]: np.intersect1d(m,n)
Out[11]: array([1.4.7])
Copy the code
- The difference set of 2 NumPy arrays
In [12]: m ,n = np.arange(10), np.arange(1.15.3)
In [13]: np.setdiff1d(m,n)
Out[13]: array([0.2.3.5.6.8.9])
Copy the code
- Filter all elements within the specified range
Note that (m >2) a pair of parentheses must be added
In [14]: m = np.arange(10).reshape(2.5)
In [15]: m[(m > 2) & (m < 7)]
Out[15]: array([3.4.5.6])
Copy the code
- A two-dimensional array swaps 2 columns
In [16]: m = np.arange(10).reshape(2.5)
In [17]: m
Out[17]:
array([[0.1.2.3.4],
[5.6.7.8.9]])
In [18]: m[:,[1.0.2.3.4]]
Out[18]:
array([[1.0.2.3.4],
[6.5.7.8.9]])
Copy the code
Multiple columns can be swapped at once:
In [19]: m[:,[1.0.2.4.3]]
Out[19]:
array([[1.0.2.4.3],
[6.5.7.9.8]])
Copy the code
- Two dimensional array, reverse line
In [20]: m = np.arange(10).reshape(2.5)
In [21]: m
Out[21]:
array([[0.1.2.3.4],
[5.6.7.8.9]])
In [22]: m[::-1]
Out[22]:
array([[5.6.7.8.9],
[0.1.2.3.4]])
Copy the code
- Generates random floating point numbers 5 to 10 with shape (3,5)
In [9]: np.random.seed(100)
In [42]: np.random.randint(5.10, (3.5)) + np.random.rand(3.5)
Out[42]:
array([[9.31623868.5.68431289.9.5974916 , 5.85600452.9.3478736 ],
[5.66356114.7.78257215.7.81974462.6.60320117.7.17326763],
[7.77318114.6.81505713.9.21447171.5.08486345.8.47547692]])
Copy the code
3. Exercise: Count the scores of the whole class
Imagine a team of five students with the results shown in the table below. You can use NumPy to calculate the average, minimum, maximum, variance, and standard deviation of these people in Chinese, English, and math. Then the total results of these people sorted, ranking results output.
#! /usr/bin/python
#vim: set fileencoding:utf-8
import numpy as np
Imagine a team of 5 students with the results shown in the table below. 1. NumPy was used to calculate the average score, minimum score, maximum score, variance and standard deviation of these people in Chinese, English and mathematics. 2. Ranking of the total results and obtaining the ranking for the output of results. ' ' '
scoretype = np.dtype({
'names': ['name'.'chinese'.'english'.'math'].'formats': ['S32'.'i'.'i'.'i']})
peoples = np.array(
[
("zhangfei".66.65.30),
("guanyu".95.85.98),
("zhaoyun".93.92.96),
("huangzhong".90.88.77),
("dianwei".80.90.90)
], dtype=scoretype)
#print(peoples)
name = peoples[:]['name']
wuli = peoples[:]['chinese']
zhili = peoples[:]['english']
tili = peoples[:]['math']
def show(name,cj) :
print name,
print "|".print np.mean(cj),
print "|".print np.min(cj),
print "|".print np.max(cj),
print "|".print np.var(cj),
print "|".print np.std(cj)
print("Average course | | the minimum scores | maximum performance | | variance standard deviation")
show("Chinese", wuli)
show("English", zhili)
show("Mathematics", tili)
print("排名:")
ranking =sorted(peoples,cmp = lambda x,y: cmp(x[1]+x[2]+x[3],y[1]+y[2]+y[3]), reverse=True)
print(ranking)
Copy the code
7. Homework
Calculate the average grade of the book
- Read the data in the file rating. TXT and analyze it
- There are 10,000 books, represented by numeric ids
- Each user is rated from 1 to 5
- Each row of data has three numbers: the user ID, the book ID, and the user’s rating of the book
Required output: Average scores for all books
- Job analysis
The file is large, so instead of reading all the data each time we test, we can create a copy of the data, with less data.
- Read the data and convert it to an integer
import numpy as np
data = np.genfromtxt('rating.txt', delimiter=', ')
data = data.astype(int)
print(data)
Copy the code
- Create two arrays to hold the total score and the total number of scores for each book
rating_sum = np.zeros(10000)
rating_people_count = np.zeros(10000)
Copy the code
- The For loop reads the data For each row
for rating in data:
book_id = rating[1] - 1
rating_sum[book_id] += rating[2]
rating_people_count[book_id] += 1
Copy the code
The first column is going to be the user ID and it doesn’t really help us in this problem, so we don’t have to worry about that.
I am rating[1] – 1.
As you probably know, both the program and the array index start at 0, so we’re going to subtract 1 so that we can index directly using the book ID.
- The complete code
import numpy as np
data = np.genfromtxt('rating.txt', delimiter=', ')
data = data.astype(int)
# print(data)
rating_sum = np.zeros(10000)
rating_people_count = np.zeros(10000)
for rating in data:
book_id = rating[1] - 1
rating_sum[book_id] += rating[2]
rating_people_count[book_id] += 1
# Calculation method 1:
result = rating_sum / rating_people_count
print(result)
# Calculation method 2:
print(np.true_divide(rating_sum, rating_people_count))
# output
[4.27970709 4.35135011 3.21434056.4.32352941 3.70769231 4.00900901]
[4.27970709 4.35135011 3.21434056.4.32352941 3.70769231 4.00900901]
Copy the code
Python scientific computing: use NumPy rapid processing data: mp.weixin.qq.com/s/eRaLUEwrx…