Python module - Numpy and Pandas - Moment For Technology

Numpy profile

Numpy is the foundation package for high-performance scientific computing and data analysis. It is the basis for a variety of other tools like PANDAS.

Numpy features:

Ndarray, a multi-dimensional array structure, efficient and space saving
A mathematical function that performs rapid operations on array data without looping
Linear algebra, random number generation and Bolier transform functions

Create ndarRay: np.array(array_list)

The difference between arrays and lists:

- The elements in the array object must be of the same type. - The array size cannot be modifiedCopy the code

Commonly used attributes

T transpose
Size Number of array elements
The dimension of the nDIM array
The dimension size of the SHAPE array (in primitive form)
Dtype Specifies the data type of an array element

Create an array

np.zeros(10)	An array of 10 zeros
np.ones(10)		# 10 arrays of 1

a = np.empty(100)		# Value stored in memory

np.arange(100)	Create an array of 100 quickly
np.arange(15).reshape(3, 5)	Create a two-dimensional arrayNp. Arange (2, 10, 0.3) NP. Linspace (0, 50, 100)Divide 0 to 50 evenly into 100 shares

np.eye(10)	# Linear algebra
Copy the code

Ndarray batch computing

Between arrays and scalars

A +1 1*3 1//a a**0.5 a>5Copy the code

Operations between arrays of the same size

a+b a/b a**b a%b a==b
Copy the code

The index

One bit array index A [5]
Multidimensional array index
- It’s written in tabular forma[2][1]
- The new writinga[2, 1]

slice

One bit array:a[5:8] a[4:] a[2:10]
Multidimensional arrays:a[1:2, 3:4] a[:, 3:5] a[:,1]
Slicing an NP array is different from slicing a list array: arrays are not automatically copied when slicing (instead, a view is created), and changes in the slicing array affect the original array
- copy()Method to create a deep copy of an array

Ndarray Boolean index

Problem: Given an array, select all the numbers in the array greater than 5

Answer: a [a] > 5

The principle of

Arrays and scalars: a>5 evaluates each element in A and returns a Boolean array
** Boolean index: ** Passing a Boolean array of the same size to the index returns a Boolean array of all valuesTrueAn array of elements at the corresponding position

Example:

# 1, given an array, pick out all the numbers in the array greater than 5
a = np.array([random.randint(1, 10) for _ in range(20)])
a[a>5]	# array([8, 6, 6, 7, 7, 6, 6])
Copy the code

# 2, given an array, pick all even numbers greater than 5 in the array
a = np.array([random.randint(1, 10) for _ in range(20)])
a[(a>5) & (a%2==0)]
Copy the code

# 3, given an array, select all numbers greater than 5 and even numbers in the array
a = np.array([random.randint(1, 10) for _ in range(20)])
a[(a>5) | (a%2==0)]
Copy the code

Ndarray fancy index

Gives values based on index position

A = np. Arange (20) a [[1,4,5,6]]# array([1, 4, 5, 6])
Copy the code

A = np. Arange (20). Reshape (4, 5)# array([[ 0, 1, 2, 3, 4],
                                        # [5, 6, 7, 8, 9],
                                        # [10, 11, 12, 13, 14],
                                        # [15, 16, 17, 18, 19]]

a[0, 2:5]	# array([7, 8, 9])
a[0, a[0]>2]
a[[1, 3], [1, 3]]	Select * from array([6, 18]);

If you want to select 6, 8, 16, and 18, you can use the full methodA [[1, 3] :] [: [1, 3]]# : indicates full cut
Copy the code

Numpy general function

General-purpose function: a function that operates on all elements of an array simultaneously

Supplementary knowledge:

int # round zero
round Round is even if the distance between the two sides is equal
math.floor # round small (round left) (floor)
math.ceil	# round to large (round to right) (smallpox)
Copy the code

Common common functions:

A function of

abs 	# the absolute value
sqrt	# prescribing
exp log 
ceil 	# round to large (round to right) (smallpox)
floor # round small (round left) (floor)
rint 	# is equivalent to round
round	Round is even if the distance between the two sides is equal
trunc # round zero
modf	# Separate decimals from integers
isnan 	# filtration nan
isinf		# filter inf
cos sin tan
Copy the code

example

A = Np.arange (-5.5,5) NP.abs (a) x, y = NP.modf (a)X is a decimal and y is an integer
Copy the code

# numpy filters nanA = np. Arange (0,5) b = a/a# array([nan, 1., 1., 1., 1.])
b[~np.isnan(b)]	# ~ is the inverse
Copy the code

# numpy filter INFA = np. Array (,4,5,6 [3]) b = np. Array (,0,3,0 [2]) c = a/c b/c! = np. J inf# filter inf
c[~np.isinf(c)]	# filter inf
Copy the code

Dual function

add
substract
multiply
divide
power
mod
maximum	# maximum of two arrays
minimum	The smallest of two arrays
Copy the code

Numpy mathematical and statistical methods

sum 	# sum
mean	# average
std		# Standard deviation
var		# and variance

min		Take the minimum
max		# maximize
argmin	Find the minimum index
argmax	Select the maximum index

Copy the code

# and variance
1 2 3 4 5
mean:3
((1-3)**2+(2-3)**2+(3-3)**2+(4-3)**2+(5-3)**2)/5

The variance shows how discrete your set of data isStandard deviation = SQRT.Almost 60% of the data is distributed within this range
a.mean()+a.std()
a.mean()-a.std()

Almost 80% of the data is distributed within this range
a.mean()+2*a.std()
a.mean()-2*a.std()
Copy the code

Numpy random number generation

The random number function is in the Np. random subpackage

Rand A random number group is generated for a given shape (range: 0 to 1) randINT A random integer is generated for a given shape. Choice A random number is generated for a given shape. Shuffle The same as random. Shuffle A random number group is generated for a given shapeCopy the code

example

np.random.randint(0, 10, 10)	Generate a random array of length 10Np np. Random. Rand (5). The random. Choice ([1, 2, 3, 4, 5], (2, 3)) np. Random. Uniform (2, 5, and (3, 4))Copy the code

Supplementary – Floating point special value

nan (Not a Number): does not equal any floating point number (nan! = nan)
inf(infinity): is larger than any floating point number
Create special values in Numpy:np.nan np.inf
In data analysis,nanOften used to represent missing data values

Pandas Data Analysis

Introduction of pandas

Pandas is a powerful Python data analysis toolkit built on top of NumPy.

Pandas provides the following functions:

Data structure DataFrame and Series with its functions
Integrate time series functionality
Provides a wealth of mathematical operations and operations
Flexible handling of missing data

PIP install pandas

Import pandas as pd

Series one-dimensional array object

A Series is an object similar to a one-dimensional array, consisting of a set of data and a set of data labels (indexes) related to it.

Creation method:

pd.Series([4, 5, -5, 3])
pd.Series([4, 5, -5, 3], index=['a'.'b'.'c'.'d'])
pd.Series({'a': 1, 'b': 2.'c': 3})
pd.Series(0, index=['a'.'b'.'c'])
Copy the code

Gets an array of values and an array of indexes: values and index properties

A Series is more like a combination of a list (array) and a dictionary

Series features

Series supports array (subscript) features:

fromndarrayTo create the Series:Series(arr)
Operation with scalars:sr * 2
Two Series operations:sr1+sr2
Index:sr[0] [the sr [[1, 4-trichlorobenzene]]]
Section:sr[0:2]
General function:np.abs(sr)
Boolean filtering:sr[sr>0]

Series supports dictionary features (labels)

Create Series from dictionary:Series(dic)
In operation:'a' in sr
Key index:sr['a'] sr[['a', 'b', 'd']]

sr.index
sr.values
sr.index[0]
sr[[1, 3]]
sr[['a'.'d']]
sr['a': 's']	# Slice by tag, wrap before and after
Copy the code

Series integer index

The integer indexes of pandas Series objects tend to drive newbies crazy.

Such as:

sr = pd.Series(np.arange(4))
sr[-1]	# complains
Copy the code

If the index is of an integer type, subscripting a value based on an integer is always tag-oriented.

Workaround: LOC attributes (interpret indexes as labels) and ILOC attributes (interpret indexes as subscripts)

sr2.loc[10]		# index by tag
sr2.iloc[10]	# by subscript index

sr2.iloc[-1]
sr2.iloc[3:6]
sr2.iloc[[2, 3, 7]]
Copy the code

Series data alignment

Two series array objects will be aligned according to the index and then added

Sr1 = pd. Series (,23,34 [11], index = ['c'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
sr1+sr2		# will be added after index alignment
Copy the code

Sr1 = pd. Series (,23,34 [11], index = ['c'.'a'.'d') sr2 = pd.series ([11,20,10,21], index=['d'.'c'.'a'.'b'])
sr1+sr2		The default value b will be replaced by NaN
Copy the code

Sr1 = pd. Series (,23,34 [11], index = ['b'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
sr1+sr2		# b NaN c NaN
Copy the code

Sr1 = pd. Series (,23,34 [11], index = ['b'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
# how to make the result 11 at index 'b' and 20 at index 'c'
Add sub div mul
sr1.add(sr2, fill_value=0)
Copy the code

Processing of Series missing values

sr.isnull()		# NaN returns True
sr.notnull()	# not NaN returns True
Copy the code

# lose
sr = sr[sr.notnull()]
sr = sr.dropna()
# fill it with other values
sr = sr.fillna(0)					Fill with 0
sr = sr.fillna(sr.mean())	# fill with average value
Copy the code

How to create a DataFrame

Dataframe columns must be of the same type

A DataFrame is a tabular data structure containing an ordered set of columns. A DataFrame can be thought of as a dictionary consisting of Series and sharing an index.

Create a way

Style #
pd.DataFrame({'one': [1, 2, 3],'two': (4 and 6)}) pd DataFrame ({'one': [1, 2, 3],'two': (4 and 6)}, index = ['a'.'b'.'c'])
Copy the code

Way # 2
pd.DataFrame({'one': pd Series (= [1, 2, 3], the index ['a'.'b'.'c']), 'two': pd Series ([1, 2, 3, 4], the index = ['b'.'a'.'c'.'d'])})
Copy the code

CSV file reading and writing

df.to_csv('test2.csv')
pd.read_csv('test2.csv')
Copy the code

DataFrame Common attributes

Index row index T transpose columns Obtain column index VALUES Obtain array of values describe() Obtain quick statisticsCopy the code

DataFrame Index and slice

DataFrame is a two-dimensional data type, so there are row and column indexes.

DataFrame can also be indexed and sliced using both drum labels and positions

locProperties andilocattribute

Use method: comma separated, followed by row index, column index
The row/column index section can be a regular index, a slice, a Boolean index, or a fancy index

Loc interprets rows as index subscripts and ILOC interprets rows as their index names

df['one'] ['a']			# one is the column index, and a is the row index
df.loc['a'.'one']	A = a; one = a
df.loc['a'To:]# Slice a row of data

df.loc[['a'.'c'],:]
df.loc[['a'.'c'].'two']		# Arbitrary collocation
Copy the code

DataFrame Data alignment with default data

When a DataFrame object is evaluated, its row and column indexes are aligned separately.

DataFrame handles default data:

dropna(axis=0,how='any',...).# axis=0; Axis =1 Deletes columns. How ='any'; how='all';
fillna()		# fill
isnull()
notnull()
Copy the code

df.dropna(how='all')	Default is any
df2.dropna(axis=1)		The default value is 0
Copy the code

Pandas Other common methods

Mean (axis=0, skipna=False) sum(axis=1) sort_index(axio,... Sort_values (by, axis, Ascending) sorts the value of a column (row). Note that nan is not involved in the sorting and is displayed at the endThe generic functions of # Numpy also apply to pandas
Copy the code

example

f.sort_values(by='two')	Sort by two columns
df.sort_values(by='two', ascending=False)	# sort by two descending order
df.sort_values(by=1, ascending=False, axis=1)	Sort by 1 in descending order

df.sort_index()		Sort by row index
df.sort_index(ascending=False)	Sort by descending row index
df.sort_index(ascending=False, axis=1)	Sort by column index in descending order
Copy the code

Pandas Event object

Time series type:

Timestamp: a specific time
Fixed date: July 2017
Interval: Start time – end time

The Python standard library handles the time object: datetime

Flexible handling time object: dateutil dateutil.parser.parse()

Grouping event objects: pandas pd.to_datetime()

datetime.datetime.strptime('2010-08-21'.'%Y-%m-%d')
import dateutil
dateutil.parser.parse('2001-01-01')
dateutil.parser.parse('200101/01')
dateutil.parser.parse('2001/01/01')
dateutil.parser.parse('01/01/2020')
pd.to_datetime(['2019-01-01'.'2010/Feb/02'])
Copy the code

Pandas Processes the time object

Generates the time object array date_range

Start Start time End End time Period Period Freq Time frequency. The default value is freq'D', optional H (our), W (eek), B (usiness), S (emi) M onth), T (min) (es), S (econd), A (year),...Copy the code

pd.date_range('2010-01-01'.'2010-5-1')
pd.date_range('2010-01-01',periods=60)
pd.date_range?
pd.date_range('2010-01-01',periods=60,freq='H')
pd.date_range('2010-01-01',periods=60,freq='W')
pd.date_range('2010-01-01',periods=60,freq='W-MON')
pf = pd.date_range('2010-01-01',periods=60,freq='B')
pd.date_range('2010-01-01',periods=60,freq='1h20min')
df[0].to_pydatetime()
Copy the code

Pandas Time series

A time Series is a Series or DataFrame indexed by a time object

A DateTime object is stored in a DatetimeIndex object when used as an index

Time series special functions:

Pass “year” or “year” as a slice
Pass in the date range as slice mode
Rich function support: resample(), truncate()….

sr = pd.Series(np.arange(1000), index=pd.date_range('2017-1-1', periods=1000))
sr['2017-3']
sr['2017-4']
sr['2017']
sr['2017':'2018-3']
sr['2017-12-24':'2018-2-1']
sr.resample('W').sum()	Sum by week
sr.resample('m').sum()
sr.resample('m').mean()
sr.truncate(before='2018-2-3')	# Cut off the front, this method is not recommended because there is time to slice
sr.truncate(after='2018-2-3')
Copy the code

Pandas File Processing

Common formats of data files:csv(Split a character with a spacer)

Pandas Reads files: loads data from file names, urls, and file objects

read_csvThe default delimiter is a comma
read_tableThe default delimiter is TAB character

read_csv read_tableMain parameters of function

Sep specifies the separator, which can be a regular expression, for example'\s+'. Header =None specifies no column names index_col Specifies a column as an index skip_row specifies some rows to skip na_values Specifies some strings to indicate missing values (parsing some strange strings to NaN) Parse_dates specifies whether certain columns are resolved as dates, of type Boolean or listCopy the code

example

pd.read_csv('zn2006.csv')
pd.read_csv('zn2006.csv', index_col=0)		# specify the 0 column index
pd.read_csv('zn2006.csv', index_col='datetime')		# specify 'datetime' as index
pd.read_csv('zn2006.csv', index_col='datetime', parse_dates=True)
pd.read_csv('zn2006.csv', index_col='datetime', parse_dates=['datetime'])
pd.read_csv('zn2006.csv', header=None)
pd.read_csv('zn2006.csv', header=None, names=list('qwertyuiopasdfg'))
pd.read_csv('zn2006.csv', the header = None, skiprows = [4] 2) pd. Read_csv ('zn2006.csv', na_values=['None'])		# Replace None with NaN
Copy the code

Write to CSV file: to_csv function

The main arguments of the write file function are:

Sep Specifies the file separator na_rep specifies the string to convert the missing value. The default is an empty string header=False No column name line index=False No row index line COLs specifies the column to output, passed in the listCopy the code

df.to_csv('test.csv', header=False, index=False, na_rep='null')
df.to_csv('test.csv', header=False, index=False, na_rep='null',columns=['open'.'high'.'low'.'close'])
df.to_html('test.html')
pd.read_excel('test.xlsx')
Copy the code

Rolling function

You can take the top 10 Windows

df['pre_high'] = df['high'].rolling(10).max()
Copy the code

Other file types supported by pandas **

json
xml
html
The database
pickle
excel

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python module – Numpy and Pandas

Numpy profile

Numpy features:

Commonly used attributes

Create an array

Ndarray batch computing

Between arrays and scalars

Operations between arrays of the same size

The index

slice

Ndarray Boolean index

Ndarray fancy index

Numpy general function

Numpy mathematical and statistical methods

Numpy random number generation

Supplementary – Floating point special value

Pandas Data Analysis

Introduction of pandas

Series one-dimensional array object

Series features

Series supports array (subscript) features:

Series supports dictionary features (labels)

Series integer index

Series data alignment

Processing of Series missing values

How to create a DataFrame

DataFrame Common attributes

DataFrame Index and slice

DataFrame Data alignment with default data

Pandas Other common methods

Pandas Event object

Pandas Processes the time object

Pandas Time series

Pandas File Processing

Rolling function

Python module – Numpy and Pandas

Numpy profile

Numpy features:

Commonly used attributes

Create an array

Ndarray batch computing

Between arrays and scalars

Operations between arrays of the same size

The index

slice

Ndarray Boolean index

Ndarray fancy index

Numpy general function

Numpy mathematical and statistical methods

Numpy random number generation

Supplementary – Floating point special value

Pandas Data Analysis

Introduction of pandas

Series one-dimensional array object

Series features

Series supports array (subscript) features:

Series supports dictionary features (labels)

Series integer index

Series data alignment

Processing of Series missing values

How to create a DataFrame

DataFrame Common attributes

DataFrame Index and slice

DataFrame Data alignment with default data

Pandas Other common methods

Pandas Event object

Pandas Processes the time object

Pandas Time series

Pandas File Processing

Rolling function

Related Posts

Basic IO operations in Java

Nacos+OpenFegin correctly calls the service posture!

Warner Cloud: How many server performance metrics do you know?