Numpy profile

Numpy is the foundation package for high-performance scientific computing and data analysis. It is the basis for a variety of other tools like PANDAS.

Numpy features:

  • Ndarray, a multi-dimensional array structure, efficient and space saving
  • A mathematical function that performs rapid operations on array data without looping
  • Linear algebra, random number generation and Bolier transform functions

Create ndarRay: np.array(array_list)

The difference between arrays and lists:

- The elements in the array object must be of the same type. - The array size cannot be modifiedCopy the code

Commonly used attributes

  • T transpose
  • Size Number of array elements
  • The dimension of the nDIM array
  • The dimension size of the SHAPE array (in primitive form)
  • Dtype Specifies the data type of an array element

Create an array

np.zeros(10)	An array of 10 zeros
np.ones(10)		# 10 arrays of 1

a = np.empty(100)		# Value stored in memory

np.arange(100)	Create an array of 100 quickly
np.arange(15).reshape(3, 5)	Create a two-dimensional arrayNp. Arange (2, 10, 0.3) NP. Linspace (0, 50, 100)Divide 0 to 50 evenly into 100 shares

np.eye(10)	# Linear algebra
Copy the code

Ndarray batch computing

Between arrays and scalars

A +1 1*3 1//a a**0.5 a>5Copy the code

Operations between arrays of the same size

a+b a/b a**b a%b a==b
Copy the code

The index

  • One bit array index A [5]

  • Multidimensional array index

    • It’s written in tabular forma[2][1]
    • The new writinga[2, 1]

slice

  • One bit array:a[5:8] a[4:] a[2:10]
  • Multidimensional arrays:a[1:2, 3:4] a[:, 3:5] a[:,1]
  • Slicing an NP array is different from slicing a list array: arrays are not automatically copied when slicing (instead, a view is created), and changes in the slicing array affect the original array
    • copy()Method to create a deep copy of an array

Ndarray Boolean index

Problem: Given an array, select all the numbers in the array greater than 5

Answer: a [a] > 5

The principle of

  • Arrays and scalars: a>5 evaluates each element in A and returns a Boolean array
  • ** Boolean index: ** Passing a Boolean array of the same size to the index returns a Boolean array of all valuesTrueAn array of elements at the corresponding position

Example:

# 1, given an array, pick out all the numbers in the array greater than 5
a = np.array([random.randint(1, 10) for _ in range(20)])
a[a>5]	# array([8, 6, 6, 7, 7, 6, 6])
Copy the code
# 2, given an array, pick all even numbers greater than 5 in the array
a = np.array([random.randint(1, 10) for _ in range(20)])
a[(a>5) & (a%2==0)]
Copy the code
# 3, given an array, select all numbers greater than 5 and even numbers in the array
a = np.array([random.randint(1, 10) for _ in range(20)])
a[(a>5) | (a%2==0)]
Copy the code

Ndarray fancy index

Gives values based on index position

A = np. Arange (20) a [[1,4,5,6]]# array([1, 4, 5, 6])
Copy the code
A = np. Arange (20). Reshape (4, 5)# array([[ 0, 1, 2, 3, 4],
                                        # [5, 6, 7, 8, 9],
                                        # [10, 11, 12, 13, 14],
                                        # [15, 16, 17, 18, 19]]

a[0, 2:5]	# array([7, 8, 9])
a[0, a[0]>2]
a[[1, 3], [1, 3]]	Select * from array([6, 18]);

If you want to select 6, 8, 16, and 18, you can use the full methodA [[1, 3] :] [: [1, 3]]# : indicates full cut
Copy the code

Numpy general function

General-purpose function: a function that operates on all elements of an array simultaneously

Supplementary knowledge:

int # round zero
round Round is even if the distance between the two sides is equal
math.floor # round small (round left) (floor)
math.ceil	# round to large (round to right) (smallpox)
Copy the code

Common common functions:

  • A function of

    abs 	# the absolute value
    sqrt	# prescribing
    exp log 
    ceil 	# round to large (round to right) (smallpox)
    floor # round small (round left) (floor)
    rint 	# is equivalent to round
    round	Round is even if the distance between the two sides is equal
    trunc # round zero
    modf	# Separate decimals from integers
    isnan 	# filtration nan
    isinf		# filter inf
    cos sin tan
    Copy the code

    example

    A = Np.arange (-5.5,5) NP.abs (a) x, y = NP.modf (a)X is a decimal and y is an integer
    Copy the code
    # numpy filters nanA = np. Arange (0,5) b = a/a# array([nan, 1., 1., 1., 1.])
    b[~np.isnan(b)]	# ~ is the inverse
    Copy the code
    # numpy filter INFA = np. Array (,4,5,6 [3]) b = np. Array (,0,3,0 [2]) c = a/c b/c! = np. J inf# filter inf
    c[~np.isinf(c)]	# filter inf
    Copy the code
  • Dual function

    add
    substract
    multiply
    divide
    power
    mod
    maximum	# maximum of two arrays
    minimum	The smallest of two arrays
    Copy the code

Numpy mathematical and statistical methods

sum 	# sum
mean	# average
std		# Standard deviation
var		# and variance

min		Take the minimum
max		# maximize
argmin	Find the minimum index
argmax	Select the maximum index

Copy the code
# and variance
1 2 3 4 5
mean:3
((1-3)**2+(2-3)**2+(3-3)**2+(4-3)**2+(5-3)**2)/5

The variance shows how discrete your set of data isStandard deviation = SQRT.Almost 60% of the data is distributed within this range
a.mean()+a.std()
a.mean()-a.std()

Almost 80% of the data is distributed within this range
a.mean()+2*a.std()
a.mean()-2*a.std()
Copy the code

Numpy random number generation

The random number function is in the Np. random subpackage

Rand A random number group is generated for a given shape (range: 0 to 1) randINT A random integer is generated for a given shape. Choice A random number is generated for a given shape. Shuffle The same as random. Shuffle A random number group is generated for a given shapeCopy the code

example

np.random.randint(0, 10, 10)	Generate a random array of length 10Np np. Random. Rand (5). The random. Choice ([1, 2, 3, 4, 5], (2, 3)) np. Random. Uniform (2, 5, and (3, 4))Copy the code

Supplementary – Floating point special value

  • nan (Not a Number): does not equal any floating point number (nan! = nan)
  • inf(infinity): is larger than any floating point number
  • Create special values in Numpy:np.nan np.inf
  • In data analysis,nanOften used to represent missing data values

Pandas Data Analysis

Introduction of pandas

Pandas is a powerful Python data analysis toolkit built on top of NumPy.

Pandas provides the following functions:

  • Data structure DataFrame and Series with its functions
  • Integrate time series functionality
  • Provides a wealth of mathematical operations and operations
  • Flexible handling of missing data

PIP install pandas

Import pandas as pd

Series one-dimensional array object

A Series is an object similar to a one-dimensional array, consisting of a set of data and a set of data labels (indexes) related to it.

Creation method:

pd.Series([4, 5, -5, 3])
pd.Series([4, 5, -5, 3], index=['a'.'b'.'c'.'d'])
pd.Series({'a': 1, 'b': 2.'c': 3})
pd.Series(0, index=['a'.'b'.'c'])
Copy the code

Gets an array of values and an array of indexes: values and index properties

A Series is more like a combination of a list (array) and a dictionary

Series features

Series supports array (subscript) features:

  • fromndarrayTo create the Series:Series(arr)
  • Operation with scalars:sr * 2
  • Two Series operations:sr1+sr2
  • Index:sr[0] [the sr [[1, 4-trichlorobenzene]]]
  • Section:sr[0:2]
  • General function:np.abs(sr)
  • Boolean filtering:sr[sr>0]

Series supports dictionary features (labels)

  • Create Series from dictionary:Series(dic)
  • In operation:'a' in sr
  • Key index:sr['a'] sr[['a', 'b', 'd']]
sr.index
sr.values
sr.index[0]
sr[[1, 3]]
sr[['a'.'d']]
sr['a': 's']	# Slice by tag, wrap before and after
Copy the code

Series integer index

The integer indexes of pandas Series objects tend to drive newbies crazy.

Such as:

sr = pd.Series(np.arange(4))
sr[-1]	# complains
Copy the code

If the index is of an integer type, subscripting a value based on an integer is always tag-oriented.

Workaround: LOC attributes (interpret indexes as labels) and ILOC attributes (interpret indexes as subscripts)

sr2.loc[10]		# index by tag
sr2.iloc[10]	# by subscript index

sr2.iloc[-1]
sr2.iloc[3:6]
sr2.iloc[[2, 3, 7]]
Copy the code

Series data alignment

Two series array objects will be aligned according to the index and then added

Sr1 = pd. Series (,23,34 [11], index = ['c'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
sr1+sr2		# will be added after index alignment
Copy the code
Sr1 = pd. Series (,23,34 [11], index = ['c'.'a'.'d') sr2 = pd.series ([11,20,10,21], index=['d'.'c'.'a'.'b'])
sr1+sr2		The default value b will be replaced by NaN
Copy the code
Sr1 = pd. Series (,23,34 [11], index = ['b'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
sr1+sr2		# b NaN c NaN
Copy the code
Sr1 = pd. Series (,23,34 [11], index = ['b'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
# how to make the result 11 at index 'b' and 20 at index 'c'
Add sub div mul
sr1.add(sr2, fill_value=0)
Copy the code

Processing of Series missing values

sr.isnull()		# NaN returns True
sr.notnull()	# not NaN returns True
Copy the code
# lose
sr = sr[sr.notnull()]
sr = sr.dropna()
# fill it with other values
sr = sr.fillna(0)					Fill with 0
sr = sr.fillna(sr.mean())	# fill with average value
Copy the code

How to create a DataFrame

Dataframe columns must be of the same type

A DataFrame is a tabular data structure containing an ordered set of columns. A DataFrame can be thought of as a dictionary consisting of Series and sharing an index.

Create a way

Style #
pd.DataFrame({'one': [1, 2, 3],'two': (4 and 6)}) pd DataFrame ({'one': [1, 2, 3],'two': (4 and 6)}, index = ['a'.'b'.'c'])
Copy the code
Way # 2
pd.DataFrame({'one': pd Series (= [1, 2, 3], the index ['a'.'b'.'c']), 'two': pd Series ([1, 2, 3, 4], the index = ['b'.'a'.'c'.'d'])})
Copy the code

CSV file reading and writing

df.to_csv('test2.csv')
pd.read_csv('test2.csv')
Copy the code

DataFrame Common attributes

Index row index T transpose columns Obtain column index VALUES Obtain array of values describe() Obtain quick statisticsCopy the code

DataFrame Index and slice

DataFrame is a two-dimensional data type, so there are row and column indexes.

DataFrame can also be indexed and sliced using both drum labels and positions

locProperties andilocattribute

  • Use method: comma separated, followed by row index, column index
  • The row/column index section can be a regular index, a slice, a Boolean index, or a fancy index

Loc interprets rows as index subscripts and ILOC interprets rows as their index names

df['one'] ['a']			# one is the column index, and a is the row index
df.loc['a'.'one']	A = a; one = a
df.loc['a'To:]# Slice a row of data

df.loc[['a'.'c'],:]
df.loc[['a'.'c'].'two']		# Arbitrary collocation
Copy the code

DataFrame Data alignment with default data

When a DataFrame object is evaluated, its row and column indexes are aligned separately.

DataFrame handles default data:

dropna(axis=0,how='any',...).# axis=0; Axis =1 Deletes columns. How ='any'; how='all';
fillna()		# fill
isnull()
notnull()
Copy the code
df.dropna(how='all')	Default is any
df2.dropna(axis=1)		The default value is 0
Copy the code

Pandas Other common methods

Mean (axis=0, skipna=False) sum(axis=1) sort_index(axio,... Sort_values (by, axis, Ascending) sorts the value of a column (row). Note that nan is not involved in the sorting and is displayed at the endThe generic functions of # Numpy also apply to pandas
Copy the code

example

f.sort_values(by='two')	Sort by two columns
df.sort_values(by='two', ascending=False)	# sort by two descending order
df.sort_values(by=1, ascending=False, axis=1)	Sort by 1 in descending order

df.sort_index()		Sort by row index
df.sort_index(ascending=False)	Sort by descending row index
df.sort_index(ascending=False, axis=1)	Sort by column index in descending order
Copy the code

Pandas Event object

Time series type:

  • Timestamp: a specific time
  • Fixed date: July 2017
  • Interval: Start time – end time

The Python standard library handles the time object: datetime

Flexible handling time object: dateutil dateutil.parser.parse()

Grouping event objects: pandas pd.to_datetime()

datetime.datetime.strptime('2010-08-21'.'%Y-%m-%d')
import dateutil
dateutil.parser.parse('2001-01-01')
dateutil.parser.parse('200101/01')
dateutil.parser.parse('2001/01/01')
dateutil.parser.parse('01/01/2020')
pd.to_datetime(['2019-01-01'.'2010/Feb/02'])
Copy the code

Pandas Processes the time object

Generates the time object array date_range

Start Start time End End time Period Period Freq Time frequency. The default value is freq'D', optional H (our), W (eek), B (usiness), S (emi) M onth), T (min) (es), S (econd), A (year),...Copy the code
pd.date_range('2010-01-01'.'2010-5-1')
pd.date_range('2010-01-01',periods=60)
pd.date_range?
pd.date_range('2010-01-01',periods=60,freq='H')
pd.date_range('2010-01-01',periods=60,freq='W')
pd.date_range('2010-01-01',periods=60,freq='W-MON')
pf = pd.date_range('2010-01-01',periods=60,freq='B')
pd.date_range('2010-01-01',periods=60,freq='1h20min')
df[0].to_pydatetime()
Copy the code

Pandas Time series

A time Series is a Series or DataFrame indexed by a time object

A DateTime object is stored in a DatetimeIndex object when used as an index

Time series special functions:

  • Pass “year” or “year” as a slice
  • Pass in the date range as slice mode
  • Rich function support: resample(), truncate()….
sr = pd.Series(np.arange(1000), index=pd.date_range('2017-1-1', periods=1000))
sr['2017-3']
sr['2017-4']
sr['2017']
sr['2017':'2018-3']
sr['2017-12-24':'2018-2-1']
sr.resample('W').sum()	Sum by week
sr.resample('m').sum()
sr.resample('m').mean()
sr.truncate(before='2018-2-3')	# Cut off the front, this method is not recommended because there is time to slice
sr.truncate(after='2018-2-3')
Copy the code

Pandas File Processing

Common formats of data files:csv(Split a character with a spacer)

Pandas Reads files: loads data from file names, urls, and file objects

  • read_csvThe default delimiter is a comma
  • read_tableThe default delimiter is TAB character

read_csv read_tableMain parameters of function

Sep specifies the separator, which can be a regular expression, for example'\s+'. Header =None specifies no column names index_col Specifies a column as an index skip_row specifies some rows to skip na_values Specifies some strings to indicate missing values (parsing some strange strings to NaN) Parse_dates specifies whether certain columns are resolved as dates, of type Boolean or listCopy the code

example

pd.read_csv('zn2006.csv')
pd.read_csv('zn2006.csv', index_col=0)		# specify the 0 column index
pd.read_csv('zn2006.csv', index_col='datetime')		# specify 'datetime' as index
pd.read_csv('zn2006.csv', index_col='datetime', parse_dates=True)
pd.read_csv('zn2006.csv', index_col='datetime', parse_dates=['datetime'])
pd.read_csv('zn2006.csv', header=None)
pd.read_csv('zn2006.csv', header=None, names=list('qwertyuiopasdfg'))
pd.read_csv('zn2006.csv', the header = None, skiprows = [4] 2) pd. Read_csv ('zn2006.csv', na_values=['None'])		# Replace None with NaN
Copy the code

Write to CSV file: to_csv function

The main arguments of the write file function are:

Sep Specifies the file separator na_rep specifies the string to convert the missing value. The default is an empty string header=False No column name line index=False No row index line COLs specifies the column to output, passed in the listCopy the code
df.to_csv('test.csv', header=False, index=False, na_rep='null')
df.to_csv('test.csv', header=False, index=False, na_rep='null',columns=['open'.'high'.'low'.'close'])
df.to_html('test.html')
pd.read_excel('test.xlsx')
Copy the code

Rolling function

You can take the top 10 Windows

df['pre_high'] = df['high'].rolling(10).max()
Copy the code

Other file types supported by pandas **

  • json
  • xml
  • html
  • The database
  • pickle
  • excel