Numpy profile
Numpy is the foundation package for high-performance scientific computing and data analysis. It is the basis for a variety of other tools like PANDAS.
Numpy features:
- Ndarray, a multi-dimensional array structure, efficient and space saving
- A mathematical function that performs rapid operations on array data without looping
- Linear algebra, random number generation and Bolier transform functions
Create ndarRay: np.array(array_list)
The difference between arrays and lists:
- The elements in the array object must be of the same type. - The array size cannot be modifiedCopy the code
Commonly used attributes
- T transpose
- Size Number of array elements
- The dimension of the nDIM array
- The dimension size of the SHAPE array (in primitive form)
- Dtype Specifies the data type of an array element
Create an array
np.zeros(10) An array of 10 zeros
np.ones(10) # 10 arrays of 1
a = np.empty(100) # Value stored in memory
np.arange(100) Create an array of 100 quickly
np.arange(15).reshape(3, 5) Create a two-dimensional arrayNp. Arange (2, 10, 0.3) NP. Linspace (0, 50, 100)Divide 0 to 50 evenly into 100 shares
np.eye(10) # Linear algebra
Copy the code
Ndarray batch computing
Between arrays and scalars
A +1 1*3 1//a a**0.5 a>5Copy the code
Operations between arrays of the same size
a+b a/b a**b a%b a==b
Copy the code
The index
-
One bit array index A [5]
-
Multidimensional array index
- It’s written in tabular form
a[2][1]
- The new writing
a[2, 1]
- It’s written in tabular form
slice
- One bit array:
a[5:8]
a[4:]
a[2:10]
- Multidimensional arrays:
a[1:2, 3:4]
a[:, 3:5]
a[:,1]
- Slicing an NP array is different from slicing a list array: arrays are not automatically copied when slicing (instead, a view is created), and changes in the slicing array affect the original array
copy()
Method to create a deep copy of an array
Ndarray Boolean index
Problem: Given an array, select all the numbers in the array greater than 5
Answer: a [a] > 5
The principle of
- Arrays and scalars: a>5 evaluates each element in A and returns a Boolean array
- ** Boolean index: ** Passing a Boolean array of the same size to the index returns a Boolean array of all values
True
An array of elements at the corresponding position
Example:
# 1, given an array, pick out all the numbers in the array greater than 5
a = np.array([random.randint(1, 10) for _ in range(20)])
a[a>5] # array([8, 6, 6, 7, 7, 6, 6])
Copy the code
# 2, given an array, pick all even numbers greater than 5 in the array
a = np.array([random.randint(1, 10) for _ in range(20)])
a[(a>5) & (a%2==0)]
Copy the code
# 3, given an array, select all numbers greater than 5 and even numbers in the array
a = np.array([random.randint(1, 10) for _ in range(20)])
a[(a>5) | (a%2==0)]
Copy the code
Ndarray fancy index
Gives values based on index position
A = np. Arange (20) a [[1,4,5,6]]# array([1, 4, 5, 6])
Copy the code
A = np. Arange (20). Reshape (4, 5)# array([[ 0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]]
a[0, 2:5] # array([7, 8, 9])
a[0, a[0]>2]
a[[1, 3], [1, 3]] Select * from array([6, 18]);
If you want to select 6, 8, 16, and 18, you can use the full methodA [[1, 3] :] [: [1, 3]]# : indicates full cut
Copy the code
Numpy general function
General-purpose function: a function that operates on all elements of an array simultaneously
Supplementary knowledge:
int # round zero
round Round is even if the distance between the two sides is equal
math.floor # round small (round left) (floor)
math.ceil # round to large (round to right) (smallpox)
Copy the code
Common common functions:
-
A function of
abs # the absolute value sqrt # prescribing exp log ceil # round to large (round to right) (smallpox) floor # round small (round left) (floor) rint # is equivalent to round round Round is even if the distance between the two sides is equal trunc # round zero modf # Separate decimals from integers isnan # filtration nan isinf # filter inf cos sin tan Copy the code
example
A = Np.arange (-5.5,5) NP.abs (a) x, y = NP.modf (a)X is a decimal and y is an integer Copy the code
# numpy filters nanA = np. Arange (0,5) b = a/a# array([nan, 1., 1., 1., 1.]) b[~np.isnan(b)] # ~ is the inverse Copy the code
# numpy filter INFA = np. Array (,4,5,6 [3]) b = np. Array (,0,3,0 [2]) c = a/c b/c! = np. J inf# filter inf c[~np.isinf(c)] # filter inf Copy the code
-
Dual function
add substract multiply divide power mod maximum # maximum of two arrays minimum The smallest of two arrays Copy the code
Numpy mathematical and statistical methods
sum # sum
mean # average
std # Standard deviation
var # and variance
min Take the minimum
max # maximize
argmin Find the minimum index
argmax Select the maximum index
Copy the code
# and variance
1 2 3 4 5
mean:3
((1-3)**2+(2-3)**2+(3-3)**2+(4-3)**2+(5-3)**2)/5
The variance shows how discrete your set of data isStandard deviation = SQRT.Almost 60% of the data is distributed within this range
a.mean()+a.std()
a.mean()-a.std()
Almost 80% of the data is distributed within this range
a.mean()+2*a.std()
a.mean()-2*a.std()
Copy the code
Numpy random number generation
The random number function is in the Np. random subpackage
Rand A random number group is generated for a given shape (range: 0 to 1) randINT A random integer is generated for a given shape. Choice A random number is generated for a given shape. Shuffle The same as random. Shuffle A random number group is generated for a given shapeCopy the code
example
np.random.randint(0, 10, 10) Generate a random array of length 10Np np. Random. Rand (5). The random. Choice ([1, 2, 3, 4, 5], (2, 3)) np. Random. Uniform (2, 5, and (3, 4))Copy the code
Supplementary – Floating point special value
nan (Not a Number)
: does not equal any floating point number (nan! = nan)inf(infinity)
: is larger than any floating point number- Create special values in Numpy:
np.nan
np.inf
- In data analysis,
nan
Often used to represent missing data values
Pandas Data Analysis
Introduction of pandas
Pandas is a powerful Python data analysis toolkit built on top of NumPy.
Pandas provides the following functions:
- Data structure DataFrame and Series with its functions
- Integrate time series functionality
- Provides a wealth of mathematical operations and operations
- Flexible handling of missing data
PIP install pandas
Import pandas as pd
Series one-dimensional array object
A Series is an object similar to a one-dimensional array, consisting of a set of data and a set of data labels (indexes) related to it.
Creation method:
pd.Series([4, 5, -5, 3])
pd.Series([4, 5, -5, 3], index=['a'.'b'.'c'.'d'])
pd.Series({'a': 1, 'b': 2.'c': 3})
pd.Series(0, index=['a'.'b'.'c'])
Copy the code
Gets an array of values and an array of indexes: values and index properties
A Series is more like a combination of a list (array) and a dictionary
Series features
Series supports array (subscript) features:
- from
ndarray
To create the Series:Series(arr)
- Operation with scalars:
sr * 2
- Two Series operations:
sr1+sr2
- Index:
sr[0]
[the sr [[1, 4-trichlorobenzene]]]
- Section:
sr[0:2]
- General function:
np.abs(sr)
- Boolean filtering:
sr[sr>0]
Series supports dictionary features (labels)
- Create Series from dictionary:
Series(dic)
- In operation:
'a' in sr
- Key index:
sr['a']
sr[['a', 'b', 'd']]
sr.index
sr.values
sr.index[0]
sr[[1, 3]]
sr[['a'.'d']]
sr['a': 's'] # Slice by tag, wrap before and after
Copy the code
Series integer index
The integer indexes of pandas Series objects tend to drive newbies crazy.
Such as:
sr = pd.Series(np.arange(4))
sr[-1] # complains
Copy the code
If the index is of an integer type, subscripting a value based on an integer is always tag-oriented.
Workaround: LOC attributes (interpret indexes as labels) and ILOC attributes (interpret indexes as subscripts)
sr2.loc[10] # index by tag
sr2.iloc[10] # by subscript index
sr2.iloc[-1]
sr2.iloc[3:6]
sr2.iloc[[2, 3, 7]]
Copy the code
Series data alignment
Two series array objects will be aligned according to the index and then added
Sr1 = pd. Series (,23,34 [11], index = ['c'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
sr1+sr2 # will be added after index alignment
Copy the code
Sr1 = pd. Series (,23,34 [11], index = ['c'.'a'.'d') sr2 = pd.series ([11,20,10,21], index=['d'.'c'.'a'.'b'])
sr1+sr2 The default value b will be replaced by NaN
Copy the code
Sr1 = pd. Series (,23,34 [11], index = ['b'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
sr1+sr2 # b NaN c NaN
Copy the code
Sr1 = pd. Series (,23,34 [11], index = ['b'.'a'.'d']) sr2 = pd.series ([11,20,10], index=['d'.'c'.'a'])
# how to make the result 11 at index 'b' and 20 at index 'c'
Add sub div mul
sr1.add(sr2, fill_value=0)
Copy the code
Processing of Series missing values
sr.isnull() # NaN returns True
sr.notnull() # not NaN returns True
Copy the code
# lose
sr = sr[sr.notnull()]
sr = sr.dropna()
# fill it with other values
sr = sr.fillna(0) Fill with 0
sr = sr.fillna(sr.mean()) # fill with average value
Copy the code
How to create a DataFrame
Dataframe columns must be of the same type
A DataFrame is a tabular data structure containing an ordered set of columns. A DataFrame can be thought of as a dictionary consisting of Series and sharing an index.
Create a way
Style #
pd.DataFrame({'one': [1, 2, 3],'two': (4 and 6)}) pd DataFrame ({'one': [1, 2, 3],'two': (4 and 6)}, index = ['a'.'b'.'c'])
Copy the code
Way # 2
pd.DataFrame({'one': pd Series (= [1, 2, 3], the index ['a'.'b'.'c']), 'two': pd Series ([1, 2, 3, 4], the index = ['b'.'a'.'c'.'d'])})
Copy the code
CSV file reading and writing
df.to_csv('test2.csv')
pd.read_csv('test2.csv')
Copy the code
DataFrame Common attributes
Index row index T transpose columns Obtain column index VALUES Obtain array of values describe() Obtain quick statisticsCopy the code
DataFrame Index and slice
DataFrame is a two-dimensional data type, so there are row and column indexes.
DataFrame can also be indexed and sliced using both drum labels and positions
loc
Properties andiloc
attribute
- Use method: comma separated, followed by row index, column index
- The row/column index section can be a regular index, a slice, a Boolean index, or a fancy index
Loc interprets rows as index subscripts and ILOC interprets rows as their index names
df['one'] ['a'] # one is the column index, and a is the row index
df.loc['a'.'one'] A = a; one = a
df.loc['a'To:]# Slice a row of data
df.loc[['a'.'c'],:]
df.loc[['a'.'c'].'two'] # Arbitrary collocation
Copy the code
DataFrame Data alignment with default data
When a DataFrame object is evaluated, its row and column indexes are aligned separately.
DataFrame handles default data:
dropna(axis=0,how='any',...).# axis=0; Axis =1 Deletes columns. How ='any'; how='all';
fillna() # fill
isnull()
notnull()
Copy the code
df.dropna(how='all') Default is any
df2.dropna(axis=1) The default value is 0
Copy the code
Pandas Other common methods
Mean (axis=0, skipna=False) sum(axis=1) sort_index(axio,... Sort_values (by, axis, Ascending) sorts the value of a column (row). Note that nan is not involved in the sorting and is displayed at the endThe generic functions of # Numpy also apply to pandas
Copy the code
example
f.sort_values(by='two') Sort by two columns
df.sort_values(by='two', ascending=False) # sort by two descending order
df.sort_values(by=1, ascending=False, axis=1) Sort by 1 in descending order
df.sort_index() Sort by row index
df.sort_index(ascending=False) Sort by descending row index
df.sort_index(ascending=False, axis=1) Sort by column index in descending order
Copy the code
Pandas Event object
Time series type:
- Timestamp: a specific time
- Fixed date: July 2017
- Interval: Start time – end time
The Python standard library handles the time object: datetime
Flexible handling time object: dateutil dateutil.parser.parse()
Grouping event objects: pandas pd.to_datetime()
datetime.datetime.strptime('2010-08-21'.'%Y-%m-%d')
import dateutil
dateutil.parser.parse('2001-01-01')
dateutil.parser.parse('200101/01')
dateutil.parser.parse('2001/01/01')
dateutil.parser.parse('01/01/2020')
pd.to_datetime(['2019-01-01'.'2010/Feb/02'])
Copy the code
Pandas Processes the time object
Generates the time object array date_range
Start Start time End End time Period Period Freq Time frequency. The default value is freq'D', optional H (our), W (eek), B (usiness), S (emi) M onth), T (min) (es), S (econd), A (year),...Copy the code
pd.date_range('2010-01-01'.'2010-5-1')
pd.date_range('2010-01-01',periods=60)
pd.date_range?
pd.date_range('2010-01-01',periods=60,freq='H')
pd.date_range('2010-01-01',periods=60,freq='W')
pd.date_range('2010-01-01',periods=60,freq='W-MON')
pf = pd.date_range('2010-01-01',periods=60,freq='B')
pd.date_range('2010-01-01',periods=60,freq='1h20min')
df[0].to_pydatetime()
Copy the code
Pandas Time series
A time Series is a Series or DataFrame indexed by a time object
A DateTime object is stored in a DatetimeIndex object when used as an index
Time series special functions:
- Pass “year” or “year” as a slice
- Pass in the date range as slice mode
- Rich function support: resample(), truncate()….
sr = pd.Series(np.arange(1000), index=pd.date_range('2017-1-1', periods=1000))
sr['2017-3']
sr['2017-4']
sr['2017']
sr['2017':'2018-3']
sr['2017-12-24':'2018-2-1']
sr.resample('W').sum() Sum by week
sr.resample('m').sum()
sr.resample('m').mean()
sr.truncate(before='2018-2-3') # Cut off the front, this method is not recommended because there is time to slice
sr.truncate(after='2018-2-3')
Copy the code
Pandas File Processing
Common formats of data files:csv
(Split a character with a spacer)
Pandas Reads files: loads data from file names, urls, and file objects
read_csv
The default delimiter is a commaread_table
The default delimiter is TAB character
read_csv
read_table
Main parameters of function
Sep specifies the separator, which can be a regular expression, for example'\s+'. Header =None specifies no column names index_col Specifies a column as an index skip_row specifies some rows to skip na_values Specifies some strings to indicate missing values (parsing some strange strings to NaN) Parse_dates specifies whether certain columns are resolved as dates, of type Boolean or listCopy the code
example
pd.read_csv('zn2006.csv')
pd.read_csv('zn2006.csv', index_col=0) # specify the 0 column index
pd.read_csv('zn2006.csv', index_col='datetime') # specify 'datetime' as index
pd.read_csv('zn2006.csv', index_col='datetime', parse_dates=True)
pd.read_csv('zn2006.csv', index_col='datetime', parse_dates=['datetime'])
pd.read_csv('zn2006.csv', header=None)
pd.read_csv('zn2006.csv', header=None, names=list('qwertyuiopasdfg'))
pd.read_csv('zn2006.csv', the header = None, skiprows = [4] 2) pd. Read_csv ('zn2006.csv', na_values=['None']) # Replace None with NaN
Copy the code
Write to CSV file: to_csv function
The main arguments of the write file function are:
Sep Specifies the file separator na_rep specifies the string to convert the missing value. The default is an empty string header=False No column name line index=False No row index line COLs specifies the column to output, passed in the listCopy the code
df.to_csv('test.csv', header=False, index=False, na_rep='null')
df.to_csv('test.csv', header=False, index=False, na_rep='null',columns=['open'.'high'.'low'.'close'])
df.to_html('test.html')
pd.read_excel('test.xlsx')
Copy the code
Rolling function
You can take the top 10 Windows
df['pre_high'] = df['high'].rolling(10).max()
Copy the code
Other file types supported by pandas **
- json
- xml
- html
- The database
- pickle
- excel