1 / on pypi
Python's own data analysis capabilities are not very strong, and third-party libraries need to be installed to enhance its capabilities and enrich its Arsenal. Python's third-party libraries have numpy, pandas, matplotlib, scipy, keras, sklearn, gensim, etc If you are using a built-in python, or you just installed python, so it just has some built-in functions, Many third-party libraries are not available. If you install Anaconda, in addition to built-in functions, it has many third-party libraries, such as Numpy, Pandas, and scipy. If you need to use a third-party library, you can install PIP install XXXX at any time. We can install some third-party libraries by ourselves. So we can do everything we want. Pypi is a repository for third-party libraries, which we can collectively call PYPI.Copy the code
2/numpy
Python itself doesn't provide array functionality, and while lists do the basic array functionality, lists aren't really arrays yet. Using lists is unacceptably slow for large amounts of data. The Numpy library provides real array functionality, along with functions for processing arrays quickly. Numpy's built-in functions process data at c level speed.Copy the code
import numpy as np
arr1 = np.array( [11.22.33])# Create a one-dimensional array
arr2 = np.array( [ [11.22.33], [44.55.66]])Create a 2d array with 2 rows and 3 columns
print(arr2 * arr2) # This is array multiplication, the elements of the corresponding position are multiplied
Copy the code
Numpy provides multidimensional array functionality, but it is a generic array, not a matrix. For example, when you multiply two arrays, you just multiply the corresponding elements, not the matrixCopy the code
np.unique(list,return_counts=True)
CNTS = NP. unique(list,return_counts=True)Copy the code
3/scipy
Scipy is an extension of the language that uses NUMPY to do optimization, linear algebra, integration, interpolation, fitting, fast Fourier transform, advanced mathematics, signal processing, optimization, statistics, and many other scientific tasks.Copy the code
4/matplotlib
Matplotlib is a third-party library used for visualization. It is mainly used for two-dimensional drawing, but also for simple THREE-DIMENSIONAL drawing. When drawing with matplotlib third-party library, you need to master three common concepts: figure: Generally abbreviated as FIG, figure is a window that pops up when the program is executed. Figure acts as a total canvas and is generated by the function plt.figure(). Axes: In each figure, there may be multiple subgraphs (or only one subgraph). In each subgraph axes, there are coordinate axes (x and y), namely axisCopy the code
Add a horizontal/vertical line to a figure
plt.axhline(y=4,ls=":",c="yellow") # Add a horizontal line at y=4
plt.axvline(x=4,ls="-",c="green") Add a vertical line where # x=4
Copy the code
The font
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = ["Arial Unicode MS"] # Specify font
plt.rcParams["font.sans-serif"] = ["SimHei"] # Chinese in bold
plt.rcParams['font.sans-serif'] = ["Microsoft JhengHei"] # YaHei JhengHei
plt.rcParams['axes.unicode_minus'] = False # Normal display of negative sign
Copy the code
Common commands
import matplotlib.pyplot as plt
fig = plt.figure(num='Name of FIG',figsize=(a,b))
# Create a canvas FIG, name the title of the whole image (num= ", not title= "),
# figsize=(length, width) Specifies the size of the canvas
Figure (num=' FIG name ',figsize=(a,b))
# Draw the first subgraph in the FIG: the graph
axes1 = fig.add_subplot(311) Add a subgraph to the canvas FIG,311It means:3line1The first one in the column, even if there is only one subplot in the canvas, is set to FIG. Add_subplot (111)
axes1.set_title("xxxxx",fontsize=20) # Define subgraph title, caption size
axes1.set_xlim([]) # Set the range of the X-axis
axes1.set_ylim([]) # Set the range of the Y-axis
axes1.set_xticklabels() # Set the X-axis tag
axes1.set_yticklabels() # Set the Y-axis tag
axes1.set_xticks() # set the location of each tag on the X-axis
axes1.set_yticks() # set the location of each tag on the X-axis
axes1.set_xlabel() # Set the X-axis tag
axes1.set_ylabel() # Set the Y-axis tag
axes1.plot() # Draw diagram
# Draw the second subgraph in FIG: pie chart
axes2 = fig.add_subplot(312) # Define the second subgraph (second in 3 rows and 1 column)
axes2.pie(sizes, explode=explode, labels=labels, colors = colors, autopct='% 1.1 f % %', shadow=True, startangle=45) # Draw a pie chart. The pie chart should include the size of each part/the label of each part/the color of each part. Autopct's function is to automatically display in percentage form with one decimal place.Axes2. Axis (" equal ")# Make sure the pie chart is not deformed and is a standard circle
axes2.pie() # Draw a pie chart
# Draw the third subgraph in FIG: the stack diagram
axes3 = fig.add_subplot(313)
axes3.bar(x, y1, color='r', label='Chinese')
axes3.bar(x, y2, bottom=y1, color='g', label='mathematics')
axes3.bar(x, y3, bottom=y1+y2, color='c', label='English')
axes3.set_xlim(1.20) # Display range
axes3.set_ylim(1.100) # Display range
Copy the code
Add a legend
fig = plt.figure( figsize=(12.6) )
x = temp_df["data_month"].tolist()
y1 = temp_df["cohesion"].tolist()
y2 = temp_df["apl"].tolist()
# Draw a graph. If you need a legend, you must have the label parameter
plt.plot(x,y1,label="cohesion")
plt.plot(x,y2,label="apl")
plt.legend(loc='upper left',
prop = {'family':'Times New Roman'.'weight':'normal'.'size':23})
plt.title("Chart of %s by month" % i,fontsize=20)
plt.xlabel("data_month",fontsize=20)
plt.ylabel("cohesion_and_apl",fontsize=20)
plt.show()
plt.legend(loc='upper right') # Draw legend, loc is the positionBest is the best position, adaptive, automatic allocation of the best position legend parameters, such as font, width, size font1 = {'family':'Times New Roman'.'weight':'normal'.'size':23}
plt.legend(loc='upper left',prop = font1)
axes3.grid(axis='y', color='gray', linestyle='-', linewidth=1)
# Draw grid lines, only in the y axis, color, line type, line width
plt.show()
No matter how many subgraphs are in the FIG, if you want to show each subgraph, you should use the plt.show() function
Copy the code
Save the figure
Plt.savefig (path,dpi=1000,format=" PDF ") The premise is to comment out plt.show()Copy the code
5/pandas
The basic data structures for Pandas are series and dataframe. Series is a one-dimensional array, dataframe is a two-dimensional array. The data type of a column in a dataframe must be the same; the data type can be different between columnsCopy the code
Pandas column data shows problems
Pd.set_option ("display.max_rows",None) Pd.set_option (" max_columns",None) # Maximum number of columns to display on a single value, default is 50 pd.set_option("max_colwidth",500) # Maximum number of columns to display on a single line, Pd.set_option ("display.width",10000)Copy the code
Replacement of the value of dataframe
Dataframe = dataframe.replace({" XXX ":["None"," None"]},"2099-12-31") Dataframe [column] = dataframe[column].replace(a,b) # dataframe = dataframe. Replace (a,b) # Dtaframe = dataframe. Replace ([a,aa,aaa], BBB) # BBB dataframe[" column name "] = dataframe Dataframe [" column name "]. Replace ([a,aa,aaa], BBB) #Copy the code
Determine if dataframe is empty
<2>DataFrame has an attribute of "empty", and uses DataFrame. Empty to determine the number of rows. Df. empty returns True if df is empty, and False otherwise. Don't add () to empty. <3>dataframe. Shape [0]=rows,shape[1]=columnsCopy the code
Create an empty dataframe
Method 3: Empty_df = pd.DataFrame(data=None, range(1,5),index=[0,1]) Empty_df = pd.DataFrame(columns=["a","b","c"]) No indexCopy the code
Loc and iloc
Loc [] is based on row label and column label (x_label, y_label). Dataframe. Iloc [] is based on row index and column index (index,columns). If the name of the row tag and column tag of the data is too long or hard to remember, it is convenient to use.iloc[], and you only need to remember the index of the tag.Copy the code
Create a dataframe
<1> Using a dictionary, A kv is a list of data dict1 = {' name ': [' CCC', 'aaa', 'BBB']. 'age':[20,21,23]} data_df = pd.DataFrame(dict1,index= XXX) # dict1 Data_df = pd.dataframe ([[], [], []....] , the columns = [], the index = []) list1 = [[1, 2, 3], [4 and 6], [7,8,9]] data_df = pd.DataFrame(list1,index= XXX,columns= XXXCopy the code
Dataframe sorting
New_df = old_df.sort_values(by=" column name ",ascending=True) by specifies the column name. Ascending is the default order. When ascending=False, 11/ When reading df data files, you can selectively read fields, Data_df = pd.read_csv(Filepath, Encoding =" GB18030 ", usecols=[" XXX "," XXX "]) # Usecols must be written after encodingCopy the code
Add row data to dataframe && Add column data
Df.append (dic,ignore_index=True) Df [new column name] = [] df.insert(loc=0,column=" column name ",value=" XXX ")Copy the code
Save the dataframe
Data_df.to_csv (path,index=False,encoding="gb18030") # index=FalseCopy the code
In the dataframe, convert the string to a time format
Data_df [" date of birth "] = pd.to_datetime(data_df[" date of birth "]) NaT data_df[" resignation date "] = pd.to_datetime(data_df[" resignation date "].fillna("2099-12-31"))Copy the code
Dataframe to heavy
Df.drop_duplicates (subset=['a','b'],keep='first',inplace=False) subset=['a','b'] Subset: Specifies a specific column, default to all columns. Keep: {' first ', 'last', False}, default 'first'. Inplace =True removes duplicates from the original DataFrame, while the default value False generates a new DataFrame object. Df.drop_duplicates (subset='id',keep='first',inplace=True Df.drop_duplicates (subset='id',keep='first',inplace=False) # inplace can be omitted, because default is FalseCopy the code
Dataframe. Insert () function
pandas.get_dummies()
I have a variable, x, and if x is a categorical variable, like blood type, I have a, B, AB, O. After get_dummies(), you program four variables, x_A, x_B,x_ab, and x_O, whose values are denoted by 0 and 1. So this is one_hot, the single hot code.Copy the code
Get_dummies (data column) Optional parameter (prefix= prefixes other parameters are not used)Copy the code
The value Embarked on is transformed into three virtual variables based on the three unique values of the column, and the prefix is added to the nameCopy the code
dataframe.sample()
Data sampling Sometimes we only need part of the data in the data set, not all of the data. This is where we do a random sampling of the data set. Sampling is included in Pandas. Application scenario: I have 10W rows of data, each row has 11 columns of properties. Now, we just have to randomly pick 2W of these rows. The implementation is simple: DataFrame. Sample (n=None, frac=None, replace=False, weights=None, random_state=None, Axis =None) N is the number of rows to be extracted (for example, when n=20000, 2W rows will be extracted). Frac is the ratio column to be extracted (sometimes, we do not care about the specific number of rows to be extracted, we want to extract the percentage, frac can be used in this case, for example, FRAc =0.8. If replace=True, it will be sampled. If replace=True, it will be sampled. Weights Is the weight of each sample, please refer to the official document for details. Random_state was covered in a previous post. Axis selects the row or column of the extracted data. When axis=0, rows are extracted, and when axis=1, columns are extracted (that is, when axis=1, n columns are randomly extracted from columns, and when axis=0, n rows are randomly extracted from rows)Copy the code
Dataframe out of order
from sklearn.utils import shuffle
new_df = shuffle(df)
Copy the code
Dataframe Changes the data type
Df [' column name '] = df[' column name '].astype(NP.float64) # In dictionary df[' column name '] = df[' column name '].astype(NP.float64 Base_df [" XXX "].apply(int) #Copy the code
Do you reserve 3 decimal places for a column of data in dataframe?
Format = lambda x: "%.3f" % x df[column name] = df[column name].map(format) # Retain 3 decimal placesCopy the code
The application of fillna ()
Fillna_values = {' column name 1': 1, 'column name 2': 2,' column name 3': 1, 'column name 2': 2,' column name 3': Df = df.fillna(value=fillna_values) 3 df = df.fillna(value=fillna_values, Limit =1) fillna_value = {'price':df["price"].mean() Df.fillna (value=fillna_value) # Then add the mean value to the null value in the name column of dfCopy the code
Dataframe copy
Df2 = df1.copy(deep=True) Deep copy: It is equivalent to storing two completely unrelated objects in memory. If the DF1 object changes, the DF2 object does not change. If it is not a deep copy, it is equivalent to having two variable names on an object. Whenever the object changes, both variables change when the reference is madeCopy the code
Dataframe Indicates data filtering
Note: isin(), notin() Have the isnull (), Also have notnull () temp_df = temp_df [(temp_df [] "departure date" > = date) & (temp_df [' rank '] the isin ([' K3A ', 'K3B', 'K3C', 'K4A', 'K4B', 'K4C'])) & (temp_df[' temp_df '].notnull()) & (temp_df[' temp_df '].notnull())).reset_index(drop=True) "Departure date" (temp_df [] > = date) & (temp_df [' rank '] the isin ([' K3A ', 'K3B', 'K3C', 'K4A', 'K4B', 'K4C'])) & (temp_df [' surface rating]. Isnull ()) & ( Temp_df [' performance '].isnull())].reset_index(drop=True) # Both evaluation and performance are null.Copy the code
Delete rows and columns from dataframe
[] data_df = data_df.drop(index= row number) # [] data_df = data_df.dropna() #Copy the code
How many ways can dataframe be traversed by row?
<1>df.iterrows(): Iterates each row of the DataFrame into (index, Series) pairs. Elements can be accessed by row[name]. For index, row in df.iterrows(): print(index) Print (row['c1'], row['c2']) # print(row[' c2']) Iterating by row, iterating each row of the DataFrame as a meta-ancestor, can be accessed by row[name], more efficiently than iterrows(). for row in df.itertuples(): Print (getattr(row, 'c1'), getattr(row, 'c2')) Each column of the DataFrame is iterated into (column name, Series) pairs that can be accessed through row[index]. For row in df.iterItems (): print(row[0], row[1], row[2]) #Copy the code
Dataframe Changes the column name
Two ways: Final_data_df. Rename (columns = {' A ':' A ', 'B', 'B'}, inplace = True) # any modify column final_data_df. Columns = [" ", ""," ", ""] # this way, If you want to change a column name, you must write all columnsCopy the code
Filter the data whose value is a specific length
new_data_df = data_df[ data_df['xxx'].str.len() == 6 ] new_data_df = data_df[ data_df['content'].str[:10] == '2019-11-12')Copy the code
None and Np. nan in dataframe
<1> None is native to Python and is of type Python Object. Therefore, None cannot participate in any calculation. <2> Np-nan is a floating-point type, so it can participate in the calculation. But the result of the calculation is always a NaN. <3> NULL in the database corresponds to None in PythonCopy the code
How do I assign null and missing values to individual elements when I customize a Dataframe
If the value is a string, use None to assign a null value to a numeric value. If the value is a null value to a time type, use Numpy. NaN to assign a null value to pandas. testframe = pd.DataFrame({'c1':[None, 'b'], 'c2':[1, np.NaN], 'c3':[pd.Timestamp('2018-09-23'), Pd. NaT]}) c1 c2 c3 0 None 1.0 2018-09-23 1 b NaN NaTCopy the code
Usage of the function between_time() in dataframe
Dataframe ["9-10"] = dataframe["a"].between(-52,11Copy the code
Dataframe filters out data containing a character?
Df = df [df [r]. "XXX" STR. The contains (" word ")] # column value contained in the word of the XXX df = df [df [r]. "XXX" STR. The contains (" hello | world ")] # Df = df[~df[" word"].str.contains("word")] ~ (df) [r]. "XXX" STR. The contains (" Hello ") | (df [XXX]. STR. The contains (" World "))] # include word Korean does not contain the HelloCopy the code
N decimal places are reserved for the entire DF file
dataframe = dataframe.round(2)
Copy the code
The apply() and applyMap () functions apply to the dataframe
<2> Datafrmae provides straightforward and simple functions, apply() and applyMap (). <3> Where the apply() function operates on certain rows or columns, and the applyMap () function operates on all elements. Df ['wide petal'] = df['petal '].apply(lambda V: 1 else 0) df['petal area'] = df.apply(lambda r: R ['petal '] * r['petal '], axis=1) <5> applyMap () np.log(v) if isinstance(v, float) else v)Copy the code
Dataframe filling fillna ()
<1> dataframe = df.fillna(method="ffill") Df ["col_name"] = df["col_name"].fillna(method="ffill") Df ["col_name"] = df["col_name"].fillna(method="bfill") #Copy the code
Sums a column of data from a dataframe data structure
Data_df [" XXX "].sum() # data_df[" XXX "].sum() #Copy the code
dataframe,dropna()
Df. dropna(axis=0,how="any",inplace=True) axis: 0: row operation (default) 1: column operation How: any: delete null values (default) all: delete null values only inplace: False: Return the new dataset (default) True: operates on the original dataset, that is, directly replaces the original DFCopy the code
Dataframe, which calculates the quantile value
Import numpy as np a =[1,2,3,4,5,6,7,8,9,10] print(np.median(a)) # print(np.median(a)) # 25% Print (np.percentile(a,75)) # percentile(a,75)Copy the code
Convert dataframe to dict
To_dict (Orient ="records") output_df = input_df.to_json(Orient ="records") The original data is as follows: input_df =Copy the code
output_df = input_df.to_dict(orient="xxx") # orient: ['dict', 'list', 'series', 'split', 'records', <1> output_df = input_df.to_dict(Orient ="records") [{' working ':' 7067769 ', 'gender' : 'male' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'first-line', 'age: 28.4,' SiLing: 53.6}, {' working ':' 2834031 ', 'gender' : 'female' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'first-line', 'age: 32.5,' SiLing: 48.3}] < 2 > output_df = Input_df. To_dict (received = "list") {' working ': (' 7067769', '2834031'), 'gender' : [' male ', 'woman'], 'marital status: [' married', 'married'], 'nature of the employees' : [' formal employees' and 'formal employees'],' employee types: [' ordinary ', 'ordinary'], 'ages' : [47.5, 55.8],' SiLing ': [31.7, 51.1]} <3> output_df = input_df.to_dict(Orient ="index") {0: {' working ':' 7067769 ', 'gender' : 'male' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'sales',' age: 41.2, 'SiLing: 43.5}, 1: {' working ':' 2834031 ', 'gender' : 'female' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'first-line', 'age: 26.2,' SiLing: 43.9}}Copy the code
6/sklearn
Machine learning related library, provides a powerful machine learning toolbox, including data preprocessing, classification, regression, clustering, prediction, model analysis and so on. Sklearn is a powerful machine learning library, but it does not include one powerful model: neural networks. Keras made up for that.Copy the code
7/keras
Keras is the most widely used deep learning framework besides TensorFlow. Keras is used for deep learning. Tensorflow is difficult to learn, but Keras is highly encapsulated and suitable for beginners.Copy the code
8/gensim
Gensim: For language tasks such as word2vec for text similarity.Copy the code