1 / on pypi

Python's own data analysis capabilities are not very strong, and third-party libraries need to be installed to enhance its capabilities and enrich its Arsenal. Python's third-party libraries have numpy, pandas, matplotlib, scipy, keras, sklearn, gensim, etc If you are using a built-in python, or you just installed python, so it just has some built-in functions, Many third-party libraries are not available. If you install Anaconda, in addition to built-in functions, it has many third-party libraries, such as Numpy, Pandas, and scipy. If you need to use a third-party library, you can install PIP install XXXX at any time. We can install some third-party libraries by ourselves. So we can do everything we want. Pypi is a repository for third-party libraries, which we can collectively call PYPI.Copy the code

2/numpy

Python itself doesn't provide array functionality, and while lists do the basic array functionality, lists aren't really arrays yet. Using lists is unacceptably slow for large amounts of data. The Numpy library provides real array functionality, along with functions for processing arrays quickly. Numpy's built-in functions process data at c level speed.Copy the code
    import numpy as np
    arr1 = np.array( [11.22.33])# Create a one-dimensional array
    arr2 = np.array( [ [11.22.33], [44.55.66]])Create a 2d array with 2 rows and 3 columns
    print(arr2 * arr2)  # This is array multiplication, the elements of the corresponding position are multiplied
Copy the code
Numpy provides multidimensional array functionality, but it is a generic array, not a matrix. For example, when you multiply two arrays, you just multiply the corresponding elements, not the matrixCopy the code

np.unique(list,return_counts=True)

CNTS = NP. unique(list,return_counts=True)Copy the code

3/scipy

Scipy is an extension of the language that uses NUMPY to do optimization, linear algebra, integration, interpolation, fitting, fast Fourier transform, advanced mathematics, signal processing, optimization, statistics, and many other scientific tasks.Copy the code

4/matplotlib

Matplotlib is a third-party library used for visualization. It is mainly used for two-dimensional drawing, but also for simple THREE-DIMENSIONAL drawing. When drawing with matplotlib third-party library, you need to master three common concepts: figure: Generally abbreviated as FIG, figure is a window that pops up when the program is executed. Figure acts as a total canvas and is generated by the function plt.figure(). Axes: In each figure, there may be multiple subgraphs (or only one subgraph). In each subgraph axes, there are coordinate axes (x and y), namely axisCopy the code

Add a horizontal/vertical line to a figure

      plt.axhline(y=4,ls=":",c="yellow")  # Add a horizontal line at y=4
      plt.axvline(x=4,ls="-",c="green")   Add a vertical line where # x=4
Copy the code

The font

      import matplotlib.pyplot as plt
      plt.rcParams["font.family"] = ["Arial Unicode MS"]  # Specify font
      plt.rcParams["font.sans-serif"] = ["SimHei"]        # Chinese in bold
      plt.rcParams['font.sans-serif'] = ["Microsoft JhengHei"] # YaHei JhengHei
      plt.rcParams['axes.unicode_minus'] = False     # Normal display of negative sign
Copy the code

Common commands

      import matplotlib.pyplot as plt
      fig = plt.figure(num='Name of FIG',figsize=(a,b)) 
      # Create a canvas FIG, name the title of the whole image (num= ", not title= "),
      # figsize=(length, width) Specifies the size of the canvas
      Figure (num=' FIG name ',figsize=(a,b))

      # Draw the first subgraph in the FIG: the graph
      axes1 = fig.add_subplot(311) Add a subgraph to the canvas FIG,311It means:3line1The first one in the column, even if there is only one subplot in the canvas, is set to FIG. Add_subplot (111)
      axes1.set_title("xxxxx",fontsize=20) # Define subgraph title, caption size
      axes1.set_xlim([])       # Set the range of the X-axis
      axes1.set_ylim([])       # Set the range of the Y-axis
      axes1.set_xticklabels()  # Set the X-axis tag
      axes1.set_yticklabels()  # Set the Y-axis tag
      axes1.set_xticks()    # set the location of each tag on the X-axis
      axes1.set_yticks()    # set the location of each tag on the X-axis
      axes1.set_xlabel()    # Set the X-axis tag
      axes1.set_ylabel()    # Set the Y-axis tag
      axes1.plot()          # Draw diagram


      # Draw the second subgraph in FIG: pie chart
      axes2 = fig.add_subplot(312)  # Define the second subgraph (second in 3 rows and 1 column)
      axes2.pie(sizes, explode=explode, labels=labels, colors = colors, autopct='% 1.1 f % %', shadow=True, startangle=45) # Draw a pie chart. The pie chart should include the size of each part/the label of each part/the color of each part. Autopct's function is to automatically display in percentage form with one decimal place.Axes2. Axis (" equal ")# Make sure the pie chart is not deformed and is a standard circle
      axes2.pie()           # Draw a pie chart


      # Draw the third subgraph in FIG: the stack diagram
      axes3 = fig.add_subplot(313)
      axes3.bar(x, y1, color='r', label='Chinese')
      axes3.bar(x, y2, bottom=y1, color='g', label='mathematics')
      axes3.bar(x, y3, bottom=y1+y2, color='c', label='English')
      axes3.set_xlim(1.20)  # Display range
      axes3.set_ylim(1.100) # Display range
Copy the code

Add a legend

   fig = plt.figure( figsize=(12.6) )

   x = temp_df["data_month"].tolist()
   y1 = temp_df["cohesion"].tolist()
   y2 = temp_df["apl"].tolist()
     
   # Draw a graph. If you need a legend, you must have the label parameter
   plt.plot(x,y1,label="cohesion")
   plt.plot(x,y2,label="apl")
   plt.legend(loc='upper left',
              prop = {'family':'Times New Roman'.'weight':'normal'.'size':23})
              
   plt.title("Chart of %s by month" % i,fontsize=20)
   plt.xlabel("data_month",fontsize=20)
   plt.ylabel("cohesion_and_apl",fontsize=20)
 
   plt.show()

   plt.legend(loc='upper right') # Draw legend, loc is the positionBest is the best position, adaptive, automatic allocation of the best position legend parameters, such as font, width, size font1 = {'family':'Times New Roman'.'weight':'normal'.'size':23}
   plt.legend(loc='upper left',prop = font1)
      
   axes3.grid(axis='y', color='gray', linestyle='-', linewidth=1)  
   # Draw grid lines, only in the y axis, color, line type, line width
      
   plt.show() 
   No matter how many subgraphs are in the FIG, if you want to show each subgraph, you should use the plt.show() function
Copy the code

Save the figure

Plt.savefig (path,dpi=1000,format=" PDF ") The premise is to comment out plt.show()Copy the code

5/pandas

The basic data structures for Pandas are series and dataframe. Series is a one-dimensional array, dataframe is a two-dimensional array. The data type of a column in a dataframe must be the same; the data type can be different between columnsCopy the code

Pandas column data shows problems

Pd.set_option ("display.max_rows",None) Pd.set_option (" max_columns",None) # Maximum number of columns to display on a single value, default is 50 pd.set_option("max_colwidth",500) # Maximum number of columns to display on a single line, Pd.set_option ("display.width",10000)Copy the code

Replacement of the value of dataframe

Dataframe = dataframe.replace({" XXX ":["None"," None"]},"2099-12-31") Dataframe [column] = dataframe[column].replace(a,b) # dataframe = dataframe. Replace (a,b) # Dtaframe = dataframe. Replace ([a,aa,aaa], BBB) # BBB dataframe[" column name "] = dataframe Dataframe [" column name "]. Replace ([a,aa,aaa], BBB) #Copy the code

Determine if dataframe is empty

<2>DataFrame has an attribute of "empty", and uses DataFrame. Empty to determine the number of rows. Df. empty returns True if df is empty, and False otherwise. Don't add () to empty. <3>dataframe. Shape [0]=rows,shape[1]=columnsCopy the code

Create an empty dataframe

Method 3: Empty_df = pd.DataFrame(data=None, range(1,5),index=[0,1]) Empty_df = pd.DataFrame(columns=["a","b","c"]) No indexCopy the code

Loc and iloc

Loc [] is based on row label and column label (x_label, y_label). Dataframe. Iloc [] is based on row index and column index (index,columns). If the name of the row tag and column tag of the data is too long or hard to remember, it is convenient to use.iloc[], and you only need to remember the index of the tag.Copy the code

Create a dataframe

<1> Using a dictionary, A kv is a list of data dict1 = {' name ': [' CCC', 'aaa', 'BBB']. 'age':[20,21,23]} data_df = pd.DataFrame(dict1,index= XXX) # dict1 Data_df = pd.dataframe ([[], [], []....] , the columns = [], the index = []) list1 = [[1, 2, 3], [4 and 6], [7,8,9]] data_df = pd.DataFrame(list1,index= XXX,columns= XXXCopy the code

Dataframe sorting

New_df = old_df.sort_values(by=" column name ",ascending=True) by specifies the column name. Ascending is the default order. When ascending=False, 11/ When reading df data files, you can selectively read fields, Data_df = pd.read_csv(Filepath, Encoding =" GB18030 ", usecols=[" XXX "," XXX "]) # Usecols must be written after encodingCopy the code

Add row data to dataframe && Add column data

Df.append (dic,ignore_index=True) Df [new column name] = [] df.insert(loc=0,column=" column name ",value=" XXX ")Copy the code

Save the dataframe

Data_df.to_csv (path,index=False,encoding="gb18030") # index=FalseCopy the code

In the dataframe, convert the string to a time format

Data_df [" date of birth "] = pd.to_datetime(data_df[" date of birth "]) NaT data_df[" resignation date "] = pd.to_datetime(data_df[" resignation date "].fillna("2099-12-31"))Copy the code

Dataframe to heavy

Df.drop_duplicates (subset=['a','b'],keep='first',inplace=False) subset=['a','b'] Subset: Specifies a specific column, default to all columns. Keep: {' first ', 'last', False}, default 'first'. Inplace =True removes duplicates from the original DataFrame, while the default value False generates a new DataFrame object. Df.drop_duplicates (subset='id',keep='first',inplace=True Df.drop_duplicates (subset='id',keep='first',inplace=False) # inplace can be omitted, because default is FalseCopy the code

Dataframe. Insert () function

pandas.get_dummies()

I have a variable, x, and if x is a categorical variable, like blood type, I have a, B, AB, O. After get_dummies(), you program four variables, x_A, x_B,x_ab, and x_O, whose values are denoted by 0 and 1. So this is one_hot, the single hot code.Copy the code

Get_dummies (data column) Optional parameter (prefix= prefixes other parameters are not used)Copy the code

The value Embarked on is transformed into three virtual variables based on the three unique values of the column, and the prefix is added to the nameCopy the code

dataframe.sample()

Data sampling Sometimes we only need part of the data in the data set, not all of the data. This is where we do a random sampling of the data set. Sampling is included in Pandas. Application scenario: I have 10W rows of data, each row has 11 columns of properties. Now, we just have to randomly pick 2W of these rows. The implementation is simple: DataFrame. Sample (n=None, frac=None, replace=False, weights=None, random_state=None, Axis =None) N is the number of rows to be extracted (for example, when n=20000, 2W rows will be extracted). Frac is the ratio column to be extracted (sometimes, we do not care about the specific number of rows to be extracted, we want to extract the percentage, frac can be used in this case, for example, FRAc =0.8. If replace=True, it will be sampled. If replace=True, it will be sampled. Weights Is the weight of each sample, please refer to the official document for details. Random_state was covered in a previous post. Axis selects the row or column of the extracted data. When axis=0, rows are extracted, and when axis=1, columns are extracted (that is, when axis=1, n columns are randomly extracted from columns, and when axis=0, n rows are randomly extracted from rows)Copy the code

Dataframe out of order

 from sklearn.utils import shuffle
 new_df = shuffle(df)
Copy the code

Dataframe Changes the data type

Df [' column name '] = df[' column name '].astype(NP.float64) # In dictionary df[' column name '] = df[' column name '].astype(NP.float64 Base_df [" XXX "].apply(int) #Copy the code

Do you reserve 3 decimal places for a column of data in dataframe?

Format = lambda x: "%.3f" % x df[column name] = df[column name].map(format) # Retain 3 decimal placesCopy the code

The application of fillna ()

Fillna_values = {' column name 1': 1, 'column name 2': 2,' column name 3': 1, 'column name 2': 2,' column name 3': Df = df.fillna(value=fillna_values) 3 df = df.fillna(value=fillna_values, Limit =1) fillna_value = {'price':df["price"].mean() Df.fillna (value=fillna_value) # Then add the mean value to the null value in the name column of dfCopy the code

Dataframe copy

Df2 = df1.copy(deep=True) Deep copy: It is equivalent to storing two completely unrelated objects in memory. If the DF1 object changes, the DF2 object does not change. If it is not a deep copy, it is equivalent to having two variable names on an object. Whenever the object changes, both variables change when the reference is madeCopy the code

Dataframe Indicates data filtering

Note: isin(), notin() Have the isnull (), Also have notnull () temp_df = temp_df [(temp_df [] "departure date" > = date) & (temp_df [' rank '] the isin ([' K3A ', 'K3B', 'K3C', 'K4A', 'K4B', 'K4C'])) & (temp_df[' temp_df '].notnull()) & (temp_df[' temp_df '].notnull())).reset_index(drop=True) "Departure date" (temp_df [] > = date) & (temp_df [' rank '] the isin ([' K3A ', 'K3B', 'K3C', 'K4A', 'K4B', 'K4C'])) & (temp_df [' surface rating]. Isnull ()) & ( Temp_df [' performance '].isnull())].reset_index(drop=True) # Both evaluation and performance are null.Copy the code

Delete rows and columns from dataframe

[] data_df = data_df.drop(index= row number) # [] data_df = data_df.dropna() #Copy the code

How many ways can dataframe be traversed by row?

<1>df.iterrows(): Iterates each row of the DataFrame into (index, Series) pairs. Elements can be accessed by row[name]. For index, row in df.iterrows(): print(index) Print (row['c1'], row['c2']) # print(row[' c2']) Iterating by row, iterating each row of the DataFrame as a meta-ancestor, can be accessed by row[name], more efficiently than iterrows(). for row in df.itertuples(): Print (getattr(row, 'c1'), getattr(row, 'c2')) Each column of the DataFrame is iterated into (column name, Series) pairs that can be accessed through row[index]. For row in df.iterItems (): print(row[0], row[1], row[2]) #Copy the code

Dataframe Changes the column name

Two ways: Final_data_df. Rename (columns = {' A ':' A ', 'B', 'B'}, inplace = True) # any modify column final_data_df. Columns = [" ", ""," ", ""] # this way, If you want to change a column name, you must write all columnsCopy the code

Filter the data whose value is a specific length

new_data_df = data_df[ data_df['xxx'].str.len() == 6 ] new_data_df = data_df[ data_df['content'].str[:10] == '2019-11-12')Copy the code

None and Np. nan in dataframe

<1> None is native to Python and is of type Python Object. Therefore, None cannot participate in any calculation. <2> Np-nan is a floating-point type, so it can participate in the calculation. But the result of the calculation is always a NaN. <3> NULL in the database corresponds to None in PythonCopy the code

How do I assign null and missing values to individual elements when I customize a Dataframe

If the value is a string, use None to assign a null value to a numeric value. If the value is a null value to a time type, use Numpy. NaN to assign a null value to pandas.  testframe = pd.DataFrame({'c1':[None, 'b'], 'c2':[1, np.NaN], 'c3':[pd.Timestamp('2018-09-23'), Pd. NaT]}) c1 c2 c3 0 None 1.0 2018-09-23 1 b NaN NaTCopy the code

Usage of the function between_time() in dataframe

Dataframe ["9-10"] = dataframe["a"].between(-52,11Copy the code

Dataframe filters out data containing a character?

Df = df [df [r]. "XXX" STR. The contains (" word ")] # column value contained in the word of the XXX df = df [df [r]. "XXX" STR. The contains (" hello | world ")] # Df = df[~df[" word"].str.contains("word")] ~ (df) [r]. "XXX" STR. The contains (" Hello ") | (df [XXX]. STR. The contains (" World "))] # include word Korean does not contain the HelloCopy the code

N decimal places are reserved for the entire DF file

dataframe = dataframe.round(2)
Copy the code

The apply() and applyMap () functions apply to the dataframe

<2> Datafrmae provides straightforward and simple functions, apply() and applyMap (). <3> Where the apply() function operates on certain rows or columns, and the applyMap () function operates on all elements. Df ['wide petal'] = df['petal '].apply(lambda V: 1 else 0) df['petal area'] = df.apply(lambda r: R ['petal '] * r['petal '], axis=1) <5> applyMap () np.log(v) if isinstance(v, float) else v)Copy the code

Dataframe filling fillna ()

<1> dataframe = df.fillna(method="ffill") Df ["col_name"] = df["col_name"].fillna(method="ffill") Df ["col_name"] = df["col_name"].fillna(method="bfill") #Copy the code

Sums a column of data from a dataframe data structure

Data_df [" XXX "].sum() # data_df[" XXX "].sum() #Copy the code

dataframe,dropna()

Df. dropna(axis=0,how="any",inplace=True) axis: 0: row operation (default) 1: column operation How: any: delete null values (default) all: delete null values only inplace: False: Return the new dataset (default) True: operates on the original dataset, that is, directly replaces the original DFCopy the code

Dataframe, which calculates the quantile value

Import numpy as np a =[1,2,3,4,5,6,7,8,9,10] print(np.median(a)) # print(np.median(a)) # 25% Print (np.percentile(a,75)) # percentile(a,75)Copy the code

Convert dataframe to dict

To_dict (Orient ="records") output_df = input_df.to_json(Orient ="records") The original data is as follows: input_df =Copy the code

output_df = input_df.to_dict(orient="xxx") # orient: ['dict', 'list', 'series', 'split', 'records', <1> output_df = input_df.to_dict(Orient ="records") [{' working ':' 7067769 ', 'gender' : 'male' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'first-line', 'age: 28.4,' SiLing: 53.6}, {' working ':' 2834031 ', 'gender' : 'female' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'first-line', 'age: 32.5,' SiLing: 48.3}] < 2 > output_df = Input_df. To_dict (received = "list") {' working ': (' 7067769', '2834031'), 'gender' : [' male ', 'woman'], 'marital status: [' married', 'married'], 'nature of the employees' : [' formal employees' and 'formal employees'],' employee types: [' ordinary ', 'ordinary'], 'ages' : [47.5, 55.8],' SiLing ': [31.7, 51.1]} <3> output_df = input_df.to_dict(Orient ="index") {0: {' working ':' 7067769 ', 'gender' : 'male' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'sales',' age: 41.2, 'SiLing: 43.5}, 1: {' working ':' 2834031 ', 'gender' : 'female' and 'marriage' : 'married', 'nature of the employees' :' formal employees', 'employee type' : 'first-line', 'age: 26.2,' SiLing: 43.9}}Copy the code

6/sklearn

Machine learning related library, provides a powerful machine learning toolbox, including data preprocessing, classification, regression, clustering, prediction, model analysis and so on. Sklearn is a powerful machine learning library, but it does not include one powerful model: neural networks. Keras made up for that.Copy the code

7/keras

Keras is the most widely used deep learning framework besides TensorFlow. Keras is used for deep learning. Tensorflow is difficult to learn, but Keras is highly encapsulated and suitable for beginners.Copy the code

8/gensim

Gensim: For language tasks such as word2vec for text similarity.Copy the code