As we all know, Numpy is an extended library in the Python environment that supports a large number of dimensional arrays and matrix operations; Pandas is a Python data manipulation and analysis software package anda powerful data analysis library. Both play an important role in daily data analysis, which would be extremely difficult without the support of Numpy and Pandas. But sometimes we need to speed up our data analysis. Is there anything that can help?
In this article, Kunal Dhariwal, a data and analysis engineer, introduces 12 Numpy and Pandas functions that make data analysis easier and more convenient. Finally, the Jupyter Notebook for the code used in this article can also be found in the GitHub project.
x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])index_val = np.argpartition(x, -4)[-4:]index_valarray([1, 8, 2, 0], dtype=int64)np.sort(x[index_val])array([10, 12, 12, 16])Copy the code
Array1 = np. Array ([0.12, 0.17, 0.24, 0.29]) array2 = np, array ([0.13, 0.19, 0.26, 0.31])False# with a tolerance of 0.1, it should return False:np. Allclose (array1,array2,0.1)False# with a tolerance of 0.2, It should return True: np. Allclose (array1, array2, 0.2) TrueCopy the code
X = np. Array ([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0]) np. The clip (x, 2, 5) array ([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2))Copy the code
# Random integersarray = np.random.randint(20, size=12)arrayarray([ 0, 1, 8, 19, 16, 18, 10, 11, 2, 13, 14, 3])# Divide by 2 and check if remainder is 1cond = np.mod(array, 2)==1condarray([False, True, False, True, False, False, False, True, False, True, False, True])# Use extract to get the valuesnp.extract(cond, array)array([ 1, 19, 11, 13, 3])# Apply condition on extract directlynp.extract(((array < 3) | (array > 15)), array)array([ 0, 1, 19, 16, 18, 2])Copy the code
Y = np. Array (,5,6,8,1,7,3,6,9 [1])# Where y is greater than 5, returns index positionnp.where(y>5)array([2, 3, 5, 7, 8], dtype=int64),)# First will replace the values that match the condition, # second will replace the values that does notnp.where(y>5, "Hit", "Miss")array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit', 'Hit'],dtype='<U4')Copy the code
A = np. Array (,5,6,8,1,7,3,6,9 [1])print("50th Percentile of a, axis = 0 : "50th percentile(a, 50, Axis =0))50th percentile of a, axis =0:6.0b = np. Array ([[10, 7, 4], [3, 2, 1]])print("30th Percentile of b, axis = 0 : "Percentile (b, 30, axis =0) percentile of B, axis =0: [5.1 3.5 1.9]Copy the code
-
Table data with heterogeneous type columns, such as SQL tables or Excel tables;
-
Ordered and disordered (not necessarily fixed frequency) time series data;
-
Arbitrary matrix data (isomorphic or heterogeneous) with row/column labels;
-
Other statistical data sets in any form. In fact, data does not need to be tagged at all to be put into the Pandas structure.
-
Easy handling of missing data (denoted by NaN) in both floating point and non-floating point data;
-
Resizability: Columns can be inserted or deleted from DataFrame and higher-dimensional objects;
-
Explicit data can be automatically aligned: objects can be explicitly aligned within a set of labels, or users can simply choose to ignore labels and have Series, DataFrame, etc., automatically align data;
-
Flexible grouping function, data set separation, application, merge and other operations, data aggregation and conversion;
-
Simplify the process of converting data into DataFrame objects, which are basically irregular, different-indexed data in Python and NumPy data structures;
-
Tag-based intelligent slicing, indexing and sub-settings for large data sets;
-
More intuitive merging and joining of data sets;
-
More flexibility in reshaping and pivot data sets;
-
Grade marks for the axis (may contain multiple marks);
-
Robust IO tools for adding data from flat files (CSV and DELIMited), Excel files, databases, and saving/loading data from HDF5 format;
-
Specific functions of time series: data range generation and frequency conversion, moving window statistics, data movement and lag, etc.
import ioimport requests# I am using this online data set just to make things easier for you guysurl = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"s = requests.get(url).content# read only first 10 rowsdf = pd.read_csv(io.StringIO(s.decode('utf-8')),nrows=10 , index_col=0)Copy the code
# create a dataframedframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])#compute a formatted string from each floating point value in framechangefn = lambda x: '%.2f' % x# Make changes element-wisedframe['d'].map(changefn)Copy the code
# max minus mix lambda fnfn = lambda x: x.max() - x.min()# Apply this on dframe that we've just created abovedframe.apply(fn)Copy the code
# Using the dataframe we created for read_csvfilter1 = df["value"].isin([112]) filter2 = Df [r]. "time" the isin ([1949.000000]) df [filter1 & filter2]Copy the code
# creating sample series data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])# Assigning issue that we facedata1= data# Change a valuedata1[0]='USA'# Also changes value in old dataframedata# To prevent that, we use# creating copy of series new = data.copy()# assigning new values new[1]='Changed value'# printing data print(new) print(data)Copy the code
# We'll use the same dataframe that we used for read_csvframex = df.select_dtypes(include="float64")# Returns only time columnCopy the code
# Create a sample dataframeschool = pd.DataFrame({'A': ['Jay', 'Usher', 'Nicky', 'Romero', 'Will'], 'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'], 'C': [26, 22, 20, 23, 24]})# Lets create a pivot table to segregate students based on age and coursetable = pd.pivot_table(school, values ='A', index =['B', 'C'], columns =['B'], aggfunc = np.sum, fill_value="Not Available") tableCopy the code