Speed up data analysis with these 12 effective Numpy and Pandas functions

As we all know, Numpy is an extended library in the Python environment that supports a large number of dimensional arrays and matrix operations; Pandas is a Python data manipulation and analysis software package anda powerful data analysis library. Both play an important role in daily data analysis, which would be extremely difficult without the support of Numpy and Pandas. But sometimes we need to speed up our data analysis. Is there anything that can help?

Published by TowardsDataScience, Kunal Dhariwal, Heart of Machines, Jamin, Wei Du, Qian Zhang.

In this article, Kunal Dhariwal, a data and analysis engineer, introduces 12 Numpy and Pandas functions that make data analysis easier and more convenient. Finally, the Jupyter Notebook for the code used in this article can also be found in the GitHub project.

The address of the project: https://github.com/kunaldhariwal/12-Amazing-Pandas-NumPy-Functions

Numpy’s 6 efficient functions

Let’s start with Numpy. Numpy is a Python language extension for scientific computation that typically contains powerful N-dimensional array objects, complex functions, tools for integrating C/C++ and Fortran code, and useful linear algebra, Fourier transform, and random number generation capabilities.

In addition to these obvious uses, Numpy can be used as an efficient multidimensional container for common data, defining any data type. This enables Numpy to integrate itself seamlessly and quickly with a variety of databases.

The next step is to parse each of the six Numpy functions.

argpartition()

With argPartition (), Numpy finds the indexes of the N largest values and prints the indexes it finds. We then sort the values as needed.

x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])index_val = np.argpartition(x, -4)[-4:]index_valarray([1, 8, 2, 0], dtype=int64)np.sort(x[index_val])array([10, 12, 12, 16])Copy the code

allclose()

Allclose () is used to match two arrays and get Boolean output. Allclose () returns False if two arrays are not equal within a tolerance. This function is useful for checking whether two arrays are similar.

Array1 = np. Array ([0.12, 0.17, 0.24, 0.29]) array2 = np, array ([0.13, 0.19, 0.26, 0.31])False# with a tolerance of 0.1, it should return False:np. Allclose (array1,array2,0.1)False# with a tolerance of 0.2, It should return True: np. Allclose (array1, array2, 0.2) TrueCopy the code

clip()

Clip() keeps values in an array within an interval. Sometimes we need to make sure that we’re within bounds. To do this, we use Numpy’s clip() function. Given an interval, the values outside the interval are cut to the interval edge.

X = np. Array ([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0]) np. The clip (x, 2, 5) array ([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2))Copy the code

extract()

As the name suggests, extract() extracts a specific element from an array under certain conditions. With extract(), we can also use conditions like and and or.

# Random integersarray = np.random.randint(20, size=12)arrayarray([ 0, 1, 8, 19, 16, 18, 10, 11, 2, 13, 14, 3])# Divide by 2 and check if remainder is 1cond = np.mod(array, 2)==1condarray([False, True, False, True, False, False, False, True, False, True, False, True])# Use extract to get the valuesnp.extract(cond, array)array([ 1, 19, 11, 13, 3])# Apply condition on extract directlynp.extract(((array < 3) | (array > 15)), array)array([ 0, 1, 19, 16, 18, 2])Copy the code

where()

Where() is used to return elements from an array that meet certain criteria. For example, it returns the index position of a number that meets a specific condition. Where() is similar to Where condition used in SQL, as shown in the following example:

Y = np. Array (,5,6,8,1,7,3,6,9 [1])# Where y is greater than 5, returns index positionnp.where(y>5)array([2, 3, 5, 7, 8], dtype=int64),)# First will replace the values that match the condition, # second will replace the values that does notnp.where(y>5, "Hit", "Miss")array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit', 'Hit'],dtype='<U4')Copy the code

percentile()

Percentile() is used to compute the NTH Percentile of an array element along a particular axis.

A = np. Array (,5,6,8,1,7,3,6,9 [1])print("50th Percentile of a, axis = 0 : "50th percentile(a, 50, Axis =0))50th percentile of a, axis =0:6.0b = np. Array ([[10, 7, 4], [3, 2, 1]])print("30th Percentile of b, axis = 0 : "Percentile (b, 30, axis =0) percentile of B, axis =0: [5.1 3.5 1.9]Copy the code

These are the six efficient functions of the Numpy extension pack that will help you. Next, take a look at the six functions in the Pandas data analysis library.

Pandas six efficient functions for the data statistics package

Pandas is also a Python package that provides fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, heterogeneous) and time series data easy and intuitive.

Pandas applies to the following types of data:

Table data with heterogeneous type columns, such as SQL tables or Excel tables;
Ordered and disordered (not necessarily fixed frequency) time series data;
Arbitrary matrix data (isomorphic or heterogeneous) with row/column labels;
Other statistical data sets in any form. In fact, data does not need to be tagged at all to be put into the Pandas structure.

Pandas is good at handling the following types:

Easy handling of missing data (denoted by NaN) in both floating point and non-floating point data;
Resizability: Columns can be inserted or deleted from DataFrame and higher-dimensional objects;
Explicit data can be automatically aligned: objects can be explicitly aligned within a set of labels, or users can simply choose to ignore labels and have Series, DataFrame, etc., automatically align data;
Flexible grouping function, data set separation, application, merge and other operations, data aggregation and conversion;
Simplify the process of converting data into DataFrame objects, which are basically irregular, different-indexed data in Python and NumPy data structures;
Tag-based intelligent slicing, indexing and sub-settings for large data sets;
More intuitive merging and joining of data sets;
More flexibility in reshaping and pivot data sets;
Grade marks for the axis (may contain multiple marks);
Robust IO tools for adding data from flat files (CSV and DELIMited), Excel files, databases, and saving/loading data from HDF5 format;
Specific functions of time series: data range generation and frequency conversion, moving window statistics, data movement and lag, etc.

read_csv(nrows=n)

One mistake most people make is to read a.csv file in its entirety even though you don’t need it. If an unknown.csv file is 10GB, reading the entire.csv file would be unwise, consuming a lot of memory and taking a lot of time. All we need to do is import a few lines from the.csv file and continue importing as needed.

import ioimport requests# I am using this online data set just to make things easier for you guysurl = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"s = requests.get(url).content# read only first 10 rowsdf = pd.read_csv(io.StringIO(s.decode('utf-8')),nrows=10 , index_col=0)Copy the code

map()

The map() function maps the values of Series based on the corresponding input. Used to replace each value in a Series with another value, which could come from a function, a dict or a Series.

# create a dataframedframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])#compute a formatted string from each floating point value in framechangefn = lambda x: '%.2f' % x# Make changes element-wisedframe['d'].map(changefn)Copy the code

apply()

Apply () allows the user to pass the function and apply it to each value in the Pandas sequence.

# max minus mix lambda fnfn = lambda x: x.max() - x.min()# Apply this on dframe that we've just created abovedframe.apply(fn)Copy the code

isin()

Lsin () is used to filter data frames. Isin () helps select rows with a particular (or more) value in a particular column.

# Using the dataframe we created for read_csvfilter1 = df["value"].isin([112]) filter2 = Df [r]. "time" the isin ([1949.000000]) df [filter1 & filter2]Copy the code

copy()

The Copy () function is used to Copy the Pandas. When a data frame is assigned to another data frame, if a change is made to one of the data frames, the value of the other data frame changes as well. To prevent such problems, use the copy () function.

# creating sample series data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])# Assigning issue that we facedata1= data# Change a valuedata1[0]='USA'# Also changes value in old dataframedata# To prevent that, we use# creating copy of series new = data.copy()# assigning new values new[1]='Changed value'# printing data print(new)  print(data)Copy the code

select_dtypes()

What select_dtypes() does is that columns based on dtypes return a subset of the data frame columns. The arguments to this function can be set to include all columns with a particular data type or to exclude columns with a particular data type.

# We'll use the same dataframe that we used for read_csvframex = df.select_dtypes(include="float64")# Returns only time columnCopy the code

Finally, pivot_table() is also a very useful function in Pandas. If you know something about using Pivot_table () in Excel, you should be able to get started.

# Create a sample dataframeschool = pd.DataFrame({'A': ['Jay', 'Usher', 'Nicky', 'Romero', 'Will'], 'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'], 'C': [26, 22, 20, 23, 24]})# Lets create a pivot table to segregate students based on age and coursetable = pd.pivot_table(school, values ='A', index =['B', 'C'], columns =['B'], aggfunc = np.sum, fill_value="Not Available") tableCopy the code

The original link: towardsdatascience.com/12-amazing-…

Speed up data analysis with these 12 effective Numpy and Pandas functions

Related Posts

Comparison of self-supervised contrast loss and supervised contrast loss

Importance sampling and KL divergence analysis and different implementation methods

FCN Network – based on Bestrivern’s blog and my own understanding