Remember as much as you learn from docs.microsoft

  • Explore and analyze data using Python

After decades of open source development, Python offers rich functionality through powerful statistical and numerical libraries:

  • NumPy and Pandas simplify data analysis and manipulation
  • Matplotlib provides compelling data visualization
  • Scikit-learn provides simple and effective predictive data analysis
  • TensorFlow and PyTorch provide machine learning and deep learning capabilities

Browse data using NumPy and Pandas

Data scientists can use a variety of tools and techniques to browse, visually present, and manipulate data. One of the most common ways data scientists work with data is by using the Python language and some specific data processing packages.

What is a NumPy

NumPy is a Python library that provides equal functionality to mathematical tools such as MATLAB and R. Although NumPy greatly simplifies the user experience, it also provides comprehensive mathematical functions.

What are the Pandas

Pandas is an extremely popular Python library for data analysis and manipulation. Pandas is like Excel to Python, providing easy-to-use functionality for tables.

Explore the data in Jupyter notebooks

Jupyter Notebook is a common way to run basic scripts using a Web browser. Typically, these notebooks are single web pages, broken down into text parts and code parts that are executed on the server rather than the local computer. This means you can get started quickly without having to install Python or other tools.

Test hypothesis

Data exploration and analysis is typically an iterative process in which the data scientist takes data samples and performs the following tasks to analyze the data and test hypotheses:

  • Clean up data to deal with errors, missing values, and other problems.
  • Apply statistical techniques to better understand the data and get a better understanding of how the sample is expected to represent the real world population (allowing for random variation).
  • Visually present data to determine relationships between variables and, in machine learning projects, identify features that may predict labels.
  • Revise your assumptions and repeat the process.

Use NumPy to explore data arrays

Let’s start with some simple data.

Suppose a university collects a sample of student scores for a data science course.

data = [50.50.47.97.49.3.53.42.26.74.82.62.37.15.70.27.36.35.48.52.63.64]
print(data)
Copy the code

The data is loaded into the Python list structure, which is a good data type for general data manipulation, but is not optimized for numerical analysis. To do this, we’ll use the NumPy package, which includes specific data types and functions that use Numbers in Python.

import numpy as np
grades = np.array(data)
print(grades)
Copy the code

If you want to know the difference between a list and a NumPy array, let’s compare how these data types behave when multiplied by 2 in an expression.

Note that multiplying the list by 2 creates a new list twice as long as the sequence of elements in the original list. Multiplying a NumPy array, on the other hand, performs a calculation by element, where the array behaves like a vector, so you end up with an array of the same size, where each element is multiplied by 2.

The point is that NumPy arrays are specifically designed to support mathematical manipulation of numeric data — which makes them more useful in data analysis than general-purpose lists.

grades.shape
Copy the code

Verify that the array has only one dimension and contains 22 elements (22 scores in the original list). You can access each element in the array by its zero-based ordinal position. Let’s get the first element (the one at position 0). Okay, now that you’re familiar with NumPy arrays, it’s time to do some analysis of the score data. You can apply aggregation across the elements in the array, so let’s find the simple average score, grades.mean() 49.18181818181818 so his average score was 49. Let me add one more set of data. Let’s add a second set of data for the same students, this time recording the typical number of hours they spend studying each week

Study_hours = [10.0, 11.5, 9.0, 16.0, 9.25, 1.0, 11.5, 9.0, 8.5, 14.5, 15.5, 13.75, 9.0, 8.0, 15.5, 8.0, 9.0, 6.0, 10.0, 12.0, 12.5, 12.0] # Create a 2 d array student_data = (an array of arrays) np.array([study_hours, grades]) # display the array print(student_data)Copy the code

Show a two-dimensional array

Print (student_data.shape) is now a two-dimensional array with the learning time in front and the grades in front to analyze the information

Get the average of each subcolumn

avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grade))
Copy the code

Explore table data using Pandas

While NumPy provides much of the functionality needed to work with numbers, especially numeric arrays; When you start working with two-dimensional data tables, the Pandas package provides a more convenient structure-dataframe.

Run the following cells to import the Pandas library and create a DataFrame with three columns. The first column is a list of student names, and the second and third columns are NumPy arrays containing study dates and grade data.

import pandas as pd

df_students = pd.DataFrame({'Name': ['Dan'.'Joann'.'Pedro'.'Rosie'.'Ethan'.'Vicky'.'Frederic'.'Jimmie'.'Rhonda'.'Giovanni'.'Francesca'.'Rajab'.'Naiyana'.'Kian'.'Jenny'.'Jakeem'.'Helena'.'Ismat'.'Anila'.'Skye'.'Daniel'.'Aisha'].'StudyHours':student_data[0].'Grade':student_data[1]})
Copy the code

Find and filter data in the DataFrame

You can use the LOC method of a DataFrame to retrieve data for a specific index value, as shown below.

I get data with index 5

df_students.loc[5]

There are also slicing operations

df_students.loc[0:5]

In addition to using the LOC method to find rows by index, you can also use the ILOC method to find rows by their ordinal position in the DataFrame (regardless of the index) : take a close look at the ILOC [0:5] results and compare them to the loC [0:5] results you previously obtained. Can you see the difference?

The values in the row and index labels returned by the LOC method range from list 0 through 5, which includes -0, 1, 2, 3, 4, and 5 (rows 6). However, the ILOC method returns rows from positions contained in ranges 0 through 5, and since integer ranges do not include upper values, this includes positions 0, 1, 2, 3, and 4 (five rows).

Iloc identifies data values in the DataFrame by location, which extends from row to column. So, for example, you can use it to find the values of the columns in positions 1 and 2 in row 0, as follows: df_Students.iloc [0,[1,2]]

Let’s go back to the LOC method and see how it handles columns. Remember that loC is used to locate data items by index value, not location. In the absence of explicit indexed columns, the rows in our data box are indexed to integer values, but the columns are identified by names: df_STUDENTs.loc [0,’Grade’]

This is another useful technique. You can use the LOC method to find index rows based on a filter expression that references named columns other than the index, as follows: df_STUDENTs.loc [df_STUDENTS [‘Name’]==’Aisha’]

For a better measure, you can use the DataFrame query method to get the same result as follows: df_STUDENTS [df_STUDENTS [‘Name’]==’Aisha’]

The previous three examples highlight the fact that using Pandas can sometimes be confusing. Often, there are multiple ways to achieve the same result. Another example is the way you refer to DataFrame column names. You can specify the column Name as a named index value (as in the example df_STUDENTS [‘Name’] we have seen so far), or you can use the column as a DataFrame property, as follows: Df_STUDENTS [df_STUDENTs.name == ‘Aisha’]

Load the DataFrame from a file

We build the DataFrame from some existing arrays. However, in many real-world scenarios, data is loaded from sources such as files. Let’s replace the student score DataFrame with the contents of the text file.

df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
df_students.head()
Copy the code

DataFrame’s read_CSV method is used to load data from text files. As you can see in the sample code, you can specify things like column delimiters and which row, if any, contains the column header (in this case, the delimiter is a comma, and the first row contains the column name — these are default Settings, so you can omit arguments).

Handling missing values

One of the most common problems that data scientists have to deal with is incomplete or missing data. So how do we know that a DataFrame contains missing values? You can use the isNULL method to identify which individual values are NULL, as shown below: df_Students.isNULL () Of course, for large dataframes, it would be inefficient to view all rows and columns individually; So we can get the sum of missing values for each column as follows: df_Students.isnull ().sum() so now we know that one StudyHours value is missing and two Grade values are missing.

To view them in context, we can filter the data box for rows that contain only any column (axis 1 of the data box) that is empty. Df_students [df_STUDENTs.isNULL ().any(axis=1)] When retrieving a DataFrame, missing values appear as NaN (not numbers).

So now that we’ve found the null values, what can we do with them?

One common approach is to estimate the substitution value. For example, if the number of study hours is missing, we can assume that the student studied the average number of study hours and replace the missing value with the average number of study hours. To do this, we can use the fillna method, as follows: Df_students. StudyHours = df_students. StudyHours. Fillna (df_students. StudyHours. Scheme ()) or, to make sure that you use only you know absolutely correct data can be very important; So you can use the Dropna method to delete rows or columns that contain null values. In this case, we will delete any rows whose columns contain null values (DataFrame on axis 0) df_STUDENTS = dF_STUDENTs.dropna (axis=0, how=’any’)

Explore the data in the DataFrame

Now that we have cleaned up the missing values, we are ready to explore the data in the DataFrame. Let’s start by comparing average study time and grades.

mean_study = df_students['StudyHours'].mean() # Get the mean grade using the column name as a property (just to make the  point!) mean_grade = df_students.Grade.mean() # Print the mean study hours and mean grade print('Average weekly study hours: {:.2f}\nAverage grade: {:.2f}'.format(mean_study, mean_grade))Copy the code

Ok, let’s filter the DataFrame to find only students who study more than the average time.

Df_students [df_STUDENTS.StudyHours > mean_study] Note that the filtered result is itself a DataFrame, so you can treat its columns just like any other DataFrame.

For example, let’s find the grade point average for students who study longer than the average. df_students[df_students.StudyHours > mean_study].Grade.mean()

Let’s assume that the passing grade of the course is 60.

We can use this information to add a new column to the DataFrame indicating whether each student has passed.

First, we’ll create a Pandas series that contains the pass/fail indicator (True or False), and then we’ll concatenate that series into a new column (axis 1) in the DataFrame. passes = pd.Series(df_students[‘Grade’] >= 60) df_students = pd.concat([df_students, passes.rename(“Pass”)], Axis =1) DataFrames are designed for tabular data and you can use them to perform a variety of data analysis operations that you can perform in a relational database; For example, grouping and aggregating data tables.

For example, you can use the Groupby method to group student data based on the Pass column you added earlier and count the number of names in each group – in other words, you can determine how many students passed and failed. print(df_students.groupby(df_students.Pass).Name.count())

You can aggregate multiple fields in a group using any of the available aggregation functions. Print (df_students.groupby(df_students.pass)[‘StudyHours’, ‘Grade’].mean()) print(df_students.groupby(df_students)[‘StudyHours’, ‘Grade’].mean())

DataFrame is very versatile and makes it easy to manipulate data. Many DataFrame operations return a new copy of the DataFrame; So if you want to modify a DataFrame but keep the existing variable, you need to assign the result of the operation to the existing variable. For example, the following code sorts the student data in descending order by Grade and assigns the sorted DataFrame to the original DF_STUDENTS variable. df_students = df_students.sort_values(‘Grade’, ascending=False)

Numpy and DataFrames are the main forces in Python data science. They provide a way to load, explore, and analyze tabular data. As we will see in subsequent modules, even advanced analysis methods often rely on Numpy and Pandas for these important roles.