Now let’s learn how to print data using PySpark. Data is one of the most fundamental things these days. It can be provided in encrypted or decrypted format. In fact, we also tend to create a lot of information every day. Whether it’s clicking a button on our smartphones or browsing the web on our computers. But why do we talk about it so much?

The main question that researchers faced in previous years was how does ** manage such a large amount of information? Technology is the answer to this question. With the advent of Apache Spark, PySpark was built to solve this problem.

If you’re new to PySpark, here’s a PySpark tutorial to get you started.

Introduction to spark using Pyspark

Apache Spark is a data management engine that helps us invent analytic-related solutions for large software development projects.

It is also a tool of choice for big data engineers and data scientists. Knowledge of Spark is one of the skills that tech companies are looking for.

It has many extension and administration options. One of them is Pyspark, from Python, for Python developers. This is one of the support library apis that can be explicitly installed on every computer. So, it’s easy to manage the implementation. As we all know, installing libraries in Python is easy.

Before we print data using PySpark

Before we start learning about different ways to print data using PySpark **, there are a few prerequisites we need to consider. **

  1. Core understanding of Python
  2. Core understanding of Pyspark and its support packages.
  3. Python 3.6 or later
  4. Java 1.8 or later (most required).
  5. An IDE such as Jupyter Notebook or VS Code.

To check these, go to the command prompt and type the command.

python --version 

Copy the code
java -version

Copy the code

Version checking

You can print data using PySpark in the following ways.

  • Print raw data
  • Format printed data
  • Display lines 20-30 at the top
  • Display the bottom 20 lines
  • Sort data before display

Resources and tools for the rest of this tutorial.

  • Data set.titanic.csv
  • The environment.Anaconda
  • IDE.Jupyter notebook

Creating a session

In the Spark environment, the session is the record holder for all instances of our activity. To create it, we use the SQL module in the Spark library.

The SparkSession class has a constructor property that has a ** AppName ()** function. This function takes the name of the application as a string parameter.

We then create the application using the **getOrCreate() method, which is called with the dot ‘.’** operator. Using this code, we create our application called “App”.

We are completely free to name any application we create. Don’t forget to create a session, because we can’t continue.

The code.

import pyspark 
from pyspark.sql import SparkSession 

session = SparkSession.builder.appName('App').getOrCreate() # creating an app

Copy the code

Creating a session

Different ways to print data using PySpark

Now that you’re all set, let’s get to the real deal. Here we will now learn different ways to print data using PySpark.

1. Print raw data

In this example, we will use a raw data set. In the field of AI (artificial intelligence), we call a collection of data a data set.

It comes in various forms, such as Excel, comma-separated value files, text files, or server document models. So, make a note of what type of file format we used to print the raw data.

In this case, we use a dataset with a **.csv extension. The read ** property of the session has various functions for reading files.

These functions are usually named for different file types. Therefore, we use the CSV () function for our data set. We store everything in data variables.

The code.

data = session.read.csv('Datasets/titanic.csv')
data # calling the variable

Copy the code

By default, Pyspark reads all data as a string. So, we call our data variable, and it returns the number of each column as a string.

To print raw data, use the dot operator –‘.’ and call the **show()** function in the data variable.

data.show()

Copy the code

Read data set

2. Format data

Data formatting in Pyspark means displaying the appropriate data types for each column in the dataset. To display all the headings, we use the option() function. This function takes two string arguments.

  1. The key
  2. value

For the key argument, we give the value header, which is true. The idea is that it will scan for the header that needs to be displayed instead of the column number above.

The most important thing is to scan the data type of each column. To do this, we need to activate the InferSchema parameter in the CSV () function previously used to read the data set. This is a Boolean datatype parameter, which means we need to set it to True to activate it. We join each function with a dot operator.

The code.

data = session.read.option('header', 'true').csv('Datasets/titanic.csv', inferSchema = True)
data

Copy the code
data.show()

Copy the code

Display data in the correct format

The output.

We can see that the title is visible with the appropriate data type.

3. The first 20-30 lines are displayed

To display the first 20-30 lines, we only need one line of code. The **show() function does this for us. If the dataset is too large, it will default to the first 20 rows. However, we can make it display as many rows as possible. Just take this number as an argument to the show()** function.

data.show() # to display top 20 rows

Copy the code

Display the first 20 lines

data.show(30) # to display top 30 rows

Copy the code

Display the first 30 lines

We can do the same thing with the **head()** function. This function specifically provides access to the top row of the data set. It takes the number of lines and is displayed by line. For example, you want to display the first 10 lines

data.head(10)

Copy the code

However, the result is in the form of an array or list. Most disappointingly, we can’t use the head() function on large datasets with thousands of rows. Here is a proof of this.

Use the header method to print the first 10 lines

4. Lines 20-30 at the bottom are displayed

This is also an easy task. The tail() function helps us with this task. Call it with a data box variable and give us the number of rows we want to display as an argument. For example, to display the last 20 lines, we write the code as.

data.tail(20)

Copy the code

Show Bolton line 20

Again, we can’t do any proper views because our data set is too large to display these rows.

5. Sort the data before displaying it

Sorting is the process in which we put things in the proper order. This can be in ascending order — from small to large or descending order — from large to small. This plays an important role in viewing data points in sequence. Columns in a data frame can be of various types. However, the two main types are integers and strings.

  1. Integers are sorted in terms of large and decimal numbers.
  2. The sorting of strings is done alphabetically.

The sort() function in Pyspark is used for this purpose only. It can take a single or multiple columns as its arguments. Let’s try it with our data set. We will sort the PassengerID column from the data set. To do that, we have two functions.

  1. sort()
  2. orderBy()

Sort in ascending order

data = data.sort('PassengerId')
data.show(5)

Copy the code

Sort a single column 1

The PassengerID column has been sorted. This code places all the elements in ascending order. So we’re only sorting single columns here. To sort multiple columns, we pass them one by one in sort(), separating each column with a comma.

data = data.sort('Name', 'Fare')
data.show(5)

Copy the code

Sort multiple columns

Sort in descending order

This is specific to the **orderBy()** function. This function provides a special option to sort our data in descending order.

In this case, all the code is the same, except that we call a **desc()** function in the orderBy() function after inserting columns and concatenating them using the dot operator.

**desc()** sorts or sorts all elements of that particular column in descending order.

First, let’s look at all the columns in the dataset.

The code.

data.columns

Copy the code

List of records

In the code below, we will sort the Name and Fare columns. The name is a string data type, so it will be sorted alphabetically. And Fare is a number, so it will be sorted in a big-small pattern.

The code.

data = data.orderBy(data.Name.desc(), data.Fare.desc())
data.show(5)

Copy the code

Sort in descending order

conclusion

So, that’s all about how we print data using Pyspark. Each piece of code is short and easy to understand. This is enough to give us some code knowledge of spark functions. This environment is very powerful for big data and other industrial and technological fields.