Wechat official account: Youerhuita author: Peter Editor: Peter

DataFrame Data creation

In my previous article, I covered two important types of data structures in Pandas: the Series and DataFrame types, and detailed how to create Series data.

This article describes how to create DataFrame data, which is the most common data type in Pandas. Almost all subsequent articles are based on DataFrame data.

Further reading

Pandas uses an explosion function

2, Pandas Series 1: Creating data for Series types

Import libraries

Pandas and Numpy are recommended to be used after being installed in Anaconda. Pymysql is a third-party library that Python uses to connect to databases and then manipulate library tables. It also needs to be installed

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

import pymysql   PIP install pymysql
Copy the code

10 ways to create DataFrame data

The following are different ways to create DataFrame data, all of which end up using the function pd.dataframe ()

Create an empty DataFrame

1, create a completely empty data

Create an empty DataFrame, find nothing output; However, a check with the type() function shows that the data is of type DataFrame

2, create data with NaN value

df0 = pd.DataFrame(
  columns=['A'.'B'.'C'].# Specify column attributes
  index=[0.1.2]  # Specifies the row index
) 

df0
Copy the code

Change the row index of the data:

df0 = pd.DataFrame(
  columns=['A'.'B'.'C'], 
  index=[1.2.3]  Change row index: start at 1
)

df0
Copy the code

Create a DataFrame manually

List the data for each column field

df1 = pd.DataFrame({  
    "name": ["Xiao Ming"."Little red"."Little hou"."Chou"."Note"]."sex": ["Male"."Female"."Female"."Male"."Male"]."age": [20.19.28.27.24]."class": [1.2.2.1.2]
})

df1
Copy the code

Read local file creation

Pandas can create DataFrame data by reading local Excel, CSV, and JSON files

1. Read the CSV file

For example, the data of a Chengdu food that I once climbed is in CSV format:

df2 = pd.read_csv("Chengdu Cuisine. CSV")   # parentheses fill in the path of the file: the file for this article is in the course directory
df2
Copy the code

2. Read the Excel file

If it is an Excel file, it can also be read:

df3 = pd.read_excel("Chengdu Cuisine. XLSX")
df3.head()  The first 5 rows of data are displayed by default
Copy the code

3. Read the JSON file

For example, the current directory of the local has a json format of data:

Read in by pandas:

df4 = pd.read_json("information.json")
df4
Copy the code

4. Read the TXT file

There is a TXT file in the current directory, as shown in the following figure:

df5 = pd.read_table("text.txt")
df5
Copy the code

If no arguments are specified in the figure above: Pandas will treat the first row of data as a column field (which is not what we want).

df7 = pd.read_table(
  "text.txt".# file path
  names=["Name"."Age"."Gender"."Province"].# Specify column attributes
  sep=""  # Specifies the delimiter: space
)

df7
Copy the code

Another solution is to modify the TXT file directly and add the desired column field attribute at the top of the file: the top row will be treated as a column field

Name Age gender place of birth Xiao Ming20Male Shenzhen Xiao Hong19Female Guangzhou sun28Female Beijing xiao Zhou25Male Shanghai xiao Zhang22Women in hangzhouCopy the code

Read database file creation

Install Pymysql first

This article describes how to manipulate a database using the Pymysql library and then read data into it by Pandas. First install the Pymysql library (pretend you know it) :

pip install pymysql
Copy the code

First look at the data in a table in the local database: read all the data in the Student table

The actual data is as follows:

2. Establish a connection

connection = pymysql.connect(
    host="IP address", port= port number, user="User name",
    password="Password",
    charset="Character set",
    db="The library"
)

cur = connection.cursor()   # Create cursor

# SQL statement to be executed
sql = """ select * from Student """

# execute SQL
cur.execute(sql)
Copy the code

3. Return the result of execution

data = []

for i in cur.fetchall():
    data.append(i)   Append each result to the list

data
Copy the code

Create a DataFrame

df8 = pd.DataFrame(data,columns=["Student id"."Name"."Date of Birth"."Gender"])   Specifies the attribute name for each column
df8
Copy the code

Created using a Python dictionary

1. Create a dictionary containing lists

# 1. A dictionary containing lists

dic1  = {"name": ["Xiao Ming"."Little red"."Note"]."age": [20.18.27]."sex": ["Male"."Female"."Male"]
       }
dic1
Copy the code

df9 = pd.DataFrame(dic1,index=[0.1.2])
df9
Copy the code

2. Create a nested dictionary within a dictionary

# Nested dictionary dictionary

dic2 = {'number': {'apple':3.'pear':2.'strawberry':5},
       'price': {'apple':10.'pear':9.'strawberry':8},
        'origin': {'apple':'the shaanxi'.'pear':'shandong'.'strawberry':'in guangdong'}
      }

dic2

# the results
{'number': {'apple': 3.'pear': 2.'strawberry': 5},
 'price': {'apple': 10.'pear': 9.'strawberry': 8},
 'origin': {'apple': 'the shaanxi'.'pear': 'shandong'.'strawberry': 'in guangdong'}}
Copy the code

The result is:

Python list creation

1. Use the default row index

lst = ["Xiao Ming"."Little red"."Chou"."Note"]
df10 = pd.DataFrame(lst,columns=["Name"])
df10
Copy the code

You can modify the index:

lst = ["Xiao Ming"."Little red"."Chou"."Note"]

df10 = pd.DataFrame(
  lst,
  columns=["Name"],
  index=["a"."b"."c"."d"]   # Modify index
)

df10
Copy the code

3. Nested lists within lists

# Nested list form

lst = [["Xiao Ming"."20"."Male"],
       ["Little red"."23"."Female"],
       ["Chou"."19"."Male"],
       ["Note"."28"."Male"]
      ]

df11 = pd.DataFrame(lst,columns=["Name"."Age"."Gender"])
df11
Copy the code

Python tuple creation

Tuples are created in a similar way to lists: they can be single tuples or nested.

1, single tuple creation

# Single-layer tuple

tup = ("Xiao Ming"."Little red"."Chou"."Note")
df12 = pd.DataFrame(tup,columns=["Name"])

df12
Copy the code

2. Tuple nesting

# Nested tuples

tup = (("Xiao Ming"."20"."Male"),
       ("Little red"."23"."Female"),
       ("Chou"."19"."Male"),
       ("Note"."28"."Male")
      )

df13 = pd.DataFrame(tup,columns=["Name"."Age"."Gender"])
df13
Copy the code

Created using Series data

A DataFrame is a two-dimensional data structure that combines several Series into columns. Each column is a Series, so we can create it directly from the Series data.

series = {'fruit':Series(['apple'.'pear'.'strawberry']),
          'number':Series([60.50.100]),
          'price':Series([7.5.18])
         }

df15 = pd.DataFrame(series)
df15
Copy the code

Numpy array creation

1, use numpy function to create

# 1, array generated using Numpy

data1 = {
    "one":np.arange(4.10),  # Generate 6 data
    "two":range(100.106),
    "three":range(20.26)
} 

df16 = pd.DataFrame(
  data1,
  index=['A'.'B'.'C'.'D'.'E'.'F']   The index length is the same as the data length
)

df16
Copy the code

2. Create directly from the Numpy array

Create numpy array

The # Shape () function is to change the shape of arrays
data2 = np.array(["Xiao Ming"."Guangzhou".175."Little red"."Shenzhen".165."Chou"."Beijing".170."Note"."Shanghai".180]).reshape(4.3)

data2
Copy the code

df17 = pd.DataFrame(
  data2,   # Incoming data
  columns=["Name"."Birthplace"."Height"].# list of attributes
  index=[0.1.2.3]  # row index
)

df17
Copy the code

3. Use random functions in NUMpy

# 3. Random function generation in numpy

Create 4 lists for name, subject, semester and class
name_list = ["Xiao Ming"."Little red"."Note"."Chou"."Zhang"]
subject_list = ["Chinese"."Mathematics"."English"."Creatures"."Physical"."Geography"."Chemistry"."Sports"]
semester_list = ["On"."Under"]
class_list = [1.2.3]

# Generate 40 scores: between 50 and 100
score_list = np.random.randint(50.100.40).tolist()   Select 40 numbers between 50 and 100
Copy the code

40 randomly generated scores:

Using the choice method of the random module in NUMpy to generate data randomly:

df18 = pd.DataFrame({
    "name": np.random.choice(name_list,40,replace=True),   # replace=True means put back after extraction (default), so the same value exists
    "subject": np.random.choice(subject_list,40),
    "semester": np.random.choice(semester_list,40),
    "class":np.random.choice(class_list,40),
    "score": score_list
})

df18
Copy the code

Use the builder from_dict

Pandas has a dictionary-related builder: DataFrame. From_dict.

It receives dictionaries composed of dictionaries or array sequence dictionaries and generates a DataFrame. This builder operates similarly to the DataFrame builder, except that the Orient parameter defaults to columns. Set the Orient parameter to ‘index’ to use the key of the dictionary as the row label.

df19 = pd.DataFrame.from_dict(dict([('name'['Ming'.'little red'.'little weeks'), ('height'[178.165.196), ('gender'['male'.'woman'.'male'), ('Birthplace'['shenzhen'.'Shanghai'.'Beijing'])                                  
                                   ])
                             )

df19
Copy the code

You can also specify row indexes and column field names with arguments:

df20 = pd.DataFrame.from_dict(dict([('name'['Ming'.'little red'.'little weeks'), ('height'[178.165.196), ('gender'['male'.'woman'.'male'), ('Birthplace'['shenzhen'.'Shanghai'.'Beijing'])                                  
                                   ]),
                              orient='index'.Use the key of the dictionary as the row index
                              columns=['one'.'two'.'three']  Specifies the column field name
                             )

df20
Copy the code

Use the builder from_Records

Pandas also has another builder for multidimensional arrays that support tuple lists or structural data types (DTypes) : from_Records

data3 = [{'height': 173.'name': 'Joe'.'gender':'male'},
        {'height': 182.'name': 'bill'.'gender':'male'},
        {'height': 165.'name': 'Cathy'.'gender':'woman'},
        {'height': 170.'name': 'Ming'.'gender':'woman'}]

df21 = pd.DataFrame.from_records(data3)

df21
Copy the code

You can also pass in structured data nested in tuples in a list:

data4 = [(173.'Ming'.'male'), 
         (182.'little red'.'woman'), 
         (161.'little weeks'.'woman'), 
         (170.'jack'.'male')
        ]

df22 = pd.DataFrame.from_records(data4, 
                                 columns=['height'.'name'.'gender']
                                )

df22
Copy the code

conclusion

A DataFrame is a two-dimensional data structure in Pandas, where data is arranged in a table of rows and columns, similar to a dictionary of Excel, SQL tables, or Series objects. It is often used in Pandas and is itself a combination of Series data.

This article describes 10 different ways to create a DataFrame, most commonly by reading a file and then processing and analyzing the data frames. I hope this article will help readers to master the creation of DataFrame.

A preview of the next article: How do I find the data we need in a DataFrame?