This is the 16th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021
First, write first
Reference book for this series of study notes: Data Analysis in Action. Tomaz Joubas will share his notes from this study book with you as part of a series called Data Analysis in Action from Scratch.
Two, knowledge point summary
1. Create a virtual Python runtime for this series;
2. The common data analysis module pandas is installed
3. Read and write CSV files using the Pandas module
Three, start using your head
1. Create a virtual environment
I usually like Pycharm, so I plan to use Pycharm completely in this series. Pycharm can be downloaded directly from the official website and you can use the community version.
I used to like Pycharm, but now I like VS Code. You can also use Jupyter Notebook.
# Windows/Mac installation, using the Python environment + Jupyter Notebook
(1) To start, open Pycharm and click File->New Project. See the following figure for basic configuration instructions.
Special note: In Python, do not appear Chinese in the project path, and do not appear Chinese in the project name. The name should summarize the project content as much as possible.
(2) After successful creation, we will find more project files and virtual environment files under the corresponding directory.
2. The common data analysis module Pandas is installed
(1) Zero basic tutorial, first teach everyone how to enter the virtual environment: Go to the directory I: pyCoding Frame Data_analysis Scripts(my virtual environment directory), hold down shift+ right mouse button, open Powershell or CMD (for PowerShell type CMD first), and type Activate. When you enter the virtual environment, you will notice that there is an extra parenthesis in front of the path which is the name of your virtual environment, indicating that you have entered the virtual environment. See below:
PS I:\pyCoding\Frame\Data_analysis\Scripts> CMD Microsoft Windows10.017134.112.]
(c) 2018Microsoft Corporation. All rights reserved. I:\pyCoding\Frame\Data_analysis\Scripts>activate (Data_analysis) I:\pyCoding\Frame\Data_analysis\Scripts>Copy the code
Do not know if you feel very troublesome, I feel particularly troublesome, every time into the virtual environment to the specified file path, and then input instructions, not in line with the style of programmers ah!
[Note] I used to use VirtualenvWrapper for virtual environment management, now I prefer to use Pipenv for virtual environment management.
(2) Install the PANDAS module
Use the shortcut to enter the virtual environment, direct PIP command installation
# CMD direct operation
C:\Users\82055>workon
Pass a name to activate one of the following virtualenvs:
==============================================================================
Data_analysis
spiderenv
C:\Users\82055>workon Data_analysis
(Data_analysis) C:\Users\82055>pip install pandas
Copy the code
Installation result:
The installation process takes about 1 minute and will be displayed when the installation is complete
Installing collected packages: pytz, numpy, six, python-dateutil, pandas
Successfully installed numpy-1.154. pandas-0.234. python-dateutil-2.7. 5 pytz-2018.7 six-1.11. 0
Copy the code
Obviously, this process installs not only the pandas package, but also the additional packages numpy, Pytz, six, python-dateutil, which we will use later.
3. Read and write CSV files using the Pandas module
(1) Data file download
Click here to download, this series according to the book of data are here, “data analysis actual combat” book source code is also in this code warehouse, of course, behind I will build a code warehouse, record their own learning process, we can first download good data files from here.
(2) Pandas
Pandas provides high performance for the Python programming language and is an easy-to-use data structure and analysis tool based on NumPy. Pandas provides high-performance, high-level data structures (e.g. DataFrame) and the tools needed to efficiently manipulate large data sets, while providing a wealth of functions and methods that allow us to quickly and easily process data.
(3) Use pandas to read the CSV file
Read code:
# Import data processing module
import pandas as pd
import os
Get the parent directory of the current file
father_path = os.getcwd()
The path to the original data file
rpath_csv = father_path+r'\data01\city_station.csv'
# fetch data
csv_read = pd.read_csv(rpath_csv)
# Display the top 10 items of data
print(csv_read.head(10))
Copy the code
Running results:
Function analysis:
Read_csv (filepath_or_buffer, sep, header, names, skiprows, na_values, encoding, nrows) read CSV file in the specified format.
Common parameter analysis:
- Filepath_or_buffer: indicates the filepath.
- Sep: string specifying the separator, default is’, ‘;
- Header: a value specifying the number of rows as the column name (annotated lines are ignored). If no column name is specified, the default header=0; If the column name header=None is specified;
- Names: List specifying column names, or explicitly header=None if the file does not contain header lines.
- Skiprows: list, the number of rows to ignore (starting from 0), the number of rows set will not be read.
- Na_values: a list of values that must be replaced by the value of NAN. Pandas defaults to NAN and can be used to handle default and incorrect values.
- Encoding: String, used in Unicode text encoding format. For example, the encoding of text such as “UTF-8” or “GBK”.
- Nrows: Number of rows to read.
(4) Write to the CSV file using pandas
Write to a CSV file:
import pandas as pd
import os
Get the parent directory of the current file
father_path = os.getcwd()
Save the path to the data file
path_csv = father_path+r'\data01\temp_city.csv'
Write data (column name + column value)
data = {"Site name": ["Beijing North"."Beijing East"."Beijing"."Beijing South"."Beijing West"]."Code": ["VAP"."BOP"."BJP"."VNP"."BXP"]}
Data is initialized to a DataFrame object
df = pd.DataFrame(data)
# data write
df.to_csv(path_csv)
Copy the code
Running results:
Function analysis:
to_csv(path_or_buf,sep,na_rep,columns,header,index)
- Path_or_buf: indicates a character string, such as the file name, specific file, relative path, and file flow.
- Sep: string, file splitting symbol.
- Na_rep: string that converts NaN to a specific value;
- Columns: a list. Select some columns to write.
- Header: None, ignore column names when writing;
- Index: False Indicates that the index is not written. The default value is True.
Four, conclusion
Persistence and hard work: results.
The idea is very complicated,
The implementation is interesting,
As long as you don’t give up,
Fame will come.
— Old Watch doggerel
See you next time. I’m a cat lover and a tech lover. If you find this article helpful, please like, comment and follow me!