Pandas reads excel, TXT, and CSV files

Pandas Reads the document officially provided by the file

The content required for reading a file in pandas must belong to the official document address

Pandas.pydata.org/pandas-docs…

The “Input/Output” command is written for pandas. The “IO” command is written for pandas. The “Input/Output” command is written for pandas

Pandas reads the TXT file

Read TXT files need to determine whether the TXT file conforms to the basic format, that is, whether there is \t,,, and other special delimiters

Usually TXT files look like this

TXT File Example

The following files are spaced with Spaces

Encoding 3 2019-03-22 00:06:32.6835965 Ashshsh 3 2019-03-22 00:06:32.6835965 Ashshsh 4 The 2017-03-22 00:06:32. 8041945 egggCopy the code

The read command can be read_CSV or read_table

import pandas as pd
df =  pd.read_table("./test.txt")
print(df)

import pandas as pd
df =  pd.read_csv("./test.txt")
print(df)
Copy the code

Note, however, that this is a DataFrame of three rows and one column, rather than the three rows and four columns required

import pandas as pd
df =  pd.read_csv("./test.txt")
print(type(df))
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
(3, 1)
Copy the code

Read_csv function

Default: load delimited data from a file, URL, or new object of a file. The default delimiter is a comma.

The above TXT document is not comma separated, so you need to add the sep separator parameter when reading

df =  pd.read_csv("./test.txt",sep=' ')
Copy the code

Source: github.com/pandas-dev/… Chinese description and key function cases

parameter	Chinese meaning
filepath_or_buffer	Can be a URL, available URL types include: HTTP, FTP, s3 and file, read a local file example: file://localhost/path/to/table.csv
sep	STR type. The default ‘,’ specifies the delimiter. If no arguments are specified, the default comma separation is attempted. A delimiter longer than one character and not ‘\s+’ will be usedpythonParser. And ignore commas in the data. Regular expression example: ‘\r\t’
delimiter	Delimiter, alternative delimiter (if this parameter is specified, sep parameter is invalid) is generally not used
delimiter_whitespace	True or False The default is False, using Spaces as separators is equivalent to SPE = ‘\ S +’ Delimite does not work if this argument is called
header	Specify the row as the column name (omit the comment line). If no column name is specified, the default header=0; If the column name header=None is specified
names	Specify column names. If the file does not contain header lines, header=None should be explicitly stated. Header can be a list of integers, such as 0,1,3. Unspecified middle lines are removed (for example, skip 2 lines in this example)
Index_col (case 1)	The default is None using the column name as the row label for the DataFrame, or MultiIndex if a sequence is given. If you are reading a file that has delimiters at the end of each line, consider using index_col=False so that Panadas does not use the first column as the row name.
usecols	By default None can use column sequences or column names such as 0, 1, 2 or ‘foo’, ‘bar’, ‘baz’ to speed up loading and reduce memory consumption.
squeeze	The default is False, True returns Series, or Series if the data has been parsed to contain only one row
prefix	Prefix of automatically generated column name numbers, e.g. ‘X’ for X0, X1… This parameter is valid if header =None or if no header is set
mangle_dupe_cols	Default to True, duplicate columns will be specified as’ x.0 ‘… ‘X.N’ instead of ‘X’… “X”. If False is passed, the data will be overwritten if duplicate names exist in the column.
dtype	Example: {‘ a ‘: Np.float64,’ b ‘: np.int32} specifies the data type of each column, where a and b indicate the column names
engine	Analysis engine used. You can choose C or Python, C engine is faster but Python engine is more powerful
Converters (case 2)	Sets the handler for the specified column, either using “ordinal” or “column name”
true_values / false_values	Did not find the actual application scenario, remarks, later perfect
skipinitialspace	Ignore Spaces after delimiters. Default is false
skiprows	Default value None Number of lines to ignore (from the beginning of the file), or list of line numbers to skip (starting at 0)
skipfooter	Ignore from the end of the file. (The C engine is not supported)
nrows	How many rows to read from the file, and the number of rows to read (from the start of the file header)
na_values	Null value definition, By default, the “# N/A”, “# N/A N/A”, “# NA”, “- 1 # IND ‘, ‘- 1 # QNAN’, ‘- NaN’, ‘- NaN’, ‘1 # IND’, ‘1 # QNAN”, “N/A”, “NA”, “NULL”, “NaN”, “n/a”, “NaN”, “null”. All behave as NAN’s
keep_default_na	If the na_values argument is specified and keep_default_na=False, the default NaN is overridden, otherwise added
na_filter	Whether to check for missing values (empty strings or null values). For large files, there are no null N/A values in the data set. Using na_filter=False can speed up reading.
verbose	Whether to print the output of various parsers, such as “number of missing values in non-numeric columns”, etc.
skip_blank_lines	If True, empty lines are skipped; Otherwise, call it NaN.
parse_dates	List of ints or names. e.g. If 1,2,3 -> parse the values of columns 1,2, and 3 as separate date columns; Dict, e.g. {‘ foo ‘: 1,3} -> merge 1,3 columns and name the merged column “foo”.
infer_datetime_format	If set to True and parse_dates is available, pandas will attempt to convert to the date type and, if it can, convert the method and parse. In some cases five to ten times faster
keep_date_col	If joining multiple columns resolves dates, the participating columns are kept. The default is False
date_parser	A function used to parse dates, using dateutil.parser.parser for conversion by default. Pandas attempts parsing in three different ways and uses the next method if it encounters problems. 1. Use one or more arrays (specified by parse_dates) as arguments. 2. The concatenation specifies a multi-column string as a column parameter; 3. Call date_parser once per line to parse one or more strings (specified by parse_dates) as arguments.
dayfirst	Date type in DD/MM format
iterator	Returns a TextFileReader object to process the file block by block.
chunksize	Size of the file block
compression	Use compressed files directly on disk. Infer if the infer parameter is used, use gzip, bz2, zip or files with ‘. Gz ‘, ‘. Bz2 ‘, ‘. If zip is used, the ZIP package must contain only one file. Set to None does not decompress.
New version 0.18.1 supports zip and XZ decompression
thousands	Thousandths symbol, default ‘, ‘
decimal	Decimal symbol, default ‘.
lineterminator	Line separator, used only under the C parser
quotechar	Quotation marks, used to identify beginning and explanation characters. Delimiters within quotation marks are ignored
quoting	Controls quoted constants in CSV. QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3) are optional.
doublequote	Double quotation marks, when single quotation marks have been defined, and the quoting parameter is not QUOTE_NONE, double quotation marks are used to indicate that the element within the quotation is used as an element.
escapechar	When quoting QUOTE_NONE, the use of a character is specified without the delimiter limit.
comment	Indicates that extra rows are not parsed. If this character appears at the beginning of the line, the line is ignored entirely. This argument can only be one character, and empty lines (like skip_blank_lines=True) comment lines are ignored by header and skiprows. For example, if comment=’#’ is specified to parse ‘# empty\na,b,c\n1,2,3 ‘with header=0 then the result will be returned with’ a,b,c’ as header
encoding	Encoding, specifying the type of character set, usually ‘UTF-8’
dialect	If no specific language is specified, sep is ignored if it is larger than one character. For details, see the csv.dialect documentation
error_bad_lines	If a row contains too many columns, the DataFrame is not returned by default, and if set to false, the row changes are culled (only with the C parser).
warn_bad_lines	If error_bad_lines =False and warn_bad_lines =True then all “bad lines” will be printed (only for C parsers)
low_memory	It is loaded in chunks into memory and parsed in low memory consumption. But type confusion can occur. Set to False to ensure that the types are not confused. Or use the dtype argument to specify the type. Note that chunksize or iterator chunks read the entire file into a Dataframe, regardless of type (only valid in C parsers)
delim_whitespace	New in Version 0.18.1: Valid in the Python parser
memory_map	If a filepath is provided for filepath_or_buffer, the file object is mapped directly to memory and the data is accessed directly from there. Using this option improves performance because there is no longer any I/O overhead and the file is not IO again in this way
float_precision	Specifies that the C engine applies to converters of floating-point values

This forms part of the reference blog www.cnblogs.com/datablog/p/… Thanks for your translation, O(∩_∩)O haha ~

Case 1

Index_col use

Start with a TXT file. The biggest problem with this file is that there is a ‘,’ at the end of each line. If there is a delimiter at the end of each line, there will be a problem, but the actual test found that you need to match the names parameter

Making, 1, 2, 3, DDD, u, 1 and 4, asd, the as, df, 12 behavior,Copy the code

This forms part of the reference blog www.cnblogs.com/datablog/p/… Thanks for your translation, O(∩_∩)O haha ~

Case 1

Index_col use

Making, 1, 2, 3, DDD, u, 1 and 4, asd, the as, df, 12 behavior,Copy the code

Write the following code

df =  pd.read_csv("./demo.txt",header=None,names=['a','b','c','d','e'])
print(df)

df =  pd.read_csv("./demo.txt",header=None,index_col=False,names=['a','b','c','d','e'])
print(df)
Copy the code

In fact, the significance of the discovery is really not very big, the document may not be clear about his specific role. Now, common uses of Index_col

When reading a file, if the index_COL column index is not set, an integer index starting from 0 is used by default. When you operate on a row or column in a table, you will always find an extra column starting from 0 when you save it to a file. If you set the index_COL parameter to set the column index, this problem will not occur.

Case 2

Converters sets the handler function for the specified column, either using “ordinal” or “column name” for column designation

import pandas as pd

def fun(x):
    return str(x)+"-haha"

df =  pd.read_csv("./test.txt",sep=' ',header=None,index_col=0,converters={3:fun})
print(type(df))
print(df.shape)
print(df)
Copy the code

Common problems with the read_CSV function

In some ides, an error will be reported if the file path or file name contains Chinese characters when importing data files using the READ_csv function.

The solution

Import pandas as pd #df=pd.read_csv('F:/ test/data.txt ') F =open('F:/ test/data.txt ')Copy the code

Exclude rows using the skiprows argument. Its function is to exclude a row. Note that the first 3 lines are skiprows=3 and the third line is skiprows=3
For irregular delimiters, use regular expressions to read files. Delimiters in files use Spaces, so we just set sep=” “to read files. When the delimiter is not a single space, it may be one space or multiple Spaces, and if you still use sep=” “to read the file, you may get a very strange data because it will also treat Spaces as data. data = pd.read_csv(“data.txt”,sep=”\s+”)
The encoding parameter must be set if Chinese encoding errors occur in the read file
Add indexes for rows and columns add column indexes with names and row indexes with index_col

Read_csv This command has a number of parameters. Most are unnecessary, since most of the files you download come in standard formats.

Read_table function

The basic usage is the same except for the separator separator.

CSV is a comma-separated value that can only read properly data split by ‘, ‘while read_table defaults to ‘\t'(i.e., TAB) for cutting data sets

Read_fwf function

Reads a file with a fixed-width column, such as a file

Id8141 360.242940 149.910199 11950.7 ID1594 444.953632 166.985655 11788.4 ID1849 364.136849 183.628767 11806.2ID1230 413.836124 184.375703 11916.8 ID1948 502.953953 173.237159 12468.3Copy the code

The read_fwf command has two additional parameters that can be set

Colspecs:

You need to give a list of tuples with a half-open interval, [from,to], which by default extrapolates from the first 100 rows.

Example:

import pandas as pd
colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
df = pd.read_fwf('demo.txt', colspecs=colspecs, header=None, index_col=0)
Copy the code

Widths:

Instead of the Colspecs argument, use a width list directly

widths = [6, 14, 13, 10]
df = pd.read_fwf('demo.txt', widths=widths, header=None)
Copy the code

Read_fwf is not used very often and can be found at pandas.pydata.org/pandas-docs… learning

Read_msgpack function

Pandas is a new serializable data format supported by pandas. It is a lightweight portable binary format similar to binary JSON that is space-efficient and provides excellent performance in both writing (serialization) and reading (deserialization).

Read_clipboard function

Read data from the clipboard, which can be thought of as the clipboard version of read_table. Useful when converting web pages to tables

This place has the following BUG

module ‘pandas’ has no attribute ‘compat’

I updated the pandas and it can be used normally

Another pitfall is that when reading the clipboard, if you copy Chinese, it is very easy to read the data

The solution

Open the site-packages\pandas\ IO \clipboard.py file for self-retrieval
Text = text.decode(‘ utF-8 ‘)
Save it and use it

Read_excel function

The official document is still the first: pandas.pydata.org/pandas-docs…

parameter	Chinese meaning
io	File objects, such as pandas Excel files or XLRD workbooks. The string may be a URL. Urls include HTTP, FTP, S3, and files. For example, a local file can be written as file://localhost/path/to/workbook.xlsx
sheet_name	Sheetname =0 (sheetName =None); sheetName =0 (sheetName =None); Int /string returns a dataframe, while None and List return dict of dataframe. Table names are strings and index positions are integers.
header	Specify the row as the column name, default 0, that is, the first row, the data below the column name row; If the data does not contain column names, set header = None;
names	Specify the name of the column and pass in a list of data
index_col	Specify column index columns, or you can use U “strings”. If you pass a list, these columns will be combined into a MultiIndex.
squeeze	If the parsed data contains only one column, a Series is returned
dtype	For the data type of the data or column, see read_CSV
engine	If IO is not a buffer or path, it must be set to identify IO. Acceptable values are None or XLRD
converters	See read_csv
The rest of the parameters	Basically the same as read_csv

If an error occurs when pandas reads the Excel file, the operation is handled as follows

Error: ImportError: No module named ‘XLRD’

PIP install XLRD is required for pandas to read the Excel file

Read_json function

parameter	Chinese meaning
path_or_buf	A valid JSON file, the default value is None, string for URL, for example file://localhost/path/to/table.json
Orient (Case 1)	1. ‘split’ : dict like {index -> index, columns -> columns, data -> values}2. ‘records’ : list like {column -> value}, … , {column -> value}3. ‘index’ : dict like {index -> {column -> value}}4. ‘columns’ : dict like {column -> {index -> value}}5. ‘values’ : just the values array
typ	The format returned (series or frame), the default is’ frame ‘
dtype	For the data type of the data or column, see read_CSV
convert_axes	Boolean, try to convert the axis to the correct dtypes, default is True
convert_dates	Parse a column list of dates; If True, attempts to parse date-like columns, Reference column tag it ends with ‘_at’,it ends with ‘_time’,it begins with ‘timestamp’,it is ‘modified’,it is ‘date’
keep_default_dates	Boolean, default True. If dates are parsed, the default date sample column is parsed
numpy	Decoded directly to a NUMpy array. The default is False; Only numeric data is supported, but labels may be non-numeric. Also note that if numpy=True, JSON sort MUST
precise_float	Boolean, default False. Set to enable the use of higher precision (strtod) functions when decoding strings to double precision values. The default (False) is to use fast but imprecise built-in functionality
date_unit	String, the timestamp unit used to detect the conversion date. Default value none. By default, timestamp precision will be detected, and if not required, timestamp precision will be forced in seconds, milliseconds, microseconds, or nanoseconds by one of ‘s’, ‘ms’, ‘us’, or ‘ns’, respectively.
encoding	Json encoding
lines	Each row of the file is read as a JSON object.

If parsing JSON, the parser will produce ValueError TypeError/AssertionError one.

Case 1

orient=’split’

Import pandas as pd s = '{" index ": [1, 2, 3]," columns ": (" a", "b"), "data" : [[1, 3], [2, 5], [6, 9]]}' df = pd.read_json(s,orient='split')Copy the code

The member Orient =’records’ is a dictionary list

import pandas as pd
s = '[{"a":1,"b":2},{"a":3,"b":4}]'
df = pd.read_json(s,orient='records')
Copy the code

Orient =’index’ takes the index as the key and the dictionary of column fields as the key. For example: s = ‘{“0”:{“a”:1,”b”:2},”1″:{“a”:2,”b”:4}}’
Orient =’columns’ or values

Github > github.com/apachecn/pa…

A common BUG read_json ()

ValueError: Trailing Data is displayed when reading the JSON file

The original format for

{"a":1,"b":1},{"a":2,"b":2}
Copy the code

Adjusted for

[{"a":1,"b":1},{"a":2,"b":2}]
Copy the code

Or use the lines argument, and the JSON adjusts to one data per row

{"a":1,"b":1}
{"a":2,"b":2}
Copy the code

If the JSON file contains Chinese characters, it is recommended to add the encoding parameter and assign the value ‘UTF-8’; otherwise, an error will be reported

Read_html function

parameter	Chinese meaning
io	Receive url, file, string. Url does not accept HTTPS, try to remove the s and climb there
match	Regular expression, which returns a table matching the regular expression
flavor	The parser defaults to ‘LXML’
header	Specifies the row where the column header resides. List is a multiple index
index_col	Specifies the column corresponding to the row header. List is a multiple index
skiprows	Skip line n (sequence) or line N (integer)
attrs	Attributes, such as attrs = {‘id’: ‘table’}
parse_dates	Parsing the date

Use method, right-click in the web page if found table is also table can be used

For example: data.stcn.com/2019/0304/1…

<table class="..." id="..." > <thead> <tr> <th>... </th> </tr> </thead> <tbody> <tr> <td>... </td> </tr> <tr>... </tr> </tbody> </table> < TABLE > : Define the table <thead> : Define the header of the table < TBody > : Define the body of the table <tr> : define the row of the table <th> : Define the header of the table < TD > : define the table cellsCopy the code

Common BUG

Error ImportError: html5lib not found, please install it

Just install HTML5lib, or use parameters

import pandas as pd
df = pd.read_html("http://data.stcn.com/2019/0304/14899644.shtml",flavor ='lxml')
Copy the code

For more information, see pandas.pydata.org/pandas-docs…

Pandas is a short table of read and write functions

Read function	Write a function	explain
read_clipboard	to_clipboard	Read text from the clipboard and pass it to read_table
read_csv	to_csv	Read a CSV (comma-separated) file into the DataFrame
read_excel	to_excel	Excel spreadsheet
read_sql	to_sql
read_pickle	to_pickle
read_json	to_json
read_msgpack	to_msgpack
read_stata	to_stata
read_gbq	to_gbq	Load data from Google BigQuery
read_hdf	to_hdf
read_html	to_html
read_parquet	to_parquet
read_feather	to_feather

import pandas as pd 2 3 csvframe = pd.read_csv('pandas_data_test\myCSV_01.csv') 4 print(csvframe, "\n-----*-----") 5 csvframe1 = pd.read_table('pandas_data_test\myCSV_01.csv',sep=',') 6 print(csvframe1, "\n-----*-----") 7 csvframe2 = pd.read_csv('pandas_data_test\ mycsv_02.csv ',header=None) Print (csvframe2, "\n-----*-----") 9 csvframe20 = Pd. read_csv('pandas_data_test\ mycsv_02.csv ',names=['white','red','blue','green','animal']) # "\n-----*-----") 11 12 csvframe30 = pd.read_csv('pandas_data_test\myCSV_03.csv') 13 print(csvframe30, "\ n * -- -- -- -- -- -- -- -- -- -") 14 csvframe31 = pd. Read_csv (' pandas_data_test \ myCSV_03 CSV, index_col = [' color ', 'status']) # level index of 15 print(csvframe31, "\ n * -- -- -- -- -- -- -- -- -- -") 17 txtframe4 = 16 pd. Read_table (' pandas_data_test \ ch05_04 TXT, sep = '\ s +) # 18 print according to the canonical parse (txtframe4, "\n-----*-----") 19 txtframe5 = pd.read_table('pandas_data_test\ch05_05.txt',sep=r'\D+',header=None,engine='python') 20 Print (txtframe5, "\n-----*-----") 21 # Use skiprows option to exclude unnecessary lines. Assign the row number of the row to be excluded to the array. Read_table ('pandas_data_test\ch05_06.txt',sep=',',skiprows=[0,1,3,6]) 23 print(txtframe6) 24 Out[1]: 25 white red blue green animal 26 0 1 5 2 3 cat 27 1 2 7 8 5 dog 28 2 3 3 6 7 horse 29 3 2 2 8 3 duck 30 4 4 4 2 1 mouse  31 5 4 4 2 1 mou 32 -----*----- 33 white red blue green animal 34 0 1 5 2 3 cat 35 1 2 7 8 5 dog 36 2 3 3 6 7 horse 37 3 2 2 8 3 duck 38 4 4 4 2 1 mouse 39 5 4 4 2 1 mou 40 -----*----- 41 0 1 2 3 4 42 0 1 5 2 3 cat 43 1 2 7 8 5 dog 44 2 3 3 6 7 horse 45 3 2 2 8 3 duck 46 4 4 4 2 1 mouse 47 -----*----- 48 white red blue green animal 49 0 1 5 2 3 cat 50 1 2 7  8 5 dog 51 2 3 3 6 7 horse 52 3 2 2 8 3 duck 53 4 4 4 2 1 mouse 54 -----*----- 55 color status iteml item2 item3 56 0 black up 3 4 6 57 1 black down 2 6 7 58 2 white up 5 5 5 59 3 white down 3 3 2 60 4 white left 1 2 1 61 5 red up 2 2 2 62 6 red down 1 1 4 63 -----*----- 64 iteml item2 item3 65 color status 66 black up 3 4 6 67 down 2 6 7 68 white up 5 5 5 69 down 3 3 2 70 left 1 2 1 71 red up 2 2 2 72 down 1 1 4 73 -----*----- 74 white red blue green 75 0 1 5 2 3 76 1 2 7  8 5 77 2 3 3 6 7 78 -----*----- 79 0 1 2 80 0 0 123 122 81 1 1 124 321 82 2 2 125 333 83 -----*----- 84 white red blue green animal 85 0 1 5 2 3 cat 86 1 2 7 8 5 dog 87 2 3 3 6 7 horse 88 3 2 2 8 3 duck 89 4 4 4 2 1 mouseCopy the code

Read some data from TXT file

1 print (csvframe2, "\ n * -- -- -- -- -- -- -- -- -- -") 2 # nrows = 2 to get the specified number of rows, Csvfram20 = pd.read_csv('pandas_data_test\ mycsv_02.csv ',skiprows=[2],nrows=2,header=None) 4 print(csvfram20) 5 Out[2]: 6 0 1 2 3 4 7 0 1 5 2 3 cat 8 1 2 7 8 5 dog 9 2 3 3 6 7 horse 10 3 2 2 8 3 duck 11 4 4 4 2 1 mouse 12 -----*----- 13 0 1  2 3 4 14 0 1 5 2 3 cat 15 1 2 7 8 5 dogCopy the code

Another interesting and common operation is to slice the text you want to parse and then walk through the sections, performing a specific operation on each of them.

For example, taking a list of numbers and adding them up every second row, and finally inserting the and into the Series object “is a simple example to understand,

It has no practical application, but once you understand how it works, you can apply it to more complex situations.

1 csvframe1 = pd.read_table('pandas_data_test\myCSV_01.csv',sep=',') 2 print(csvframe1, "\n-----*-----") 3 out = pd.Series() 4 pieces = pd.read_csv('pandas_data_test\myCSV_01.csv',chunksize=4) # The chunksize parameter determines the number of lines to split each section. 7 print(piece['white']) 8 out.at[i] = piece['white'].sum() 9 i += 1 10 print(out, "\n-----*-----") 11 Out[3]: 12 white red blue green animal 13 0 1 5 2 3 cat 14 1 2 7 8 5 dog 15 2 3 3 6 7 horse 16 3 2 2 8 3 duck 17 4 4 4 2 1 mouse  18 5 4 4 2 1 mou 19 -----*----- 20 0 1 21 1 2 22 2 3 23 3 2 24 Name: white, dtype: int64 25 4 4 26 5 4 27 Name: white, dtype: int64 28 0 8 29 1 8 30 dtype: int64Copy the code

Write data to a CSV file

1 print(csvframe1) 2 print(csvframe1.to_csv('pandas_data_test\ CH05_07.csv ') Print (csvframe1.to_csv('pandas_data_test\ CH05_07b.csv ',index =False,header=False)) 5 Print (csvframe30.to_csv('pandas_data_test\ CH05_08.csv ')) 6 # Replace the empty field with the desired value using the na_rep option of to_csv(). Print (csvframe30.to_csv('pandas_data_test\ CH05_09.csv ',na_rep=" empty ") 8 Out[4]: 9 white red blue green animal 10 0 1 5 2 3 cat 11 1 2 7 8 5 dog 12 2 3 3 6 7 horse 13 3 2 2 8 3 duck 14 4 4 4 2 1 mouse 15 5 4 4 2 1 mou 16 None 17 None 18 None 19 NoneCopy the code

Enter the folder we can see the corresponding file:

Reading and writing HTML files

1 Frame = pd.dataframe (np.Arange (4).0 2 Print (frame.to_html()) 3 Frame 2 = pd.dataframe (0 Np. Random. The random ((4, 4)), the index = [' white ', 'black', 'red', 'blue1], the columns = [' up' and 'down', 'right' and 'left']) 4 s = [' < HTML > ']  5 s.append('<HEAD><TITLE>My DataFrame</TITLE></HEAD>') 6 s.append(' <B0DY>') 7 s.append(frame.to_html()) 8 s.append('</BODY></HTML>') 9 html = ''.join(s) 10 html_file = open('pandas_data_test\myFrame.html','w') 11 html_file.write(html) 12 html_file.close() 13 web_frames = pd.read_html('pandas_data_test\myFrame.html') 14 print(web_frames[0]) 15 ranking = pd.read_html('http://www.meccanismocomplesso.org/en/ Eccanismo-complesso-sito-2 / Classifia-Punteggio /') print(ranking[0][1:10]) 18 <table border="1" class="dataframe"> 19 <thead> 20 <tr style="text-align: right;" > 21 <th></th> 22 <th>0</th> 23 <th>1</th> 24 </tr> 25 </thead> 26 <tbody> 27 <tr> 28 <th>0</th> 29 <td>0</td> 30 <td>1</td> 31 </tr> 32 <tr> 33 <th>1</th> 34 <td>2</td> 35 <td>3</td> 36 </tr> 37 </tbody> 38 </table> 39 Unnamed: 0 0 1 40 0 0 0 1 41 1 1 2 3 42 # Nome Exp Livelli 43 1 2 admin 9029 NaN 44 2 3 BrunoOrsini 2124 NaN 45 3 4 Berserker 700  NaN 46 4 5 Dnocioni 543 NaN 47 5 6 albertosallusti 409 NaN 48 6 7 Jon 233 NaN 49 7 8 Mr.Y 180 NaN 50 8 9 michele sisinni 157 NaN 51 9 10 Selina 136 NaNCopy the code

Read data from XML

Pandas does not have any I/O API functions specifically designed to handle the XML format. It doesn’t, but this format is actually

This is important because a lot of structured data is stored in XML format. It doesn’t matter that pandas doesn’t have a specialized handler because Python

There are many libraries that read and write XML data (except pandas). One of these libraries, called LXML, is so good at processing large files that it has been converted from

Stand out from many similar libraries. This section describes how to use it to process XML files, and how to integrate it with PANDAS to maximize its functionality

Finally, the required data is retrieved from the XML file and converted into a DataFrame object.

The XML source file is shown below

Parse ('pandas_data_test\books.xml') 4 root = xml.getroot() # Print (root.book.author) 6 mes1 = root.book.getChildren () 7 print(" root.book.getchildren () ") \n", Mes1) 8 mes2 = root.book [1].getChildren () # print([child.tag for child in mes2]) # print([child.tag for child in mes2]) # Print ([child-.text for child in mes2]) # print([child-.text for child in mes2]) # print([child-.text for child in mes2]) # print([child-.text for child in mes2]) 12 272103_l_EnRoss, Mark 13 root.book.getChildren () 15 ['Author', 'Title', 'Genre', 15 ['Author', 'Title', 'Genre', 'Price', 'PublishDate'] 16 [' 272l03_l_EnBracket, Barbara', 'XML for Dummies', 'Computer', '35.95', '20L4-L2-L6 ']Copy the code

Read and write Microsoft Excel files

Read_excel () and to_excel() can read. XLS and. XLSX files.

Reading and writing JSON data read_json() and to_json() HDF5 Format So far, you have learned how to read and write data in text format. To analyze large amounts of data, it is best to use binary formats. Python has many kinds of binary data processing

Tool. The HDF5 library has had some success in this regard. HDF stands for Hierarchical Data Format. HDF5

Libraries focus on reading and writing HDF5 files, a data structure consisting of nodes capable of storing large data sets. The library is all in C language

Developed to provide python/ MATLAB and Java language interface. Its rapid expansion is the result of widespread use by developers, as well as its effectiveness

In particular, it is efficient to store large amounts of data in this format. HDF5 is simpler than other formats that handle binary data

Real-time compression is supported so that files can be compressed using repeated patterns in data structures. Currently, Python provides two ways to manipulate HDF5 data

Method: PyTables and H5PY. There are several differences between the two approaches, and the choice of one depends largely on your needs.

H5py provides interfaces to HDF5’s advanced apis. PyTables encapsulates many of the details of HDF5, providing more flexible data containers, index tables, and searches

Functions and other computing-related media. Pandas also has a class called HDFStore, similar to Diet, that stores pandas with PyTables

Object. Before using the HDF5 format, you must import the HDFStore class.

1 from pandas. IO. Pytables import HDFStore 2 If no, install 3 Frame = Pd. DataFrame (np) arange (16). Reshape (4, 4), the index = [' white ', 'black1', 'red' and 'blue'], the columns = [' up 'and' down ', 'right' and 'left']) 4 store = HDFStore('pandas_data_test\mydata.h5') 5 store['obj1'] = frame 6 frame1 = Pd. DataFrame (np) random. Rand (16). Reshape (4, 4), the index = [' white ', 'black1', 'red' and 'blue'], the columns = [' up 'and' down ', 'right' and 'left'])  7 store['obj2'] = frame1 8 print(store['obj1']) 9 print(store['obj2']) 10 Out[7]: 11 up down right left 12 white 0 1 2 3 13 black1 4 5 6 7 14 red 8 9 10 11 15 blue 12 13 14 15 16 up down right left 17 White 0.251269 0.422823 0.619294 0.273534 18 black1 0.593960 0.353969 0.966026 0.104581 19 red 0.964577 0.625644 0.342923 0.638627 20 blue 0.246541 0.997952 0.414599 0.908750 21 Closing Remaining open files:pandas_data_test\mydata.h5... doneCopy the code

Implement object serialization

The ****pickle module implements a powerful algorithm for pickling and deserializing data structures implemented in Python.

Serialization is the process of converting the hierarchy of objects into byte streams. Serialization facilitates the transfer, storage, and reconstruction of objects that can be reloaded using only the sink

Build an object that retains all of its original characteristics.

The pandas library is handy for object serialization (deserialization), all tools are readily available, and there is no need to import the cPickle module into a Python session

Block, all operations are implicit. Pandas’ serialization format is not entirely ASCII.

1 import pickle 2 data = { 'color': ['white','red'], 'value': [5, 7]} 3 pickled_data = pickle.dumps(data) 4 print(pickled_data) 5 nframe = pickle.loads(pickled_data) 6 print(nframe) 7 8 0 0 frame = pd.dataframe (np.arange(16). 0 0 Index = ['up','down','left','right']) 10 frame.to_pickle('pandas_data_test\frame.pkl') # Print (pd.read_pickle('pandas_data_test\frame.pkl')) 13 b'\x80\x03}q\x00(X\x05\x00\x00\x00colorq\x01]q\x02(X\x05\x00\x00\x00whiteq\x03X\x03\x00\x00\x00redq\x04eX\x05\x00\x00\x0 0valueq\x05]q\x06(K\x05K\x07eu.' 14 {'color': ['white', 'red'], 'value': [5, 7]} 15 0 1 2 3 16 up 0 1 2 3 17 down 4 5 6 7 18 left 8 9 10 11 19 right 12 13 14 15Copy the code

Interconnecting with databases **** In many applications, little data is used from text files, because text files are not the most efficient way to store data.

Data is often stored in SQL-like relational databases, and NoSQL databases have recently become popular as a supplement.

Loading data from an SQL database and converting it to a DataFrame object is simple. Several functions provided by PANDAS simplify the process.

The pandas. IO. SQL module provides a unified database independent interface called SQLAlchemy. This interface simplifies the connection mode for both

There is only one set of operation commands for any type of database. Connect to the database using the create_engine() function, which you can use to configure the driver

All required attributes such as username, password, port, and database instance. The typical form of a database URL is:

dialect+driver://username:password@host:port/database

The identifying name of a name, such as SQLite, mysql, PostgresQL, Oracle, or MSSQL. Drivername is used to connect using all lowercase letters

Name of the DBAPI to the database. If not specified, the “default” DBAPI will be imported (if available) – this default value is usually available to the back end

The widest range of drivers.

1 from sqlalchemy import create_engine 2 3 # PostgreSQL database 4 # default 5 engine = Create_engine (' postgresQL :// Scott :tiger@localhost/mydatabase') 6 # pg8000 drive 7 engine = Create_engine (' postgresQL +pg8000:// Scott :tiger@localhost/mydatabase') 8 # psycopg2 drive 9 engine = create_engine('postgresql+psycopg2://scott:tiger@localhost/mydatabase') 10 # MySql 11 # default 12 engine = Create_engine ('mysql:// Scott :tiger@localhost/foo') 13 14 # mysql-python create_engine('mysql+mysqldb://scott:tiger@localhost/foo') 16 17 # MySQL-connector-python 18 engine = create_engine('mysql+mysqlconnector://scott:tiger@localhost/foo') 19 # OurSQL 20 engine = create_engine('mysql+oursql://scott:tiger@localhost/foo') 21 # Oracle 22 engine = Create_engine ('oracle:// Scott :[email protected]:1521/sidname') 23 24 engine = create_engine('oracle+cx_oracle://scott:tiger@tnsname') 25 # Microsoft SQL 26 # pyodbc 27 engine = create_engine('mssql+pyodbc://scott:tiger@mydsn') 28 29 # pymssql 30 engine = create_engine('mssql+pymssql://scott:tiger@hostname:port/dbname')Copy the code

SQLite:

Because SQLite connects to local files, the URL format is slightly different. The File part of the URL is the file name of the database.

For relative file paths, this requires three slashes: engine = create_engine(‘sqlite:///foo.db’)

**** For absolute file paths, the absolute path is followed by three slashes:

Unix/Mac – 4 initial slashes in total

engine = create_engine(‘sqlite:absolute/path/to/foo.db’)

**Windows

engine = create_engine(‘sqlite:///C:\path\to\foo.db’)

Windows alternative using raw string

engine = create_engine(r’sqlite:///C:\path\to\foo.db’)

SQLite3 Data read and write **** Learn to use Python’s built-in SQLite database SQLite3. The SQLite3 tool implements a simple, lightweight DBMS, SQL,

It can therefore be built into any application implemented in The Python language. It is very useful, you can create an embedded database in a single file.

If you want to use all the features of a database without having to install a real database, this tool is the best choice. If you want to use real

SQLite3 does not practice database operations before using the database, or use the database to store data in a single program regardless of the interface

Wrong choice.

1 from sqlalchemy import create_engine 2 frame = pd.DataFrame( Np.arange (20).0 (4,5), Columns =[' White ','red','blue','black','green']) 0 # connect SQLite3 database 4 engine = 0 Create_engine ('sqlite:///pandas_data_test/foo.db') 5 # Convert DataFrame to database table. 6 # to_sql(self, name, con, schema=None, if_exists='fail', index=True, 7 # index_label=None, chunksize=None, Print (pd.read_sql('colors',engine))Copy the code

Running results:

Pandas reads excel, TXT, and CSV files

Pandas Reads the document officially provided by the file

Pandas reads the TXT file

Common problems with the read_CSV function

Read_fwf function

Read_clipboard function

Read_json function

Read_html function

Related Posts