Pandas Reads the document officially provided by the file
The content required for reading a file in pandas must belong to the official document address
Pandas.pydata.org/pandas-docs…
The “Input/Output” command is written for pandas. The “IO” command is written for pandas. The “Input/Output” command is written for pandas
Pandas reads the TXT file
Read TXT files need to determine whether the TXT file conforms to the basic format, that is, whether there is \t,,, and other special delimiters
Usually TXT files look like this
TXT File Example
The following files are spaced with Spaces
Encoding 3 2019-03-22 00:06:32.6835965 Ashshsh 3 2019-03-22 00:06:32.6835965 Ashshsh 4 The 2017-03-22 00:06:32. 8041945 egggCopy the code
The read command can be read_CSV or read_table
import pandas as pd
df = pd.read_table("./test.txt")
print(df)
import pandas as pd
df = pd.read_csv("./test.txt")
print(df)
Copy the code
Note, however, that this is a DataFrame of three rows and one column, rather than the three rows and four columns required
import pandas as pd
df = pd.read_csv("./test.txt")
print(type(df))
print(df.shape)
<class 'pandas.core.frame.DataFrame'>
(3, 1)
Copy the code
Read_csv function
Default: load delimited data from a file, URL, or new object of a file. The default delimiter is a comma.
The above TXT document is not comma separated, so you need to add the sep separator parameter when reading
df = pd.read_csv("./test.txt",sep=' ')
Copy the code
Source: github.com/pandas-dev/… Chinese description and key function cases
parameter | Chinese meaning |
---|---|
filepath_or_buffer | Can be a URL, available URL types include: HTTP, FTP, s3 and file, read a local file example: file://localhost/path/to/table.csv |
sep | STR type. The default ‘,’ specifies the delimiter. If no arguments are specified, the default comma separation is attempted. A delimiter longer than one character and not ‘\s+’ will be usedpythonParser. And ignore commas in the data. Regular expression example: ‘\r\t’ |
delimiter | Delimiter, alternative delimiter (if this parameter is specified, sep parameter is invalid) is generally not used |
delimiter_whitespace | True or False The default is False, using Spaces as separators is equivalent to SPE = ‘\ S +’ Delimite does not work if this argument is called |
header | Specify the row as the column name (omit the comment line). If no column name is specified, the default header=0; If the column name header=None is specified |
names | Specify column names. If the file does not contain header lines, header=None should be explicitly stated. Header can be a list of integers, such as 0,1,3. Unspecified middle lines are removed (for example, skip 2 lines in this example) |
Index_col (case 1) | The default is None using the column name as the row label for the DataFrame, or MultiIndex if a sequence is given. If you are reading a file that has delimiters at the end of each line, consider using index_col=False so that Panadas does not use the first column as the row name. |
usecols | By default None can use column sequences or column names such as 0, 1, 2 or ‘foo’, ‘bar’, ‘baz’ to speed up loading and reduce memory consumption. |
squeeze | The default is False, True returns Series, or Series if the data has been parsed to contain only one row |
prefix | Prefix of automatically generated column name numbers, e.g. ‘X’ for X0, X1… This parameter is valid if header =None or if no header is set |
mangle_dupe_cols | Default to True, duplicate columns will be specified as’ x.0 ‘… ‘X.N’ instead of ‘X’… “X”. If False is passed, the data will be overwritten if duplicate names exist in the column. |
dtype | Example: {‘ a ‘: Np.float64,’ b ‘: np.int32} specifies the data type of each column, where a and b indicate the column names |
engine | Analysis engine used. You can choose C or Python, C engine is faster but Python engine is more powerful |
Converters (case 2) | Sets the handler for the specified column, either using “ordinal” or “column name” |
true_values / false_values | Did not find the actual application scenario, remarks, later perfect |
skipinitialspace | Ignore Spaces after delimiters. Default is false |
skiprows | Default value None Number of lines to ignore (from the beginning of the file), or list of line numbers to skip (starting at 0) |
skipfooter | Ignore from the end of the file. (The C engine is not supported) |
nrows | How many rows to read from the file, and the number of rows to read (from the start of the file header) |
na_values | Null value definition, By default, the “# N/A”, “# N/A N/A”, “# NA”, “- 1 # IND ‘, ‘- 1 # QNAN’, ‘- NaN’, ‘- NaN’, ‘1 # IND’, ‘1 # QNAN”, “N/A”, “NA”, “NULL”, “NaN”, “n/a”, “NaN”, “null”. All behave as NAN’s |
keep_default_na | If the na_values argument is specified and keep_default_na=False, the default NaN is overridden, otherwise added |
na_filter | Whether to check for missing values (empty strings or null values). For large files, there are no null N/A values in the data set. Using na_filter=False can speed up reading. |
verbose | Whether to print the output of various parsers, such as “number of missing values in non-numeric columns”, etc. |
skip_blank_lines | If True, empty lines are skipped; Otherwise, call it NaN. |
parse_dates | List of ints or names. e.g. If 1,2,3 -> parse the values of columns 1,2, and 3 as separate date columns; Dict, e.g. {‘ foo ‘: 1,3} -> merge 1,3 columns and name the merged column “foo”. |
infer_datetime_format | If set to True and parse_dates is available, pandas will attempt to convert to the date type and, if it can, convert the method and parse. In some cases five to ten times faster |
keep_date_col | If joining multiple columns resolves dates, the participating columns are kept. The default is False |
date_parser | A function used to parse dates, using dateutil.parser.parser for conversion by default. Pandas attempts parsing in three different ways and uses the next method if it encounters problems. 1. Use one or more arrays (specified by parse_dates) as arguments. 2. The concatenation specifies a multi-column string as a column parameter; 3. Call date_parser once per line to parse one or more strings (specified by parse_dates) as arguments. |
dayfirst | Date type in DD/MM format |
iterator | Returns a TextFileReader object to process the file block by block. |
chunksize | Size of the file block |
compression | Use compressed files directly on disk. Infer if the infer parameter is used, use gzip, bz2, zip or files with ‘. Gz ‘, ‘. Bz2 ‘, ‘. If zip is used, the ZIP package must contain only one file. Set to None does not decompress. |
New version 0.18.1 supports zip and XZ decompression | |
thousands | Thousandths symbol, default ‘, ‘ |
decimal | Decimal symbol, default ‘. |
lineterminator | Line separator, used only under the C parser |
quotechar | Quotation marks, used to identify beginning and explanation characters. Delimiters within quotation marks are ignored |
quoting | Controls quoted constants in CSV. QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3) are optional. |
doublequote | Double quotation marks, when single quotation marks have been defined, and the quoting parameter is not QUOTE_NONE, double quotation marks are used to indicate that the element within the quotation is used as an element. |
escapechar | When quoting QUOTE_NONE, the use of a character is specified without the delimiter limit. |
comment | Indicates that extra rows are not parsed. If this character appears at the beginning of the line, the line is ignored entirely. This argument can only be one character, and empty lines (like skip_blank_lines=True) comment lines are ignored by header and skiprows. For example, if comment=’#’ is specified to parse ‘# empty\na,b,c\n1,2,3 ‘with header=0 then the result will be returned with’ a,b,c’ as header |
encoding | Encoding, specifying the type of character set, usually ‘UTF-8’ |
dialect | If no specific language is specified, sep is ignored if it is larger than one character. For details, see the csv.dialect documentation |
error_bad_lines | If a row contains too many columns, the DataFrame is not returned by default, and if set to false, the row changes are culled (only with the C parser). |
warn_bad_lines | If error_bad_lines =False and warn_bad_lines =True then all “bad lines” will be printed (only for C parsers) |
low_memory | It is loaded in chunks into memory and parsed in low memory consumption. But type confusion can occur. Set to False to ensure that the types are not confused. Or use the dtype argument to specify the type. Note that chunksize or iterator chunks read the entire file into a Dataframe, regardless of type (only valid in C parsers) |
delim_whitespace | New in Version 0.18.1: Valid in the Python parser |
memory_map | If a filepath is provided for filepath_or_buffer, the file object is mapped directly to memory and the data is accessed directly from there. Using this option improves performance because there is no longer any I/O overhead and the file is not IO again in this way |
float_precision | Specifies that the C engine applies to converters of floating-point values |
This forms part of the reference blog www.cnblogs.com/datablog/p/… Thanks for your translation, O(∩_∩)O haha ~
Case 1
Index_col use
Start with a TXT file. The biggest problem with this file is that there is a ‘,’ at the end of each line. If there is a delimiter at the end of each line, there will be a problem, but the actual test found that you need to match the names parameter
Making, 1, 2, 3, DDD, u, 1 and 4, asd, the as, df, 12 behavior,Copy the code
This forms part of the reference blog www.cnblogs.com/datablog/p/… Thanks for your translation, O(∩_∩)O haha ~
Case 1
Index_col use
Start with a TXT file. The biggest problem with this file is that there is a ‘,’ at the end of each line. If there is a delimiter at the end of each line, there will be a problem, but the actual test found that you need to match the names parameter
Making, 1, 2, 3, DDD, u, 1 and 4, asd, the as, df, 12 behavior,Copy the code
Write the following code
df = pd.read_csv("./demo.txt",header=None,names=['a','b','c','d','e'])
print(df)
df = pd.read_csv("./demo.txt",header=None,index_col=False,names=['a','b','c','d','e'])
print(df)
Copy the code
In fact, the significance of the discovery is really not very big, the document may not be clear about his specific role. Now, common uses of Index_col
When reading a file, if the index_COL column index is not set, an integer index starting from 0 is used by default. When you operate on a row or column in a table, you will always find an extra column starting from 0 when you save it to a file. If you set the index_COL parameter to set the column index, this problem will not occur.
Case 2
Converters sets the handler function for the specified column, either using “ordinal” or “column name” for column designation
import pandas as pd
def fun(x):
return str(x)+"-haha"
df = pd.read_csv("./test.txt",sep=' ',header=None,index_col=0,converters={3:fun})
print(type(df))
print(df.shape)
print(df)
Copy the code
Common problems with the read_CSV function
- In some ides, an error will be reported if the file path or file name contains Chinese characters when importing data files using the READ_csv function.
The solution
Import pandas as pd #df=pd.read_csv('F:/ test/data.txt ') F =open('F:/ test/data.txt ')Copy the code
- Exclude rows using the skiprows argument. Its function is to exclude a row. Note that the first 3 lines are skiprows=3 and the third line is skiprows=3
- For irregular delimiters, use regular expressions to read files. Delimiters in files use Spaces, so we just set sep=” “to read files. When the delimiter is not a single space, it may be one space or multiple Spaces, and if you still use sep=” “to read the file, you may get a very strange data because it will also treat Spaces as data. data = pd.read_csv(“data.txt”,sep=”\s+”)
- The encoding parameter must be set if Chinese encoding errors occur in the read file
- Add indexes for rows and columns add column indexes with names and row indexes with index_col
Read_csv This command has a number of parameters. Most are unnecessary, since most of the files you download come in standard formats.
Read_table function
The basic usage is the same except for the separator separator.
CSV is a comma-separated value that can only read properly data split by ‘, ‘while read_table defaults to ‘\t'(i.e., TAB) for cutting data sets
Read_fwf function
Reads a file with a fixed-width column, such as a file
Id8141 360.242940 149.910199 11950.7 ID1594 444.953632 166.985655 11788.4 ID1849 364.136849 183.628767 11806.2ID1230 413.836124 184.375703 11916.8 ID1948 502.953953 173.237159 12468.3Copy the code
The read_fwf command has two additional parameters that can be set
Colspecs:
You need to give a list of tuples with a half-open interval, [from,to], which by default extrapolates from the first 100 rows.
Example:
import pandas as pd
colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
df = pd.read_fwf('demo.txt', colspecs=colspecs, header=None, index_col=0)
Copy the code
Widths:
Instead of the Colspecs argument, use a width list directly
widths = [6, 14, 13, 10]
df = pd.read_fwf('demo.txt', widths=widths, header=None)
Copy the code
Read_fwf is not used very often and can be found at pandas.pydata.org/pandas-docs… learning
Read_msgpack function
Pandas is a new serializable data format supported by pandas. It is a lightweight portable binary format similar to binary JSON that is space-efficient and provides excellent performance in both writing (serialization) and reading (deserialization).
Read_clipboard function
Read data from the clipboard, which can be thought of as the clipboard version of read_table. Useful when converting web pages to tables
This place has the following BUG
module ‘pandas’ has no attribute ‘compat’
I updated the pandas and it can be used normally
Another pitfall is that when reading the clipboard, if you copy Chinese, it is very easy to read the data
The solution
- Open the site-packages\pandas\ IO \clipboard.py file for self-retrieval
- Text = text.decode(‘ utF-8 ‘)
- Save it and use it
Read_excel function
The official document is still the first: pandas.pydata.org/pandas-docs…
parameter | Chinese meaning |
---|---|
io | File objects, such as pandas Excel files or XLRD workbooks. The string may be a URL. Urls include HTTP, FTP, S3, and files. For example, a local file can be written as file://localhost/path/to/workbook.xlsx |
sheet_name | Sheetname =0 (sheetName =None); sheetName =0 (sheetName =None); Int /string returns a dataframe, while None and List return dict of dataframe. Table names are strings and index positions are integers. |
header | Specify the row as the column name, default 0, that is, the first row, the data below the column name row; If the data does not contain column names, set header = None; |
names | Specify the name of the column and pass in a list of data |
index_col | Specify column index columns, or you can use U “strings”. If you pass a list, these columns will be combined into a MultiIndex. |
squeeze | If the parsed data contains only one column, a Series is returned |
dtype | For the data type of the data or column, see read_CSV |
engine | If IO is not a buffer or path, it must be set to identify IO. Acceptable values are None or XLRD |
converters | See read_csv |
The rest of the parameters | Basically the same as read_csv |
If an error occurs when pandas reads the Excel file, the operation is handled as follows
Error: ImportError: No module named ‘XLRD’
PIP install XLRD is required for pandas to read the Excel file
Read_json function
parameter | Chinese meaning |
---|---|
path_or_buf | A valid JSON file, the default value is None, string for URL, for example file://localhost/path/to/table.json |
Orient (Case 1) | 1. ‘split’ : dict like {index -> index, columns -> columns, data -> values}2. ‘records’ : list like {column -> value}, … , {column -> value}3. ‘index’ : dict like {index -> {column -> value}}4. ‘columns’ : dict like {column -> {index -> value}}5. ‘values’ : just the values array |
typ | The format returned (series or frame), the default is’ frame ‘ |
dtype | For the data type of the data or column, see read_CSV |
convert_axes | Boolean, try to convert the axis to the correct dtypes, default is True |
convert_dates | Parse a column list of dates; If True, attempts to parse date-like columns, Reference column tag it ends with ‘_at’,it ends with ‘_time’,it begins with ‘timestamp’,it is ‘modified’,it is ‘date’ |
keep_default_dates | Boolean, default True. If dates are parsed, the default date sample column is parsed |
numpy | Decoded directly to a NUMpy array. The default is False; Only numeric data is supported, but labels may be non-numeric. Also note that if numpy=True, JSON sort MUST |
precise_float | Boolean, default False. Set to enable the use of higher precision (strtod) functions when decoding strings to double precision values. The default (False) is to use fast but imprecise built-in functionality |
date_unit | String, the timestamp unit used to detect the conversion date. Default value none. By default, timestamp precision will be detected, and if not required, timestamp precision will be forced in seconds, milliseconds, microseconds, or nanoseconds by one of ‘s’, ‘ms’, ‘us’, or ‘ns’, respectively. |
encoding | Json encoding |
lines | Each row of the file is read as a JSON object. |
If parsing JSON, the parser will produce ValueError TypeError/AssertionError one.
Case 1
- orient=’split’
Import pandas as pd s = '{" index ": [1, 2, 3]," columns ": (" a", "b"), "data" : [[1, 3], [2, 5], [6, 9]]}' df = pd.read_json(s,orient='split')Copy the code
- The member Orient =’records’ is a dictionary list
import pandas as pd
s = '[{"a":1,"b":2},{"a":3,"b":4}]'
df = pd.read_json(s,orient='records')
Copy the code
- Orient =’index’ takes the index as the key and the dictionary of column fields as the key. For example: s = ‘{“0”:{“a”:1,”b”:2},”1″:{“a”:2,”b”:4}}’
- Orient =’columns’ or values
Github > github.com/apachecn/pa…
A common BUG read_json ()
ValueError: Trailing Data is displayed when reading the JSON file
The original format for
{"a":1,"b":1},{"a":2,"b":2}
Copy the code
Adjusted for
[{"a":1,"b":1},{"a":2,"b":2}]
Copy the code
Or use the lines argument, and the JSON adjusts to one data per row
{"a":1,"b":1}
{"a":2,"b":2}
Copy the code
If the JSON file contains Chinese characters, it is recommended to add the encoding parameter and assign the value ‘UTF-8’; otherwise, an error will be reported
Read_html function
parameter | Chinese meaning |
---|---|
io | Receive url, file, string. Url does not accept HTTPS, try to remove the s and climb there |
match | Regular expression, which returns a table matching the regular expression |
flavor | The parser defaults to ‘LXML’ |
header | Specifies the row where the column header resides. List is a multiple index |
index_col | Specifies the column corresponding to the row header. List is a multiple index |
skiprows | Skip line n (sequence) or line N (integer) |
attrs | Attributes, such as attrs = {‘id’: ‘table’} |
parse_dates | Parsing the date |
Use method, right-click in the web page if found table is also table can be used
For example: data.stcn.com/2019/0304/1…
<table class="..." id="..." > <thead> <tr> <th>... </th> </tr> </thead> <tbody> <tr> <td>... </td> </tr> <tr>... </tr> </tbody> </table> < TABLE > : Define the table <thead> : Define the header of the table < TBody > : Define the body of the table <tr> : define the row of the table <th> : Define the header of the table < TD > : define the table cellsCopy the code
Common BUG
Error ImportError: html5lib not found, please install it
Just install HTML5lib, or use parameters
import pandas as pd
df = pd.read_html("http://data.stcn.com/2019/0304/14899644.shtml",flavor ='lxml')
Copy the code
For more information, see pandas.pydata.org/pandas-docs…
Pandas is a short table of read and write functions
Read function | Write a function | explain |
---|---|---|
read_clipboard | to_clipboard | Read text from the clipboard and pass it to read_table |
read_csv | to_csv | Read a CSV (comma-separated) file into the DataFrame |
read_excel | to_excel | Excel spreadsheet |
read_sql | to_sql | |
read_pickle | to_pickle | |
read_json | to_json | |
read_msgpack | to_msgpack | |
read_stata | to_stata | |
read_gbq | to_gbq | Load data from Google BigQuery |
read_hdf | to_hdf | |
read_html | to_html | |
read_parquet | to_parquet | |
read_feather | to_feather |
import pandas as pd 2 3 csvframe = pd.read_csv('pandas_data_test\myCSV_01.csv') 4 print(csvframe, "\n-----*-----") 5 csvframe1 = pd.read_table('pandas_data_test\myCSV_01.csv',sep=',') 6 print(csvframe1, "\n-----*-----") 7 csvframe2 = pd.read_csv('pandas_data_test\ mycsv_02.csv ',header=None) Print (csvframe2, "\n-----*-----") 9 csvframe20 = Pd. read_csv('pandas_data_test\ mycsv_02.csv ',names=['white','red','blue','green','animal']) # "\n-----*-----") 11 12 csvframe30 = pd.read_csv('pandas_data_test\myCSV_03.csv') 13 print(csvframe30, "\ n * -- -- -- -- -- -- -- -- -- -") 14 csvframe31 = pd. Read_csv (' pandas_data_test \ myCSV_03 CSV, index_col = [' color ', 'status']) # level index of 15 print(csvframe31, "\ n * -- -- -- -- -- -- -- -- -- -") 17 txtframe4 = 16 pd. Read_table (' pandas_data_test \ ch05_04 TXT, sep = '\ s +) # 18 print according to the canonical parse (txtframe4, "\n-----*-----") 19 txtframe5 = pd.read_table('pandas_data_test\ch05_05.txt',sep=r'\D+',header=None,engine='python') 20 Print (txtframe5, "\n-----*-----") 21 # Use skiprows option to exclude unnecessary lines. Assign the row number of the row to be excluded to the array. Read_table ('pandas_data_test\ch05_06.txt',sep=',',skiprows=[0,1,3,6]) 23 print(txtframe6) 24 Out[1]: 25 white red blue green animal 26 0 1 5 2 3 cat 27 1 2 7 8 5 dog 28 2 3 3 6 7 horse 29 3 2 2 8 3 duck 30 4 4 4 2 1 mouse 31 5 4 4 2 1 mou 32 -----*----- 33 white red blue green animal 34 0 1 5 2 3 cat 35 1 2 7 8 5 dog 36 2 3 3 6 7 horse 37 3 2 2 8 3 duck 38 4 4 4 2 1 mouse 39 5 4 4 2 1 mou 40 -----*----- 41 0 1 2 3 4 42 0 1 5 2 3 cat 43 1 2 7 8 5 dog 44 2 3 3 6 7 horse 45 3 2 2 8 3 duck 46 4 4 4 2 1 mouse 47 -----*----- 48 white red blue green animal 49 0 1 5 2 3 cat 50 1 2 7 8 5 dog 51 2 3 3 6 7 horse 52 3 2 2 8 3 duck 53 4 4 4 2 1 mouse 54 -----*----- 55 color status iteml item2 item3 56 0 black up 3 4 6 57 1 black down 2 6 7 58 2 white up 5 5 5 59 3 white down 3 3 2 60 4 white left 1 2 1 61 5 red up 2 2 2 62 6 red down 1 1 4 63 -----*----- 64 iteml item2 item3 65 color status 66 black up 3 4 6 67 down 2 6 7 68 white up 5 5 5 69 down 3 3 2 70 left 1 2 1 71 red up 2 2 2 72 down 1 1 4 73 -----*----- 74 white red blue green 75 0 1 5 2 3 76 1 2 7 8 5 77 2 3 3 6 7 78 -----*----- 79 0 1 2 80 0 0 123 122 81 1 1 124 321 82 2 2 125 333 83 -----*----- 84 white red blue green animal 85 0 1 5 2 3 cat 86 1 2 7 8 5 dog 87 2 3 3 6 7 horse 88 3 2 2 8 3 duck 89 4 4 4 2 1 mouseCopy the code
Read some data from TXT file
1 print (csvframe2, "\ n * -- -- -- -- -- -- -- -- -- -") 2 # nrows = 2 to get the specified number of rows, Csvfram20 = pd.read_csv('pandas_data_test\ mycsv_02.csv ',skiprows=[2],nrows=2,header=None) 4 print(csvfram20) 5 Out[2]: 6 0 1 2 3 4 7 0 1 5 2 3 cat 8 1 2 7 8 5 dog 9 2 3 3 6 7 horse 10 3 2 2 8 3 duck 11 4 4 4 2 1 mouse 12 -----*----- 13 0 1 2 3 4 14 0 1 5 2 3 cat 15 1 2 7 8 5 dogCopy the code
Another interesting and common operation is to slice the text you want to parse and then walk through the sections, performing a specific operation on each of them.
For example, taking a list of numbers and adding them up every second row, and finally inserting the and into the Series object “is a simple example to understand,
It has no practical application, but once you understand how it works, you can apply it to more complex situations.
1 csvframe1 = pd.read_table('pandas_data_test\myCSV_01.csv',sep=',') 2 print(csvframe1, "\n-----*-----") 3 out = pd.Series() 4 pieces = pd.read_csv('pandas_data_test\myCSV_01.csv',chunksize=4) # The chunksize parameter determines the number of lines to split each section. 7 print(piece['white']) 8 out.at[i] = piece['white'].sum() 9 i += 1 10 print(out, "\n-----*-----") 11 Out[3]: 12 white red blue green animal 13 0 1 5 2 3 cat 14 1 2 7 8 5 dog 15 2 3 3 6 7 horse 16 3 2 2 8 3 duck 17 4 4 4 2 1 mouse 18 5 4 4 2 1 mou 19 -----*----- 20 0 1 21 1 2 22 2 3 23 3 2 24 Name: white, dtype: int64 25 4 4 26 5 4 27 Name: white, dtype: int64 28 0 8 29 1 8 30 dtype: int64Copy the code
Write data to a CSV file
1 print(csvframe1) 2 print(csvframe1.to_csv('pandas_data_test\ CH05_07.csv ') Print (csvframe1.to_csv('pandas_data_test\ CH05_07b.csv ',index =False,header=False)) 5 Print (csvframe30.to_csv('pandas_data_test\ CH05_08.csv ')) 6 # Replace the empty field with the desired value using the na_rep option of to_csv(). Print (csvframe30.to_csv('pandas_data_test\ CH05_09.csv ',na_rep=" empty ") 8 Out[4]: 9 white red blue green animal 10 0 1 5 2 3 cat 11 1 2 7 8 5 dog 12 2 3 3 6 7 horse 13 3 2 2 8 3 duck 14 4 4 4 2 1 mouse 15 5 4 4 2 1 mou 16 None 17 None 18 None 19 NoneCopy the code
Enter the folder we can see the corresponding file:
Reading and writing HTML files
1 Frame = pd.dataframe (np.Arange (4).0 2 Print (frame.to_html()) 3 Frame 2 = pd.dataframe (0 Np. Random. The random ((4, 4)), the index = [' white ', 'black', 'red', 'blue1], the columns = [' up' and 'down', 'right' and 'left']) 4 s = [' < HTML > '] 5 s.append('<HEAD><TITLE>My DataFrame</TITLE></HEAD>') 6 s.append(' <B0DY>') 7 s.append(frame.to_html()) 8 s.append('</BODY></HTML>') 9 html = ''.join(s) 10 html_file = open('pandas_data_test\myFrame.html','w') 11 html_file.write(html) 12 html_file.close() 13 web_frames = pd.read_html('pandas_data_test\myFrame.html') 14 print(web_frames[0]) 15 ranking = pd.read_html('http://www.meccanismocomplesso.org/en/ Eccanismo-complesso-sito-2 / Classifia-Punteggio /') print(ranking[0][1:10]) 18 <table border="1" class="dataframe"> 19 <thead> 20 <tr style="text-align: right;" > 21 <th></th> 22 <th>0</th> 23 <th>1</th> 24 </tr> 25 </thead> 26 <tbody> 27 <tr> 28 <th>0</th> 29 <td>0</td> 30 <td>1</td> 31 </tr> 32 <tr> 33 <th>1</th> 34 <td>2</td> 35 <td>3</td> 36 </tr> 37 </tbody> 38 </table> 39 Unnamed: 0 0 1 40 0 0 0 1 41 1 1 2 3 42 # Nome Exp Livelli 43 1 2 admin 9029 NaN 44 2 3 BrunoOrsini 2124 NaN 45 3 4 Berserker 700 NaN 46 4 5 Dnocioni 543 NaN 47 5 6 albertosallusti 409 NaN 48 6 7 Jon 233 NaN 49 7 8 Mr.Y 180 NaN 50 8 9 michele sisinni 157 NaN 51 9 10 Selina 136 NaNCopy the code
Read data from XML
Pandas does not have any I/O API functions specifically designed to handle the XML format. It doesn’t, but this format is actually
This is important because a lot of structured data is stored in XML format. It doesn’t matter that pandas doesn’t have a specialized handler because Python
There are many libraries that read and write XML data (except pandas). One of these libraries, called LXML, is so good at processing large files that it has been converted from
Stand out from many similar libraries. This section describes how to use it to process XML files, and how to integrate it with PANDAS to maximize its functionality
Finally, the required data is retrieved from the XML file and converted into a DataFrame object.
The XML source file is shown below
Parse ('pandas_data_test\books.xml') 4 root = xml.getroot() # Print (root.book.author) 6 mes1 = root.book.getChildren () 7 print(" root.book.getchildren () ") \n", Mes1) 8 mes2 = root.book [1].getChildren () # print([child.tag for child in mes2]) # print([child.tag for child in mes2]) # Print ([child-.text for child in mes2]) # print([child-.text for child in mes2]) # print([child-.text for child in mes2]) # print([child-.text for child in mes2]) 12 272103_l_EnRoss, Mark 13 root.book.getChildren () 15 ['Author', 'Title', 'Genre', 15 ['Author', 'Title', 'Genre', 'Price', 'PublishDate'] 16 [' 272l03_l_EnBracket, Barbara', 'XML for Dummies', 'Computer', '35.95', '20L4-L2-L6 ']Copy the code
Read and write Microsoft Excel files
Read_excel () and to_excel() can read. XLS and. XLSX files.
Reading and writing JSON data read_json() and to_json() HDF5 Format So far, you have learned how to read and write data in text format. To analyze large amounts of data, it is best to use binary formats. Python has many kinds of binary data processing
Tool. The HDF5 library has had some success in this regard. HDF stands for Hierarchical Data Format. HDF5
Libraries focus on reading and writing HDF5 files, a data structure consisting of nodes capable of storing large data sets. The library is all in C language
Developed to provide python/ MATLAB and Java language interface. Its rapid expansion is the result of widespread use by developers, as well as its effectiveness
In particular, it is efficient to store large amounts of data in this format. HDF5 is simpler than other formats that handle binary data
Real-time compression is supported so that files can be compressed using repeated patterns in data structures. Currently, Python provides two ways to manipulate HDF5 data
Method: PyTables and H5PY. There are several differences between the two approaches, and the choice of one depends largely on your needs.
H5py provides interfaces to HDF5’s advanced apis. PyTables encapsulates many of the details of HDF5, providing more flexible data containers, index tables, and searches
Functions and other computing-related media. Pandas also has a class called HDFStore, similar to Diet, that stores pandas with PyTables
Object. Before using the HDF5 format, you must import the HDFStore class.
1 from pandas. IO. Pytables import HDFStore 2 If no, install 3 Frame = Pd. DataFrame (np) arange (16). Reshape (4, 4), the index = [' white ', 'black1', 'red' and 'blue'], the columns = [' up 'and' down ', 'right' and 'left']) 4 store = HDFStore('pandas_data_test\mydata.h5') 5 store['obj1'] = frame 6 frame1 = Pd. DataFrame (np) random. Rand (16). Reshape (4, 4), the index = [' white ', 'black1', 'red' and 'blue'], the columns = [' up 'and' down ', 'right' and 'left']) 7 store['obj2'] = frame1 8 print(store['obj1']) 9 print(store['obj2']) 10 Out[7]: 11 up down right left 12 white 0 1 2 3 13 black1 4 5 6 7 14 red 8 9 10 11 15 blue 12 13 14 15 16 up down right left 17 White 0.251269 0.422823 0.619294 0.273534 18 black1 0.593960 0.353969 0.966026 0.104581 19 red 0.964577 0.625644 0.342923 0.638627 20 blue 0.246541 0.997952 0.414599 0.908750 21 Closing Remaining open files:pandas_data_test\mydata.h5... doneCopy the code
Implement object serialization
The ****pickle module implements a powerful algorithm for pickling and deserializing data structures implemented in Python.
Serialization is the process of converting the hierarchy of objects into byte streams. Serialization facilitates the transfer, storage, and reconstruction of objects that can be reloaded using only the sink
Build an object that retains all of its original characteristics.
The pandas library is handy for object serialization (deserialization), all tools are readily available, and there is no need to import the cPickle module into a Python session
Block, all operations are implicit. Pandas’ serialization format is not entirely ASCII.
1 import pickle 2 data = { 'color': ['white','red'], 'value': [5, 7]} 3 pickled_data = pickle.dumps(data) 4 print(pickled_data) 5 nframe = pickle.loads(pickled_data) 6 print(nframe) 7 8 0 0 frame = pd.dataframe (np.arange(16). 0 0 Index = ['up','down','left','right']) 10 frame.to_pickle('pandas_data_test\frame.pkl') # Print (pd.read_pickle('pandas_data_test\frame.pkl')) 13 b'\x80\x03}q\x00(X\x05\x00\x00\x00colorq\x01]q\x02(X\x05\x00\x00\x00whiteq\x03X\x03\x00\x00\x00redq\x04eX\x05\x00\x00\x0 0valueq\x05]q\x06(K\x05K\x07eu.' 14 {'color': ['white', 'red'], 'value': [5, 7]} 15 0 1 2 3 16 up 0 1 2 3 17 down 4 5 6 7 18 left 8 9 10 11 19 right 12 13 14 15Copy the code
Interconnecting with databases **** In many applications, little data is used from text files, because text files are not the most efficient way to store data.
Data is often stored in SQL-like relational databases, and NoSQL databases have recently become popular as a supplement.
Loading data from an SQL database and converting it to a DataFrame object is simple. Several functions provided by PANDAS simplify the process.
The pandas. IO. SQL module provides a unified database independent interface called SQLAlchemy. This interface simplifies the connection mode for both
There is only one set of operation commands for any type of database. Connect to the database using the create_engine() function, which you can use to configure the driver
All required attributes such as username, password, port, and database instance. The typical form of a database URL is:
dialect+driver://username:password@host:port/database
The identifying name of a name, such as SQLite, mysql, PostgresQL, Oracle, or MSSQL. Drivername is used to connect using all lowercase letters
Name of the DBAPI to the database. If not specified, the “default” DBAPI will be imported (if available) – this default value is usually available to the back end
The widest range of drivers.
1 from sqlalchemy import create_engine 2 3 # PostgreSQL database 4 # default 5 engine = Create_engine (' postgresQL :// Scott :tiger@localhost/mydatabase') 6 # pg8000 drive 7 engine = Create_engine (' postgresQL +pg8000:// Scott :tiger@localhost/mydatabase') 8 # psycopg2 drive 9 engine = create_engine('postgresql+psycopg2://scott:tiger@localhost/mydatabase') 10 # MySql 11 # default 12 engine = Create_engine ('mysql:// Scott :tiger@localhost/foo') 13 14 # mysql-python create_engine('mysql+mysqldb://scott:tiger@localhost/foo') 16 17 # MySQL-connector-python 18 engine = create_engine('mysql+mysqlconnector://scott:tiger@localhost/foo') 19 # OurSQL 20 engine = create_engine('mysql+oursql://scott:tiger@localhost/foo') 21 # Oracle 22 engine = Create_engine ('oracle:// Scott :[email protected]:1521/sidname') 23 24 engine = create_engine('oracle+cx_oracle://scott:tiger@tnsname') 25 # Microsoft SQL 26 # pyodbc 27 engine = create_engine('mssql+pyodbc://scott:tiger@mydsn') 28 29 # pymssql 30 engine = create_engine('mssql+pymssql://scott:tiger@hostname:port/dbname')Copy the code
SQLite:
Because SQLite connects to local files, the URL format is slightly different. The File part of the URL is the file name of the database.
For relative file paths, this requires three slashes: engine = create_engine(‘sqlite:///foo.db’)
**** For absolute file paths, the absolute path is followed by three slashes:
Unix/Mac – 4 initial slashes in total
engine = create_engine(‘sqlite:absolute/path/to/foo.db’)
**Windows
engine = create_engine(‘sqlite:///C:\path\to\foo.db’)
Windows alternative using raw string
engine = create_engine(r’sqlite:///C:\path\to\foo.db’)
SQLite3 Data read and write **** Learn to use Python’s built-in SQLite database SQLite3. The SQLite3 tool implements a simple, lightweight DBMS, SQL,
It can therefore be built into any application implemented in The Python language. It is very useful, you can create an embedded database in a single file.
If you want to use all the features of a database without having to install a real database, this tool is the best choice. If you want to use real
SQLite3 does not practice database operations before using the database, or use the database to store data in a single program regardless of the interface
Wrong choice.
1 from sqlalchemy import create_engine 2 frame = pd.DataFrame( Np.arange (20).0 (4,5), Columns =[' White ','red','blue','black','green']) 0 # connect SQLite3 database 4 engine = 0 Create_engine ('sqlite:///pandas_data_test/foo.db') 5 # Convert DataFrame to database table. 6 # to_sql(self, name, con, schema=None, if_exists='fail', index=True, 7 # index_label=None, chunksize=None, Print (pd.read_sql('colors',engine))Copy the code
Running results: