Pandas has the following basic structure:
To view the head and tail samples of a Series or DataFrame object, use the head() and tail() methods. The default display is five elements, but you can pass custom numbers.
In [3]: seriesd = pd.Series(np.random.randn(100)) In [4]: seriesd.head() 0 1.894425 1 0.804395 2 -1.511387 3 0.195662 4 -0.053392 dtype: float64 In [5]: seriesd.head(3) 0 1.894425 1 0.804395 2 -1.511387 dtype: float64
The pandas object has a number of properties that allow access to metadata:
Series: index DataFrame: index, columns, values Panel: Items, major_axis, and minor_axis can all be safely reassigned. In [6] : df = pd DataFrame (np) random) randn (4, 3), the index = pd. Date_range (‘ 1/1/2000, periods = 4), the columns = [‘ A ‘, ‘B’, ‘C’]) In [8] : Df A B C 2000-01-01 1.748549-0.766845 0.592976 2000-01-02-0.821119-0.122719 0.055415 2000-01-03 1.146029 1.615741 -1.115166 2000-01-04 0.947920 0.181372 0.190599 In [9]: df.columns Out[9]: Index([u’A’, u’B’, u’C’], dtype=’object’)
In [10]: df.index Out[10]: DatetimeIndex([‘2000-01-01’, ‘2000-01-02’, ‘2000-01-03’, ‘2000-01-04′], dtype=’datetime64[ns]’, freq=’D’)
In [11]: Df. Values array([[1.74854937, -0.76684475, 0.59297559], [-0.82111911, -0.12271889, 0.05541523], [1.14602866, 1.61574067, -1.1151657], [0.9479203, 0.18137236, Dataframes include add(), sub(), mul(), div() and relative radd(), rsub()… Using these functions, broadcast matches can be made by specifying the axis parameter as index or columns.
In [13]: Df.sub (df.ix[1],axis=’columns’) A B C 2000-01-01 2.569668-0.644126 0.537560 2000-01-02 0.000000 0.000000 0.000000 2000-01-03 1.967148 1.738460-1.170581 2000-01-04 1.769039 0.304091 0.135184
In [14]: Df.sub (df.ix[1],axis=1) A B C 2000-01-01 2.569668-0.644126 0.537560 2000-01-02 0.000000 0.000000 0.000000 2000-01-03 1.967148 1.738460-1.170581 2000-01-04 1.769039 0.304091 0.135184
In [15]: df.sub(df.ix[1],axis=’index’) A B C 2000-01-01 00:00:00 NaN NaN NaN 2000-01-02 00:00:00 NaN NaN NaN 2000-01-03 00:00:00 NaN NaN NaN 2000-01-04 00:00:00 NaN NaN NaN A NaN NaN NaN B NaN NaN NaN C NaN NaN NaN
In [16]: df.sub(df.ix[1],axis=0) A B C 2000-01-01 00:00:00 NaN NaN NaN 2000-01-02 00:00:00 NaN NaN NaN 2000-01-03 00:00:00 NaN NaN NaN NaN A NaN NaN NaN B NaN NaN NaN C NaN NaN NaN #MultiIndex specify level control broadcast behavior In [17]: dfmi = df.copy() In [18]: dfmi.index = pd.MultiIndex.from_tuples([(1,’a’),(1,’b’),(1,’c’),(2,’a’)],names=[‘first’,’second’]) In [19]: dfmi Out[19]: A B C first second 1 A 1.748549-0.766845 0.59297b-0.821119-0.122719 0.0554c 1.146029 1.615741-1.115166 2 A In [20]: dmi. sub(df[‘A’],axis=0,level=’second’) Out[20]: A B C first second 1 A NaN NaN NaN B NaN NaN NaN C NaN NaN NaN 2 A NaN NaN NaN #Panel arithmetic operation broadcast control similar to DataFrame, set axis = ‘Major ‘/’minor’/’items’ panel.sub (major_mean, axis=’major’) 2.3.2 Missing data values are filled in Series and DataFrame (although not yet in Panel), The arithmetic function may choose to enter fill_value, which is the value to replace when the value in the position is lost. For example, when adding two DataFrame objects, you might want to treat NaN as 0, unless both DataFrames are missing the value.
In [54]: Df1 A B C 2000-01-01 1.748549 NaN 0.592976 2000-01-02-0.821119-0.122719 0.055415 2000-01-03 1.146029 1.615741 -1.115166 2000-01-04 3.214000 0.181372 0.190599
In [55]: Df2 A B C 2000-01-01 1.748549 3.241000 0.592976 2000-01-02-0.821119-0.122719 0.055415 2000-01-03 1.146029 1.615741 -1.115166 2000-01-04 NaN 0.181372 0.190599
In [56]: Df1 +df2 A B C 2000-01-01 3.497099 NaN 1.185951 2000-01-02-1.642238-0.245438 0.110830 2000-01-03 2.292057 3.231481 -2.230331 2000-01-04 NaN 0.362745 0.381198
In [57]: Df1. Add (df2,fill_value=0) A B C 2000-01-01 3.497099 3.241000 1.185951 2000-01-02-1.642238-0.245438 0.110830 2000-01-03 2.292057 3.231481-2.230331 2000-01-04 3.214000 0.362745 0.381198 The missing value in one of the dataframes conditionally fills the value of a similar label from the other dataframes using combine_first():
In [63]: df1 = pd.DataFrame({‘A’ : [1., np.nan, 3., 5., np.nan],’B’ : [np.nan, 2., 3., np.nan, 6.]}) In [64]: df2 = pd.DataFrame({‘A’ : [5., 2., 4., np.nan, 3., 7.],’B’ : [np.nan, np.nan, 3., 4., 6., 8.]}) In [65]: Df1 A B 0 1.0 NaN 1 NaN 2.0 2 3.0 3.0 3 5.0 NaN 4 NaN 6.0 In [66]: Df2 A B 0 5.0 NaN 1 2.0 NaN 2 4.0 3.0 3 NaN 4.0 4 3.0 6.0 5 7.0 8.0 In [67]: Df1.com bine_first(df2) A B 0 1.0 NaN 1 2.0 2.0 2 3.0 3.0 3 5.0 4.0 4 3.0 6.0 5 7.0 8.0 The ordinary Combine () function takes another DataFrame and combiner function, aligns the input DataFrame, and then passes pairs of Series of combiner functions:
In [75]: combiner = lambda x, y: np.where(pd.isnull(x), y, x) In [76]: df1.combine(df2, Combiner) A B 0 1.0 NaN 1 2.0 2.0 2 3.0 3.0 3 5.0 4.0 4 3.0 6.0 5 7.0 8.0 2.4 Description There are many descriptive statistics functions for Series, DataFrame, and Panel. Most are aggregative class functions: sum(), mean(), STD (), quantile(), etc. Some of them return an object of the same size: cumsum(), cumprod(). In general, these functions contain axis arguments, which can be passed as names or integers:
Series: no axis arguments; DataFrame: “index” (Axis =0, default), “columns” (Axis =1) “Items” (Axis =0), “Major” (Axis =1, default), “minor” (Axis =2) all of these functions contain a Skipna argument that determines whether to skip null values.
In [80]: df.sum(0, skipna=False) 2.4.1 Describe describe(
In [96] : frame = pd. DataFrame (np) random) randn (1000, 5), the columns = [‘ a ‘, ‘b’, ‘c’, ‘d’, ‘e’] In [98] : Frame.describe () This function specifies the percentiles parameter to generate the desired percentage statistics:
In [99]: series.describe(percentiles=[.05, .25, .75, .95]) count 500.000000 mean-0.039663 STD 1.069371 min-3.463789 5%-1.741334 25%-0.731101 50%-0.058918 75% 0.672758 Idxmin () and idxmax() are used to calculate the minimum or maximum value of an Index. When more than one value meets the minimum or maximum value, the first matched value is returned.
Value_counts () Counts total counts of the same value.
In [3]: data = np.random.randint(0, 7, size=50)
In [4]: pd.value_counts(data) Out[4]: 6 11 5 8 3 8 0 8 4 7 2 5 1 3 dtype: int64
In [5]: df5 = pd.DataFrame({“A”: np.random.randint(0, 7, size=50),”B”: np.random.randint(-10, 15, size=50)})
In [6]: df5.mode() Out[6]: A B 0 0.0 7 1 NaN 10 2.4.4 Discretization and quantization Continuous values can be discretized using the functions cut()(based on the boundary of values) and qcut()(based on the boundary of sample quantiles) :
In [11]: arr = np.random.randn(20)
In [12]: factor = pd.cut(arr,4)
In [13] : factor Out [13] : [(0.34, 0.49), (0.49, 1.32], (0.34, 0.49), (1.169, 0.34), (2.00224, 1.169],… , (1.169, 0.34), (2.00224, 1.169), (0.49, 1.32], (0.34, 0.49], (Length: 1.169, 0.34]] 20 Categories (4, object) : [(2.00224, 1.169] “(1.169, 0.34]” (< 0.34, 0.49] (0.49, 1.32]] qcut quantile () calculation sample. For example, we can divide some normally distributed data into equal-size quartiles
In [8]: arr = np.random.randn(30)
In [9]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])
In [10] : factor Out [10] : [(0.471, 0.0742), (0.471, 0.0742), (0.0742, 0.797], (0.797, 2.597], [2.735, 0.471],… [2.735, 0.471], [2.735, 0.471], (0.797, 2.597], [2.735, 0.471], (0.0742, 0.797]] Length: 30 Categories (4, object): [[-2.735, -0.471] < (-0.471, 0.0742] < (0.0742, 0.797] < (0.797, 2.597]] PERSONALLY feel this useful, too small
To apply custom or other external functions to pandas, there are three methods. The appropriate method depends on whether the function expects to operate on the entire Series or DataFrame, in row or column mode, and in element mode:
Pipe () if you want to chain calls to pandas, use pipe().
Pipe (f, arg2=2, arg3=3). Pipe (f, arg2=2, arg3=3) Apply () can use the apply () method to apply any function, such as descriptive statistics, along the axis of a DataFrame or Panel, with optional axis arguments:
Df.apply (np.mean, axis=1) 2.6 Re-indexing and changing the tag reindex() is the basic data alignment method in PANDAS. It is used to implement almost all other functions that depend on label alignment. Reindexing means making the data conform to a given set of labels and a particular axis. This accomplishes several things:
Reorder existing data to match the given new label Insert NA at data label position without the label if specified, fill In missing label data In [3] using logic (highly relevant to using time series data) : s = pd.Series(np.random.randn(5),index=[‘a’,’b’,’c’,’d’,’e’]) In [4]: S a 0.763208 B-0.328499 C-0.233119D 1.874552 e-0.937831 DType: float64
In [5]: s.reindex([‘e’,’b’,’f’,’d’]) e-0.937833b-0.328499f NaN d 1.874552dtype: Float64 2.6.1 Reindexing to align with another object You may want to take an object and rename its pivot to the same label as the other object, using reindex_Like () :
In [188]: df. Reindex_like (df2) 2.6.2 Align Operations Align objects with each other The align () method is the fastest way to align two objects simultaneously. It supports the join parameter:
Join =’left’ : use the index of the calling object JOIN =’right’ : use the index of the passed object join =’inner’ : Using the intersection of indexes For DataFrames, by default, the join method will be applied to indexes and columns, and can also be specified by Axis to align only on the specified axis, such as Axis =0 or axis=1. In [196]: df.align(df2, join=’inner’, axis=0) In [196]: df.align(df2, join=’inner’, axis=0)
In [212]: df.drop([‘one’], axis=1) 2.6.4 Renaming/mapping labels rename () allows you to re-label an axis based on some mapping (a dictionary or series) or any function:
In [215]: s.rename(str.upper) In [216]: df.rename(columns={‘one’ : ‘foo’, ‘two’ : ‘bar’},index={‘a’ : ‘apple’, ‘b’ : ‘BANANA ‘, ‘d’ : ‘durian’}) 2.7 Iteration Basic iteration behavior to pandas depends on the type. Basic iteration (for I in object) produces:
Series: Values DataFrame: Column Labels Panel: Item Labels Pandas Objects also have the dict-like iteritems() method to traverse (key, value) pairs. To iterate over DataFrame rows, use the following method: iterRows (): Iterates over DataFrame rows into (index, Series) pairs. This converts rows to Series objects, which can change dtypes and have some performance implications. Itertuples (): Namedtuples iterating over DataFrame rows as values. In [220]: df = pd.dataframe ({‘a’: [1, 2, 3], ‘b’: [‘ a ‘, ‘b’, ‘c’]}) In [221] : for the index, the row In df. Iterrows () : the row [‘ a ‘] = 10 other doesn’t
The main methods for sorting axis tags (indexes) are series.sort_index () and datafame.sort_index () : Sort_values () and datafame.sort_values (). Datafame.sort_values () can accept arguments by and axis = 0, which will use any vector or column name of the DataFrame to determine the sort order: In [3]: Df1 = pd. DataFrame ({‘ one ‘:,1,1,1 [2],’ two ‘:,3,2,4 [1],’ three ‘:,4,3,2 [5]}) In [4]. Df1. sort_values(by=’two’) one three two 0 2 5 1 2 1 3 2 1 1 4 3 3 1 2 4 2.9 Dtypes Datetime64 [ns] and dateTime64 [NS, TZ], Timedelta [ns], Categories and Object. In addition, these Dtypes have item sizes, such as INT64 and int32.
The dTypes attribute of Data Frames returns a Series with each column Data type.
In [17]: df.dtypes Out[17]: a int64 b object c float64 dtype: object
In [18]: df[‘a’].dtype Out[18]: Dtype (‘int64’) pandas (‘int64’) is used to clean data for pandas. It is used to clean data for pandas