pandas-tutorial
August 1, 2021
[2]: import pandas as pd
1 PANDAS
• Pandas is a python module that is used for data manipulation and wrangling.
• In this tutorial, 13 inbuilt pandas functionalities will be explored.
• Pandas has numerous functionalities which still have to be explored. This notebook is only
an introduction
.read_xxx() * Data comes in various formats such as .txt, .csv, .xlsx etc. .read_xxx() function
helps one read the original file into a pandas dataframe. * Different file extensions are read in
different ways e.g .read_csv(filename) for csv files, .read_json(filename) for json files.
** pd.DataFrame()** * 1. Sometimes you may have to create your own data frame. * 1. Pandas
provides you with a way of doing this. * 1. The pd.DataFrame function allows you to create youw
own dataframe as shown below:
[10]: #instantiating a pandas dataframe object
df = pd.DataFrame()
#creating a column of numeric attributes
df['num1'] = [10, 23, 45, 90, 46, 34, 10]
#creating a column with fractional variables
df['frac'] = [0.234, 0.123, 0.4353, 21.34, 45.00, 30.20, 90.045]
#creating categorical columns
df['cat1'] = ['a', 'b', 'c', 'd', 'a', 'c', 'b']
df['cat2'] = ['x', 'y', 'x', 'x', 'y', 'y', 'x']
#creating a dummy datetime column
df['date_col'] = ['1990-08-3', '2000-07-21', '1998-06-17', '2021-06-30',␣
,→'1776-07-4',
'2001-09-11', '2010-02-17']
Seeing the results of your dataframe * You can then print out basic aspects of your dataframe
using different pandas functions. * These include .columns to print the column names, .head(n) to
show the first n columns, .tail(n) for otherwsie and many more. * Few are explored here
[11]: #displaying the first 3 columns
df.head(3)
1
, [11]: num1 frac cat1 cat2 date_col
0 10 0.2340 a x 1990-08-3
1 23 0.1230 b y 2000-07-21
2 45 0.4353 c x 1998-06-17
[12]: #A statistical summary of your numeric columns can be obtained as:
df.describe()
[12]: num1 frac
count 7.000000 7.000000
mean 36.857143 26.768186
std 27.739434 32.876148
min 10.000000 0.123000
25% 16.500000 0.334650
50% 34.000000 21.340000
75% 45.500000 37.600000
max 90.000000 90.045000
[14]: #to determine the memory consumption of each colum in bytes:- this is important␣
,→in evaluating the suitability of your computing resources,
df.memory_usage(deep=True)
[14]: Index 128
num1 56
frac 56
cat1 406
cat2 406
date_col 467
dtype: int64
[15]: #To know the datatypes of our columns:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num1 7 non-null int64
1 frac 7 non-null float64
2 cat1 7 non-null object
3 cat2 7 non-null object
4 date_col 7 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 408.0+ bytes
Pro tip: Switching from one dataframe can be achieved through .astype functionality
2
August 1, 2021
[2]: import pandas as pd
1 PANDAS
• Pandas is a python module that is used for data manipulation and wrangling.
• In this tutorial, 13 inbuilt pandas functionalities will be explored.
• Pandas has numerous functionalities which still have to be explored. This notebook is only
an introduction
.read_xxx() * Data comes in various formats such as .txt, .csv, .xlsx etc. .read_xxx() function
helps one read the original file into a pandas dataframe. * Different file extensions are read in
different ways e.g .read_csv(filename) for csv files, .read_json(filename) for json files.
** pd.DataFrame()** * 1. Sometimes you may have to create your own data frame. * 1. Pandas
provides you with a way of doing this. * 1. The pd.DataFrame function allows you to create youw
own dataframe as shown below:
[10]: #instantiating a pandas dataframe object
df = pd.DataFrame()
#creating a column of numeric attributes
df['num1'] = [10, 23, 45, 90, 46, 34, 10]
#creating a column with fractional variables
df['frac'] = [0.234, 0.123, 0.4353, 21.34, 45.00, 30.20, 90.045]
#creating categorical columns
df['cat1'] = ['a', 'b', 'c', 'd', 'a', 'c', 'b']
df['cat2'] = ['x', 'y', 'x', 'x', 'y', 'y', 'x']
#creating a dummy datetime column
df['date_col'] = ['1990-08-3', '2000-07-21', '1998-06-17', '2021-06-30',␣
,→'1776-07-4',
'2001-09-11', '2010-02-17']
Seeing the results of your dataframe * You can then print out basic aspects of your dataframe
using different pandas functions. * These include .columns to print the column names, .head(n) to
show the first n columns, .tail(n) for otherwsie and many more. * Few are explored here
[11]: #displaying the first 3 columns
df.head(3)
1
, [11]: num1 frac cat1 cat2 date_col
0 10 0.2340 a x 1990-08-3
1 23 0.1230 b y 2000-07-21
2 45 0.4353 c x 1998-06-17
[12]: #A statistical summary of your numeric columns can be obtained as:
df.describe()
[12]: num1 frac
count 7.000000 7.000000
mean 36.857143 26.768186
std 27.739434 32.876148
min 10.000000 0.123000
25% 16.500000 0.334650
50% 34.000000 21.340000
75% 45.500000 37.600000
max 90.000000 90.045000
[14]: #to determine the memory consumption of each colum in bytes:- this is important␣
,→in evaluating the suitability of your computing resources,
df.memory_usage(deep=True)
[14]: Index 128
num1 56
frac 56
cat1 406
cat2 406
date_col 467
dtype: int64
[15]: #To know the datatypes of our columns:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num1 7 non-null int64
1 frac 7 non-null float64
2 cat1 7 non-null object
3 cat2 7 non-null object
4 date_col 7 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 408.0+ bytes
Pro tip: Switching from one dataframe can be achieved through .astype functionality
2