Exploratory Data Analysis: The Key to Extracting Meaningful Insights from Data
Exploratory Data Analysis (EDA):
Let us start the kit from knowing the importance of EDA in any data science project. There are few
crucial points in data sets that everyone should take care of while performing EDA tasks on any dataset.
Data Set Description:
You can make use of Digital platform to come across Random data sets.
Before entering this topic you should be much aware of Stats Concept
Steps
Import necessary libraries – pandas, numpy, seaborn, and matplotlib
Read in the data set
Get acquainted with the data using functions like shape, columns, and dtypes
Check the statistical description of the data using the describe function
Code
Import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltdata =
pd.read_csv(‘path/to/diabetes.csv’)print(data.head())print(data.shape)print(data.columns)print(data.dty
pes)print(data.describe())
One way to check for missing values in data is by using the isnull() function, which returns a boolean
value of either true or false. To find the number of missing values in a column, you can use the sum()
function after calling isnull(). Another important aspect to consider is the presence of zeros, which may
not make sense for certain features in the data. In such cases, you can use imputation methods such as
mean or median to replace these zeros. The choice between mean or median depends on the presence
of outliers in the data.
To visualize the presence of outliers in the data, you can use a box plot. The lower whisker represents the
value of q1, the middle line represents q2 (the median value), and the upper whisker represents q3. The
presence of outliers can impact the choice of imputation method.
To create a box plot in Python, you can use the sns.boxplot() function from the seaborn library. You can
also save the plot as an image using the plt.savefig() function.
Box Plot Analysis
, In order to calculate the values of upper and lower whiskers, as well as the upper and lower vertical
values, we can use the value of IQR. Outliers in the data can be removed using the concept of quantile.
Once the outliers are removed, the distribution of the columns can be analyzed to determine whether
they are symmetric or nonsymmetric. This will help in deciding whether to impute the missing values
using mean or median.
Removing Outliers
The quantile function can be used to remove the records falling under a certain percentile threshold. For
example, the records falling below the 99th percentile can be removed using quantile function. This will
remove the records that are not falling under this constraint and give a new dataset with lesser outliers.
Feature Selection
Feature selection is a useful technique to select only the most crucial features for the analysis. The
correlation coefficient heatmap can be used to determine the correlation between the features. Highly
correlated features can be removed and only one of them can be taken for analysis.
Distribution Analysis
The distribution of the columns can be analyzed using the sns.distplot function. If the distribution is
symmetric, imputation can be done using mean. If it is nonsymmetric, imputation should be done using
median or any other algorithm.
Classification Task: Checking Data Balance
One important task in a classification task is to check whether the data is balanced or not. If the data is
biased towards one class, the results will be affected. To check if the data is balanced, use the
value_counts function on the outcome column. If the data is imbalanced, apply techniques to resolve
the problem such as under sampling or over sampling.
Imputation and Replacement
If a column has missing values, impute or replace them with appropriate values. For example, if insulin
has a non-symmetric distribution, replace the 0 values with the median using the replace function.
Check whether the replacement is successful by using the value_counts function on the column.
Summary of EDA Tasks
1.Read the data
Exploratory Data Analysis (EDA):
Let us start the kit from knowing the importance of EDA in any data science project. There are few
crucial points in data sets that everyone should take care of while performing EDA tasks on any dataset.
Data Set Description:
You can make use of Digital platform to come across Random data sets.
Before entering this topic you should be much aware of Stats Concept
Steps
Import necessary libraries – pandas, numpy, seaborn, and matplotlib
Read in the data set
Get acquainted with the data using functions like shape, columns, and dtypes
Check the statistical description of the data using the describe function
Code
Import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltdata =
pd.read_csv(‘path/to/diabetes.csv’)print(data.head())print(data.shape)print(data.columns)print(data.dty
pes)print(data.describe())
One way to check for missing values in data is by using the isnull() function, which returns a boolean
value of either true or false. To find the number of missing values in a column, you can use the sum()
function after calling isnull(). Another important aspect to consider is the presence of zeros, which may
not make sense for certain features in the data. In such cases, you can use imputation methods such as
mean or median to replace these zeros. The choice between mean or median depends on the presence
of outliers in the data.
To visualize the presence of outliers in the data, you can use a box plot. The lower whisker represents the
value of q1, the middle line represents q2 (the median value), and the upper whisker represents q3. The
presence of outliers can impact the choice of imputation method.
To create a box plot in Python, you can use the sns.boxplot() function from the seaborn library. You can
also save the plot as an image using the plt.savefig() function.
Box Plot Analysis
, In order to calculate the values of upper and lower whiskers, as well as the upper and lower vertical
values, we can use the value of IQR. Outliers in the data can be removed using the concept of quantile.
Once the outliers are removed, the distribution of the columns can be analyzed to determine whether
they are symmetric or nonsymmetric. This will help in deciding whether to impute the missing values
using mean or median.
Removing Outliers
The quantile function can be used to remove the records falling under a certain percentile threshold. For
example, the records falling below the 99th percentile can be removed using quantile function. This will
remove the records that are not falling under this constraint and give a new dataset with lesser outliers.
Feature Selection
Feature selection is a useful technique to select only the most crucial features for the analysis. The
correlation coefficient heatmap can be used to determine the correlation between the features. Highly
correlated features can be removed and only one of them can be taken for analysis.
Distribution Analysis
The distribution of the columns can be analyzed using the sns.distplot function. If the distribution is
symmetric, imputation can be done using mean. If it is nonsymmetric, imputation should be done using
median or any other algorithm.
Classification Task: Checking Data Balance
One important task in a classification task is to check whether the data is balanced or not. If the data is
biased towards one class, the results will be affected. To check if the data is balanced, use the
value_counts function on the outcome column. If the data is imbalanced, apply techniques to resolve
the problem such as under sampling or over sampling.
Imputation and Replacement
If a column has missing values, impute or replace them with appropriate values. For example, if insulin
has a non-symmetric distribution, replace the 0 values with the median using the replace function.
Check whether the replacement is successful by using the value_counts function on the column.
Summary of EDA Tasks
1.Read the data