Class notes

Exploratory Data Analytics

Rating

Sold

Pages

Uploaded on

09-06-2023

Written in

2022/2023

The Document talks about the extraction of meaningful insights in data science and also the Information a person should know about the topic.

Institution

Course

Content preview

Exploratory Data Analysis: The Key to Extracting Meaningful Insights from Data

Exploratory Data Analysis (EDA):

Let us start the kit from knowing the importance of EDA in any data science project. There are few
crucial points in data sets that everyone should take care of while performing EDA tasks on any dataset.

Data Set Description:

You can make use of Digital platform to come across Random data sets.

Before entering this topic you should be much aware of Stats Concept

Steps

Import necessary libraries – pandas, numpy, seaborn, and matplotlib

Read in the data set

Get acquainted with the data using functions like shape, columns, and dtypes

Check the statistical description of the data using the describe function

Code

Import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltdata =
pd.read_csv(‘path/to/diabetes.csv’)print(data.head())print(data.shape)print(data.columns)print(data.dty
pes)print(data.describe())

One way to check for missing values in data is by using the isnull() function, which returns a boolean
value of either true or false. To find the number of missing values in a column, you can use the sum()
function after calling isnull(). Another important aspect to consider is the presence of zeros, which may
not make sense for certain features in the data. In such cases, you can use imputation methods such as
mean or median to replace these zeros. The choice between mean or median depends on the presence
of outliers in the data.

To visualize the presence of outliers in the data, you can use a box plot. The lower whisker represents the
value of q1, the middle line represents q2 (the median value), and the upper whisker represents q3. The
presence of outliers can impact the choice of imputation method.

To create a box plot in Python, you can use the sns.boxplot() function from the seaborn library. You can
also save the plot as an image using the plt.savefig() function.

Box Plot Analysis

, In order to calculate the values of upper and lower whiskers, as well as the upper and lower vertical
values, we can use the value of IQR. Outliers in the data can be removed using the concept of quantile.
Once the outliers are removed, the distribution of the columns can be analyzed to determine whether
they are symmetric or nonsymmetric. This will help in deciding whether to impute the missing values
using mean or median.

Removing Outliers

The quantile function can be used to remove the records falling under a certain percentile threshold. For
example, the records falling below the 99th percentile can be removed using quantile function. This will
remove the records that are not falling under this constraint and give a new dataset with lesser outliers.

Feature Selection

Feature selection is a useful technique to select only the most crucial features for the analysis. The
correlation coefficient heatmap can be used to determine the correlation between the features. Highly
correlated features can be removed and only one of them can be taken for analysis.

Distribution Analysis

The distribution of the columns can be analyzed using the sns.distplot function. If the distribution is
symmetric, imputation can be done using mean. If it is nonsymmetric, imputation should be done using
median or any other algorithm.

Classification Task: Checking Data Balance

One important task in a classification task is to check whether the data is balanced or not. If the data is
biased towards one class, the results will be affected. To check if the data is balanced, use the
value_counts function on the outcome column. If the data is imbalanced, apply techniques to resolve
the problem such as under sampling or over sampling.

Imputation and Replacement

If a column has missing values, impute or replace them with appropriate values. For example, if insulin
has a non-symmetric distribution, replace the 0 values with the median using the replace function.
Check whether the replacement is successful by using the value_counts function on the column.

Summary of EDA Tasks

1.Read the data

Report Copyright Violation

Written for

Course: Data Analytics

All documents for this subject (155)

Document information

Uploaded on: June 9, 2023
Number of pages: 6
Written in: 2022/2023
Type: Class notes
Professor(s): Priya & vidhya
Contains: Extraction of meaningful insights

Subjects

data analytics
data science
extraction of meaningful insights

$8.49

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

jowinjoe

Get to know the seller

jowinjoe PSG College of technology

View profile

Sold

Member since

2 year

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller jowinjoe. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $8.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 47251 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Exploratory Data Analytics

Content preview

Written for

Document information

Subjects

Get to know the seller

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?