Geschreven door studenten die geslaagd zijn Direct beschikbaar na je betaling Online lezen of als PDF Verkeerd document? Gratis ruilen 4,6 TrustPilot
logo-home
College aantekeningen

DATA QUALITY AND TRANSFORMATION

Beoordeling
-
Verkocht
-
Pagina's
61
Geüpload op
08-11-2025
Geschreven in
2025/2026

Short notes on ensuring data accuracy and consistency, and transforming data into suitable formats for analysis and modeling.

Instelling
Vak

Voorbeeld van de inhoud

SCSB1231 DATA AND INFORMATION SCIENCE




UNIT 3 DATA QUALITY AND TRANSFORMATION



Data Imputation – Data Transformation (minmax, log transform, z-score transform etc.,). –
Binning, Classing and Standardization. – Outlier/Noise & Anomalies.

,Data Imputation:

Data imputation is the process of replacing missing or incomplete data points in a
dataset with estimated or substituted values. These estimated values are typically derived
from the available data, statistical methods, or machine learning algorithms.

Data imputation fills missing values in datasets, preserving data completeness and quality. It
ensures practical analysis, model performance, and visualizations by preventing data loss and
maintaining sample size. Imputation reduces bias, maintains data relationships, and
facilitates various statistical techniques, enabling better decision-making and insights from
incomplete data.




Importance of Data Imputation in Analysis


Data imputation is crucial in data analysis as it addresses missing or incomplete data,
ensuring the integrity of analyses. Imputed data enables the use of various statistical methods
and machine learning algorithms, improving model accuracy and predictive power.
Without imputation, valuable information may be lost, leading to biased or less reliable
results. It helps maintain sample size, reduces bias, and enhances the overall quality and
reliability of data-driven insights.

Types of Missing Data
Below are the different types as follows:

1. Missing Completely at Random (MCAR)
In this type, the probability of data being missing is unrelated to both observed and
unobserved data. In other words, missing is purely random and occurs by chance. MCAR
implies that the missing data is not systematically related to any variables in the dataset. For
example, a sensor failure that results in sporadic missing temperature readings can be
considered MCAR.

2. Missing at Random (MAR)

,Missing data is considered MAR when the probability of data being missing is related to
observed data but not directly to unobserved data. In other words, missingness is dependent
on some observed variables. For instance, in a medical study, men might be less likely to
report certain health conditions than women, creating missing data related to the gender
variable. MAR is a more general and common type of missing data than MCAR.

3. Missing Not at Random (MNAR)
MNAR occurs when the probability of data being missing is related to unobserved data or the
missing values themselves. This type of missing data can introduce bias into analyses because
the missingness is related to the missing values. An example of MNAR could be patients with
severe symptoms avoiding follow-up appointments, resulting in missing data related to the
severity of their condition.



Data Imputation Techniques:


There are several methods and techniques for data imputation, each with its strengths and
suitability depending on the nature of the data and the analysis goals. Let’s discuss some
commonly used data imputation techniques:

1. Mean/Median/Mode Imputation

• Mean Imputation: Replace missing values in numerical variables with the average of
the observed values for that variable.
• Median Imputation: Replace missing values in numerical variables with the middle
value of the observed values for that variable.
• Mode Imputation: Replace missing values in categorical variables with the most
frequent category among the observed values for that variable.




Advantages

, • Simplicity
• Preserves Data Structure
• Applicability

Disadvantages and Considerations

• Ignores Data Relationships
• May Distort Data
• Inappropriate for Missing Data Patterns



When to Use:

• Use mean imputation for numerical variables when missing data is missing
completely at random (MCAR) and the variable has a relatively normal distribution.
• Use median imputation when the data is skewed or contains outliers, as it is less
sensitive to extreme values.
• Use mode imputation for categorical variables when you have missing values that can
be reasonably replaced with the most frequent category.

2. Forward Fill and Backward Fill

• Forward Fill: In forward fill imputation, missing values are replaced with the most
recent observed value in the sequence. It propagates the last known value forward
until a new observation is encountered.
• Backward Fill: In backward fill imputation, missing values are replaced with the
next observed value in the sequence. It propagates the next known value backward
until a new observation is encountered.




For forward fill, replace each missing value with the most recent observed value that
precedes it in time. For backward fill, replace each missing value with the next
observed value that follows it in time.

Geschreven voor

Instelling
Vak

Documentinformatie

Geüpload op
8 november 2025
Aantal pagina's
61
Geschreven in
2025/2026
Type
College aantekeningen
Docent(en)
Abirami
Bevat
Alle colleges

Onderwerpen

$4.49
Krijg toegang tot het volledige document:

Verkeerd document? Gratis ruilen Binnen 14 dagen na aankoop en voor het downloaden kun je een ander document kiezen. Je kunt het bedrag gewoon opnieuw besteden.
Geschreven door studenten die geslaagd zijn
Direct beschikbaar na je betaling
Online lezen of als PDF

Maak kennis met de verkoper
Seller avatar
lsharan

Maak kennis met de verkoper

Seller avatar
lsharan Sathyabama institute of science and technology
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
-
Lid sinds
6 maanden
Aantal volgers
0
Documenten
15
Laatst verkocht
-

0.0

0 beoordelingen

5
0
4
0
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Bezig met je bronvermelding?

Maak nauwkeurige citaten in APA, MLA en Harvard met onze gratis bronnengenerator.

Bezig met je bronvermelding?

Veelgestelde vragen