Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Class notes

DATA PREPROCESSING AND PREPARATION

Rating
-
Sold
-
Pages
39
Uploaded on
08-11-2025
Written in
2025/2026

Short notes on preparing raw data by cleaning, normalizing, and transforming it to ensure accurate and efficient analysis.

Institution
Course

Content preview

UNIT II
DATA PREPROCESSING AND PREPARATION
Data Munging, Wrangling - Data Visualisation Basics -Plyr packages - Cast/Melt. Tableau:
Creating Visualisations in Tableau-Data hierarchies, filters, groups, sets, calculated fields-
Map based visualisations-build interactive dashboards-Data Stories.


2.1 DATA MUNGING
What is data munging?
Data wrangling, data munging is the initial process of refining raw data into content or
formats better-suited for consumption by downstream systems and users. The term ‘Mung’
was coined in the late 60s as a somewhat derogatory term for actions and transformations
which progressively degrade a dataset, and quickly became tied to the backronym “Mash
Until No Good” (or, recursively, “Mung Until No Good”).
But as the diversity, expertise, and specialization of data practitioners grew in the internet
age, ‘munging’ and ‘wrangling’ became more useful generic terms, used analogously to
‘coding’ for software engineers.
With the rise of cloud computing and storage, and more sophisticated analytics, these terms
evolved further, and today refer specifically to the initial collection, preparation, and
refinement of raw data.



The data munging process:


With the wide variety of verticals, use-cases, types of users, and systems utilizing
enterprise data today, the specifics of munging can take on myriad forms.
Data exploration: Munging usually begins with data exploration. Whether an analyst is
merely peeking at completely new data in initial data analysis (IDA), or a data scientist
begins the search for novel associations in existing records in exploratory data analysis
(EDA), munging always begins with some degree of data discovery.


Data transformation: Once a sense of the raw data’s contents and structure have been
established, it must be transformed to new formats appropriate for downstream processing.
This step involves the pure data scientist, for example un-nesting hierarchical JSON data,
denormalizing disparate tables so relevant information can be accessed from one place, or
reshaping and aggregating time series data to the dimensions and spans of interest.


Data enrichment: Optionally, once data is ready for consumption, data mungers might
choose to perform additional enrichment steps. This involves finding external sources of
information to expand the scope or content of existing records. For example, using an open-
source weather data set to add daily temperature to an ice-cream shop’s sales figures.

,Data validation: The final, perhaps most important, munging step is validation. At this
point, the data is ready to be used, but certain common-sense or sanity checks are critical if
one wishes to trust the processed data. This step allows users to discover typos, incorrect
mappings, problems with transformation steps, even the rare corruption caused by
computational failure or error.


The most basic munging operations can be performed in generic tools like Excel or Tableau
—from searching for typos to using pivot tables, or the occasional informational
visualization and simple macro. But for regular mungers and wranglers, a more flexible,
powerful programming language is far more effective.
Python is often lauded as the most flexible popular programming language, and this is no
exception when it comes to data munging. With one of the largest collections of third-party
libraries, especially rich data processing and analysis tools like Pandas, NumPy, and SciPy,
Python simplifies many complex data munging tasks. Pandas in particular is one of the fastest-
growing and best-supported data munging libraries, while still only a tiny part of the massive
Python ecosystem.
Python is also easier to learn than many other languages thanks to simpler, more intuitive
formatting, as well as a focus on legible english-language-adjacent syntax. Python’s wide
applicability, rich libraries, and online support, new practitioners will additionally find the
language useful far beyond data processing use cases, everywhere from web development
to workflow automation.Data science is the study of data. Like biological sciences is a
study of biology, physical sciences, it’s the study of physical reactions. Data is real, data
has real properties, and we need to study them if we’re going to work on them. Data
Science involves data and some signs.



2.2 DATA WRANGLING


Data wrangling is the process of cleaning and unifying messy and complex data sets for easy
access and analysis.
With the amount of data and data sources rapidly growing and expanding, it is getting
increasingly essential for large amounts of available data to be organized for analysis. This
process typically includes manually converting and mapping data from one raw form into
another format to allow for more convenient consumption and organization of the data.
The Goals of Data Wrangling
• Reveal a "deeper intelligence" by gathering data from multiple sources
• Provide accurate, actionable data in the hands of business analysts in a timely matter
• Reduce the time spent collecting and organizing unruly data before it can be utilized
• Enable data scientists and analysts to focus on the analysis of data, rather than the
wrangling

,• Drive better decision-making skills by senior leaders in an organization



Key Steps to Data Wrangling
Data Acquisition: Identify and obtain access to the data within your sources.


Joining Data: Combine the edited data for further use and analysis.
Data Cleansing: Redesign the data into a usable and functional format and correct/remove
any bad data.


Package Managers are tools that help you manage the dependencies for your project. A
dependency is code that is required for your program to function properly. These often come
in the form of packages.
Packages can also have their own dependencies. Managing all these dependencies can be
hard because packages may require specific versions of their dependencies. It’s easy to break
something by modifying dependencies manually.
Data Science is kinda blended with various tools, algorithms, and machine learning
principles. Most simply, it involves obtaining meaningful information or insights from
structured or unstructured data through a process of analyzing, programming and business
skills. It is a field containing many elements like mathematics, statistics, computer science,
etc. Those who are good at these respective fields with enough knowledge of the domain in
which you are willing to work can call themselves as Data Scientist. It’s not an easy thing to
do but not impossible too. You need to start from data, it’s visualization, programming,
formulation, development, and deployment of your model. In the future, there will be great
hype for data scientist jobs. Taking in that mind, be ready to prepare yourself to fit in this
world.


2.3 Data Visualization Basics
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way
to see and understand trends, outliers, and patterns in data.


In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.


The advantages and benefits of good data visualization
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, square
from circle. Our culture is visual, including everything from art and advertisements to TV
and movies. Data visualization is another form of visual art that grabs our interest and

, keeps our eyes on the message. When we see a chart, we quickly see trends and outliers. If
we can see something, we internalize it quickly. It’s storytelling with a purpose. If you’ve
ever stared at a massive spreadsheet of data and couldn’t see a trend, you know how much
more effective a visualization can be.


Big Data is here and we need to know what it says
As the “age of Big Data” kicks into high-gear, visualization is an increasingly key tool to
make sense of the trillions of rows of data generated every day. Data visualization helps to
tell stories by curating data into a form easier to understand, highlighting the trends and
outliers. A good visualization tells a story, removing the noise from data and highlighting
the useful information. However, it’s not simply as easy as just dressing up a graph to make
it look better or slapping on the “info” part of an infographic. Effective data visualization is
a delicate balancing act between form and function. The plainest graph could be too boring
to catch any notice or it make tell a powerful point; the most stunning visualization could
utterly fail at conveying the right message or it could speak volumes. The data and the
visuals need to work together, and there’s an art to combining great analysis with great
storytelling.


2.4 plyr Package in R Programming
The dplyr package in R Programming Language is a structure of data manipulation that
provides a uniform set of verbs, helping to resolve the most frequent data manipulation
hurdles.
The dplyr Package in R performs the steps given below quicker and in an easier fashion:
By limiting the choices the focus can now be more on data manipulation difficulties.
There are uncomplicated “verbs”, functions present for tackling every common data
manipulation and the thoughts can be translated into code faster.
There are valuable backend and hence waiting time for the computer reduces.


Melting and Casting In R

Melting and Casting are one of the interesting aspects in R programming to change the shape
of the data and further, getting the desired shape. R programming language has many
methods to reshape the data using reshape package. melt() and cast() are the functions that
efficiently reshape the data. There are many packages in R that require data reshaping. Each
data is specified in multiple rows of dataframe with different details in each row and this type
of format of data is known as long format.

Written for

Institution
Course

Document information

Uploaded on
November 8, 2025
Number of pages
39
Written in
2025/2026
Type
Class notes
Professor(s)
Abirami
Contains
All classes

Subjects

$7.59
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
lsharan

Get to know the seller

Seller avatar
lsharan Sathyabama institute of science and technology
Follow You need to be logged in order to follow users or courses
Sold
-
Member since
6 months
Number of followers
0
Documents
15
Last sold
-

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions