Unit 1
Data
• a raw information, facts or numbers collected to be examined or analysed to make
decisions.
• should be in a formalized manner suitable for communication, interpretation and
processing.
Information
• Result of analysing data
Data versus Information
Data are the building blocks of information. Likewise, pieces of information are the
building blocks of records.
Information: Data that has been given value through analysis, interpretation, or
compilation in a meaningful form.
Types of Data:
1. Structured: Well-organized data with a defined schema (e.g., tables).
2. Unstructured: Unorganized data with no fixed format (e.g., emails).
3. Natural Language: A type of unstructured data that involves human language,
requiring linguistic techniques.
4. Machine-generated: Data created automatically by systems without human input.
5. Graph-based: Data focusing on object relationships and connections.
6. Audio, Video, Images: Multimedia data captured through sound and visuals.
7. Streaming: Real-time data generated continuously by events.
DATA ACQ
Introduction:
Data Acquisition is the process of collecting, filtering, and converting real-world data
from various relevant sources into a format that can be processed by computers for
further analysis. In today's data-driven world, it plays a crucial role in business
intelligence, machine learning, and decision-making systems.
Importance of Data Acquisition:
1. Strategic Planning: Helps businesses analyze customer behavior and market trends
to frame effective strategies.
2. Error Detection: Makes it easier to identify inconsistencies or gaps in data early on.
3. Minimizes Human Error: Automates data collection to reduce manual mistakes.
4. Improved Data Security: Secure handling and storage of data.
5. Cost Efficiency: Automates repetitive tasks, saving time and operational costs.
6. Enables Recommendation Systems: Collected data is crucial for building systems
like product or content recommendations.
Data Acquisition in Machine Learning:
In machine learning, data acquisition is the first and most important step. The model's
accuracy and performance depend heavily on the quality of data collected.
Key Steps:
1. Collection & Integration: Data is gathered from multiple sources and combined as
required.
2. Formatting: Raw data is organized and cleaned to fit the model's requirements.
,3. Labeling: Data is tagged with proper labels or classifications for supervised
learning.
Data Acquisition Process:
1. Data Discovery: Searching and identifying new, useful datasets from internal or
external sources.
2. Data Augmentation: Enhancing existing datasets by adding external data to
improve richness and context.
3. Data Generation: Creating datasets either manually (e.g., surveys) or automatically
(e.g., sensors, web scraping).
Techniques and Tools for Data Acquisition:
1. Data Warehouses & ETL (Extract, Transform, Load):
• Data Warehouse: A centralized database where structured data from various
sources is stored.
• ETL Process:
o Extract: Retrieve data from multiple sources
o Transform: Convert it into a suitable format
o Load: Store it in a data warehouse
• ETL Tools:
o Code-based: SQL, PL/SQL, BASE SAS
o GUI-based: Informatica, Data Stage, SSIS
2. Data Lakes & ELT (Extract, Load, Transform):
• Store structured, semi-structured, and unstructured data (e.g., images, videos,
PDFs)
• Follows ELT process where raw data is stored first and transformed only when
needed
• More flexible and suitable for big data environments
3. Cloud Data Warehouses:
• Examples: Amazon Redshift, Google BigQuery, Snowflake
• Offers on-demand storage without physical hardware
• Cost-effective and scalable solutions for modern enterprises
Data Collection Sources:
1. Primary Data Collection:
Original, firsthand data collected directly from the source.
• Surveys/Questionnaires: Structured forms to collect data online or offline
• Interviews: One-on-one interaction; can be structured or unstructured
• Observations: Monitoring subjects in a natural environment
• Experiments: Controlled environment to study cause-effect relationships
• Focus Groups: Group discussions for feedback or opinion gathering
2. Secondary Data Collection:
Data collected by others, reused for new analysis.
• Published Sources: Books, research papers, newspapers
• Online Databases: Statistical and academic data
• Government Records: Census data, economic reports
• Public Data: Social media posts, forums
, • Previous Research Studies: Existing datasets used for comparative or extended
research
Internal and External Systems:
Internal Systems:
• Data generated within an organization
• Comes from internal operations like sales, production, customer support
• Stored in CRM, ERP systems, spreadsheets, etc.
• Highly structured and controlled
External Systems:
• Data gathered from outside the organization
• Sources: market trends, competitors, government databases, social media
• Includes data from APIs (Application Programming Interfaces)
Web APIs:
• Enable communication between different software systems
• REST API: Most common API type; returns data as resources
o Lightweight, easy to use, suitable for web applications
Conclusion:
Data Acquisition is the foundation of any data-centric technology or system. Without
accurate and well-collected data, the success of machine learning models, business
intelligence tools, and automated systems is compromised. Understanding the sources,
tools, and processes involved in data acquisition enables better decision-making and
technological advancement.
Data Pre processing
Data preprocessing is an important process of data mining. In this process, raw data is
converted into an understandable format and made ready for further analysis.
Purpose of data preprocessing:
❖ Get data overview
❖ Identify missing data
❖ Identify outliers or anomalous data
❖ Remove Inconsistencies
Data preprocessing in Machine Learning : A practical approach
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model. A real-world data generally contains noises, missing values, and maybe
in an unusable format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learni
It involves below steps:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
Data
• a raw information, facts or numbers collected to be examined or analysed to make
decisions.
• should be in a formalized manner suitable for communication, interpretation and
processing.
Information
• Result of analysing data
Data versus Information
Data are the building blocks of information. Likewise, pieces of information are the
building blocks of records.
Information: Data that has been given value through analysis, interpretation, or
compilation in a meaningful form.
Types of Data:
1. Structured: Well-organized data with a defined schema (e.g., tables).
2. Unstructured: Unorganized data with no fixed format (e.g., emails).
3. Natural Language: A type of unstructured data that involves human language,
requiring linguistic techniques.
4. Machine-generated: Data created automatically by systems without human input.
5. Graph-based: Data focusing on object relationships and connections.
6. Audio, Video, Images: Multimedia data captured through sound and visuals.
7. Streaming: Real-time data generated continuously by events.
DATA ACQ
Introduction:
Data Acquisition is the process of collecting, filtering, and converting real-world data
from various relevant sources into a format that can be processed by computers for
further analysis. In today's data-driven world, it plays a crucial role in business
intelligence, machine learning, and decision-making systems.
Importance of Data Acquisition:
1. Strategic Planning: Helps businesses analyze customer behavior and market trends
to frame effective strategies.
2. Error Detection: Makes it easier to identify inconsistencies or gaps in data early on.
3. Minimizes Human Error: Automates data collection to reduce manual mistakes.
4. Improved Data Security: Secure handling and storage of data.
5. Cost Efficiency: Automates repetitive tasks, saving time and operational costs.
6. Enables Recommendation Systems: Collected data is crucial for building systems
like product or content recommendations.
Data Acquisition in Machine Learning:
In machine learning, data acquisition is the first and most important step. The model's
accuracy and performance depend heavily on the quality of data collected.
Key Steps:
1. Collection & Integration: Data is gathered from multiple sources and combined as
required.
2. Formatting: Raw data is organized and cleaned to fit the model's requirements.
,3. Labeling: Data is tagged with proper labels or classifications for supervised
learning.
Data Acquisition Process:
1. Data Discovery: Searching and identifying new, useful datasets from internal or
external sources.
2. Data Augmentation: Enhancing existing datasets by adding external data to
improve richness and context.
3. Data Generation: Creating datasets either manually (e.g., surveys) or automatically
(e.g., sensors, web scraping).
Techniques and Tools for Data Acquisition:
1. Data Warehouses & ETL (Extract, Transform, Load):
• Data Warehouse: A centralized database where structured data from various
sources is stored.
• ETL Process:
o Extract: Retrieve data from multiple sources
o Transform: Convert it into a suitable format
o Load: Store it in a data warehouse
• ETL Tools:
o Code-based: SQL, PL/SQL, BASE SAS
o GUI-based: Informatica, Data Stage, SSIS
2. Data Lakes & ELT (Extract, Load, Transform):
• Store structured, semi-structured, and unstructured data (e.g., images, videos,
PDFs)
• Follows ELT process where raw data is stored first and transformed only when
needed
• More flexible and suitable for big data environments
3. Cloud Data Warehouses:
• Examples: Amazon Redshift, Google BigQuery, Snowflake
• Offers on-demand storage without physical hardware
• Cost-effective and scalable solutions for modern enterprises
Data Collection Sources:
1. Primary Data Collection:
Original, firsthand data collected directly from the source.
• Surveys/Questionnaires: Structured forms to collect data online or offline
• Interviews: One-on-one interaction; can be structured or unstructured
• Observations: Monitoring subjects in a natural environment
• Experiments: Controlled environment to study cause-effect relationships
• Focus Groups: Group discussions for feedback or opinion gathering
2. Secondary Data Collection:
Data collected by others, reused for new analysis.
• Published Sources: Books, research papers, newspapers
• Online Databases: Statistical and academic data
• Government Records: Census data, economic reports
• Public Data: Social media posts, forums
, • Previous Research Studies: Existing datasets used for comparative or extended
research
Internal and External Systems:
Internal Systems:
• Data generated within an organization
• Comes from internal operations like sales, production, customer support
• Stored in CRM, ERP systems, spreadsheets, etc.
• Highly structured and controlled
External Systems:
• Data gathered from outside the organization
• Sources: market trends, competitors, government databases, social media
• Includes data from APIs (Application Programming Interfaces)
Web APIs:
• Enable communication between different software systems
• REST API: Most common API type; returns data as resources
o Lightweight, easy to use, suitable for web applications
Conclusion:
Data Acquisition is the foundation of any data-centric technology or system. Without
accurate and well-collected data, the success of machine learning models, business
intelligence tools, and automated systems is compromised. Understanding the sources,
tools, and processes involved in data acquisition enables better decision-making and
technological advancement.
Data Pre processing
Data preprocessing is an important process of data mining. In this process, raw data is
converted into an understandable format and made ready for further analysis.
Purpose of data preprocessing:
❖ Get data overview
❖ Identify missing data
❖ Identify outliers or anomalous data
❖ Remove Inconsistencies
Data preprocessing in Machine Learning : A practical approach
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model. A real-world data generally contains noises, missing values, and maybe
in an unusable format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learni
It involves below steps:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set