UNIT 1 DATA ACQUISITION
Data Acquisition – Sources of acquiring the data – Internal Systems and External Systems,
Web APIs, Data Preprocessing – Exploratory Data Analysis(EDA) – Basic tools(plots,
graphs and summary statistics) of EDA, Open Data Sources, Data APIs, Web Scrapping –
Relational Database access(queries) to process/access data
Introduction
Data
• a raw information, facts or numbers collected to be examined or analysed to make
decisions.
• should be in a formalized manner suitable for communication, interpretation and
processing.
Information
• Result of analysing data
Data versus Information
Types of Data
• Structured – Data which is organized and formatted in a specific way that forms a
well-defined schema or shape to form a proper structure.
, • Unstructured – These data are in an unorganized form and context specific or
varying. Eg., e-mail
• Natural language - It is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and
linguistics
• Machine-generated - data that is automatically created by a computer, process,
application, or other machine without human intervention.
• Graph-based - data that focuses on the relationship or adjacency of objects
• Audio, video, and images – captured and recognized through sound, pictures and
videos
• Streaming - The data flows into the system when an event happens instead of being
loaded into a data store in a batch.
Data file formats
• Tabular (e.g., .csv, .tsv, .xlsx)
• Non-tabular (e.g., .txt, .rtf, .xml)
• Image (e.g., .png, .jpg, .tif)
• Agnostic (e.g., .dat)
➢ some file formats are proprietary and can only be opened by software developed by
a particular company
➢ There are also other file formats that store metadata, such as SPSS and STATA files
that contain information on data labels.
1. Data Acquisition
Data Acquisition :
• The process of gathering various data from different relevant sources is referred to
as Data Acquisition
• It translates into the collection of data and ingesting it into a system for further use.
Importance of Data Acquisition :
• It is easier for businesses to analyze and formulate corresponding strategies around it.
• Having data in one place makes it easier to detect any discrepancy and solve it faster.
• It also decreases human error and improves data security.
, • In the longer run, it proves to be cost-efficient.
• It helps in building Recommendation System
Things to consider when acquiring data are:
• What data is needed to achieve the goal?
• How much data is needed?
• Where and how can this data be found?
• What legal and privacy concerns should be considered?
Data acquisition comprises of two steps – Data Harvest and Data Ingestion
Data Harvest :
It is the process by which a source generates data and it considers what data is
acquired.
Data Ingestion :
• Focuses on bringing the produced data into a given system.
• Data ingestion consists of three stages – discover, connect and synchronize.
Data Acquisition Methods
Data can be obtained from many different sources, such as websites, apps, IoT protocols or
even physical notes, and new data sources pop up literally every day.
Methods of acquiring data :
• Collecting new data.
• Converting and/or transforming legacy data.
• Sharing or exchanging data.
• Purchasing data.
Challenges and Characteristics to be considered for Data Acquisition
Before using these methods for data acquisition, USGS suggests considering certain business
goals and data characteristics.
First, think about the business goal (why is this data required and what will it bring?).
Next, consider the costs, time restrictions and format.
For specific domain, heavily regulated industry like banking, or are a government-controlled
entity, additional restrictions may also apply– for instance, data standard thresholds or
business rule limitations
, Every data acquisition method comes with additional challenges and characteristics to be
considered. For example, when it comes to transforming legacy data, first assess the legacy
quality. And for purchasing data, all the licensing issues need to be analysed.
Data Acquisition in Machine Learning
“ Data acquisition is the process of sampling signals that measure real-world physical
conditions and converting the resulting samples into digital numeric values that a computer
can manipulate.”
• Collection and Integration of the data: The data is extracted from various sources
and multiple data need to be combined based upon the requirement.
• Formatting: Prepare or organize the datasets as per the analysis requirements.
• Labeling: After gathering data, it is required to label or naming the data
Data Acquisition Process
The process of data acquisition involves searching for the datasets that can be used to train
the Machine Learning models.
The main segments are :
1. Data Discovery
2. Data Augmentation
3. Data Generation
Data Acquisition – Sources of acquiring the data – Internal Systems and External Systems,
Web APIs, Data Preprocessing – Exploratory Data Analysis(EDA) – Basic tools(plots,
graphs and summary statistics) of EDA, Open Data Sources, Data APIs, Web Scrapping –
Relational Database access(queries) to process/access data
Introduction
Data
• a raw information, facts or numbers collected to be examined or analysed to make
decisions.
• should be in a formalized manner suitable for communication, interpretation and
processing.
Information
• Result of analysing data
Data versus Information
Types of Data
• Structured – Data which is organized and formatted in a specific way that forms a
well-defined schema or shape to form a proper structure.
, • Unstructured – These data are in an unorganized form and context specific or
varying. Eg., e-mail
• Natural language - It is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and
linguistics
• Machine-generated - data that is automatically created by a computer, process,
application, or other machine without human intervention.
• Graph-based - data that focuses on the relationship or adjacency of objects
• Audio, video, and images – captured and recognized through sound, pictures and
videos
• Streaming - The data flows into the system when an event happens instead of being
loaded into a data store in a batch.
Data file formats
• Tabular (e.g., .csv, .tsv, .xlsx)
• Non-tabular (e.g., .txt, .rtf, .xml)
• Image (e.g., .png, .jpg, .tif)
• Agnostic (e.g., .dat)
➢ some file formats are proprietary and can only be opened by software developed by
a particular company
➢ There are also other file formats that store metadata, such as SPSS and STATA files
that contain information on data labels.
1. Data Acquisition
Data Acquisition :
• The process of gathering various data from different relevant sources is referred to
as Data Acquisition
• It translates into the collection of data and ingesting it into a system for further use.
Importance of Data Acquisition :
• It is easier for businesses to analyze and formulate corresponding strategies around it.
• Having data in one place makes it easier to detect any discrepancy and solve it faster.
• It also decreases human error and improves data security.
, • In the longer run, it proves to be cost-efficient.
• It helps in building Recommendation System
Things to consider when acquiring data are:
• What data is needed to achieve the goal?
• How much data is needed?
• Where and how can this data be found?
• What legal and privacy concerns should be considered?
Data acquisition comprises of two steps – Data Harvest and Data Ingestion
Data Harvest :
It is the process by which a source generates data and it considers what data is
acquired.
Data Ingestion :
• Focuses on bringing the produced data into a given system.
• Data ingestion consists of three stages – discover, connect and synchronize.
Data Acquisition Methods
Data can be obtained from many different sources, such as websites, apps, IoT protocols or
even physical notes, and new data sources pop up literally every day.
Methods of acquiring data :
• Collecting new data.
• Converting and/or transforming legacy data.
• Sharing or exchanging data.
• Purchasing data.
Challenges and Characteristics to be considered for Data Acquisition
Before using these methods for data acquisition, USGS suggests considering certain business
goals and data characteristics.
First, think about the business goal (why is this data required and what will it bring?).
Next, consider the costs, time restrictions and format.
For specific domain, heavily regulated industry like banking, or are a government-controlled
entity, additional restrictions may also apply– for instance, data standard thresholds or
business rule limitations
, Every data acquisition method comes with additional challenges and characteristics to be
considered. For example, when it comes to transforming legacy data, first assess the legacy
quality. And for purchasing data, all the licensing issues need to be analysed.
Data Acquisition in Machine Learning
“ Data acquisition is the process of sampling signals that measure real-world physical
conditions and converting the resulting samples into digital numeric values that a computer
can manipulate.”
• Collection and Integration of the data: The data is extracted from various sources
and multiple data need to be combined based upon the requirement.
• Formatting: Prepare or organize the datasets as per the analysis requirements.
• Labeling: After gathering data, it is required to label or naming the data
Data Acquisition Process
The process of data acquisition involves searching for the datasets that can be used to train
the Machine Learning models.
The main segments are :
1. Data Discovery
2. Data Augmentation
3. Data Generation