CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
UNIT – I - UNIT I - INTRODUCTION
FOUNDATION OF DATA SCIENCE
INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research goals –
Retrieving data – Data preparation - Exploratory Data analysis – Build the model– presenting findings and
building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data
Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.
Benefits and uses of data science
Data science and big data are used almost everywhere in both commercial and noncommercial Settings
Commercial companies in almost every industry use data science and big data to gain insights into
their customers, processes, staff, completion, and products.
Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up-sell, and personalize their offerings.
Governmental organizations are also aware of data’s value. Many governmental organizations not only
rely on internal data scientists to discover valuable information, but also share their data with the
public.
Nongovernmental organizations (NGOs) use it to raise money and defend their causes.
Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional classes.
Facets of data
In data science and big data you’ll come across many different types of data, and each of them tends to require
different tools and techniques. The main categories of data are these:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
Let’s explore all these interesting data types.
Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record. As
such, it’s often easy to store structured data in tables within databases or Excel files
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
1
DOWNLOADED FROM STUCOR APP
, CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email
Natural language
Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
2
DOWNLOADED FROM STUCOR APP
, CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.
Machine-generated data
Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention.
Machine-generated data is becoming a major data resource and will continue to do so.
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.
Graph-based or network data
“Graph data” can be a confusing term because any data can be shown in a graph.
Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical data.
Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two people.
Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers.
MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game analytics.
Recently a company called DeepMind succeeded at creating an algorithm that’s capable of learning
how to play video games.
This algorithm takes the video screen as input and learns to interpret everything via a complex process
of deep learning.
3
DOWNLOADED FROM STUCOR APP
, CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Streaming data
The data flows into the system when an event happens instead of being loaded into a data store in a
batch.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
Data Science Process
Overview of the data science process
The typical data science process consists of six steps through which you’ll iterate, as shown in figure
1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.
4
DOWNLOADED FROM STUCOR APP
UNIT – I - UNIT I - INTRODUCTION
FOUNDATION OF DATA SCIENCE
INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research goals –
Retrieving data – Data preparation - Exploratory Data analysis – Build the model– presenting findings and
building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data
Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.
Benefits and uses of data science
Data science and big data are used almost everywhere in both commercial and noncommercial Settings
Commercial companies in almost every industry use data science and big data to gain insights into
their customers, processes, staff, completion, and products.
Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up-sell, and personalize their offerings.
Governmental organizations are also aware of data’s value. Many governmental organizations not only
rely on internal data scientists to discover valuable information, but also share their data with the
public.
Nongovernmental organizations (NGOs) use it to raise money and defend their causes.
Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional classes.
Facets of data
In data science and big data you’ll come across many different types of data, and each of them tends to require
different tools and techniques. The main categories of data are these:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
Let’s explore all these interesting data types.
Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record. As
such, it’s often easy to store structured data in tables within databases or Excel files
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
1
DOWNLOADED FROM STUCOR APP
, CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email
Natural language
Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
2
DOWNLOADED FROM STUCOR APP
, CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.
Machine-generated data
Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention.
Machine-generated data is becoming a major data resource and will continue to do so.
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.
Graph-based or network data
“Graph data” can be a confusing term because any data can be shown in a graph.
Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical data.
Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two people.
Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers.
MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game analytics.
Recently a company called DeepMind succeeded at creating an algorithm that’s capable of learning
how to play video games.
This algorithm takes the video screen as input and learns to interpret everything via a complex process
of deep learning.
3
DOWNLOADED FROM STUCOR APP
, CS3352 – DOWNLOADED FROM STUCOR APP | FDS III SEM CSE
Streaming data
The data flows into the system when an event happens instead of being loaded into a data store in a
batch.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
Data Science Process
Overview of the data science process
The typical data science process consists of six steps through which you’ll iterate, as shown in figure
1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.
4
DOWNLOADED FROM STUCOR APP