SAVITRIBAI PHULE PUNE UNIVERSITY
B.E. (Computer Engineering / IT / Data Science)
DATA SCIENCE & BIG DATA ANALYTICS
UNIT I -- Complete Exam Notes
Structured for 4 . 5 . 6 Mark Questions | Target: 30/30
Subject Data Science & Big Data Analytics
Unit Unit I -- Introduction & Data Preprocessing
University Savitribai Phule Pune University (SPPU)
Exam Pattern 4-Mark | 5-Mark | 6-Mark Questions
Contents 15 Topics + Model Answers + Viva Q&A;
Target Score Full 30/30 Marks
,SPPU | Data Science & Big Data Analytics | Unit I Exam Notes -- Score 30/30
PART 1 . COMPLETE UNIT NOTES
1. Basics and Need of Data Science & Big Data
1.1 What is Data Science?
DEFINITION
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It combines statistics,
computer science, domain expertise, and machine learning for decision-making.
1.2 Why is Data Science Needed?
• Exponential data growth from IoT, social media, and digital transactions
• Traditional tools cannot process or analyze large, complex datasets
• Enables predictive analytics, fraud detection, and personalization
• Provides competitive business advantage through data-driven decisions
• Uncovers hidden patterns and correlations invisible to human analysis
1.3 What is Big Data?
DEFINITION
Big Data refers to extremely large datasets that cannot be stored, processed, or analyzed using
traditional database tools. It is characterized by the 5 V's: Volume, Velocity, Variety, Veracity, and Value.
1.4 Need for Big Data Technologies
• Organizations generate terabytes of data every single day
• Traditional RDBMS cannot handle petabyte-scale datasets
• Real-time processing needed for stock markets, social media, IoT sensors
• Enables cost reduction through optimized business operations
• Supports pattern recognition and accurate future prediction
2. Applications of Data Science
Domain / Industry Data Science Applications
Healthcare Disease prediction, drug discovery, medical imaging, patient monitoring
Finance & Banking Fraud detection, credit scoring, risk analysis, algorithmic trading
E-Commerce Product recommendations, customer segmentation, demand forecasting
Social Media Sentiment analysis, content recommendation, targeted advertising
Transportation Route optimization, autonomous vehicles, traffic prediction
Education Personalized learning, dropout prediction, student performance analysis
Manufacturing Predictive maintenance, quality control, supply chain optimization
Page 2
, SPPU | Data Science & Big Data Analytics | Unit I Exam Notes -- Score 30/30
Government Crime prediction, policy planning, census analysis
REAL EXAMPLE - Healthcare
Patient records (age, symptoms, lab results) are fed to a Random Forest model that predicts heart
disease probability with 92% accuracy -- enabling proactive medical intervention and reducing mortality
rates.
3. Data Explosion
Data Explosion refers to the unprecedented and exponential growth of data being generated, captured, and
stored globally due to digital transformation.
Causes of Data Explosion
• Social Media -- Billions of posts, images, videos daily (Facebook, Instagram, Twitter)
• IoT Devices -- Sensors, smart appliances generating continuous real-time data streams
• Mobile Devices -- GPS tracking, app usage, purchase records, call logs
• E-Commerce -- Clickstream, purchase history, product browsing, reviews
• Scientific Research -- Genome sequencing, satellite imagery, astronomical data
• Cloud Computing -- Cheaper storage encourages organizations to store everything
DATA EXPLOSION - GLOBAL DATA VOLUME GROWTH
2010 ||-- ~1 Zettabyte
2015 ||-------- ~8 Zettabytes
2020 ||---------------- ~40 Zettabytes
2025 ||------------------------ ~120+ Zettabytes
----------------------------------
1 Zettabyte = 1 billion Terabytes
Data doubles approximately every 2 years
Challenges Created by Data Explosion
• Storage management, scalability, and infrastructure cost
• Data security, privacy, and regulatory compliance
• Need for specialized tools: Hadoop, Apache Spark, NoSQL databases
• Ensuring data quality and consistency at massive scale
4. Five V's of Big Data
The 5 V's are the key characteristics that define and distinguish Big Data:
Massive scale of data generated every second.
V1 - VOLUME Example: Facebook: 4 PB/day | Twitter: 500M tweets/day | YouTube: 500 hrs video/min
Speed at which data is generated, collected, and processed (real-time or near-real-time).
V2 - VELOCITY Example: Stock ticks processed in microseconds | IoT sensors stream data continuously
Diverse types: Structured (tables), Semi-structured (JSON, XML), Unstructured (text, images,
V3 - VARIETY video).
Example: Hospital stores: text reports + X-ray images + lab values + ECG signals
Page 3
B.E. (Computer Engineering / IT / Data Science)
DATA SCIENCE & BIG DATA ANALYTICS
UNIT I -- Complete Exam Notes
Structured for 4 . 5 . 6 Mark Questions | Target: 30/30
Subject Data Science & Big Data Analytics
Unit Unit I -- Introduction & Data Preprocessing
University Savitribai Phule Pune University (SPPU)
Exam Pattern 4-Mark | 5-Mark | 6-Mark Questions
Contents 15 Topics + Model Answers + Viva Q&A;
Target Score Full 30/30 Marks
,SPPU | Data Science & Big Data Analytics | Unit I Exam Notes -- Score 30/30
PART 1 . COMPLETE UNIT NOTES
1. Basics and Need of Data Science & Big Data
1.1 What is Data Science?
DEFINITION
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It combines statistics,
computer science, domain expertise, and machine learning for decision-making.
1.2 Why is Data Science Needed?
• Exponential data growth from IoT, social media, and digital transactions
• Traditional tools cannot process or analyze large, complex datasets
• Enables predictive analytics, fraud detection, and personalization
• Provides competitive business advantage through data-driven decisions
• Uncovers hidden patterns and correlations invisible to human analysis
1.3 What is Big Data?
DEFINITION
Big Data refers to extremely large datasets that cannot be stored, processed, or analyzed using
traditional database tools. It is characterized by the 5 V's: Volume, Velocity, Variety, Veracity, and Value.
1.4 Need for Big Data Technologies
• Organizations generate terabytes of data every single day
• Traditional RDBMS cannot handle petabyte-scale datasets
• Real-time processing needed for stock markets, social media, IoT sensors
• Enables cost reduction through optimized business operations
• Supports pattern recognition and accurate future prediction
2. Applications of Data Science
Domain / Industry Data Science Applications
Healthcare Disease prediction, drug discovery, medical imaging, patient monitoring
Finance & Banking Fraud detection, credit scoring, risk analysis, algorithmic trading
E-Commerce Product recommendations, customer segmentation, demand forecasting
Social Media Sentiment analysis, content recommendation, targeted advertising
Transportation Route optimization, autonomous vehicles, traffic prediction
Education Personalized learning, dropout prediction, student performance analysis
Manufacturing Predictive maintenance, quality control, supply chain optimization
Page 2
, SPPU | Data Science & Big Data Analytics | Unit I Exam Notes -- Score 30/30
Government Crime prediction, policy planning, census analysis
REAL EXAMPLE - Healthcare
Patient records (age, symptoms, lab results) are fed to a Random Forest model that predicts heart
disease probability with 92% accuracy -- enabling proactive medical intervention and reducing mortality
rates.
3. Data Explosion
Data Explosion refers to the unprecedented and exponential growth of data being generated, captured, and
stored globally due to digital transformation.
Causes of Data Explosion
• Social Media -- Billions of posts, images, videos daily (Facebook, Instagram, Twitter)
• IoT Devices -- Sensors, smart appliances generating continuous real-time data streams
• Mobile Devices -- GPS tracking, app usage, purchase records, call logs
• E-Commerce -- Clickstream, purchase history, product browsing, reviews
• Scientific Research -- Genome sequencing, satellite imagery, astronomical data
• Cloud Computing -- Cheaper storage encourages organizations to store everything
DATA EXPLOSION - GLOBAL DATA VOLUME GROWTH
2010 ||-- ~1 Zettabyte
2015 ||-------- ~8 Zettabytes
2020 ||---------------- ~40 Zettabytes
2025 ||------------------------ ~120+ Zettabytes
----------------------------------
1 Zettabyte = 1 billion Terabytes
Data doubles approximately every 2 years
Challenges Created by Data Explosion
• Storage management, scalability, and infrastructure cost
• Data security, privacy, and regulatory compliance
• Need for specialized tools: Hadoop, Apache Spark, NoSQL databases
• Ensuring data quality and consistency at massive scale
4. Five V's of Big Data
The 5 V's are the key characteristics that define and distinguish Big Data:
Massive scale of data generated every second.
V1 - VOLUME Example: Facebook: 4 PB/day | Twitter: 500M tweets/day | YouTube: 500 hrs video/min
Speed at which data is generated, collected, and processed (real-time or near-real-time).
V2 - VELOCITY Example: Stock ticks processed in microseconds | IoT sensors stream data continuously
Diverse types: Structured (tables), Semi-structured (JSON, XML), Unstructured (text, images,
V3 - VARIETY video).
Example: Hospital stores: text reports + X-ray images + lab values + ECG signals
Page 3