PM
CMPE172 Final Exam questions and answers with
complete solutions verified graded a++ latest update
Terms in this set (137)
Major problems faced by majorly falls under three V's:
Volume: Facebook generates 500 TB of data
Big Data every day. Twitter generates 8TB of data daily
Velocity: Need of framework which is
capable of high-speed data Variety:
Computations of data from various sources
have varied formats
A Big Data_requires three components:
-A scalable and available storage mechanism,
architecture such as a distributed filesystem or database
-A distributed compute engine, for processing and
querying the data at scale
-Tools to manage the resources and services used to
implement these systems
Big_systems come in two general forms:
data -NoSQL databases that integrate these components
https://quizlet.com/504905432/cmpe172-final- 1/45
flash-cards/
,7/7/25, 5:44
PM
into a database system,
-Environments like Hadoop
In Big Data, Data is ingested to tier - HDFS (Hadoop
persistence Distributed File
System), AWS S3, SQL and NoSQL databases.
-Flume for log aggregation, Sqoop for interoperating
with databases used.
_____is a technology to store massive datasets on a
cluster of cheap machines in a
Hadoop distributed manner. a registered trademark of
the Apache software foundation. basically split
files into the large blocks and distribute them
across the clusters,
transfer code into nodes to process data in
parallel. Datasets processed faster and more
efficiently.
_____consists of three core components:
◦ Hadoop Distributed File
Hadoop System (HDFS) - It is the
storage layer of Hadoop.
◦ Map-Reduce - It is the data processing layer of
https://quizlet.com/504905432/cmpe172-final- 2/45
flash-cards/
,7/7/25, 5:44
PM
Hadoop.
◦ YARN - It is the resource management layer of
Hadoop.
_____utilizes a simple programming model to perform
the required operation
Hadoop among clusters.
All modules in Hadoop are designed with a
fundamental assumption that hardware failures
are common occurrences and should be
dealt with by the framework.
It runs the application using the MapReduce algorithm
_____algorithm:
◦ Data is processed in parallel on different CPU nodes.
◦ Capable of running on clusters of computers
◦ Could perform a complete statistical analysis for a
huge amount of data.
Map Reduce
-is the data processing layer of Hadoop:
-Data processes in two phases:
Map Phase- This phase applies business
logic to the data. The input data gets
converted into key-value pairs. Reduce Phase-
https://quizlet.com/504905432/cmpe172-final- 3/45
flash-cards/
, 7/7/25, 5:44
PM
The Reduce phase takes as input the
output of Map Phase. It applies aggregation based on
the key of the key-value pairs.
Hadoop Distributed File System ( ) - It is the storage
layer of Hadoop.
-Master is a high-end machine where metadata is
stored
-Slaves are inexpensive computers.
-The Big Data files get divided into the number
HDFS of blocks. Hadoop stores these blocks in a
distributed fashion on the cluster of slave
nodes.
-HDFS has two daemons running
NameNode: Responsible for maintaining,
monitoring and managing DataNodes.
Records the metadata of the files like the
location of blocks, file size, permission,
hierarchy etc
DataNode: Runs on the slave machine. Stores
the actual business data. Serves the read-write
request from the user
____ - Yet Another Resource Negotiator:
https://quizlet.com/504905432/cmpe172-final- 4/45
flash-cards/