Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Exam (elaborations)

Big Data Engineer

Rating
-
Sold
-
Pages
9
Grade
A
Uploaded on
07-02-2024
Written in
2023/2024

This document is intended for anyone seeking for work prospects in Big Data. It contains the most frequently asked interview questions that I encountered between November 2023 and January 2024. It includes topics from Hadoop, Spark, and Hive.

Show more Read less
Institution
Course

Content preview

Easy to crack the Big data interview:
Topics covered:
1.Hadoop
2.Spark
3.Hive


1.HADOOP
Q1. what is Hadoop? why ?
Hadoop is an open source framework that manages the storage and processing of large amounts of data for
applications.

Q2.what are the main components of Hadoop?
Storage – HDFS
Batch processing – MapReduce
Resource Management – YARN
Q3. What is HDFS? What are the functions of name node and data node?
HDFS (Hadoop Distributed File System). Instead of keeping all data on a single node (machine), HDFS distributes it across
multiple nodes with the default replication factor of 3.
It follows master and slave topology.
NameNode works as Master in Hadoop cluster. Main function performed by NameNode:
1. Stores metadata of actual data.
2. Manages File system namespace and executes operations like opening/closing files, renaming files and directories.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
DataNode works as Slave in Hadoop cluster . Main function performed by DataNode:
1. Actually stores Business data.
2. This is actual worker node were Read/Write/Data processing is handled.
3. Upon instruction from Master, it performs creation/replication/deletion of data blocks.
4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation.
Q4. What happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.


Q5. What happens if namenode fails?
 Since Hadoop 2.x, HDFS cluster has two NameNodes: active and passive. The Active NameNode is the NameNode
that works and runs in the Hadoop cluster.

,  Passive NameNode is also known as Standby NameNode. It comes into action only when the active NameNode
fails.
 Whenever the active NameNode fails, the standby NameNode takes over the responsibility of the failed
NameNode and keep the HDFS up and running. The passive Namenode takes the edit logs (meta data file) from
NameNode and merges it with the FsImage (File system Image) to produce an updated FsImage as well as to
prevent the Edit Logs from becoming too large.

Q6. what are the process of MapReduce?
Map Phase:
 The input data is divided into smaller chunks called "splits."
 A "Mapper" function is applied to each split independently. The Mapper takes the input data and produces a
set of key-value pairs.
Shuffle and Sort Phase:
 The output key-value pairs from all Mappers are shuffled and sorted by key to ensure that all values with the
same key are grouped together. This is essential for the subsequent Reduce phase.
Reduce Phase:
 The sorted key-value pairs are passed to a set of "Reducer" functions. Each Reducer receives a group of key-
value pairs with the same key.
 The Reducer processes this data and produces an output, typically aggregating or summarizing the values
associated with each key.
 The output of the Reduce phase is typically written to an external storage system, like HDFS (Hadoop
Distributed File System).


2.SPARK
Q7. What are the features of Apache Spark?
 High Processing Speed
 In-Memory Computation
 Reusability
 Fault Tolerance
 Stream Processing
 Lazy Evaluation
 Support Multiple Languages
 Hadoop Integration


Q8. What does DAG refer to in Apache Spark?
DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge
from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the
edges represent the operations to be performed on those RDDs
Q9. How is Apache Spark different from MapReduce?

MapReduce spark

Written for

Course

Document information

Uploaded on
February 7, 2024
Number of pages
9
Written in
2023/2024
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

$8.89
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
rbabyshri

Get to know the seller

Seller avatar
rbabyshri Exam Questions
Follow You need to be logged in order to follow users or courses
Sold
-
Member since
2 year
Number of followers
0
Documents
2
Last sold
-

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions