Class notes

Big Data Analytics (BDA) Complete Notes | Hadoop, Spark, Hive, Big Data Storage & Analysis | Exam Preparation Notes

Rating

Sold

Pages

Uploaded on

16-03-2026

Written in

2024/2025

These Big Data Analytics (BDA) notes provide a clear and structured explanation of the fundamental concepts of big data systems. The notes are organized chapter-wise and are useful for exam preparation, quick revision, and concept understanding. The document covers important topics such as: • Introduction to Big Data and its characteristics (5Vs) • Development and challenges of Big Data • Relationship between Cloud Computing, IoT, and Big Data • Hadoop architecture including HDFS, MapReduce, and YARN • The Hadoop ecosystem and tools like Spark, Hive, Pig, and HBase • Big Data generation and acquisition methods • Data storage systems such as DAS, NAS, SAN, and distributed storage • CAP theorem and NoSQL databases • Big Data analysis techniques and data mining methods • Applications of Big Data in business, IoT, and modern technology These notes are ideal for students studying Big Data Analytics, Computer Science, Data Science, or Information Technology, and are especially helpful for semester exam preparation and quick concept revision.

Show more Read less

Institution

Course

Content preview

BIG DATA ANALYTICS
NOTES

, CHAPTER-1
INTRODUCTION:
Dawn of the Big Data Era:
The term of big data was coined under the explosive increase of global data and was mainly
used to describe these enormous datasets. Compared with traditional datasets, big data
generally includes masses of unstructured data that need more real-time analysis.

At present, big data has attracted considerable interest from industry, academia, and
government agencies. For example, issues on big data are often covered in public media,
including The Economist , New York Times, and National Public Radio etc. The era of big data is
coming beyond all doubt. The rapid growth of cloud computing and the Internet of Things (IoT)
further promote the sharp growth of data.

Definition and Features of Big Data:
Big data is an abstract concept. Apart from masses of data, it also has some other features, which
determine the difference between itself and “massive data” or “very big data.”

In 2010, Apache Hadoop deﬁned big data as “datasets which could not be captured, managed,
and processed by general computers within an acceptable scope.” On the basis of this deﬁnition,
in May 2011, McKinsey & Company, a global consulting agency announced Big Data as “the Next
Frontier for Innovation, Competition, and Productivity.” Big data shall mean such datasets which
could not be acquired, stored, and managed by classic database software. This deﬁnition
includes two connotations: First, the dataset volumes that conform to the standard of big data
are changing, and may grow over time or with technological advances; Second, the dataset
volumes that conform to the standard of big data in different applications differ from each other.

Features:

Big data is a collection of data from many different sources and is often described by ﬁve
characteristics: volume, value, variety, velocity, and veracity.

● Volume: the size and amounts of big data that companies manage and analyze.

,● Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that lead to more effective
operations, stronger customer relationships and other clear and quantiﬁable business beneﬁts
● Variety: the diversity and range of different data types, including unstructured data,
semi-structured data and raw data
● Velocity: the speed at which companies receive, store and manage data – e.g., the speciﬁc
number of social media posts or search queries received within a day, hour or other unit of time
● Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level conﬁdence.
The additional characteristic of variability can also be considered:
● Variability: the changing nature of the data companies seek to capture, manage and
analyze – e.g., in sentiment or text analytics, changes in the meaning of key words or phrases.

The Development of Big Data:
Let's look at the development of big data. A project called Hadoop was born in 2005.
Hadoop is a very important technology in the ﬁeld of big data. It provides a software framework
for distributed storage and processing of big data using the MapReduce programming model.
Many countries around the world and some research institutes have conducted some pilot
projects on Hadoop, and have achieved a series of results.

In 2011, EMC held a global summit on Cloud Meets Big Data, and in May of the same year,
McKinsey published a related research report. They expected that the so-called digital universe
will contain 35 zettabytes of information within the next decade. EMC introduced what it is
calling “The EMC Big Data Stack” deﬁning their view of how to store, manage, and act on the big
data coming downstream.

In December of the same year, China’s Ministry of Industry and Information Technology
issued the 12th Five-Year Development Plan for the Internet of Things. China will increase
ﬁnancial support for the smart industry, smart agriculture, smart logistics, smart transportation,
smart grid, smart environmental protection, smart security, smart medical care, and smart home
in the future. It represents the initial application of big data.

Between 2012 and 2015, many governments and companies around the world, including
the United Nations, published a series of related ideas or outlines of action to promote the
development of big data. After that, big data has entered a high-speed developing phase, and the
13th Five-Year Development Plan of the big data industry was born in China in 2017, which means
that big data has begun to be widely used and developed at a high speed worldwide.

, Challenges of Big Data:
1) Data Representation: many datasets have certain levels of heterogeneity in type, structure,
semantics, organization, granularity, and accessibility. Data representation aims to make data
more meaningful for computer analysis and user interpretation. Nevertheless, an improper data
representation will reduce the value of the original data and may even obstruct effective data
analysis. Efﬁcient data representation shall reﬂect data structure, class, and type, as well as
integrated technologies, so as to enable efﬁcient operations on different datasets.
2) Redundancy Reduction and Data Compression: generally, there is a high level of redundancy
in datasets. Redundancy reduction and data compression is effective to reduce the indirect cost
of the entire system on the premise that the potential values of the data are not affected. For
example, most data generated by sensor networks are highly redundant, which may be ﬁltered
and compressed at orders of magnitude.
3) Data Life Cycle Management: compared with the relatively slow advances of storage systems,
pervasive sensors and computing are generating data at unprecedented rates and scales. We are
confronted with a lot of pressing challenges, one of which is that the current storage system
could not support such massive data.
4) Analytical Mechanism: the analytical system of big data shall process masses of
heterogeneous data within a limited time. However, traditional RDBMSs are strictly designed
with a lack of scalability and expandability, which could not meet the performance requirements.
Non-relational databases have shown their unique advantages in the processing of unstructured
data and started to become mainstream in big data analysis.
5) Data Conﬁdentiality: most big data service providers or owners at present could not
effectively maintain and analyze such huge datasets because of their limited capacity. They must
rely on professionals or tools to analyze the data, which increase the potential safety risks. For
example, the transactional dataset generally includes a set of complete operating data to drive
key business processes. Such data contains details of the lowest granularity and some sensitive
information such as credit card numbers. Therefore, analysis of big data may be delivered to a
third party for processing only when proper preventive measures are taken to protect the
sensitive data, to ensure its safety.
6) Energy Management: the energy consumption of mainframe computing systems has drawn
much attention from both economy and environment perspectives. With the increase of data
volume and analytical demands, the processing, storage, and transmission of big data will
inevitably consume more and more electric energy. Therefore, system-level power consumption
control and management mechanisms shall be established for big data while expandability and
accessibility are both ensured.
7) Expendability and Scalability: the analytical system of big data must support present and
future datasets. The analytical algorithm must be able to process increasingly expanding and
more complex datasets.

Report Copyright Violation

Written for

Institution: University
Course: CS301

All documents for this subject (2)

Document information

Uploaded on: March 16, 2026
Number of pages: 36
Written in: 2024/2025
Type: Class notes
Professor(s): N/a
Contains: All classes

Subjects

bda notes
big data exam notes
hadoop notes
hadoop architecture
hdfs map reduce yarn
apache spark notes
hive pig hbase
big data storage system
cap theorem
no sql databases
data mining techniques
big da

$3.79

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

sranayjami

Get to know the seller

sranayjami self study

View profile

Sold

Member since

1 month

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller sranayjami. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $3.79. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 47251 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Big Data Analytics (BDA) Complete Notes | Hadoop, Spark, Hive, Big Data Storage & Analysis | Exam Preparation Notes

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?