Class notes

Introduction to Big Data Notes

Rating

Sold

Pages

Uploaded on

15-09-2023

Written in

2023/2024

This Document sustains introductory notes about Big Data and FAQ's

Institution

Course

Content preview

All about [What is Big Data]?
Data Engineering

Understanding Big Data Volume and Use Cases

During an interview with a banking company, the interviewer was not satisfied with the volume
of data the interviewee was processing, which was 25GB per day for one country. However, the
interviewee explained that the problem was not with the volume, but with the processing speed
of the existing technology. This scenario highlights the misconception that big data is only
about volume.

The use case for big data is dependent on the previous technology, and the answer to why big
data is necessary lies with the limitations of the previous technology. For example, if an Oracle
system cannot process more than 10,000TB of data, moving to big data becomes necessary
when the volume exceeds this limit.

Confidence in big data comes from understanding that the use case is not just about volume,
but also about velocity and processing speed. Big data technology encompasses a range of
solutions, including Hadoop, Spark, Kafka, Strom, Flumes, and over 10,000 more, developed to
solve different data problems across various layers.

Big data is not just a market name but a problem name given to the technology due to the lack
of other names. It is similar to how programming languages like C, C++, and Java have their
names.

Introduction to Big Data

Big data is a complex field that involves various layers of technology, including storage,
processing, testing, visualization, analytics, machine learning, and artificial intelligence. These
layers are supported by different technologies such as databases, file systems, and processing
frameworks.

,Data Layers

● Storage Layer: This layer involves technology for storing data, including databases
and file systems.
● Processing Layer: This layer involves technology for processing data, including
processing frameworks like informatica etl.
● Testing Layer: This layer involves technology for testing data.
● Visualization Layer: This layer involves technology for visualizing data.
● Analytics: This layer involves technology for data science, machine learning, and
artificial intelligence.
● Automation: This layer involves technology for scheduling and automation.

Big data also involves various sub-projects that are supported by different groups of people and
companies. While some of these sub-projects are included in the initial releases of big data
technologies like Hadoop, others are added later.

History of Hadoop

Hadoop was invented by Doug Cutting in the mid-2000s. It is an open-source technology that
includes two projects, HDFS and MapReduce. The inspiration for Hadoop came from two base
papers released by Google in the early 2000s, GFS and Google MapReduce. Hadoop was
developed to process and distribute data in a parallel and distributed manner.

After inventing Hadoop, Doug Cutting announced it as an open-source technology. Open-source
technologies are those in which the source code is freely available for use and modification.
Companies can use open-source technologies for free and may provide funding for the
developers of the technology to provide support and create new projects.

Apache Software Foundation is a community that provides licenses for open-source code. Many
IT giant companies trust the Apache License and monitor the Apache website for new source
code. If they find a project they like, they may provide funding for the developers to create new
projects and provide support.

BigData System Configuration
Data Engineering

, System Requirements for Learning Big Data

If you want to start learning about big data on your personal laptop, it is important to choose the
right system requirements. Here are some recommendations:

● Avoid using enterprise editions like Cloudera or Hortonworks, as they require a
minimum of 10+ GB of RAM and may not work well on your laptop.
● Instead, opt for Apache's vanilla flavour of Hadoop and Spark, which you can
download and install directly from the internet.
● You will need a Linux operating system on top of Windows. You can install Linux
using software like VMware and then install Hadoop and Spark.
● For Apache flavor, a minimum of 4 to 6 GB of RAM is sufficient, and the hard disk can
be around 13 GB to 100 GB.
● No need to purchase a new laptop or RAM unless you are using Cloudera or
Hortonworks.

It's important to note that for real-time projects or interviews, it is not recommended to mention
that you are using Apache. Instead, preconfigured platforms like Cloudera or Hortonworks are
more commonly used. However, for self-learning or course-based learning, Apache is
recommended.

Although the installation process may be a bit cumbersome, it is a one-time process and you
will learn valuable skills from it. Plus, once installed, you can freely explore and learn without any
performance issues.

Unboxing [Hadoop Framework]
Data Engineering

What is Hadoop?

Hadoop is one of the solutions in big data with multiple components.

What is HDfs?

HDfs is a component in Hadoop for distributing data, similar to Linux commands.

Report Copyright Violation

Written for

Course: Data Science

All documents for this subject (213)

Document information

Uploaded on: September 15, 2023
Number of pages: 23
Written in: 2023/2024
Type: Class notes
Professor(s): Gautam gujjar
Contains: All classes

Subjects

data science
big data
hive
apache

$8.99

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

sambarandas

Get to know the seller

sambarandas Self

View profile

Sold

Member since

2 year

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller sambarandas. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $8.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 46387 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Introduction to Big Data Notes

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?