All about [What is Big Data]?
Data Engineering
Understanding Big Data Volume and Use Cases
During an interview with a banking company, the interviewer was not satisfied with the volume
of data the interviewee was processing, which was 25GB per day for one country. However, the
interviewee explained that the problem was not with the volume, but with the processing speed
of the existing technology. This scenario highlights the misconception that big data is only
about volume.
The use case for big data is dependent on the previous technology, and the answer to why big
data is necessary lies with the limitations of the previous technology. For example, if an Oracle
system cannot process more than 10,000TB of data, moving to big data becomes necessary
when the volume exceeds this limit.
Confidence in big data comes from understanding that the use case is not just about volume,
but also about velocity and processing speed. Big data technology encompasses a range of
solutions, including Hadoop, Spark, Kafka, Strom, Flumes, and over 10,000 more, developed to
solve different data problems across various layers.
Big data is not just a market name but a problem name given to the technology due to the lack
of other names. It is similar to how programming languages like C, C++, and Java have their
names.
Introduction to Big Data
Big data is a complex field that involves various layers of technology, including storage,
processing, testing, visualization, analytics, machine learning, and artificial intelligence. These
layers are supported by different technologies such as databases, file systems, and processing
frameworks.
,Data Layers
● Storage Layer: This layer involves technology for storing data, including databases
and file systems.
● Processing Layer: This layer involves technology for processing data, including
processing frameworks like informatica etl.
● Testing Layer: This layer involves technology for testing data.
● Visualization Layer: This layer involves technology for visualizing data.
● Analytics: This layer involves technology for data science, machine learning, and
artificial intelligence.
● Automation: This layer involves technology for scheduling and automation.
Big data also involves various sub-projects that are supported by different groups of people and
companies. While some of these sub-projects are included in the initial releases of big data
technologies like Hadoop, others are added later.
History of Hadoop
Hadoop was invented by Doug Cutting in the mid-2000s. It is an open-source technology that
includes two projects, HDFS and MapReduce. The inspiration for Hadoop came from two base
papers released by Google in the early 2000s, GFS and Google MapReduce. Hadoop was
developed to process and distribute data in a parallel and distributed manner.
After inventing Hadoop, Doug Cutting announced it as an open-source technology. Open-source
technologies are those in which the source code is freely available for use and modification.
Companies can use open-source technologies for free and may provide funding for the
developers of the technology to provide support and create new projects.
Apache Software Foundation is a community that provides licenses for open-source code. Many
IT giant companies trust the Apache License and monitor the Apache website for new source
code. If they find a project they like, they may provide funding for the developers to create new
projects and provide support.
BigData System Configuration
Data Engineering
, System Requirements for Learning Big Data
If you want to start learning about big data on your personal laptop, it is important to choose the
right system requirements. Here are some recommendations:
● Avoid using enterprise editions like Cloudera or Hortonworks, as they require a
minimum of 10+ GB of RAM and may not work well on your laptop.
● Instead, opt for Apache's vanilla flavour of Hadoop and Spark, which you can
download and install directly from the internet.
● You will need a Linux operating system on top of Windows. You can install Linux
using software like VMware and then install Hadoop and Spark.
● For Apache flavor, a minimum of 4 to 6 GB of RAM is sufficient, and the hard disk can
be around 13 GB to 100 GB.
● No need to purchase a new laptop or RAM unless you are using Cloudera or
Hortonworks.
It's important to note that for real-time projects or interviews, it is not recommended to mention
that you are using Apache. Instead, preconfigured platforms like Cloudera or Hortonworks are
more commonly used. However, for self-learning or course-based learning, Apache is
recommended.
Although the installation process may be a bit cumbersome, it is a one-time process and you
will learn valuable skills from it. Plus, once installed, you can freely explore and learn without any
performance issues.
Unboxing [Hadoop Framework]
Data Engineering
What is Hadoop?
Hadoop is one of the solutions in big data with multiple components.
What is HDfs?
HDfs is a component in Hadoop for distributing data, similar to Linux commands.
Data Engineering
Understanding Big Data Volume and Use Cases
During an interview with a banking company, the interviewer was not satisfied with the volume
of data the interviewee was processing, which was 25GB per day for one country. However, the
interviewee explained that the problem was not with the volume, but with the processing speed
of the existing technology. This scenario highlights the misconception that big data is only
about volume.
The use case for big data is dependent on the previous technology, and the answer to why big
data is necessary lies with the limitations of the previous technology. For example, if an Oracle
system cannot process more than 10,000TB of data, moving to big data becomes necessary
when the volume exceeds this limit.
Confidence in big data comes from understanding that the use case is not just about volume,
but also about velocity and processing speed. Big data technology encompasses a range of
solutions, including Hadoop, Spark, Kafka, Strom, Flumes, and over 10,000 more, developed to
solve different data problems across various layers.
Big data is not just a market name but a problem name given to the technology due to the lack
of other names. It is similar to how programming languages like C, C++, and Java have their
names.
Introduction to Big Data
Big data is a complex field that involves various layers of technology, including storage,
processing, testing, visualization, analytics, machine learning, and artificial intelligence. These
layers are supported by different technologies such as databases, file systems, and processing
frameworks.
,Data Layers
● Storage Layer: This layer involves technology for storing data, including databases
and file systems.
● Processing Layer: This layer involves technology for processing data, including
processing frameworks like informatica etl.
● Testing Layer: This layer involves technology for testing data.
● Visualization Layer: This layer involves technology for visualizing data.
● Analytics: This layer involves technology for data science, machine learning, and
artificial intelligence.
● Automation: This layer involves technology for scheduling and automation.
Big data also involves various sub-projects that are supported by different groups of people and
companies. While some of these sub-projects are included in the initial releases of big data
technologies like Hadoop, others are added later.
History of Hadoop
Hadoop was invented by Doug Cutting in the mid-2000s. It is an open-source technology that
includes two projects, HDFS and MapReduce. The inspiration for Hadoop came from two base
papers released by Google in the early 2000s, GFS and Google MapReduce. Hadoop was
developed to process and distribute data in a parallel and distributed manner.
After inventing Hadoop, Doug Cutting announced it as an open-source technology. Open-source
technologies are those in which the source code is freely available for use and modification.
Companies can use open-source technologies for free and may provide funding for the
developers of the technology to provide support and create new projects.
Apache Software Foundation is a community that provides licenses for open-source code. Many
IT giant companies trust the Apache License and monitor the Apache website for new source
code. If they find a project they like, they may provide funding for the developers to create new
projects and provide support.
BigData System Configuration
Data Engineering
, System Requirements for Learning Big Data
If you want to start learning about big data on your personal laptop, it is important to choose the
right system requirements. Here are some recommendations:
● Avoid using enterprise editions like Cloudera or Hortonworks, as they require a
minimum of 10+ GB of RAM and may not work well on your laptop.
● Instead, opt for Apache's vanilla flavour of Hadoop and Spark, which you can
download and install directly from the internet.
● You will need a Linux operating system on top of Windows. You can install Linux
using software like VMware and then install Hadoop and Spark.
● For Apache flavor, a minimum of 4 to 6 GB of RAM is sufficient, and the hard disk can
be around 13 GB to 100 GB.
● No need to purchase a new laptop or RAM unless you are using Cloudera or
Hortonworks.
It's important to note that for real-time projects or interviews, it is not recommended to mention
that you are using Apache. Instead, preconfigured platforms like Cloudera or Hortonworks are
more commonly used. However, for self-learning or course-based learning, Apache is
recommended.
Although the installation process may be a bit cumbersome, it is a one-time process and you
will learn valuable skills from it. Plus, once installed, you can freely explore and learn without any
performance issues.
Unboxing [Hadoop Framework]
Data Engineering
What is Hadoop?
Hadoop is one of the solutions in big data with multiple components.
What is HDfs?
HDfs is a component in Hadoop for distributing data, similar to Linux commands.