NOTES
, CHAPTER-1
INTRODUCTION:
Dawn of the Big Data Era:
The term of big data was coined under the explosive increase of global data and was mainly
used to describe these enormous datasets. Compared with traditional datasets, big data
generally includes masses of unstructured data that need more real-time analysis.
At present, big data has attracted considerable interest from industry, academia, and
government agencies. For example, issues on big data are often covered in public media,
including The Economist , New York Times, and National Public Radio etc. The era of big data is
coming beyond all doubt. The rapid growth of cloud computing and the Internet of Things (IoT)
further promote the sharp growth of data.
Definition and Features of Big Data:
Big data is an abstract concept. Apart from masses of data, it also has some other features, which
determine the difference between itself and “massive data” or “very big data.”
In 2010, Apache Hadoop defined big data as “datasets which could not be captured, managed,
and processed by general computers within an acceptable scope.” On the basis of this definition,
in May 2011, McKinsey & Company, a global consulting agency announced Big Data as “the Next
Frontier for Innovation, Competition, and Productivity.” Big data shall mean such datasets which
could not be acquired, stored, and managed by classic database software. This definition
includes two connotations: First, the dataset volumes that conform to the standard of big data
are changing, and may grow over time or with technological advances; Second, the dataset
volumes that conform to the standard of big data in different applications differ from each other.
Features:
Big data is a collection of data from many different sources and is often described by five
characteristics: volume, value, variety, velocity, and veracity.
● Volume: the size and amounts of big data that companies manage and analyze.
,● Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that lead to more effective
operations, stronger customer relationships and other clear and quantifiable business benefits
● Variety: the diversity and range of different data types, including unstructured data,
semi-structured data and raw data
● Velocity: the speed at which companies receive, store and manage data – e.g., the specific
number of social media posts or search queries received within a day, hour or other unit of time
● Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level confidence.
The additional characteristic of variability can also be considered:
● Variability: the changing nature of the data companies seek to capture, manage and
analyze – e.g., in sentiment or text analytics, changes in the meaning of key words or phrases.
The Development of Big Data:
Let's look at the development of big data. A project called Hadoop was born in 2005.
Hadoop is a very important technology in the field of big data. It provides a software framework
for distributed storage and processing of big data using the MapReduce programming model.
Many countries around the world and some research institutes have conducted some pilot
projects on Hadoop, and have achieved a series of results.
In 2011, EMC held a global summit on Cloud Meets Big Data, and in May of the same year,
McKinsey published a related research report. They expected that the so-called digital universe
will contain 35 zettabytes of information within the next decade. EMC introduced what it is
calling “The EMC Big Data Stack” defining their view of how to store, manage, and act on the big
data coming downstream.
In December of the same year, China’s Ministry of Industry and Information Technology
issued the 12th Five-Year Development Plan for the Internet of Things. China will increase
financial support for the smart industry, smart agriculture, smart logistics, smart transportation,
smart grid, smart environmental protection, smart security, smart medical care, and smart home
in the future. It represents the initial application of big data.
Between 2012 and 2015, many governments and companies around the world, including
the United Nations, published a series of related ideas or outlines of action to promote the
development of big data. After that, big data has entered a high-speed developing phase, and the
13th Five-Year Development Plan of the big data industry was born in China in 2017, which means
that big data has begun to be widely used and developed at a high speed worldwide.
, Challenges of Big Data:
1) Data Representation: many datasets have certain levels of heterogeneity in type, structure,
semantics, organization, granularity, and accessibility. Data representation aims to make data
more meaningful for computer analysis and user interpretation. Nevertheless, an improper data
representation will reduce the value of the original data and may even obstruct effective data
analysis. Efficient data representation shall reflect data structure, class, and type, as well as
integrated technologies, so as to enable efficient operations on different datasets.
2) Redundancy Reduction and Data Compression: generally, there is a high level of redundancy
in datasets. Redundancy reduction and data compression is effective to reduce the indirect cost
of the entire system on the premise that the potential values of the data are not affected. For
example, most data generated by sensor networks are highly redundant, which may be filtered
and compressed at orders of magnitude.
3) Data Life Cycle Management: compared with the relatively slow advances of storage systems,
pervasive sensors and computing are generating data at unprecedented rates and scales. We are
confronted with a lot of pressing challenges, one of which is that the current storage system
could not support such massive data.
4) Analytical Mechanism: the analytical system of big data shall process masses of
heterogeneous data within a limited time. However, traditional RDBMSs are strictly designed
with a lack of scalability and expandability, which could not meet the performance requirements.
Non-relational databases have shown their unique advantages in the processing of unstructured
data and started to become mainstream in big data analysis.
5) Data Confidentiality: most big data service providers or owners at present could not
effectively maintain and analyze such huge datasets because of their limited capacity. They must
rely on professionals or tools to analyze the data, which increase the potential safety risks. For
example, the transactional dataset generally includes a set of complete operating data to drive
key business processes. Such data contains details of the lowest granularity and some sensitive
information such as credit card numbers. Therefore, analysis of big data may be delivered to a
third party for processing only when proper preventive measures are taken to protect the
sensitive data, to ensure its safety.
6) Energy Management: the energy consumption of mainframe computing systems has drawn
much attention from both economy and environment perspectives. With the increase of data
volume and analytical demands, the processing, storage, and transmission of big data will
inevitably consume more and more electric energy. Therefore, system-level power consumption
control and management mechanisms shall be established for big data while expandability and
accessibility are both ensured.
7) Expendability and Scalability: the analytical system of big data must support present and
future datasets. The analytical algorithm must be able to process increasingly expanding and
more complex datasets.