Presentation

Data Virtualization in the Cloud Era

Rating

Sold

Pages

Uploaded on

02-08-2024

Written in

2022/2023

"Data virtualization had been held back by complexity for decades until recent advances in cloud technology, data lakes, networking hardware, and machine learning transformed the dream into reality. It's becoming increasingly practical to access data through an interface that hides low-level details about where it's stored, how it's organized, and which systems are needed to manipulate or process it. You can combine and query data from anywhere and leave the complex details behind. In this practical book, authors Dr. Daniel Abadi and Andrew Mott discuss in detail what data virtualization is and the trends in technology that are making data virtualization increasingly useful. With this book, data engineers, data architects, and data scientists will explore the architecture of modern data virtualization systems and learn how these systems differ from one another at technical and practical levels."

Show more Read less

Institution

Course

Content preview

,Chapter 1. Introduction to Data
Virtualization and Data Lakes
As long as humankind has existed, knowledge has been spread out across the world. In ancient
times, one had to go to Delphi to see the oracle, to Egypt for the construction knowledge necessary
to build the pyramids, and to Babylon for the irrigation science to build the Hanging Gardens. If
someone had a simple question, such as, “Were the engineering practices used in building the
pyramids and Hanging Gardens similar?” no answer could be obtained in less than a decade. The
person would have to spend years learning the languages used in Babylon and Egypt in order to
converse with the locals who knew the history of these structures; learn enough about the domains
to express their questions in such a way that the answers would reveal whether the engineering
practices were similar; and then travel to these locations and figure out who were the correct locals
to ask.

In modern times, data is no less spread out than it used to be. Important knowledge—capable of
answering many of the world’s most important questions—remains dispersed across the entire
world. Even within a single organization, data is typically spread across many different locations.
Data generally starts off being located where it was generated, and a large amount of energy is
needed to overcome the inertia to move it to a new location. The more locations data is generated
in, the more locations data is found in.

Although it no longer takes a decade to answer basic questions, it still takes a surprisingly long
time. This is true even though we are now able to transfer data across the world at the speed of
light: an enormous amount of information can be accessed from halfway across the earth in just a
few seconds. Furthermore, computers can now consume and process billions of bytes of data in
less than a second. Answering almost any question should be nearly instantaneous, yet in practice
it is not. Why does it take so long to answer basic questions—even today?

The answer to this question is that many of the obstacles to answering a question about the
pyramids and Hanging Gardens in ancient times still exist today. Language was a barrier then. It
is still a barrier today. Learning enough about the semantics of the domain was a barrier then. It is
still a barrier today. Figuring out who to ask (i.e., where to find the needed data) was a barrier then.
It is still a barrier today. So while travel time is billions of times faster now than it was then, and
processing time is billions of times faster now than it was then, this only benefits us to the extent
that these parts of analyzing data are no longer the bottleneck. Rather, it is these other barriers that
prevent us from efficiently getting answers to questions we have about datasets.

Language is a barrier beyond the fact that one dataset may be in English, another in Chinese, and
another in Greek. Even if they are all in English, the computing system that stores the data may
require questions to be posed in different languages in order to extract or answer questions about
these datasets. One system may have an SQL interface, another GraphQL, and a third system may
support only text search. The client who wishes to pose a question to these differing systems needs
to learn the language that the system supports as its interface. Further, the client needs to
understand enough about the semantics of how data is stored in a source system to pose a question

,coherently. Is data stored in tables, a graph, or flat files? If tables, what do the rows and columns
correspond to? What are the types of each column (integers, strings, text)? Does a column for a
particular table refer to the same real-world entity as a column from a different table? Furthermore,
where can I even find a dataset related to my question? I know there is a useful dataset in Spain.
Is there also one in Belgium? Egypt? Japan? How do I discover what is out there and what I have
access to? And how do I request permission to access something I do not currently have access
to?

The goal of data virtualization (DV) is to eliminate or alleviate these other barriers. A DV System
creates a central interface in which data can be accessed no matter where it is located, no matter
how it is stored, and no matter how it is organized. The system does not physically move the data
to a central location. Rather, the data exists there virtually. A user of the system is given the
impression that all data is in one place, even though in reality it may be spread across the world.
Furthermore, the user is presented with information about what datasets exist, how they are
organized, and enough of the semantic details of each dataset to be able to formulate queries over
them. The user can then issue commands that access any dataset virtualized by the system without
needing to know any of the physical details regarding where data is located, which systems are
being used to store it, and how the data is compressed or organized in storage.

The convenience of a proper functioning DV System is obvious. We will see in Chapter 2 that
there are many challenges in building such systems, and indeed many DV Systems often fall
significantly short of the promise we have described. Nonetheless, when they work as intended (as
we illustrate in Chapter 6), they are extremely powerful and can dramatically broaden the scope of
a data analysis task while also accelerating the analysis process in a variety of ways, including by
reducing the human effort involved in bringing together datasets for analysis.

A Quick Overview of Data Virtualization
System Architecture
Figure 1-1 shows a high-level architecture of a DV System, the core software that implements data
virtualization, which we’ll discuss in detail in Chapter 3. The system itself is fairly lightweight—
it does not store any data locally except for a cache of recently accessed data and query results
(see Chapter 4). Instead, it is configured to access a set of underlying data sources that are
virtualized by the system such that a client is given the impression that these data sources are local
to the DV System, even though in reality they are separate systems that are potentially
geographically dispersed and distant from the DV System. In Figure 1-1, three underlying data
sources are shown, two of which are separate data management systems that may contain their
own interfaces and query access capabilities for datasets stored within. The third is a data lake,
which may store datasets as files in a distributed filesystem.

, Figure 1-1. A high-level DV System architecture

The existence of these underlying data source systems is registered in a catalog that is viewable by
clients of the DV System. Furthermore, the catalog usually contains information regarding the
semantics of the datasets stored in the underlying data systems—what exactly is the data stored in
these datasets, how this data was collected, what real-world entities are referred to in the dataset,
and how the datasets are organized—in tables, graphs, or hierarchical structures. Furthermore, the
schema of these structures is defined. All of this information allows clients to be aware of what
datasets exist, what important metadata these datasets have (including statistical information about
the data contained within the datasets), and how to express coherent requests to access them.

The most complex part of a DV System is the data virtualization engine (DV Engine), which
receives requests from clients (generated using the client interface) and performs whatever
processing is required for these requests. This typically involves communication with the specific
underlying data sources that contain data relevant to those requests. The DV Engine thus needs to
know how to communicate with a variety of different types of systems that may store data that is

Report Copyright Violation

Written for

Course: Data Virtualization in the Cloud Era

All documents for this subject (1)

Document information

Uploaded on: August 2, 2024
Number of pages: 79
Written in: 2022/2023
Type: PRESENTATION
Person: Unknown

Subjects

data virtualization in the cloud era

$4.99

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

RobertCuong

Get to know the seller

RobertCuong Telecommunication

View profile

Sold

Member since

3 year

Number of followers

Documents

225

Last sold

GPON and WiFi

+ SDH solution based on Fujitsu/Alcatel/Huawei devices in deployment and troubleshoot + Switching and Routing network fundamental and advance + GPON solution with deep knowledge of PLOAM/OMCI, activation procedure. Analysis of Private/Public OMCI + WiFi solution with WiFi Management/Control/Data. WiFi bandsteering, WiFi mesh, and WiFi 6, 6E, 7, ...

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller RobertCuong. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 48766 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Data Virtualization in the Cloud Era

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?