Virtualization and Data Lakes
As long as humankind has existed, knowledge has been spread out across the world. In ancient
times, one had to go to Delphi to see the oracle, to Egypt for the construction knowledge necessary
to build the pyramids, and to Babylon for the irrigation science to build the Hanging Gardens. If
someone had a simple question, such as, “Were the engineering practices used in building the
pyramids and Hanging Gardens similar?” no answer could be obtained in less than a decade. The
person would have to spend years learning the languages used in Babylon and Egypt in order to
converse with the locals who knew the history of these structures; learn enough about the domains
to express their questions in such a way that the answers would reveal whether the engineering
practices were similar; and then travel to these locations and figure out who were the correct locals
to ask.
In modern times, data is no less spread out than it used to be. Important knowledge—capable of
answering many of the world’s most important questions—remains dispersed across the entire
world. Even within a single organization, data is typically spread across many different locations.
Data generally starts off being located where it was generated, and a large amount of energy is
needed to overcome the inertia to move it to a new location. The more locations data is generated
in, the more locations data is found in.
Although it no longer takes a decade to answer basic questions, it still takes a surprisingly long
time. This is true even though we are now able to transfer data across the world at the speed of
light: an enormous amount of information can be accessed from halfway across the earth in just a
few seconds. Furthermore, computers can now consume and process billions of bytes of data in
less than a second. Answering almost any question should be nearly instantaneous, yet in practice
it is not. Why does it take so long to answer basic questions—even today?
The answer to this question is that many of the obstacles to answering a question about the
pyramids and Hanging Gardens in ancient times still exist today. Language was a barrier then. It
is still a barrier today. Learning enough about the semantics of the domain was a barrier then. It is
still a barrier today. Figuring out who to ask (i.e., where to find the needed data) was a barrier then.
It is still a barrier today. So while travel time is billions of times faster now than it was then, and
processing time is billions of times faster now than it was then, this only benefits us to the extent
that these parts of analyzing data are no longer the bottleneck. Rather, it is these other barriers that
prevent us from efficiently getting answers to questions we have about datasets.
Language is a barrier beyond the fact that one dataset may be in English, another in Chinese, and
another in Greek. Even if they are all in English, the computing system that stores the data may
require questions to be posed in different languages in order to extract or answer questions about
these datasets. One system may have an SQL interface, another GraphQL, and a third system may
support only text search. The client who wishes to pose a question to these differing systems needs
to learn the language that the system supports as its interface. Further, the client needs to
understand enough about the semantics of how data is stored in a source system to pose a question
,coherently. Is data stored in tables, a graph, or flat files? If tables, what do the rows and columns
correspond to? What are the types of each column (integers, strings, text)? Does a column for a
particular table refer to the same real-world entity as a column from a different table? Furthermore,
where can I even find a dataset related to my question? I know there is a useful dataset in Spain.
Is there also one in Belgium? Egypt? Japan? How do I discover what is out there and what I have
access to? And how do I request permission to access something I do not currently have access
to?
The goal of data virtualization (DV) is to eliminate or alleviate these other barriers. A DV System
creates a central interface in which data can be accessed no matter where it is located, no matter
how it is stored, and no matter how it is organized. The system does not physically move the data
to a central location. Rather, the data exists there virtually. A user of the system is given the
impression that all data is in one place, even though in reality it may be spread across the world.
Furthermore, the user is presented with information about what datasets exist, how they are
organized, and enough of the semantic details of each dataset to be able to formulate queries over
them. The user can then issue commands that access any dataset virtualized by the system without
needing to know any of the physical details regarding where data is located, which systems are
being used to store it, and how the data is compressed or organized in storage.
The convenience of a proper functioning DV System is obvious. We will see in Chapter 2 that
there are many challenges in building such systems, and indeed many DV Systems often fall
significantly short of the promise we have described. Nonetheless, when they work as intended (as
we illustrate in Chapter 6), they are extremely powerful and can dramatically broaden the scope of
a data analysis task while also accelerating the analysis process in a variety of ways, including by
reducing the human effort involved in bringing together datasets for analysis.
A Quick Overview of Data Virtualization
System Architecture
Figure 1-1 shows a high-level architecture of a DV System, the core software that implements data
virtualization, which we’ll discuss in detail in Chapter 3. The system itself is fairly lightweight—
it does not store any data locally except for a cache of recently accessed data and query results
(see Chapter 4). Instead, it is configured to access a set of underlying data sources that are
virtualized by the system such that a client is given the impression that these data sources are local
to the DV System, even though in reality they are separate systems that are potentially
geographically dispersed and distant from the DV System. In Figure 1-1, three underlying data
sources are shown, two of which are separate data management systems that may contain their
own interfaces and query access capabilities for datasets stored within. The third is a data lake,
which may store datasets as files in a distributed filesystem.
, Figure 1-1. A high-level DV System architecture
The existence of these underlying data source systems is registered in a catalog that is viewable by
clients of the DV System. Furthermore, the catalog usually contains information regarding the
semantics of the datasets stored in the underlying data systems—what exactly is the data stored in
these datasets, how this data was collected, what real-world entities are referred to in the dataset,
and how the datasets are organized—in tables, graphs, or hierarchical structures. Furthermore, the
schema of these structures is defined. All of this information allows clients to be aware of what
datasets exist, what important metadata these datasets have (including statistical information about
the data contained within the datasets), and how to express coherent requests to access them.
The most complex part of a DV System is the data virtualization engine (DV Engine), which
receives requests from clients (generated using the client interface) and performs whatever
processing is required for these requests. This typically involves communication with the specific
underlying data sources that contain data relevant to those requests. The DV Engine thus needs to
know how to communicate with a variety of different types of systems that may store data that is