Platform: An Introductory Overview
Data is a valuable asset that can help your company make better decisions, identify new
opportunities, and improve operations. Google in 2013 undertook a strategic project to increase
employee retention by improving manager quality. Even something as loosey-goosey as manager
skill could be studied in a data-driven manner. Google was able to improve management
favorability from 83% to 88% by analyzing 10K performance reviews, identifying common
behaviors of high-performing managers, and creating training programs. Another example of a
strategic data project was carried out at Amazon. The ecommerce giant implemented
a recommendation system based on customer behaviors that drove 35% of purchases in 2017. The
Warriors, a San Francisco basketball team, is yet another example; they enacted an analytics
program that helped catapult them to the top of their league. All these—employee retention,
product recommendations, improving win rates—are examples of business goals that were
achieved by modern data analytics.
To become a data-driven company, you need to build an ecosystem for data analytics, processing,
and insights. This is because there are many different types of applications (websites, dashboards,
mobile apps, ML models, distributed devices, etc.) that create and consume data. There are also
many different departments within your company (finance, sales, marketing, operations, logistics,
etc.) that need data-driven insights. Because the entire company is your customer base, building a
data platform is more than just an IT project.
This chapter introduces data platforms, their requirements, and why traditional data architectures
prove insufficient. It also discusses technology trends in data analytics and AI, and how to build
data platforms for the future using the public cloud. This chapter is a general overview of the core
topics covered in more detail in the rest of the book.
The Data Lifecycle
The purpose of a data platform is to support the steps that organizations need to carry out to move
from raw data to insightful information. It is helpful to understand the steps of the data lifecycle
(collect, store, process, visualize, activate) because they can be mapped almost as-is to a data
architecture to create a unified analytics platform.
The Journey to Wisdom
Data helps companies to develop smarter products, reach more customers, and increase their return
on investment (ROI). Data can also be leveraged to measure customer satisfaction, profitability,
and cost. But the data by itself is not enough. Data is raw material that needs to pass through a
series of stages before it can be used to generate insights and knowledge. This sequence of stages
is what we call a data lifecycle. There are many definitions available in the literature, but from a
general point of view, we can identify five main stages in modern data platform architecture:
,1. Collect
Data has to be acquired and injected into the target systems (e.g., manual data entry, batch
loading, streaming ingestion, etc.).
2. Store
Data needs to be persisted in a durable fashion with the ability to easily access it in the
future (e.g., file storage system, database).
3. Process/transform
Data has to be manipulated to make it useful for subsequent steps (e.g., cleansing,
wrangling, transforming).
4. Analyze/visualize
Data needs to be studied to derive business insights via manual elaboration (e.g., queries,
slice and dice) or automatic processing (e.g., enrichment using ML application
programming interfaces—APIs).
5. Activate
Surfacing the data insights in a form and place where decisions can be made (e.g.,
notifications that act as a trigger for specific manual actions, automatic job executions
when specific conditions are met, ML models that send feedback to devices).
Each of these stages feeds into the next, similar to the flow of water through a set of pipes.
Water Pipes Analogy
To understand the data lifecycle better, think of it as a simplified water pipe system. The water
starts at an aqueduct and is then transferred and transformed through a series of pipes until it
reaches a group of houses. The data lifecycle is similar, with data being collected, stored,
processed/transformed, and analyzed before it is used to make decisions (see Figure 1-1).
, Figure 1-1. Water lifecycle, providing an analogy for the five steps in the data lifecycle
You can see some similarities between the plumbing world and the data world. Plumbing engineers
are like data engineers, who design and build the systems that make data usable. People who
analyze water samples are like data analysts and data scientists, who analyze data to find insights.
Of course, this is just a simplification. There are many other roles in a company that use data, like
executives, developers, business users, and security administrators. But this analogy can help you
remember the main concepts.
In the canonical data lifecycle, shown in Figure 1-2, data engineers collect and store data in an
analytics store. The stored data is then processed using a variety of tools. If the tools involve
programming, the processing is typically done by data engineers. If the tools are declarative, the
processing is typically done by data analysts. The processed data is then analyzed by business
users and data scientists. Business users use the insights to make decisions, such as launching
marketing campaigns or issuing refunds. Data scientists use the data to train ML models, which
can be used to automate tasks or make predictions.