Summary

Summary GOOGLE DATA ENGINEERING CHEATSHEET

Rating

Sold

Pages

Uploaded on

14-01-2023

Written in

2022/2023

In this cheat sheet, you will learn: Basic concepts of Data Engineering. Hadoop Ecosystem. Google compute platform. Identity access management. Key concepts. Compute choices. Stackdriver Storage, Big table, BigQuery, and Cloud SQ

Show more Read less

Institution

Course

Content preview

Google Data Engineering Cheatsheet
Compiled by Maverick Lin (http://mavericklin.com)
Last Updated August 4, 2018

What is Data Engineering? Hadoop Overview Hadoop Ecosystem
Data engineering enables data-driven decision making by Hadoop An entire ecosystem of tools have emerged around
collecting, transforming, and visualizing data. A data Data can no longer fit in memory on one machine (mo- Hadoop, which are based on interacting with HDFS.
engineer designs, builds, maintains, and troubleshoots nolithic), so a new way of computing was devised using
data processing systems with a particular emphasis on the many computers to process the data (distributed). Such Hive: data warehouse software built o top of Hadoop that
security, reliability, fault-tolerance, scalability, fidelity, a group is called a cluster, which makes up server farms. facilitates reading, writing, and managing large datasets
and efficiency of such systems. All of these servers have to be coordinated in the following residing in distributed storage using SQL-like queries (Hi-
ways: partition data, coordinate computing tasks, handle veQL). Hive abstracts away underlying MapReduce jobs
A data engineer also analyzes data to gain insight into fault tolerance/recovery, and allocate capacity to process. and returns HDFS in the form of tables (not HDFS).
business outcomes, builds statistical models to support Pig: high level scripting language (Pig Latin) that enables
decision-making, and creates machine learning models to Hadoop is an open source distributed processing fra- writing complex data transformations. It pulls unstructu-
automate and simplify key business processes. mework that manages data processing and storage for big red/incomplete data from sources, cleans it, and places it
data applications running in clustered systems. It is com- in a database/data warehouses. Pig performs ETL into
Key Points prised of 3 main components: data warehouse while Hive queries from data warehouse
• Build/maintain data structures and databases • Hadoop Distributed File System (HDFS): to perform analysis (GCP: DataFlow).
• Design data processing systems a distributed file system that provides high- Spark: framework for writing fast, distributed programs
• Analyze data and enable machine learning throughput access to application data by partitio- for data processing and analysis. Spark solves similar pro-
• Design for reliability ning data across many machines blems as Hadoop MapReduce but with a fast in-memory
• Visualize data and advocate policy • YARN: framework for job scheduling and cluster approach. It is an unified engine that supports SQL que-
• Model business processes for analysis resource management (task coordination) ries, streaming data, machine learning and graph proces-
• Design for security and compliance • MapReduce: YARN-based system for parallel sing. Can operate separately from Hadoop but integrates
processing of large data sets on multiple machines well with Hadoop. Data is processed using Resilient Dis-
Google Compute Platform (GCP) tributed Datasets (RDDs), which are immutable, lazily
HDFS evaluated, and tracks lineage. ;
GCP is a collection of Google computing resources, which Each disk on a different machine in a cluster is comprised Hbase: non-relational, NoSQL, column-oriented data-
are offered via services. Data engineering services include ;
of 1 master node; the rest are data nodes. The master base management system that runs on top of HDFS. Well
Compute, Storage, Big Data, and Machine Learning. node manages the overall file system by storing the suited for sparse data sets (GCP: BigTable)
directory structure and metadata of the files. The Flink/Kafka: stream processing framework. Batch stre-
The 4 ways to interact with GCP include the console, data nodes physically store the data. Large files are aming is for bounded, finite datasets, with periodic up-
command-line-interface (CLI), API, and mobile app. broken up/distributed across multiple machines, which dates, and delayed processing. Stream processing is for
are replicated across 3 machines to provide fault tolerance. unbounded datasets, with continuous updates, and imme-
The GCP resource hierarchy is organized as follows. All diate processing. Stream data and stream processing must
resources (VMs, storage buckets, etc) are organized into MapReduce be decoupled via a message queue. Can group streaming
projects. These projects may be organized into folders, Parallel programming paradigm which allows for proces- data (windows) using tumbling (non-overlapping time),
which can contain other folders. All folders and projects; sing of huge amounts of data by running processes on mul- sliding (overlapping time), or session (session gap) win-
can be brought together under an organization node. tiple machines. Defining a MapReduce job requires two dows.
Project folders and organization nodes are where policies stages: map and reduce. Beam: programming model to define and execute data
can be defined. Policies are inherited downstream and • Map: operation to be performed in parallel on small processing pipelines, including ETL, batch and stream
dictate who can access what resources. Every resource portions of the dataset. the output is a key-value (continuous) processing. After building the pipeline, it
must belong to a project and every must have a billing pair < K, V > is executed by one of Beam’s distributed processing back-
account associated with it. • Reduce: operation to combine the results of Map ends (Apache Apex, Apache Flink, Apache Spark, and
Google Cloud Dataflow). Modeled as a Directed Acyclic
Advantages: Performance (fast solutions), Pricing (sub- YARN- Yet Another Resource Negotiator Graph (DAG).
hour billing, sustained use discounts, custom machine ty- Coordinates tasks running on the cluster and assigns new Oozie: workflow scheduler system to manage Hadoop jobs
pes), PaaS Solutions, Robust Infrastructure nodes in case of failure. Comprised of 2 subcomponents: Sqoop: transferring framework to transfer large amounts
the resource manager and the node manager. The re- of data into HDFS from relational databases (MySQL)
source manager runs on a single master node and sche-
dules tasks across nodes. The node manager runs on all

, IAM Key Concepts Compute Choices
Identity Access Management (IAM) OLAP vs. OLTP Google App Engine
Access management service to manage different members Online Analytical Processing (OLAP): primary Flexible, serverless platform for building highly available
of the platform- who has what access for which resource. objective is data analysis. It is an online analysis and applications. Ideal when you want to focus on writing
data retrieving process, characterized by a large volume and developing code and do not want to manage servers,
Each member has roles and permissions to allow them of data and complex queries, uses data warehouses. clusters, or infrastructures.
access to perform their duties on the platform. 3 member Online Transaction Processing (OLTP): primary Use Cases: web sites, mobile app and gaming backends,
types: Google account (single person, gmail account), objective is data processing, manages database modifi- RESTful APIs, IoT apps.
service account (non-person, application), and Google cation, characterized by large numbers of short online
Group (multiple people). Roles are a set of specific transactions, simple queries, and traditional DBMS. Google Kubernetes (Container) Engine
permissions for members. Cannot assign permissions to Logical infrastructure powered by Kubernetes, an open-
user directly, must grant roles. Row vs. Columnar Database source container orchestration system. Ideal for managing
Row Format: stores data by row containers in production, increase velocity and operatabi-
If you grant a member access on a higher hierarchy level, Column Format: stores data tables by column rather lity, and don’t have OS dependencies.
that member will have access to all levels below that than by row, which is suitable for analytical query proces- Use Cases: containerized workloads, cloud-native
hierarchy level as well. You cannot be restricted a lower sing and data warehouses distributed systems, hybrid applications.
level. The policy is a union of assigned and inherited
policies. Google Compute Engine (IaaS)
Virtual Machines (VMs) running in Google’s global data
Primitive Roles: Owner (full access to resources, center. Ideal for when you need complete control over
manage roles), Editor (edit access to resources, change or your infrastructure and direct access to high-performance
add), Viewer (read access to resources) hardward or need OS-level changes.
Predefined Roles: finer-grained access control than Use Cases: any workload requiring a specific OS or
primitive roles, predefined by Google Cloud OS configuration, currently deployed and on-premises
Custom Roles software that you want to run in the cloud.

Best Practice: use predefined roles when they exist (over Summary: AppEngine is the PaaS option- serverless
primitive). Follow the principle of least privileged favors. IaaS, Paas, SaaS and ops free. ComputeEngine is the IaaS option- fully
IaaS: gives you the infrastructure pieces (VMs) but you controllable down to OS level. Kubernetes Engine is in
have to maintain/join together the different infrastructure the middle- clusters of machines running Kuberenetes and
Stackdriver pieces for your application to work. Most flexible option. hosting containers.
GCP’s monitoring, logging, and diagnostics solution. Pro- PaaS: gives you all the infrastructure pieces already
vides insights to health, performance, and availability of joined so you just have to deploy source code on the
applications. platform for your application to work. PaaS solutions are
Main Functions managed services/no-ops (highly available/reliable) and
• Debugger: inspect state of app in real time without serverless/autoscaling (elastic). Less flexible than IaaS
stopping/slowing down e.g. code behavior
• Error Reporting: counts, analyzes, aggregates Fully Managed, Hotspotting Additional Notes
crashes in cloud services You can also mix and match multiple compute options.
• Monitoring: overview of performance, uptime and Preemptible Instances: instances that run at a much
heath of cloud services (metrics, events, metadata) lower price but may be terminated at any time, self-
• Alerting: create policies to notify you when health terminate after 24 hours. ideal for interruptible workloads
and uptime check results exceed a certain limit Snapshots: used for backups of disks
• Tracing: tracks how requests propagate through Images: VM OS (Ubuntu, CentOS)
applications/receive near real-time performance re-
sults, latency reports of VMs
• Logging: store, search, monitor and analyze log
data and events from GCP

Report Copyright Violation

Written for

Course: Engineering

All documents for this subject (599)

Document information

Uploaded on: January 14, 2023
Number of pages: 9
Written in: 2022/2023
Type: SUMMARY

Subjects

olap
oltp
iam
gcp
stackdriver
computechoices
handoop ecosystem
handoop overview
google app engine
google container engine
row vs column database
iaas
paas
saas
yarn yet another resource negotiator
st

$15.89

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

nasrinpherdowsi

Get to know the seller

nasrinpherdowsi

View profile

Sold

Member since

3 year

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller nasrinpherdowsi. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $15.89. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45189 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Summary GOOGLE DATA ENGINEERING CHEATSHEET

Content preview

Written for

Document information

Subjects

Get to know the seller

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?