AWS Pipeline Documentation – Weekly Lab Activity
1. Introduction
The practical activity in this week forced me to deploy the cloud data pipeline as shown in the
weekly tutorial within my environment in the AWS Academy Learner Lab.
The goal was to construct a simplified ETL process with the help of S3, AWS Glue, DataBrew,
Glue crawlers, Glue catalog and Amazon athena.
Besides clean the Time Slot column of the dataset, I also needed to discuss how it would be
cleaned, as well as provide an architecture diagram that would depict my implementation.
2. Step-by-Step Implementation in AWS Academy Learner Lab
Remark: In the final section, I will include a place where I will add my screenshot of the AWS
Console and my account name with Learner Lab.
2.1. Step 1 — Creating the S3 Raw Zone Bucket
What I Did
• Accessed the AWS Console through Learner Lab.
• Navigated to S3
• Added a new bucket where data is kept as raw e.g. my-raw-zone-bucket
What This Resource Does
The Raw Zone S3 bucket is used to store raw data with the uploaded data.
It is the first point of landing of the pipeline.
2.2 Step 2 — Uploading the Raw Dataset to S3
What I Did
Uploaded the dataset data to the Raw Zone bucket.
Purpose
This will offer the source data that will be subjected to transformation stages.
2.3 Step 3 — Cleaning Data Using AWS Glue DataBrew
What I Did
• Opened Glue DataBrew
• Did a project and picked the dataset in the Raw S3 bucket.
• Applied changes including:
, Removing null values
Formatting columns
Basic data type correction
Purpose
The DataBrew offers a no-code platform to prepare and profile data and then run automated
ETL.
2.4 Step 4 — Writing Cleaned Data to S3 Clean Zone
What I Did
Exported the DataBrew results (cleaned) to a different S3 bucket (Clean Zone), e.g.: my-
clean-zone-bucket
Purpose
This provides an intermediate standardized data to be used in the further transformation steps.
2.5 Step 5 — Creating a Glue Crawler (Clean Zone)
What I Did
• Created a Glue Crawler
• Aimed at S3 Clean Zone bucket.
• Compulsory assigned IAM permissions.
• Ran the crawler
Purpose
The crawler is automated to scan and guesses the schema of the cleaned dataset and catalogs the
metadata as the Glue Data Catalog.
2.6 Step 6 — Reviewing the AWS Glue Data Catalog
What I Did
• Checks Tables part of Glue Data Catalog.
• Confirmed that the crawler created a table of the clean data.
Purpose
This metadata of a catalog will be needed subsequently in Glue Jobs and Athena statements.
2.7 Step 7 — Creating a Glue ETL Job (Transforming Clean → Curated)
What I Did**
1. Introduction
The practical activity in this week forced me to deploy the cloud data pipeline as shown in the
weekly tutorial within my environment in the AWS Academy Learner Lab.
The goal was to construct a simplified ETL process with the help of S3, AWS Glue, DataBrew,
Glue crawlers, Glue catalog and Amazon athena.
Besides clean the Time Slot column of the dataset, I also needed to discuss how it would be
cleaned, as well as provide an architecture diagram that would depict my implementation.
2. Step-by-Step Implementation in AWS Academy Learner Lab
Remark: In the final section, I will include a place where I will add my screenshot of the AWS
Console and my account name with Learner Lab.
2.1. Step 1 — Creating the S3 Raw Zone Bucket
What I Did
• Accessed the AWS Console through Learner Lab.
• Navigated to S3
• Added a new bucket where data is kept as raw e.g. my-raw-zone-bucket
What This Resource Does
The Raw Zone S3 bucket is used to store raw data with the uploaded data.
It is the first point of landing of the pipeline.
2.2 Step 2 — Uploading the Raw Dataset to S3
What I Did
Uploaded the dataset data to the Raw Zone bucket.
Purpose
This will offer the source data that will be subjected to transformation stages.
2.3 Step 3 — Cleaning Data Using AWS Glue DataBrew
What I Did
• Opened Glue DataBrew
• Did a project and picked the dataset in the Raw S3 bucket.
• Applied changes including:
, Removing null values
Formatting columns
Basic data type correction
Purpose
The DataBrew offers a no-code platform to prepare and profile data and then run automated
ETL.
2.4 Step 4 — Writing Cleaned Data to S3 Clean Zone
What I Did
Exported the DataBrew results (cleaned) to a different S3 bucket (Clean Zone), e.g.: my-
clean-zone-bucket
Purpose
This provides an intermediate standardized data to be used in the further transformation steps.
2.5 Step 5 — Creating a Glue Crawler (Clean Zone)
What I Did
• Created a Glue Crawler
• Aimed at S3 Clean Zone bucket.
• Compulsory assigned IAM permissions.
• Ran the crawler
Purpose
The crawler is automated to scan and guesses the schema of the cleaned dataset and catalogs the
metadata as the Glue Data Catalog.
2.6 Step 6 — Reviewing the AWS Glue Data Catalog
What I Did
• Checks Tables part of Glue Data Catalog.
• Confirmed that the crawler created a table of the clean data.
Purpose
This metadata of a catalog will be needed subsequently in Glue Jobs and Athena statements.
2.7 Step 7 — Creating a Glue ETL Job (Transforming Clean → Curated)
What I Did**