All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damage caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Grosvenor House 11 St Paul’s Square Birmingham B3 1RB, UK.
ISBN 978-1-80461-442-6
www.packt.com
Contributors
About the author
Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA.
Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers.
Gareth also frequently speaks on data-related topics.
To my amazing wife and children, thank you for your patience and understanding as I spent countless hours writing the revised edition of this book. Your support for me taking on this project, and making the space and time for me to write, means so much to me.
A special thanks to Disha Umarwani, Praful Kava, and Natalie Rabinovich, who each contributed content for the first edition of this book. And many thanks to Amit Kalawat, Leonardo Gomez, and many others for helping to review content for this revised edition.
About the reviewers
Vaibhav Tyagi is a skilled and experienced cloud data engineer and architect with 10 years of experience. He has a deep understanding of AWS cloud services and is proficient in a variety of data engineering tools, including Spark, Hive, and Hadoop.
Throughout his career, he has worked for Teradata, Citigroup, NatWest, and Amazon, and has worked on, among other things, designing and implementing cloud-based pipelines, complex cloud environments, and the creation of data warehouses.
I would like to thank my wife and children who have been my biggest cheerleaders and put up with my long working hours. I am truly grateful for their love and support. And thank you to my friends who have also been a great source of support.
Gaurav Verma has 9 years of experience in the field, having worked at AWS, Skyscanner, Discovery Communications, and Tata Consultancy Services.
He excels in designing and delivering big data and analytics solutions on AWS. His expertise spans AWS services, Python, Scala, Spark, and more. He currently leads a team at Amazon, overseeing global analytics and the ML data platform. His career highlights include optimizing data pipelines, managing analytics projects, and extensive training in big data and data engineering technologies.
Learn more on Discord
To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:
https://discord.gg/9s5mHNyECd
Databases and data warehouses • 22
Dealing with big, unstructured data • 23
Cloud-based solutions for big data analytics • 24
A deeper dive into data warehouse concepts and architecture
Dimensional modeling in data warehouses • 28
Understanding the role of data marts • 32
Distributed storage and massively parallel processing • 33
Columnar data storage and efficient data compression • 35
Feeding data into the warehouse – ETL and ELT pipelines • 37
Data lake logical architecture • 41
The storage layer and storage zones • 42
Catalog and search layers • 43
Ingestion layer • 43
The processing layer • 44
The consumption layer • 44
lake architecture summary • 44
Federated queries across database engines • 47
Accessing the
CLI •
Using AWS CloudShell to access the CLI • 49 Creating new Amazon S3 buckets • 51
Amazon Database Migration Service (DMS) • 54
Amazon Kinesis for streaming data ingestion • 56
Amazon Kinesis Agent • 57
Amazon Kinesis Firehose • 58
Amazon Kinesis Data Streams • 59
Amazon Kinesis Data Analytics • 60
Amazon Kinesis Video Streams • 60
Amazon MSK for streaming data ingestion • 61
Amazon AppFlow for ingesting data from SaaS services • 62
AWS Transfer Family for ingestion using FTP/SFTP protocols • 63
AWS DataSync for ingesting from on premises and multicloud storage services • 64
The AWS Snow family of devices for large data transfers • 64
AWS Glue for data ingestion • 66
An overview of AWS services for transforming data
AWS Lambda for light transformations • 67
AWS Glue for serverless data processing • 68
Serverless ETL processing • 68
AWS Glue DataBrew • 70
AWS Glue Data Catalog • 70
AWS Glue crawlers • 72
Amazon EMR for Hadoop ecosystem processing • 73
An overview of AWS services for orchestrating big data pipelines
AWS Glue workflows for orchestrating Glue components • 75
AWS Step Functions for complex workflows • 77
Amazon Managed Workflows for Apache Airflow (MWAA) • 79
An overview of AWS services for consuming data
Amazon Athena for SQL queries in the data lake • 81
Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures • 82
Overview of Amazon QuickSight for visualizing data • 85
Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket 87
Creating a Lambda layer containing the AWS SDK for pandas library • 87
Creating an IAM policy and role for your Lambda function • 89
Creating a Lambda function • 91
Configuring our Lambda function to be triggered by an S3 upload • 96
Common data regulatory requirements • 104
Core data protection concepts • 105
Personally identifiable information (PII) • 105
Personal data • 105
Encryption • 106
Anonymized data • 106
Pseudonymized data/tokenization • 107
Authentication • 108
Authorization • 109
Putting these concepts together • 109
Data quality • 110
Data profiling • 111
Data lineage • 113
Business and technical data
Implementing a data catalog to avoid creating a data swamp
Business data catalogs • 115
Technical data catalogs • 117
AWS services that help with data
The AWS Glue/Lake Formation technical data catalog • 118
AWS Glue DataBrew for profiling datasets • 120
AWS Glue Data Quality • 121
AWS Key Management Service (KMS) for data encryption • 122
Amazon Macie for detecting PII data in Amazon S3 objects • 123
The AWS Glue Studio Detect PII transform for detecting PII data in datasets • 124
Amazon GuardDuty for detecting threats in an AWS account • 124
AWS Identity and Access Management (IAM) service • 124
Using AWS Lake Formation to manage data lake access • 128
Permissions management before Lake Formation • 128
Permissions management using AWS Lake Formation • 129
Hands-on –
Creating a new user with IAM permissions • 130
Transitioning to managing fine-grained permissions with AWS Lake Formation • 135
Activating Lake Formation permissions for a database and table • 136
Granting Lake Formation permissions • 138
Section 2: Architecting and Implementing Data
Whiteboarding
Conducting
Data standardization • 154
Data quality checks • 155
Data partitioning • 155
Data denormalization • 155
Data cataloging • 155
Whiteboarding data transformation • 155
Loading data into data
Hands-on
architecting a sample
Detailed notes from the project “Bright Light” whiteboarding meeting of GP Widgets, Inc • 161