Data Analytics in the AWS Cloud

Name: Data Analytics in the AWS Cloud | Building a Data Platform for BI and Predictive Analytics on AWS
Brand: Wiley
Price: 38.99 EUR
Availability: OnlineOnly

Building a Data Platform for BI and Predictive Analytics on AWS

Joe Minichino(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 6. April 2023

631 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-90925-5 (ISBN)

38,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

Introduction xxiii

Chapter 1 AWS Data Lakes and Analytics Technology Overview 1

Why AWS? 1

What Does a Data Lake Look Like in AWS? 2

Analytics on AWS 3

Skills Required to Build and Maintain an AWS Analytics Pipeline 3

Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team 5

The Data Vision 6

Support 6

DA Team Roles 7

Early Stage Roles 7

Team Lead 8

Data Architect 8

Data Engineer 8

Data Analyst 9

Maturity Stage Roles 9

Data Scientist 9

Cloud Engineer 10

Business Intelligence (BI) Developer 10

Machine Learning Engineer 10

Business Analyst 11

Niche Roles 11

Analytics Flow at a Process Level 12

Workflow Methodology 12

The DA Team Mantra: "Automate Everything" 14

Analytics Models in the Wild: Centralized, Distributed, Center of Excellence 15

Centralized 15

Distributed 16

Center of Excellence 16

Summary 17

Chapter 3 Working on AWS 19

Accessing AWS 20

Everything Is a Resource 21

S3: An Important Exception 21

IAM: Policies, Roles, and Users 22

Policies 22

Identity- Based Policies 24

Resource- Based Policies 25

Roles 25

Users and User Groups 25

Summarizing IAM 26

Working with the Web Console 26

The AWS Command- Line Interface 29

Installing AWS cli 29

Linux Installation 30

macOS Installation 30

Windows 31

Configuring AWS cli 31

A Note on Region 33

Setting Individual Parameters 33

Using Profiles and Configuration Files 33

Final Notes on Configuration 36

Using the AWS cli 36

Using Skeletons and File Inputs 39

Cleaning Up! 43

Infrastructure- as- Code: CloudFormation and Terraform 44

CloudFormation 44

CloudFormation Stacks 46

CloudFormation Template Anatomy 47

CloudFormation Changesets 52

Getting Stack Information 55

Cleaning Up Again 57

CloudFormation Conclusions 58

Terraform 58

Coding Style 58

Modularity 59

Limitations 59

Terraform vs. CloudFormation 60

Infrastructure- as- Code: CDK, Pulumi, Cloudcraft, and Other Solutions 60

AWS CDK 60

Pulumi 62

Cloudcraft 62

Infrastructure Management Conclusions 63

Chapter 4 Serverless Computing and Data Engineering 65

Serverless vs. Fully Managed 65

AWS Serverless Technologies 66

AWS Lambda 67

Pricing Model 67

Laser Focus on Code 68

The Lambda Paradigm Shift 69

Virtually Infinite Scalability 70

Geographical Distribution 70

A Lambda Hello World 71

Lambda Configuration 74

Runtime 74

Container- Based Lambdas 75

Architectures 75

Memory 75

Networking 76

Execution Role 76

Environment Variables 76

AWS EventBridge 77

AWS Fargate 77

AWS DynamoDB 77

AWS SNS 77

Amazon SQS 78

AWS CloudWatch 78

Amazon QuickSight 78

AWS Step Functions 78

Amazon API Gateway 79

Amazon Cognito 79

AWS Serverless Application Model (SAM) 79

Ephemeral Infrastructure 80

AWS SAM Installation 80

Configuration 80

Creating Your First AWS SAM Project 81

Application Structure 83

SAM Resource Types 85

SAM Lambda Template 86

!! Recursive Lambda Invocation !! 88

Function Metadata 88

Outputs 89

Implicitly Generated Resources 89

Other Template Sections 90

Lambda Code 90

Building Your First SAM Application 93

Testing the AWS SAM Application Locally 96

Deployment 99

Cleaning Up 104

Summary 104

Chapter 5 Data Ingestion 105

AWS Data Lake Architecture 106

Serverless Data Lake Architecture Structure 106

Ingestion 106

Storage and Processing 108

Cataloging, Governance, and Search 108

Security and Monitoring 109

Consumption 109

Sample Processing Architecture: Cataloging Images into DynamoDB 109

Use Case Description 109

SAM Application Creation 110

S3- Triggered Lambda 111

Adding DynamoDB 119

Lambda Execution Context 121

Inserting into DynamoDB 121

Cleaning Up 123

Serverless Ingestion 124

AWS Fargate 124

AWS Lambda 124

Example Architecture: Fargate- Based Periodic Batch Import 125

The Basic Importer 125

ECS CLI 128

AWS Copilot cli 128

Clean Up 136

AWS Kinesis Ingestion 136

Example Architecture: Two- Pronged Delivery 137

Fully Managed Ingestion with AppFlow 146

Operational Data Ingestion with Database Migration Service 151

DMS Concepts 151

DMS Instance 151

DMS Endpoints 152

DMS Tasks 152

Summary of the Workflow 152

Common Use of DMS 153

Example Architecture: DMS to S3 154

DMS Instance 154

DMS Endpoints 156

DMS Task 162

Summary 167

Chapter 6 Processing Data 169

Phases of Data Preparation 170

What Is ETL? Why Should I Care? 170

ETL Job vs. Streaming Job 171

Overview of ETL in AWS 172

ETL with AWS Glue 172

ETL with Lambda Functions 172

ETL with Hadoop/EMR 173

Other Ways to Perform ETL 173

ETL Job Design Concepts 173

Source Identification 174

Destination Identification 174

Mappings 174

Validation 174

Filter 175

Join, Denormalization, Relationalization 175

AWS Glue for ETL 176

Really, It's Just Spark 176

Visual 176

Spark Script Editor 177

Python Shell Script Editor 177

Jupyter Notebook 177

Connectors 177

Creating Connections 178

Creating Connections with the Web Console 178

Creating Connections with the AWS cli 179

Creating ETL Jobs with AWS Glue Visual Editor 184

ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet) 184

Job Bookmarks 187

Transformations 188

Apply Mapping 189

Filter 189

Other Available Transforms 190

Run the Edited Job 191

Visual Editor with Source and Target Conclusions 192

Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target) 192

Creating ETL Jobs with the Spark Script Editor 192

Developing ETL Jobs with AWS Glue Notebooks 193

What Is a Notebook? 194

Notebook Structure 194

Step 1: Load Code into a DynamicFrame 196

Step 2: Apply Field Mapping 197

Step 3: Apply the Filter 197

Step 4: Write to S3 in Parquet Format 198

Example: Joining and Denormalizing Data from Two S3 Locations 199

Conclusions for Manually Authored Jobs with Notebooks 203

Creating ETL Jobs with AWS Glue Interactive Sessions 204

It's Magic 205

Development Workflow 206

Streaming Jobs 207

Differences with a Standard ETL Job 208

Streaming Sources 208

Example: Process Kinesis Streams with a Streaming Job 208

Streaming ETL Jobs Conclusions 217

Summary 217

Chapter 7 Cataloging, Governance, and Search 219

Cataloging with AWS Glue 219

AWS Glue and the AWS Glue Data Catalog 219

Glue Databases and Tables 220

Databases 220

The Idea of Schema- on- Read 221

Tables 222

Create Table Manually 223

Creating a Table from an Existing Schema 225

Creating a Table with a Crawler 225

Summary on Databases and Tables 226

Crawlers 226

Updating or Not Updating? 230

Running the Crawler 231

Creating a Crawler from the AWS CLI 231

Retrieving Table Information from the CLI 233

Classifiers 235

Classifier Example 236

Crawlers and Classifiers Summary 237

Search with Amazon Athena: The Heart of Analytics in AWS 238

A Bit of History 238

Interface Overview 238

Creating Tables Manually 239

Athena Data Types 240

Complex Types 241

Running a Query 242

Connecting with JDBC and ODBC 243

Query Stats 243

Recent Queries and Saved Queries 243

The Power of Partitions 244

Athena Pricing Model 244

Automatic Naming 245

Athena Query Output 246

Athena Peculiarities (SQL and Not) 246

Computed Fields Gotcha and WITH Statement Workaround 246

Lowercase! 247

Query Explain 248

Deduplicating Records 249

Working with JSON, Flattening, and Unnesting 250

Athena Views 251

Create Table as Select (CTAS) 252

Saving Queries and Reusing Saved Queries 253

Running Parameterized Queries 254

Athena Federated Queries 254

Athena Lambda Connectors 255

Note on Connection Errors 256

Performing Federated Queries 257

Creating a View from a Federated Query 258

Governing: Athena Workgroups, Lake Formation, and More 258

Athena Workgroups 259

Fine- Grained Athena Access with IAM 262

Recap of Athena- Based Governance 264

AWS Lake Formation 265

Registering a Location in Lake Formation 266

Creating a Database in Lake Formation 268

Assigning Permissions in Lake Formation 269

LF- Tags and Permissions in Lake Formation 271

Data Filters 277

Governance Conclusions 279

Summary 280

Chapter 8 Data Consumption: BI, Visualization, and Reporting 283

QuickSight 283

Signing Up for QuickSight 284

Standard Plan 284

Enterprise Plan 284

Users and User Groups 285

Managing Users and Groups 285

Managing QuickSight 286

Users and Groups 287

Your Subscriptions 287

SPICE Capacity 287

Account Settings 287

Security and Permissions 287

VPC Connections 288

Mobile Settings 289

Domains and Embedding 289

Single Sign- On 289

Data Sources and Datasets 289

Creating an Athena Data Source 291

Creating Other Data Sources 292

Creating a Data Source from the AWS cli 292

Creating a Dataset from a Table 294

Creating a Dataset from a SQL Query 295

Duplicating Datasets 296

Note on Creating Datasets 297

QuickSight Favorites, Recent, and Folders 297

SPICE 298

Manage SPICE Capacity 298

Refresh Schedule 299

QuickSight Data Editor 299

QuickSight Data Types 302

Change Data Types 302

Calculated Fields 303

Joining Data 305

Excluding Fields 309

Filtering Data 309

Removing Data 310

Geospatial Hierarchies and Adding Fields to Hierarchies 310

Unsupported Format Dates 311

Visualizing Data: QuickSight Analysis 312

Adding a Title and a Description to Your Analysis 313

Renaming the Sheet 314

Your First Visual with AutoGraph 314

Field Wells 314

Visuals Types 315

Saving and Autosaving 316

A First Example: Pie Chart 316

Renaming a Visual 317

Filtering Data 318

Adding Drill- Downs 320

Parameters 321

Actions 324

Insights 328

ML- Powered Insights 330

Sharing an Analysis 335

Dashboards 335

Dashboard Layouts and Themes 335

Publishing a Dashboard 336

Embedding Visuals and Dashboards 337

Data Consumption: Not Only Dashboards 337

Summary 338

Chapter 9 Machine Learning at Scale 339

Machine Learning and Artificial Intelligence 339

What Are ML/AI Use Cases? 340

Types of ML Models 340

Overview of ML/AI AWS Solutions 341

Amazon SageMaker 341

SageMaker Domains 342

Adding a User to the Domain 344

SageMaker Studio 344

SageMaker Example Notebook 346

Step 1: Prerequisites and Preprocessing 346

Step 2: Data Ingestion 347

Step 3: Data Inspection 348

Step 4: Data Conversion 349

Step 5: Upload Training Data 349

Step 6: Train the Model 349

Step 7: Set Up Hosting and Deploy the Model 351

Step 8: Validate the Model 352

Step 9: Use the Model 353

Inference 353

Real Time 354

Asynchronous 354

Serverless 354

Batch Transform 354

Data Wrangler 356

SageMaker Canvas 357

Summary 358

Appendix Example Data Architectures in AWS 359

Modern Data Lake Architecture 360

ETL in a Lake House 361

Consuming Data in the Lake House 361

The Modern Data Lake Architecture 362

Batch Processing 362

Stream Processing 363

Architecture Design Recommendations 364

Automate Everything 365

Build on Events 365

Performance = Cost Savings 365

AWS Glue Catalog and Athena- Centric Workflow 365

Design Flexible 365

Pick Your Battles 365

Parquet 366

Summary 366

Index 367

CHAPTER 2
The Path to Analytics: Setting Up a Data and Analytics Team

Creating analytics, especially in a large organization, can be a monumental effort, and a business needs to be prepared to invest time and resources, which will all repay the company manifold by enabling data-driven decisions. The people who will make this shift toward data-driven decision making are your Data and Analytics team, sometimes referred to as Data Analytics team or even simply as Data team (although this latest version tends to confuse people, as it may seem related to database administration). This book will refer to the Data and Analytics team as the DA team.

Although the focus of this book is architectural patterns and designs that will help you turn your organization into a data-driven one, a high-level overview of the skills and people you will need to make this happen is necessary.

The Data Vision

The first step in delivering analytics is to create a data vision, a statement for your business as a whole. This can be a simple quote that works as a compass for all the projects your DA team will work on.

A vision does not have to be immutable. However, you should only change it if it is somehow only applicable to certain conditions or periods of time and those conditions have been satisfied or that time has passed.

A vision is the North Star of your data journey. It should always be a factor when you're making decisions about what kind of work to carry out or how to prioritize a current backlog. An example of a data vision is "to create a unified analytics facility that enables business management to slice and dice data at will."

Support

It's important to create the vision, and it's also vital for the vision to have the support of all the involved stakeholders. Management will be responsible for allocating resources to the DA team, so these managers need to be behind the vision and the team's ability to carry it out. You should have a vision statement ready and submit it to management, or have management create it in the first place.

I won't linger any further on this topic because this book is more of a technical nature than a business one, but be sure not to skip this vital step.

REDUCTIO AD ABSURDUM: HOW NOT TO GO ABOUT CREATING ANALYTICS

Before diving into the steps for creating analytics, allow me to give you some friendly advice on how you should not go about it. I will do so by recounting a fictional yet all too common story of failure by businesses and companies.

Data Undriven Inc. is a successful company with hundreds of employees, but it's in dire need of analytics to reverse some worrying revenue trends. The leadership team recognizes the need for a far more accurate kind of analytics than what they currently have available, since it appears the company is unable to pinpoint exactly what side of the business is hemorrhaging money. Gemma, a member of the leadership team, decides to start a project to create analytics for the company, which will find its ultimate manifestation in a dashboard illustrating all sorts of useful metrics. Gemma thinks Bob is a great Python/SQL data analyst and tasks Bob with the creation of reports. The ideas are good, but data for these reports resides in various data sources. This data is unsuitable for analysis because it is sparse and inaccurate, some integrity is broken, there are holes due to temporary system failures, and the DBA team has been hit with large and unsustainable queries run against their live transactional databases, which are meant to serve data to customers, not to be reported on.

Bob collects the data from all the sources and after weeks of wrangling, cleaning, filtering, and general massaging of the data, produces analytics to Gemma in the form of a spreadsheet with graphs in it.

Gemma is happy with the result, although she notices some incongruence with the expected figures. She asks Bob to automate this analysis into a dashboard that managers can consult and that will contain up-to-date information.

Bob is in a state of panic, looking up how to automate his analytics scripts, while also trying to understand why his numbers do not match Gemma's expectations-not to mention the fact that his Python program takes between 3 and 4 hours to run every time, so the development cycle is horrendously slow.

The following weeks are a harrowing story of misunderstandings, failed attempts at automations, frustration, degraded database performance, with the ultimate result that Gemma has no analytics and Bob has quit his job to join a DA team elsewhere.

What is the moral of the story? Do not put any analyst to work before you have a data engineer in place. This cannot be stated strongly enough. Resist the temptation to want analytics now . Go about it the right way. Set up a DA team, even if it's small and you suffer from resource constraints in the beginning, and let analysts come into the picture when the data is ready for analytics and not before. Let's see what kind of skills and roles you should rely on to create a successful DA team and achieve analytics even at scale.

DA Team Roles

There are two groups of roles for a DA team: the early stages and the mature stage. The definitions for these are not strict and vary from business to business. Make sure core roles are covered before advancing to more niche and specialized ones.

Early Stage Roles

By "early stage roles" we refer to a set of roles that will constitute the nucleus of your nascent DA team and that will help the team grow. At the very beginning, it is to be expected that the people involved will have to exercise some flexibility and open-mindedness in terms of the scope and authority of their roles, because the priority is to build the foundation for a data platform. So a team lead will most likely be hands-on, actively contributing to engineering, and the same can be said of the data architect, whereas data engineers will have to perform a lot of work in the realms of data platform engineering to enable the construction and monitoring of pipelines.

Team Lead

Your DA team should have, at least at the beginning, strong leadership in the form of a team lead. This is a person who is clearly technically proficient in the realm of analytics and is able to create tasks and delegate them to the right people, oversee the technical work that's being carried out, and act as a liaison between management and the DA team.

Analytics is a vast domain that has more business implications than other strictly technical areas (like feature development, for example), and yet the technical aspects can be incredibly challenging, normally requiring engineers with years of experience to carry out the work. For this reason, it is good to have a person spearheading the work in terms of workflow and methodology to avoid early-stage fragmentation, discrepancies, and general disruption of the work due to lack of cohesion within the team. The team can potentially evolve into something more of a flat-hierarchy unit later on, when every member is working with similar methods and practices that can be-at that later point-questioned and changed.

Data Architect

A data architect is a fundamental figure for a DA team and one the team cannot do without. Even if you don't elect someone to be officially recognized as the architect in the team, it is advisable to elect the most experienced and architecturally minded engineer to the role of supervisor of all the architectures designed and implemented by the DA team. Ideally the architect is a full-time role, not only designing pipeline architectures but also completing work on the technology adoption front, which is a hefty and delicate task at the same time.

Deciding whether you should adopt a serverless architecture over an Airflow- or Hadoop-based one is something that requires careful attention. Elements such as in-house skills and maintenance costs are also involved in the decision-making process.

The business can-especially under resource constraints-decide to combine the architect and team lead roles. I suggest making the data architect/team lead a full-time role before the analytics demand volume in the company becomes too large to be handled by a single team lead or data architect.

Data Engineer

Every DA team should have a data engineering (DE) subteam, which is the beating heart of data analytics. Data engineers are responsible for implementing systems that move, transform, and catalog data in order to render the data suitable for analytics.

In the context of analytics powered by AWS, data engineers nowadays are necessarily multifaceted engineers with skills spanning various areas of technology. They are cloud computing engineers, DevOps engineers, and database/data lake/data warehouse experts, and they are knowledgeable in continuous integration/continuous deployment (CI/CD).

You will find that most DEs have particular strengths and interests, so it would be wise to create a team of DEs with some diversity of skills....

Systemvoraussetzungen

Als PDF speichern Als Link merken

Data Analytics in the AWS Cloud

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

CHAPTER 2 The Path to Analytics: Setting Up a Data and Analytics Team

The Data Vision

Support

REDUCTIO AD ABSURDUM: HOW NOT TO GO ABOUT CREATING ANALYTICS

DA Team Roles

Early Stage Roles

Team Lead

Data Architect

Data Engineer

Systemvoraussetzungen

CHAPTER 2
The Path to Analytics: Setting Up a Data and Analytics Team