Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
A comprehensive and accessible roadmap to performing data analytics in the AWS cloud
In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you'll explore every relevant aspect of data analytics-from data engineering to analysis, business intelligence, DevOps, and MLOps-as you discover how to integrate machine learning predictions with analytics engines and visualization tools.
You'll also find:
A can't-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.
GIONATA "JOE" MINICHINO is Principal Software Engineer and Data Architect on the Data & Analytics Team at Teamwork. He specializes in cloud computing, machine/deep learning, and artificial intelligence and designs end-to-end Amazon Web Services pipelines that move large quantities of diverse data for analysis and visualization.
Introduction xxiii
Chapter 1 AWS Data Lakes and Analytics Technology Overview 1
Why AWS? 1
What Does a Data Lake Look Like in AWS? 2
Analytics on AWS 3
Skills Required to Build and Maintain an AWS Analytics Pipeline 3
Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team 5
The Data Vision 6
Support 6
DA Team Roles 7
Early Stage Roles 7
Team Lead 8
Data Architect 8
Data Engineer 8
Data Analyst 9
Maturity Stage Roles 9
Data Scientist 9
Cloud Engineer 10
Business Intelligence (BI) Developer 10
Machine Learning Engineer 10
Business Analyst 11
Niche Roles 11
Analytics Flow at a Process Level 12
Workflow Methodology 12
The DA Team Mantra: "Automate Everything" 14
Analytics Models in the Wild: Centralized, Distributed, Center of Excellence 15
Centralized 15
Distributed 16
Center of Excellence 16
Summary 17
Chapter 3 Working on AWS 19
Accessing AWS 20
Everything Is a Resource 21
S3: An Important Exception 21
IAM: Policies, Roles, and Users 22
Policies 22
Identity- Based Policies 24
Resource- Based Policies 25
Roles 25
Users and User Groups 25
Summarizing IAM 26
Working with the Web Console 26
The AWS Command- Line Interface 29
Installing AWS cli 29
Linux Installation 30
macOS Installation 30
Windows 31
Configuring AWS cli 31
A Note on Region 33
Setting Individual Parameters 33
Using Profiles and Configuration Files 33
Final Notes on Configuration 36
Using the AWS cli 36
Using Skeletons and File Inputs 39
Cleaning Up! 43
Infrastructure- as- Code: CloudFormation and Terraform 44
CloudFormation 44
CloudFormation Stacks 46
CloudFormation Template Anatomy 47
CloudFormation Changesets 52
Getting Stack Information 55
Cleaning Up Again 57
CloudFormation Conclusions 58
Terraform 58
Coding Style 58
Modularity 59
Limitations 59
Terraform vs. CloudFormation 60
Infrastructure- as- Code: CDK, Pulumi, Cloudcraft, and Other Solutions 60
AWS CDK 60
Pulumi 62
Cloudcraft 62
Infrastructure Management Conclusions 63
Chapter 4 Serverless Computing and Data Engineering 65
Serverless vs. Fully Managed 65
AWS Serverless Technologies 66
AWS Lambda 67
Pricing Model 67
Laser Focus on Code 68
The Lambda Paradigm Shift 69
Virtually Infinite Scalability 70
Geographical Distribution 70
A Lambda Hello World 71
Lambda Configuration 74
Runtime 74
Container- Based Lambdas 75
Architectures 75
Memory 75
Networking 76
Execution Role 76
Environment Variables 76
AWS EventBridge 77
AWS Fargate 77
AWS DynamoDB 77
AWS SNS 77
Amazon SQS 78
AWS CloudWatch 78
Amazon QuickSight 78
AWS Step Functions 78
Amazon API Gateway 79
Amazon Cognito 79
AWS Serverless Application Model (SAM) 79
Ephemeral Infrastructure 80
AWS SAM Installation 80
Configuration 80
Creating Your First AWS SAM Project 81
Application Structure 83
SAM Resource Types 85
SAM Lambda Template 86
!! Recursive Lambda Invocation !! 88
Function Metadata 88
Outputs 89
Implicitly Generated Resources 89
Other Template Sections 90
Lambda Code 90
Building Your First SAM Application 93
Testing the AWS SAM Application Locally 96
Deployment 99
Cleaning Up 104
Summary 104
Chapter 5 Data Ingestion 105
AWS Data Lake Architecture 106
Serverless Data Lake Architecture Structure 106
Ingestion 106
Storage and Processing 108
Cataloging, Governance, and Search 108
Security and Monitoring 109
Consumption 109
Sample Processing Architecture: Cataloging Images into DynamoDB 109
Use Case Description 109
SAM Application Creation 110
S3- Triggered Lambda 111
Adding DynamoDB 119
Lambda Execution Context 121
Inserting into DynamoDB 121
Cleaning Up 123
Serverless Ingestion 124
AWS Fargate 124
AWS Lambda 124
Example Architecture: Fargate- Based Periodic Batch Import 125
The Basic Importer 125
ECS CLI 128
AWS Copilot cli 128
Clean Up 136
AWS Kinesis Ingestion 136
Example Architecture: Two- Pronged Delivery 137
Fully Managed Ingestion with AppFlow 146
Operational Data Ingestion with Database Migration Service 151
DMS Concepts 151
DMS Instance 151
DMS Endpoints 152
DMS Tasks 152
Summary of the Workflow 152
Common Use of DMS 153
Example Architecture: DMS to S3 154
DMS Instance 154
DMS Endpoints 156
DMS Task 162
Summary 167
Chapter 6 Processing Data 169
Phases of Data Preparation 170
What Is ETL? Why Should I Care? 170
ETL Job vs. Streaming Job 171
Overview of ETL in AWS 172
ETL with AWS Glue 172
ETL with Lambda Functions 172
ETL with Hadoop/EMR 173
Other Ways to Perform ETL 173
ETL Job Design Concepts 173
Source Identification 174
Destination Identification 174
Mappings 174
Validation 174
Filter 175
Join, Denormalization, Relationalization 175
AWS Glue for ETL 176
Really, It's Just Spark 176
Visual 176
Spark Script Editor 177
Python Shell Script Editor 177
Jupyter Notebook 177
Connectors 177
Creating Connections 178
Creating Connections with the Web Console 178
Creating Connections with the AWS cli 179
Creating ETL Jobs with AWS Glue Visual Editor 184
ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet) 184
Job Bookmarks 187
Transformations 188
Apply Mapping 189
Filter 189
Other Available Transforms 190
Run the Edited Job 191
Visual Editor with Source and Target Conclusions 192
Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target) 192
Creating ETL Jobs with the Spark Script Editor 192
Developing ETL Jobs with AWS Glue Notebooks 193
What Is a Notebook? 194
Notebook Structure 194
Step 1: Load Code into a DynamicFrame 196
Step 2: Apply Field Mapping 197
Step 3: Apply the Filter 197
Step 4: Write to S3 in Parquet Format 198
Example: Joining and Denormalizing Data from Two S3 Locations 199
Conclusions for Manually Authored Jobs with Notebooks 203
Creating ETL Jobs with AWS Glue Interactive Sessions 204
It's Magic 205
Development Workflow 206
Streaming Jobs 207
Differences with a Standard ETL Job 208
Streaming Sources 208
Example: Process Kinesis Streams with a Streaming Job 208
Streaming ETL Jobs Conclusions 217
Summary 217
Chapter 7 Cataloging, Governance, and Search 219
Cataloging with AWS Glue 219
AWS Glue and the AWS Glue Data Catalog 219
Glue Databases and Tables 220
Databases 220
The Idea of Schema- on- Read 221
Tables 222
Create Table Manually 223
Creating a Table from an Existing Schema 225
Creating a Table with a Crawler 225
Summary on Databases and Tables 226
Crawlers 226
Updating or Not Updating? 230
Running the Crawler 231
Creating a Crawler from the AWS CLI 231
Retrieving Table Information from the CLI 233
Classifiers 235
Classifier Example 236
Crawlers and Classifiers Summary 237
Search with Amazon Athena: The Heart of Analytics in AWS 238
A Bit of History 238
Interface Overview 238
Creating Tables Manually 239
Athena Data Types 240
Complex Types 241
Running a Query 242
Connecting with JDBC and ODBC 243
Query Stats 243
Recent Queries and Saved Queries 243
The Power of Partitions 244
Athena Pricing Model 244
Automatic Naming 245
Athena Query Output 246
Athena Peculiarities (SQL and Not) 246
Computed Fields Gotcha and WITH Statement Workaround 246
Lowercase! 247
Query Explain 248
Deduplicating Records 249
Working with JSON, Flattening, and Unnesting 250
Athena Views 251
Create Table as Select (CTAS) 252
Saving Queries and Reusing Saved Queries 253
Running Parameterized Queries 254
Athena Federated Queries 254
Athena Lambda Connectors 255
Note on Connection Errors 256
Performing Federated Queries 257
Creating a View from a Federated Query 258
Governing: Athena Workgroups, Lake Formation, and More 258
Athena Workgroups 259
Fine- Grained Athena Access with IAM 262
Recap of Athena- Based Governance 264
AWS Lake Formation 265
Registering a Location in Lake Formation 266
Creating a Database in Lake Formation 268
Assigning Permissions in Lake Formation 269
LF- Tags and Permissions in Lake Formation 271
Data Filters 277
Governance Conclusions 279
Summary 280
Chapter 8 Data Consumption: BI, Visualization, and Reporting 283
QuickSight 283
Signing Up for QuickSight 284
Standard Plan 284
Enterprise Plan 284
Users and User Groups 285
Managing Users and Groups 285
Managing QuickSight 286
Users and Groups 287
Your Subscriptions 287
SPICE Capacity 287
Account Settings 287
Security and Permissions 287
VPC Connections 288
Mobile Settings 289
Domains and Embedding 289
Single Sign- On 289
Data Sources and Datasets 289
Creating an Athena Data Source 291
Creating Other Data Sources 292
Creating a Data Source from the AWS cli 292
Creating a Dataset from a Table 294
Creating a Dataset from a SQL Query 295
Duplicating Datasets 296
Note on Creating Datasets 297
QuickSight Favorites, Recent, and Folders 297
SPICE 298
Manage SPICE Capacity 298
Refresh Schedule 299
QuickSight Data Editor 299
QuickSight Data Types 302
Change Data Types 302
Calculated Fields 303
Joining Data 305
Excluding Fields 309
Filtering Data 309
Removing Data 310
Geospatial Hierarchies and Adding Fields to Hierarchies 310
Unsupported Format Dates 311
Visualizing Data: QuickSight Analysis 312
Adding a Title and a Description to Your Analysis 313
Renaming the Sheet 314
Your First Visual with AutoGraph 314
Field Wells 314
Visuals Types 315
Saving and Autosaving 316
A First Example: Pie Chart 316
Renaming a Visual 317
Filtering Data 318
Adding Drill- Downs 320
Parameters 321
Actions 324
Insights 328
ML- Powered Insights 330
Sharing an Analysis 335
Dashboards 335
Dashboard Layouts and Themes 335
Publishing a Dashboard 336
Embedding Visuals and Dashboards 337
Data Consumption: Not Only Dashboards 337
Summary 338
Chapter 9 Machine Learning at Scale 339
Machine Learning and Artificial Intelligence 339
What Are ML/AI Use Cases? 340
Types of ML Models 340
Overview of ML/AI AWS Solutions 341
Amazon SageMaker 341
SageMaker Domains 342
Adding a User to the Domain 344
SageMaker Studio 344
SageMaker Example Notebook 346
Step 1: Prerequisites and Preprocessing 346
Step 2: Data Ingestion 347
Step 3: Data Inspection 348
Step 4: Data Conversion 349
Step 5: Upload Training Data 349
Step 6: Train the Model 349
Step 7: Set Up Hosting and Deploy the Model 351
Step 8: Validate the Model 352
Step 9: Use the Model 353
Inference 353
Real Time 354
Asynchronous 354
Serverless 354
Batch Transform 354
Data Wrangler 356
SageMaker Canvas 357
Summary 358
Appendix Example Data Architectures in AWS 359
Modern Data Lake Architecture 360
ETL in a Lake House 361
Consuming Data in the Lake House 361
The Modern Data Lake Architecture 362
Batch Processing 362
Stream Processing 363
Architecture Design Recommendations 364
Automate Everything 365
Build on Events 365
Performance = Cost Savings 365
AWS Glue Catalog and Athena- Centric Workflow 365
Design Flexible 365
Pick Your Battles 365
Parquet 366
Summary 366
Index 367
Creating analytics, especially in a large organization, can be a monumental effort, and a business needs to be prepared to invest time and resources, which will all repay the company manifold by enabling data-driven decisions. The people who will make this shift toward data-driven decision making are your Data and Analytics team, sometimes referred to as Data Analytics team or even simply as Data team (although this latest version tends to confuse people, as it may seem related to database administration). This book will refer to the Data and Analytics team as the DA team.
Although the focus of this book is architectural patterns and designs that will help you turn your organization into a data-driven one, a high-level overview of the skills and people you will need to make this happen is necessary.
The first step in delivering analytics is to create a data vision, a statement for your business as a whole. This can be a simple quote that works as a compass for all the projects your DA team will work on.
A vision does not have to be immutable. However, you should only change it if it is somehow only applicable to certain conditions or periods of time and those conditions have been satisfied or that time has passed.
A vision is the North Star of your data journey. It should always be a factor when you're making decisions about what kind of work to carry out or how to prioritize a current backlog. An example of a data vision is "to create a unified analytics facility that enables business management to slice and dice data at will."
It's important to create the vision, and it's also vital for the vision to have the support of all the involved stakeholders. Management will be responsible for allocating resources to the DA team, so these managers need to be behind the vision and the team's ability to carry it out. You should have a vision statement ready and submit it to management, or have management create it in the first place.
I won't linger any further on this topic because this book is more of a technical nature than a business one, but be sure not to skip this vital step.
Before diving into the steps for creating analytics, allow me to give you some friendly advice on how you should not go about it. I will do so by recounting a fictional yet all too common story of failure by businesses and companies.
Data Undriven Inc. is a successful company with hundreds of employees, but it's in dire need of analytics to reverse some worrying revenue trends. The leadership team recognizes the need for a far more accurate kind of analytics than what they currently have available, since it appears the company is unable to pinpoint exactly what side of the business is hemorrhaging money. Gemma, a member of the leadership team, decides to start a project to create analytics for the company, which will find its ultimate manifestation in a dashboard illustrating all sorts of useful metrics. Gemma thinks Bob is a great Python/SQL data analyst and tasks Bob with the creation of reports. The ideas are good, but data for these reports resides in various data sources. This data is unsuitable for analysis because it is sparse and inaccurate, some integrity is broken, there are holes due to temporary system failures, and the DBA team has been hit with large and unsustainable queries run against their live transactional databases, which are meant to serve data to customers, not to be reported on.
Bob collects the data from all the sources and after weeks of wrangling, cleaning, filtering, and general massaging of the data, produces analytics to Gemma in the form of a spreadsheet with graphs in it.
Gemma is happy with the result, although she notices some incongruence with the expected figures. She asks Bob to automate this analysis into a dashboard that managers can consult and that will contain up-to-date information.
Bob is in a state of panic, looking up how to automate his analytics scripts, while also trying to understand why his numbers do not match Gemma's expectations-not to mention the fact that his Python program takes between 3 and 4 hours to run every time, so the development cycle is horrendously slow.
The following weeks are a harrowing story of misunderstandings, failed attempts at automations, frustration, degraded database performance, with the ultimate result that Gemma has no analytics and Bob has quit his job to join a DA team elsewhere.
What is the moral of the story? Do not put any analyst to work before you have a data engineer in place. This cannot be stated strongly enough. Resist the temptation to want analytics now . Go about it the right way. Set up a DA team, even if it's small and you suffer from resource constraints in the beginning, and let analysts come into the picture when the data is ready for analytics and not before. Let's see what kind of skills and roles you should rely on to create a successful DA team and achieve analytics even at scale.
There are two groups of roles for a DA team: the early stages and the mature stage. The definitions for these are not strict and vary from business to business. Make sure core roles are covered before advancing to more niche and specialized ones.
By "early stage roles" we refer to a set of roles that will constitute the nucleus of your nascent DA team and that will help the team grow. At the very beginning, it is to be expected that the people involved will have to exercise some flexibility and open-mindedness in terms of the scope and authority of their roles, because the priority is to build the foundation for a data platform. So a team lead will most likely be hands-on, actively contributing to engineering, and the same can be said of the data architect, whereas data engineers will have to perform a lot of work in the realms of data platform engineering to enable the construction and monitoring of pipelines.
Your DA team should have, at least at the beginning, strong leadership in the form of a team lead. This is a person who is clearly technically proficient in the realm of analytics and is able to create tasks and delegate them to the right people, oversee the technical work that's being carried out, and act as a liaison between management and the DA team.
Analytics is a vast domain that has more business implications than other strictly technical areas (like feature development, for example), and yet the technical aspects can be incredibly challenging, normally requiring engineers with years of experience to carry out the work. For this reason, it is good to have a person spearheading the work in terms of workflow and methodology to avoid early-stage fragmentation, discrepancies, and general disruption of the work due to lack of cohesion within the team. The team can potentially evolve into something more of a flat-hierarchy unit later on, when every member is working with similar methods and practices that can be-at that later point-questioned and changed.
A data architect is a fundamental figure for a DA team and one the team cannot do without. Even if you don't elect someone to be officially recognized as the architect in the team, it is advisable to elect the most experienced and architecturally minded engineer to the role of supervisor of all the architectures designed and implemented by the DA team. Ideally the architect is a full-time role, not only designing pipeline architectures but also completing work on the technology adoption front, which is a hefty and delicate task at the same time.
Deciding whether you should adopt a serverless architecture over an Airflow- or Hadoop-based one is something that requires careful attention. Elements such as in-house skills and maintenance costs are also involved in the decision-making process.
The business can-especially under resource constraints-decide to combine the architect and team lead roles. I suggest making the data architect/team lead a full-time role before the analytics demand volume in the company becomes too large to be handled by a single team lead or data architect.
Every DA team should have a data engineering (DE) subteam, which is the beating heart of data analytics. Data engineers are responsible for implementing systems that move, transform, and catalog data in order to render the data suitable for analytics.
In the context of analytics powered by AWS, data engineers nowadays are necessarily multifaceted engineers with skills spanning various areas of technology. They are cloud computing engineers, DevOps engineers, and database/data lake/data warehouse experts, and they are knowledgeable in continuous integration/continuous deployment (CI/CD).
You will find that most DEs have particular strengths and interests, so it would be wise to create a team of DEs with some diversity of skills....
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.