Official Google Cloud Certified Professional Data Engineer Study Guide

Name: Official Google Cloud Certified Professional Data Engineer Study Guide
Brand: Wiley
Price: 39.99 EUR
Availability: OnlineOnly

Dan Sullivan(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 11. Mai 2020

352 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-61845-4 (ISBN)

39,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

Introduction xxiii

Assessment Test xxix

Chapter 1 Selecting Appropriate Storage Technologies 1

From Business Requirements to Storage Systems 2

Ingest 3

Store 5

Process and Analyze 6

Explore and Visualize 8

Technical Aspects of Data: Volume, Velocity, Variation, Access, and Security 8

Volume 8

Velocity 9

Variation in Structure 10

Data Access Patterns 11

Security Requirements 12

Types of Structure: Structured, Semi-Structured, and Unstructured 12

Structured: Transactional vs. Analytical 13

Semi-Structured: Fully Indexed vs. Row Key Access 13

Unstructured Data 15

Google's Storage Decision Tree 16

Schema Design Considerations 16

Relational Database Design 17

NoSQL Database Design 20

Exam Essentials 23

Review Questions 24

Chapter 2 Building and Operationalizing Storage Systems 29

Cloud SQL 30

Configuring Cloud SQL 31

Improving Read Performance with Read Replicas 33

Importing and Exporting Data 33

Cloud Spanner 34

Configuring Cloud Spanner 34

Replication in Cloud Spanner 35

Database Design Considerations 36

Importing and Exporting Data 36

Cloud Bigtable 37

Configuring Bigtable 37

Database Design Considerations 38

Importing and Exporting 39

Cloud Firestore 39

Cloud Firestore Data Model 40

Indexing and Querying 41

Importing and Exporting 42

BigQuery 42

BigQuery Datasets 43

Loading and Exporting Data 44

Clustering, Partitioning, and Sharding Tables 45

Streaming Inserts 46

Monitoring and Logging in BigQuery 46

BigQuery Cost Considerations 47

Tips for Optimizing BigQuery 47

Cloud Memorystore 48

Cloud Storage 50

Organizing Objects in a Namespace 50

Storage Tiers 51

Cloud Storage Use Cases 52

Data Retention and Lifecycle Management 52

Unmanaged Databases 53

Exam Essentials 54

Review Questions 56

Chapter 3 Designing Data Pipelines 61

Overview of Data Pipelines 62

Data Pipeline Stages 63

Types of Data Pipelines 66

GCP Pipeline Components 73

Cloud Pub/Sub 74

Cloud Dataflow 76

Cloud Dataproc 79

Cloud Composer 82

Migrating Hadoop and Spark to GCP 82

Exam Essentials 83

Review Questions 86

Chapter 4 Designing a Data Processing Solution 89

Designing Infrastructure 90

Choosing Infrastructure 90

Availability, Reliability, and Scalability of Infrastructure 93

Hybrid Cloud and Edge Computing 96

Designing for Distributed Processing 98

Distributed Processing: Messaging 98

Distributed Processing: Services 101

Migrating a Data Warehouse 102

Assessing the Current State of a Data Warehouse 102

Designing the Future State of a Data Warehouse 103

Migrating Data, Jobs, and Access Controls 104

Validating the Data Warehouse 105

Exam Essentials 105

Review Questions 107

Chapter 5 Building and Operationalizing Processing Infrastructure 111

Provisioning and Adjusting Processing Resources 112

Provisioning and Adjusting Compute Engine 113

Provisioning and Adjusting Kubernetes Engine 118

Provisioning and Adjusting Cloud Bigtable 124

Provisioning and Adjusting Cloud Dataproc 127

Configuring Managed Serverless Processing Services 129

Monitoring Processing Resources 130

Stackdriver Monitoring 130

Stackdriver Logging 130

Stackdriver Trace 131

Exam Essentials 132

Review Questions 134

Chapter 6 Designing for Security and Compliance 139

Identity and Access Management with Cloud IAM 140

Predefined Roles 141

Custom Roles 143

Using Roles with Service Accounts 145

Access Control with Policies 146

Using IAM with Storage and Processing Services 148

Cloud Storage and IAM 148

Cloud Bigtable and IAM 149

BigQuery and IAM 149

Cloud Dataflow and IAM 150

Data Security 151

Encryption 151

Key Management 153

Ensuring Privacy with the Data Loss Prevention API 154

Detecting Sensitive Data 154

Running Data Loss Prevention Jobs 155

Inspection Best Practices 156

Legal Compliance 156

Health Insurance Portability and Accountability Act (HIPAA) 156

Children's Online Privacy Protection Act 157

FedRAMP 158

General Data Protection Regulation 158

Exam Essentials 158

Review Questions 161

Chapter 7 Designing Databases for Reliability, Scalability, and Availability 165

Designing Cloud Bigtable Databases for Scalability and Reliability 166

Data Modeling with Cloud Bigtable 166

Designing Row-keys 168

Designing for Time Series 170

Use Replication for Availability and Scalability 171

Designing Cloud Spanner Databases for Scalability and Reliability 172

Relational Database Features 173

Interleaved Tables 174

Primary Keys and Hotspots 174

Database Splits 175

Secondary Indexes 176

Query Best Practices 177

Designing BigQuery Databases for Data Warehousing 179

Schema Design for Data Warehousing 179

Clustered and Partitioned Tables 181

Querying Data in BigQuery 182

External Data Access 183

BigQuery ML 185

Exam Essentials 185

Review Questions 188

Chapter 8 Understanding Data Operations for Flexibility and Portability 191

Cataloging and Discovery with Data Catalog 192

Searching in Data Catalog 193

Tagging in Data Catalog 194

Data Preprocessing with Dataprep 195

Cleansing Data 196

Discovering Data 196

Enriching Data 197

Importing and Exporting Data 197

Structuring and Validating Data 198

Visualizing with Data Studio 198

Connecting to Data Sources 198

Visualizing Data 200

Sharing Data 200

Exploring Data with Cloud Datalab 200

Jupyter Notebooks 201

Managing Cloud Datalab Instances 201

Adding Libraries to Cloud Datalab Instances 202

Orchestrating Workflows with Cloud Composer 202

Airflow Environments 203

Creating DAGs 203

Airflow Logs 204

Exam Essentials 204

Review Questions 206

Chapter 9 Deploying Machine Learning Pipelines 209

Structure of ML Pipelines 210

Data Ingestion 211

Data Preparation 212

Data Segregation 215

Model Training 217

Model Evaluation 218

Model Deployment 220

Model Monitoring 221

GCP Options for Deploying Machine Learning Pipeline 221

Cloud AutoML 221

BigQuery ML 223

Kubeflow 223

Spark Machine Learning 224

Exam Essentials 225

Review Questions 227

Chapter 10 Choosing Training and Serving Infrastructure 231

Hardware Accelerators 232

Graphics Processing Units 232

Tensor Processing Units 233

Choosing Between CPUs, GPUs, and TPUs 233

Distributed and Single Machine Infrastructure 234

Single Machine Model Training 234

Distributed Model Training 235

Serving Models 236

Edge Computing with GCP 237

Edge Computing Overview 237

Edge Computing Components and Processes 239

Edge TPU 240

Cloud IoT 240

Exam Essentials 241

Review Questions 244

Chapter 11 Measuring, Monitoring, and Troubleshooting Machine Learning Models 247

Three Types of Machine Learning Algorithms 248

Supervised Learning 248

Unsupervised Learning 253

Anomaly Detection 254

Reinforcement Learning 254

Deep Learning 255

Engineering Machine Learning Models 257

Model Training and Evaluation 257

Operationalizing ML Models 262

Common Sources of Error in Machine Learning Models 263

Data Quality 264

Unbalanced Training Sets 264

Types of Bias 264

Exam Essentials 265

Review Questions 267

Chapter 12 Leveraging Prebuilt Models as a Service 269

Sight 270

Vision AI 270

Video AI 272

Conversation 274

Dialogflow 274

Cloud Text-to-Speech API 275

Cloud Speech-to-Text API 275

Language 276

Translation 276

Natural Language 277

Structured Data 278

Recommendations AI API 278

Cloud Inference API 280

Exam Essentials 280

Review Questions 282

Appendix Answers to Review Questions 285

Chapter 1: Selecting Appropriate Storage Technologies 286

Chapter 2: Building and Operationalizing Storage Systems 288

Chapter 3: Designing Data Pipelines 290

Chapter 4: Designing a Data Processing Solution 291

Chapter 5: Building and Operationalizing Processing Infrastructure 293

Chapter 6: Designing for Security and Compliance 295

Chapter 7: Designing Databases for Reliability, Scalability, and Availability 296

Chapter 8: Understanding Data Operations for Flexibility and Portability 298

Chapter 9: Deploying Machine Learning Pipelines 299

Chapter 10: Choosing Training and Serving Infrastructure 301

Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models 303

Chapter 12: Leveraging Prebuilt Models as a Service 304

Index 307

Introduction

The Google Cloud Certified Professional Data Engineer exam tests your ability to design, deploy, monitor, and adapt services and infrastructure for data-driven decision-making. The four primary areas of focus in this exam are as follows:

Designing data processing systems
Building and operationalizing data processing systems
Operationalizing machine learning models
Ensuring solution quality

Designing data processing systems involves selecting storage technologies, including relational, analytical, document, and wide-column databases, such as Cloud SQL, BigQuery, Cloud Firestore, and Cloud Bigtable, respectively. You will also be tested on designing pipelines using services such as Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer. The exam will test your ability to design distributed systems that may include hybrid clouds, message brokers, middleware, and serverless functions. Expect to see questions on migrating data warehouses from on-premises infrastructure to the cloud.

The building and operationalizing data processing systems parts of the exam will test your ability to support storage systems, pipelines, and infrastructure in a production environment. This will include using managed services for storage as well as batch and stream processing. It will also cover common operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. As a data engineer, you are expected to understand how to provision resources, monitor pipelines, and test distributed systems.

Machine learning is an increasingly important topic. This exam will test your knowledge of prebuilt machine learning models available in GCP as well as the ability to deploy machine learning pipelines with custom-built models. You can expect to see questions about machine learning service APIs and data ingestion, as well as training and evaluating models. The exam uses machine learning terminology, so it is important to understand the nomenclature, especially terms such as model, supervised and unsupervised learning, regression, classification, and evaluation metrics.

The fourth domain of knowledge covered in the exam is ensuring solution quality, which includes security, scalability, efficiency, and reliability. Expect questions on ensuring privacy with data loss prevention techniques, encryption, identity, and access management, as well ones about compliance with major regulations. The exam also tests a data engineer's ability to monitor pipelines with Stackdriver, improve data models, and scale resources as needed. You may also encounter questions that assess your ability to design portable solutions and plan for future business requirements.

In your day-to-day experience with GCP, you may spend more time working on some data engineering tasks than others. This is expected. It does, however, mean that you should be aware of the exam topics about which you may be less familiar. Machine learning questions can be especially challenging to data engineers who work primarily on ingestion and storage systems. Similarly, those who spend a majority of their time developing machine learning models may need to invest more time studying schema modeling for NoSQL databases and designing fault-tolerant distributed systems.

What Does This Book Cover?

This book covers the topics outlined in the Google Cloud Professional Data Engineer exam guide available here:

cloud.google.com/certification/guides/data-engineer

Chapter 1: Selecting Appropriate Storage Technologies This chapter covers selecting appropriate storage technologies, including mapping business requirements to storage systems; understanding the distinction between structured, semi-structured, and unstructured data models; and designing schemas for relational and NoSQL databases. By the end of the chapter, you should understand the various criteria that data engineers consider when choosing a storage technology.

Chapter 2: Building and Operationalizing Storage Systems This chapter discusses how to deploy storage systems and perform data management operations, such as importing and exporting data, configuring access controls, and doing performance tuning. The services included in this chapter are as follows: Cloud SQL, Cloud Spanner, Cloud Bigtable, Cloud Firestore, BigQuery, Cloud Memorystore, and Cloud Storage. The chapter also includes a discussion of working with unmanaged databases, understanding storage costs and performance, and performing data lifecycle management.

Chapter 3: Designing Data Pipelines This chapter describes high-level design patterns, along with some variations on those patterns, for data pipelines. It also reviews how GCP services like Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer are used to implement data pipelines. It also covers migrating data pipelines from an on-premises Hadoop cluster to GCP.

Chapter 4: Designing a Data Processing Solution In this chapter, you learn about designing infrastructure for data engineering and machine learning, including how to do several tasks, such as choosing an appropriate compute service for your use case; designing for scalability, reliability, availability, and maintainability; using hybrid and edge computing architecture patterns and processing models; and migrating a data warehouse from on-premises data centers to GCP.

Chapter 5: Building and Operationalizing Processing Infrastructure This chapter discusses managed processing resources, including those offered by App Engine, Cloud Functions, and Cloud Dataflow. The chapter also includes a discussion of how to use Stackdriver Metrics, Stackdriver Logging, and Stackdriver Trace to monitor processing infrastructure.

Chapter 6: Designing for Security and Compliance This chapter introduces several key topics of security and compliance, including identity and access management, data security, encryption and key management, data loss prevention, and compliance.

Chapter 7: Designing Databases for Reliability, Scalability, and Availability This chapter provides information on designing for reliability, scalability, and availability of three GPC databases: Cloud Bigtable, Cloud Spanner, and Cloud BigQuery. It also covers how to apply best practices for designing schemas, querying data, and taking advantage of the physical design properties of each database.

Chapter 8: Understanding Data Operations for Flexibility and Portability This chapter describes how to use the Data Catalog, a metadata management service supporting the discovery and management of data in Google Cloud. It also introduces Cloud Dataprep, a preprocessing tool for transforming and enriching data, as well as Data Studio for visualizing data and Cloud Datalab for interactive exploration and scripting.

Chapter 9: Deploying Machine Learning Pipelines Machine learning pipelines include several stages that begin with data ingestion and preparation and then perform data segregation followed by model training and evaluation. GCP provides multiple ways to implement machine learning pipelines. This chapter describes how to deploy ML pipelines using general-purpose computing resources, such as Compute Engine and Kubernetes Engine. Managed services, such as Cloud Dataflow and Cloud Dataproc, are also available, as well as specialized machine learning services, such as AI Platform, formerly known as Cloud ML.

Chapter 10: Choosing Training and Serving Infrastructure This chapter focuses on choosing the appropriate training and serving infrastructure for your needs when serverless or specialized AI services are not a good fit for your requirements. It discusses distributed and single-machine infrastructure, the use of edge computing for serving machine learning models, and the use of hardware accelerators.

Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models This chapter focuses on key concepts in machine learning, including machine learning terminology and core concepts and common sources of error in machine learning. Machine learning is a broad discipline with many areas of specialization. This chapter provides you with a high-level overview to help you pass the Professional Data Engineer exam, but it is not a substitute for learning machine learning from resources designed for that purpose.

Chapter 12: Leveraging Prebuilt ML Models as a Service This chapter describes Google Cloud Platform options for using pretrained machine learning models to help developers build and deploy intelligent services quickly. The services are broadly grouped into sight, conversation, language, and structured data. These services are available through APIs or through Cloud AutoML services.

Interactive Online Learning Environment and TestBank

Learning the material in the Official Google Cloud Certified Professional Engineer Study Guide is an important part of preparing for the Professional Data Engineer certification exam, but we also provide additional tools to help you prepare. The online TestBank will help you understand the types of questions that will appear on the certification exam.

The sample tests in the TestBank include all the questions in each chapter as well as the questions from the assessment test. In addition, there are two practice exams with 50 questions...

Systemvoraussetzungen

Als PDF speichern Als Link merken