Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
The proven Study Guide that prepares you for this new Google Cloud exam
The Google Cloud Certified Professional Data Engineer Study Guide, provides everything you need to prepare for this important exam and master the skills necessary to land that coveted Google Cloud Professional Data Engineer certification. Beginning with a pre-book assessment quiz to evaluate what you know before you begin, each chapter features exam objectives and review questions, plus the online learning environment includes additional complete practice tests.
Written by Dan Sullivan, a popular and experienced online course author for machine learning, big data, and Cloud topics, Google Cloud Certified Professional Data Engineer Study Guide is your ace in the hole for deploying and managing analytics and machine learning applications.
This exam guide is designed to help you develop an in depth understanding of data engineering and machine learning on Google Cloud Platform.
DAN SULLIVAN is a software architect specializing in data architecture, machine learning, and cloud computing. Dan is a Google Cloud Certified Professional Data Engineer, Professional Architect, and Associate Cloud Engineer. Dan is the author of six books and numerous articles. He is an instructor with LinkedIn Learning and Udemy for Business.
Introduction xxiii
Assessment Test xxix
Chapter 1 Selecting Appropriate Storage Technologies 1
From Business Requirements to Storage Systems 2
Ingest 3
Store 5
Process and Analyze 6
Explore and Visualize 8
Technical Aspects of Data: Volume, Velocity, Variation, Access, and Security 8
Volume 8
Velocity 9
Variation in Structure 10
Data Access Patterns 11
Security Requirements 12
Types of Structure: Structured, Semi-Structured, and Unstructured 12
Structured: Transactional vs. Analytical 13
Semi-Structured: Fully Indexed vs. Row Key Access 13
Unstructured Data 15
Google's Storage Decision Tree 16
Schema Design Considerations 16
Relational Database Design 17
NoSQL Database Design 20
Exam Essentials 23
Review Questions 24
Chapter 2 Building and Operationalizing Storage Systems 29
Cloud SQL 30
Configuring Cloud SQL 31
Improving Read Performance with Read Replicas 33
Importing and Exporting Data 33
Cloud Spanner 34
Configuring Cloud Spanner 34
Replication in Cloud Spanner 35
Database Design Considerations 36
Importing and Exporting Data 36
Cloud Bigtable 37
Configuring Bigtable 37
Database Design Considerations 38
Importing and Exporting 39
Cloud Firestore 39
Cloud Firestore Data Model 40
Indexing and Querying 41
Importing and Exporting 42
BigQuery 42
BigQuery Datasets 43
Loading and Exporting Data 44
Clustering, Partitioning, and Sharding Tables 45
Streaming Inserts 46
Monitoring and Logging in BigQuery 46
BigQuery Cost Considerations 47
Tips for Optimizing BigQuery 47
Cloud Memorystore 48
Cloud Storage 50
Organizing Objects in a Namespace 50
Storage Tiers 51
Cloud Storage Use Cases 52
Data Retention and Lifecycle Management 52
Unmanaged Databases 53
Exam Essentials 54
Review Questions 56
Chapter 3 Designing Data Pipelines 61
Overview of Data Pipelines 62
Data Pipeline Stages 63
Types of Data Pipelines 66
GCP Pipeline Components 73
Cloud Pub/Sub 74
Cloud Dataflow 76
Cloud Dataproc 79
Cloud Composer 82
Migrating Hadoop and Spark to GCP 82
Exam Essentials 83
Review Questions 86
Chapter 4 Designing a Data Processing Solution 89
Designing Infrastructure 90
Choosing Infrastructure 90
Availability, Reliability, and Scalability of Infrastructure 93
Hybrid Cloud and Edge Computing 96
Designing for Distributed Processing 98
Distributed Processing: Messaging 98
Distributed Processing: Services 101
Migrating a Data Warehouse 102
Assessing the Current State of a Data Warehouse 102
Designing the Future State of a Data Warehouse 103
Migrating Data, Jobs, and Access Controls 104
Validating the Data Warehouse 105
Exam Essentials 105
Review Questions 107
Chapter 5 Building and Operationalizing Processing Infrastructure 111
Provisioning and Adjusting Processing Resources 112
Provisioning and Adjusting Compute Engine 113
Provisioning and Adjusting Kubernetes Engine 118
Provisioning and Adjusting Cloud Bigtable 124
Provisioning and Adjusting Cloud Dataproc 127
Configuring Managed Serverless Processing Services 129
Monitoring Processing Resources 130
Stackdriver Monitoring 130
Stackdriver Logging 130
Stackdriver Trace 131
Exam Essentials 132
Review Questions 134
Chapter 6 Designing for Security and Compliance 139
Identity and Access Management with Cloud IAM 140
Predefined Roles 141
Custom Roles 143
Using Roles with Service Accounts 145
Access Control with Policies 146
Using IAM with Storage and Processing Services 148
Cloud Storage and IAM 148
Cloud Bigtable and IAM 149
BigQuery and IAM 149
Cloud Dataflow and IAM 150
Data Security 151
Encryption 151
Key Management 153
Ensuring Privacy with the Data Loss Prevention API 154
Detecting Sensitive Data 154
Running Data Loss Prevention Jobs 155
Inspection Best Practices 156
Legal Compliance 156
Health Insurance Portability and Accountability Act (HIPAA) 156
Children's Online Privacy Protection Act 157
FedRAMP 158
General Data Protection Regulation 158
Exam Essentials 158
Review Questions 161
Chapter 7 Designing Databases for Reliability, Scalability, and Availability 165
Designing Cloud Bigtable Databases for Scalability and Reliability 166
Data Modeling with Cloud Bigtable 166
Designing Row-keys 168
Designing for Time Series 170
Use Replication for Availability and Scalability 171
Designing Cloud Spanner Databases for Scalability and Reliability 172
Relational Database Features 173
Interleaved Tables 174
Primary Keys and Hotspots 174
Database Splits 175
Secondary Indexes 176
Query Best Practices 177
Designing BigQuery Databases for Data Warehousing 179
Schema Design for Data Warehousing 179
Clustered and Partitioned Tables 181
Querying Data in BigQuery 182
External Data Access 183
BigQuery ML 185
Exam Essentials 185
Review Questions 188
Chapter 8 Understanding Data Operations for Flexibility and Portability 191
Cataloging and Discovery with Data Catalog 192
Searching in Data Catalog 193
Tagging in Data Catalog 194
Data Preprocessing with Dataprep 195
Cleansing Data 196
Discovering Data 196
Enriching Data 197
Importing and Exporting Data 197
Structuring and Validating Data 198
Visualizing with Data Studio 198
Connecting to Data Sources 198
Visualizing Data 200
Sharing Data 200
Exploring Data with Cloud Datalab 200
Jupyter Notebooks 201
Managing Cloud Datalab Instances 201
Adding Libraries to Cloud Datalab Instances 202
Orchestrating Workflows with Cloud Composer 202
Airflow Environments 203
Creating DAGs 203
Airflow Logs 204
Exam Essentials 204
Review Questions 206
Chapter 9 Deploying Machine Learning Pipelines 209
Structure of ML Pipelines 210
Data Ingestion 211
Data Preparation 212
Data Segregation 215
Model Training 217
Model Evaluation 218
Model Deployment 220
Model Monitoring 221
GCP Options for Deploying Machine Learning Pipeline 221
Cloud AutoML 221
BigQuery ML 223
Kubeflow 223
Spark Machine Learning 224
Exam Essentials 225
Review Questions 227
Chapter 10 Choosing Training and Serving Infrastructure 231
Hardware Accelerators 232
Graphics Processing Units 232
Tensor Processing Units 233
Choosing Between CPUs, GPUs, and TPUs 233
Distributed and Single Machine Infrastructure 234
Single Machine Model Training 234
Distributed Model Training 235
Serving Models 236
Edge Computing with GCP 237
Edge Computing Overview 237
Edge Computing Components and Processes 239
Edge TPU 240
Cloud IoT 240
Exam Essentials 241
Review Questions 244
Chapter 11 Measuring, Monitoring, and Troubleshooting Machine Learning Models 247
Three Types of Machine Learning Algorithms 248
Supervised Learning 248
Unsupervised Learning 253
Anomaly Detection 254
Reinforcement Learning 254
Deep Learning 255
Engineering Machine Learning Models 257
Model Training and Evaluation 257
Operationalizing ML Models 262
Common Sources of Error in Machine Learning Models 263
Data Quality 264
Unbalanced Training Sets 264
Types of Bias 264
Exam Essentials 265
Review Questions 267
Chapter 12 Leveraging Prebuilt Models as a Service 269
Sight 270
Vision AI 270
Video AI 272
Conversation 274
Dialogflow 274
Cloud Text-to-Speech API 275
Cloud Speech-to-Text API 275
Language 276
Translation 276
Natural Language 277
Structured Data 278
Recommendations AI API 278
Cloud Inference API 280
Exam Essentials 280
Review Questions 282
Appendix Answers to Review Questions 285
Chapter 1: Selecting Appropriate Storage Technologies 286
Chapter 2: Building and Operationalizing Storage Systems 288
Chapter 3: Designing Data Pipelines 290
Chapter 4: Designing a Data Processing Solution 291
Chapter 5: Building and Operationalizing Processing Infrastructure 293
Chapter 6: Designing for Security and Compliance 295
Chapter 7: Designing Databases for Reliability, Scalability, and Availability 296
Chapter 8: Understanding Data Operations for Flexibility and Portability 298
Chapter 9: Deploying Machine Learning Pipelines 299
Chapter 10: Choosing Training and Serving Infrastructure 301
Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models 303
Chapter 12: Leveraging Prebuilt Models as a Service 304
Index 307
The Google Cloud Certified Professional Data Engineer exam tests your ability to design, deploy, monitor, and adapt services and infrastructure for data-driven decision-making. The four primary areas of focus in this exam are as follows:
Designing data processing systems involves selecting storage technologies, including relational, analytical, document, and wide-column databases, such as Cloud SQL, BigQuery, Cloud Firestore, and Cloud Bigtable, respectively. You will also be tested on designing pipelines using services such as Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer. The exam will test your ability to design distributed systems that may include hybrid clouds, message brokers, middleware, and serverless functions. Expect to see questions on migrating data warehouses from on-premises infrastructure to the cloud.
The building and operationalizing data processing systems parts of the exam will test your ability to support storage systems, pipelines, and infrastructure in a production environment. This will include using managed services for storage as well as batch and stream processing. It will also cover common operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. As a data engineer, you are expected to understand how to provision resources, monitor pipelines, and test distributed systems.
Machine learning is an increasingly important topic. This exam will test your knowledge of prebuilt machine learning models available in GCP as well as the ability to deploy machine learning pipelines with custom-built models. You can expect to see questions about machine learning service APIs and data ingestion, as well as training and evaluating models. The exam uses machine learning terminology, so it is important to understand the nomenclature, especially terms such as model, supervised and unsupervised learning, regression, classification, and evaluation metrics.
The fourth domain of knowledge covered in the exam is ensuring solution quality, which includes security, scalability, efficiency, and reliability. Expect questions on ensuring privacy with data loss prevention techniques, encryption, identity, and access management, as well ones about compliance with major regulations. The exam also tests a data engineer's ability to monitor pipelines with Stackdriver, improve data models, and scale resources as needed. You may also encounter questions that assess your ability to design portable solutions and plan for future business requirements.
In your day-to-day experience with GCP, you may spend more time working on some data engineering tasks than others. This is expected. It does, however, mean that you should be aware of the exam topics about which you may be less familiar. Machine learning questions can be especially challenging to data engineers who work primarily on ingestion and storage systems. Similarly, those who spend a majority of their time developing machine learning models may need to invest more time studying schema modeling for NoSQL databases and designing fault-tolerant distributed systems.
This book covers the topics outlined in the Google Cloud Professional Data Engineer exam guide available here:
cloud.google.com/certification/guides/data-engineer
Chapter 1: Selecting Appropriate Storage Technologies This chapter covers selecting appropriate storage technologies, including mapping business requirements to storage systems; understanding the distinction between structured, semi-structured, and unstructured data models; and designing schemas for relational and NoSQL databases. By the end of the chapter, you should understand the various criteria that data engineers consider when choosing a storage technology.
Chapter 2: Building and Operationalizing Storage Systems This chapter discusses how to deploy storage systems and perform data management operations, such as importing and exporting data, configuring access controls, and doing performance tuning. The services included in this chapter are as follows: Cloud SQL, Cloud Spanner, Cloud Bigtable, Cloud Firestore, BigQuery, Cloud Memorystore, and Cloud Storage. The chapter also includes a discussion of working with unmanaged databases, understanding storage costs and performance, and performing data lifecycle management.
Chapter 3: Designing Data Pipelines This chapter describes high-level design patterns, along with some variations on those patterns, for data pipelines. It also reviews how GCP services like Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer are used to implement data pipelines. It also covers migrating data pipelines from an on-premises Hadoop cluster to GCP.
Chapter 4: Designing a Data Processing Solution In this chapter, you learn about designing infrastructure for data engineering and machine learning, including how to do several tasks, such as choosing an appropriate compute service for your use case; designing for scalability, reliability, availability, and maintainability; using hybrid and edge computing architecture patterns and processing models; and migrating a data warehouse from on-premises data centers to GCP.
Chapter 5: Building and Operationalizing Processing Infrastructure This chapter discusses managed processing resources, including those offered by App Engine, Cloud Functions, and Cloud Dataflow. The chapter also includes a discussion of how to use Stackdriver Metrics, Stackdriver Logging, and Stackdriver Trace to monitor processing infrastructure.
Chapter 6: Designing for Security and Compliance This chapter introduces several key topics of security and compliance, including identity and access management, data security, encryption and key management, data loss prevention, and compliance.
Chapter 7: Designing Databases for Reliability, Scalability, and Availability This chapter provides information on designing for reliability, scalability, and availability of three GPC databases: Cloud Bigtable, Cloud Spanner, and Cloud BigQuery. It also covers how to apply best practices for designing schemas, querying data, and taking advantage of the physical design properties of each database.
Chapter 8: Understanding Data Operations for Flexibility and Portability This chapter describes how to use the Data Catalog, a metadata management service supporting the discovery and management of data in Google Cloud. It also introduces Cloud Dataprep, a preprocessing tool for transforming and enriching data, as well as Data Studio for visualizing data and Cloud Datalab for interactive exploration and scripting.
Chapter 9: Deploying Machine Learning Pipelines Machine learning pipelines include several stages that begin with data ingestion and preparation and then perform data segregation followed by model training and evaluation. GCP provides multiple ways to implement machine learning pipelines. This chapter describes how to deploy ML pipelines using general-purpose computing resources, such as Compute Engine and Kubernetes Engine. Managed services, such as Cloud Dataflow and Cloud Dataproc, are also available, as well as specialized machine learning services, such as AI Platform, formerly known as Cloud ML.
Chapter 10: Choosing Training and Serving Infrastructure This chapter focuses on choosing the appropriate training and serving infrastructure for your needs when serverless or specialized AI services are not a good fit for your requirements. It discusses distributed and single-machine infrastructure, the use of edge computing for serving machine learning models, and the use of hardware accelerators.
Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models This chapter focuses on key concepts in machine learning, including machine learning terminology and core concepts and common sources of error in machine learning. Machine learning is a broad discipline with many areas of specialization. This chapter provides you with a high-level overview to help you pass the Professional Data Engineer exam, but it is not a substitute for learning machine learning from resources designed for that purpose.
Chapter 12: Leveraging Prebuilt ML Models as a Service This chapter describes Google Cloud Platform options for using pretrained machine learning models to help developers build and deploy intelligent services quickly. The services are broadly grouped into sight, conversation, language, and structured data. These services are available through APIs or through Cloud AutoML services.
Learning the material in the Official Google Cloud Certified Professional Engineer Study Guide is an important part of preparing for the Professional Data Engineer certification exam, but we also provide additional tools to help you prepare. The online TestBank will help you understand the types of questions that will appear on the certification exam.
The sample tests in the TestBank include all the questions in each chapter as well as the questions from the assessment test. In addition, there are two practice exams with 50 questions...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.