
Google BigQuery Analytics
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Persons
Content
Part I BigQuery Fundamentals
Chapter 1 The Story of Big Data at Google 3
Big Data Stack 1.0 4
Big Data Stack 2.0 (and Beyond) 5
Open Source Stack 7
Google Cloud Platform 8
Cloud Processing 9
Cloud Storage 9
Cloud Analytics 9
Problem Statement 10
What Is Big Data? 10
Why Big Data? 10
Why Do You Need New Ways to Process Big Data? 11
How Can You Read a Terabyte in a Second? 12
What about MapReduce? 12
How Can You Ask Questions of Your Big Data and Quickly
Get Answers? 13
Summary 13
Chapter 2 BigQuery Fundamentals 15
What Is BigQuery? 15
SQL Queries over Big Data 16
Cloud Storage System 21
Distributed Cloud Computing 23
Analytics as a Service (AaaS?) 26
What BigQuery Isn't 29
BigQuery Technology Stack 31
Google Cloud Platform 34
BigQuery Service History 37
BigQuery Sensors Application 39
Sensor Client Android App 40
BigQuery Sensors AppEngine App 41
Running Ad-Hoc Queries 42
Summary 43
Chapter 3 Getting Started with BigQuery 45
Creating a Project 45
Google APIs Console 46
Free Tier Limitations and Billing 49
Running Your First Query 51
Loading Data 54
Using the Command-Line Client 57
Install and Setup 58
Using the Client 60
Service Account Access 62
Setting Up Google Cloud Storage 64
Development Environment 66
Python Libraries 66
Java Libraries 67
Additional Tools 67
Summary 68
Chapter 4 Understanding the BigQuery Object Model 69
Projects 70
Project Names 70
Project Billing 72
Project Access Control 72
Projects and AppEngine 73
BigQuery Data 73
Naming in BigQuery 73
Schemas 75
Tables 76
Datasets 77
Jobs 78
Job Components 78
BigQuery Billing and Quotas 85
Storage Costs 85
Processing Costs 86
Query RPCs 87
TableData.insertAll() RPCs 87
Data Model for End-to-End Application 87
Project 87
Datasets 88
Tables 89
Summary 91
Part II Basic BigQuery 93
Chapter 5 Talking to the BigQuery API 95
Introduction to Google APIs 95
Authenticating API Access 96
RESTful Web Services for the SOAP-Less Masses 105
Discovering Google APIs 112
Common Operations 113
BigQuery REST Collections 122
Projects 123
Datasets 126
Tables 132
TableData 139
Jobs 144
BigQuery API Tour 151
Error Handling in BigQuery 154
Summary 158
Chapter 6 Loading Data 159
Bulk Loads 160
Moving Bytes 163
Destination Table 170
Data Formats 174
Errors 182
Limits and Quotas 186
Streaming Inserts 188
Summary 193
Chapter 7 Running Queries 195
BigQuery Query API 196
Query API Methods 196
Query API Features 208
Query Billing and Quotas 213
BigQuery Query Language 221
BigQuery SQL in Five Queries 222
Differences from Standard SQL 232
Summary 236
Chapter 8 Putting It Together 237
A Quick Tour 238
Mobile Client 242
Monitoring Service 243
Log Collection Service 252
Log Trampoline 253
Dashboard 260
Data Caching 261
Data Transformation 265
Web Client 269
Summary 272
Part III Advanced BigQuery 273
Chapter 9 Understanding Query Execution 275
Background 276
Storage Architecture 277
Colossus File System (CFS) 277
ColumnIO 278
Durability and Availability 281
Query Processing 282
Dremel Serving Trees 283
Architecture Comparisons 295
Relational Databases 295
MapReduce 298
Summary 303
Chapter 10 Advanced Queries 305
Advanced SQL 306
Subqueries 307
Combining Tables: Implicit UNION and JOIN 310
Analytic and Windowing Functions 315
BigQuery SQL Extensions 318
The EACH Keyword 318
Data Sampling 320
Repeated Fields 324
Query Errors 334
Result Too Large 334
Resources Exceeded 337
Recipes 338
Pivot 339
Cohort Analysis 340
Parallel Lists 343
Exact Count Distinct 344
Trailing Averages 346
Finding Concurrency 347
Summary 348
Chapter 11 Managing Data Stored in BigQuery 349
Query Caching 349
Result Caching 350
Table Snapshots 354
AppEngine Datastore Integration 358
Simple Kind 359
Mixing Types 366
Final Thoughts 368
Metatables and Table Sharding 368
Time Travel 368
Selecting Tables 374
Summary 378
Part IV BigQuery Applications 381
Chapter 12 External Data Processing 383
Getting Data Out of BigQuery 384
Extract Jobs 384
TableData.list() 396
AppEngine MapReduce 405
Sequential Solution 407
Basic AppEngine MapReduce 409
BigQuery Integration 412
Using BigQuery with Hadoop 418
Querying BigQuery from a Spreadsheet 419
BigQuery Queries in Google Spreadsheets (Apps Script) 419
BigQuery Queries in Microsoft Excel 429
Summary 433
Chapter 13 Using BigQuery from Third-Party Tools 435
BigQuery Adapters 436
Simba ODBC Connector 436
JDBC Connection Options 444
Client-Side Encryption with Encrypted BigQuery 445
Scientifi c Data Processing Tools in BigQuery 452
BigQuery from R 452
Python Pandas and BigQuery 461
Visualizing Data in BigQuery 467
Visualizing Your BigQuery Data with Tableau 467
Visualizing Your BigQuery Data with BIME 473
Other Data Visualization Options 477
Summary 478
Chapter 14 Querying Google Data Sources 479
Google Analytics 480
Setting Up BigQuery Access 480
Table Schema 481
Querying the Tables 483
Google AdSense 485
Table Structure 486
Leveraging BigQuery 490
Google Cloud Storage 491
Summary 494
Index 495
Chapter 1
The Story of Big Data at Google
Since its founding in 1998, Google has grown by multiple orders of magnitude in several different dimensions—how many queries it handles, the size of the search index, the amount of user data it stores, the number of services it provides, and the number of users who rely on those services. From a hardware perspective, the Google Search engine has gone from a server sitting under a desk in a lab at Stanford to hundreds of thousands of servers located in dozens of datacenters around the world.
The traditional approach to scaling (outside of Google) has been to scale the hardware up as the demands on it grow. Instead of running your database on a small blade server, run it on a Big Iron machine with 64 processors and a terabyte of RAM. Instead of relying on inexpensive disks, the traditional scaling path moves critical data to costly network-attached storage (NAS).
There are some problems with the scale-up approach, however:
- Scaled-up machines are expensive. If you need one that has twice the processing power, it might cost you five times as much.
- Scaled-up machines are single points of failure. You might need to get more than one expensive server in case of a catastrophic problem, and each one usually ends up being built with so many backup and redundant pieces that you're paying for a lot more hardware than you actually need.
- Scale up has limits. At some point, you lose the ability to add more processors or RAM; you've bought the most expensive and fastest machine that is made (or that you can afford), and it still might not be fast enough.
- Scale up doesn't protect you against software failures. If you have a Big Iron server that has a kernel bug, that machine will crash just as easily (and as hard) as your Windows laptop.
Google, from an early point in time, rejected scale-up architectures. It didn't, however, do this because it saw the limitations more clearly or because it was smarter than everyone else. It rejected scale-up because it was trying to save money. If the hardware vendor quotes you $1 million for the server you need, you could buy 200 $5,000 machines instead. Google engineers thought, “Surely there is a way we could put those 200 servers to work so that the next time we need to increase the size, we just need to buy a few more cheap machines, rather than upgrade to the $5 million server.” Their solution was to scale out, rather than scale up.
Big Data Stack 1.0
Between 2000 and 2004, armed with a few principles, Google laid the foundation for its Big Data strategy:
- Anything can fail, at any time, so write your software expecting unreliable hardware. At most companies, when a database server crashes, it is a serious event. If a network switch dies, it will probably cause downtime. By running in an environment in which individual components fail often, you paradoxically end up with a much more stable system because your software is designed to handle those failures. You can quantify your risk beyond blindly quoting statistics, such as mean time between failures (MTBFs) or service-level agreements (SLAs).
- Use only commodity, off-the-shelf components. This has a number of advantages: You don't get locked into a particular vendor's feature set; you can always find replacements; and you don't experience big price discontinuities when you upgrade to the “bigger” version.
- The cost for twice the amount of capacity should not be considerably more than the cost for twice the amount of hardware. This means the software must be built to scale out, rather than up. However, this also imposes limits on the types of operations that you can do. For instance, if you scale out your database, it may be difficult to do a
JOINoperation, since you'd need to join data together that lives on different machines. - “A foolish consistency is the hobgoblin of little minds.” If you abandon the “C” (consistency) in ACID database operations, it becomes much easier to parallelize operations. This has a cost, however; loss of consistency means that programmers have to handle cases in which reading data they just wrote might return a stale (inconsistent) copy. This means you need smart programmers.
These principles, along with a cost-saving necessity, inspired new computation architectures. Over a short period of time, Google produced three technologies that inspired the Big Data revolution:
- Google File System (GFS): A distributed, cluster-based filesystem. GFS assumes that any disk can fail, so data is stored in multiple locations, which means that data is still available even when a disk that it was stored on crashes.
- MapReduce: A computing paradigm that divides problems into easily parallelizable pieces and orchestrates running them across a cluster of machines.
- Bigtable: A forerunner of the NoSQL database, Bigtable enables structured storage to scale out to multiple servers. Bigtable is also replicated, so failure of any particular tablet server doesn't cause data loss.
What's more, Google published papers on these technologies, which enabled others to emulate them outside of Google. Doug Cutting and other open source contributors integrated the concepts into a tool called Hadoop. Although Hadoop is considered to be primarily a MapReduce implementation, it also incorporates GFS and BigTable clones, which are called HDFS and HBase, respectively.
Armed with these three technologies, Google replaced nearly all the off-the-shelf software usually used to run a business. It didn't need (with a couple of exceptions) a traditional SQL database; it didn't need an e-mail server because its Gmail service was built on top of these technologies.
Big Data Stack 2.0 (and Beyond)
The three technologies—GFS, MapReduce, and Bigtable—made it possible for Google to scale out its infrastructure. However, they didn't make it easy. Over the next few years, a number of problems emerged:
- MapReduce is hard. It can be difficult to set up and difficult to decompose your problem into Map and Reduce phases. If you need multiple MapReduce rounds (which is common for many real-world problems), you face the issue of how to deal with state in between phases and how to deal with partial failures without having to restart the whole thing.
- MapReduce can be slow. If you want to ask questions of your data, you have to wait minutes or hours to get the answers. Moreover, you have to write custom C++ or Java code each time you want to change the question that you're asking.
- GFS, while improving durability of the data (since it is replicated multiple times) can suffer from reduced availability, since the metadata server is a single point of failure.
- Bigtable has problems in a multidatacenter environment. Most services run in multiple locations; Bigtable replication between datacenters is only eventually consistent (meaning that data that gets written out will show up everywhere, but not immediately). Individual services spend a lot of redundant effort babysitting the replication process.
- Programmers (even Google programmers) have a really difficult time dealing with eventual consistency. This same problem occurred when Intel engineers tried improving CPU performance by relaxing the memory model to be eventually consistent; it caused lots of subtle bugs because the hardware stopped working the way people's mental model of it operated.
Over the next several years, Google built a number of additional infrastructure components that refined the ideas from the 1.0 stack:
- Colossus: A distributed filesystem that works around many of the limitations in GFS. Unlike many of the other technologies used at Google, Colossus' architecture hasn't been publicly disclosed in research papers.
- Megastore: A geographically replicated, consistent NoSQL-type datastore. Megastore uses the Paxos algorithm to ensure consistent reads and writes. This means that if you write data in one datacenter, it is immediately available in all other datacenters.
- Spanner: A globally replicated datastore that can handle data locality constraints, like “This data is allowed to reside only in European datacenters.” Spanner managed to solve the problem of global time ordering in a geographically distributed system by using atomic clocks to guarantee synchronization to within a known bound.
- FlumeJava: A system that allows you to write idiomatic Java code that runs over collections of Big Data. Flume operations get compiled and optimized to run as a series of MapReduce operations. This solves the ease of setup, ease of writing, and ease of handling multiple MapReduce problems previously mentioned.
- Dremel: A distributed SQL query engine that can perform complex queries over data stored on Colossus, GFS, or elsewhere.
The version 2.0 stack, built piecemeal on top of the version 1.0 stack (Megastore is built on top of Bigtable, for instance), addresses many of the drawbacks of the previous version. For instance, Megastore allows services to write from any datacenter and know that other readers will read the most up-to-date version. Spanner, in many ways, is a successor to Megastore, which adds automatic planet-scale replication and data provenance protection.
On...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.