Google BigQuery Analytics

Name: Google BigQuery Analytics
Brand: Wiley
Price: 34.99 EUR
Availability: OnlineOnly

Jordan Tigani Siddartha Naidu(Author)

Wiley (Publisher)

Published on 21. May 2014

528 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-82479-5 (ISBN)

€34.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Introduction xiii

Part I BigQuery Fundamentals

Chapter 1 The Story of Big Data at Google 3

Big Data Stack 1.0 4

Big Data Stack 2.0 (and Beyond) 5

Open Source Stack 7

Google Cloud Platform 8

Cloud Processing 9

Cloud Storage 9

Cloud Analytics 9

Problem Statement 10

What Is Big Data? 10

Why Big Data? 10

Why Do You Need New Ways to Process Big Data? 11

How Can You Read a Terabyte in a Second? 12

What about MapReduce? 12

How Can You Ask Questions of Your Big Data and Quickly

Get Answers? 13

Summary 13

Chapter 2 BigQuery Fundamentals 15

What Is BigQuery? 15

SQL Queries over Big Data 16

Cloud Storage System 21

Distributed Cloud Computing 23

Analytics as a Service (AaaS?) 26

What BigQuery Isn't 29

BigQuery Technology Stack 31

Google Cloud Platform 34

BigQuery Service History 37

BigQuery Sensors Application 39

Sensor Client Android App 40

BigQuery Sensors AppEngine App 41

Running Ad-Hoc Queries 42

Summary 43

Chapter 3 Getting Started with BigQuery 45

Creating a Project 45

Google APIs Console 46

Free Tier Limitations and Billing 49

Running Your First Query 51

Loading Data 54

Using the Command-Line Client 57

Install and Setup 58

Using the Client 60

Service Account Access 62

Setting Up Google Cloud Storage 64

Development Environment 66

Python Libraries 66

Java Libraries 67

Additional Tools 67

Summary 68

Chapter 4 Understanding the BigQuery Object Model 69

Projects 70

Project Names 70

Project Billing 72

Project Access Control 72

Projects and AppEngine 73

BigQuery Data 73

Naming in BigQuery 73

Schemas 75

Tables 76

Datasets 77

Jobs 78

Job Components 78

BigQuery Billing and Quotas 85

Storage Costs 85

Processing Costs 86

Query RPCs 87

TableData.insertAll() RPCs 87

Data Model for End-to-End Application 87

Project 87

Datasets 88

Tables 89

Summary 91

Part II Basic BigQuery 93

Chapter 5 Talking to the BigQuery API 95

Introduction to Google APIs 95

Authenticating API Access 96

RESTful Web Services for the SOAP-Less Masses 105

Discovering Google APIs 112

Common Operations 113

BigQuery REST Collections 122

Projects 123

Datasets 126

Tables 132

TableData 139

Jobs 144

BigQuery API Tour 151

Error Handling in BigQuery 154

Summary 158

Chapter 6 Loading Data 159

Bulk Loads 160

Moving Bytes 163

Destination Table 170

Data Formats 174

Errors 182

Limits and Quotas 186

Streaming Inserts 188

Summary 193

Chapter 7 Running Queries 195

BigQuery Query API 196

Query API Methods 196

Query API Features 208

Query Billing and Quotas 213

BigQuery Query Language 221

BigQuery SQL in Five Queries 222

Differences from Standard SQL 232

Summary 236

Chapter 8 Putting It Together 237

A Quick Tour 238

Mobile Client 242

Monitoring Service 243

Log Collection Service 252

Log Trampoline 253

Dashboard 260

Data Caching 261

Data Transformation 265

Web Client 269

Summary 272

Part III Advanced BigQuery 273

Chapter 9 Understanding Query Execution 275

Background 276

Storage Architecture 277

Colossus File System (CFS) 277

ColumnIO 278

Durability and Availability 281

Query Processing 282

Dremel Serving Trees 283

Architecture Comparisons 295

Relational Databases 295

MapReduce 298

Summary 303

Chapter 10 Advanced Queries 305

Advanced SQL 306

Subqueries 307

Combining Tables: Implicit UNION and JOIN 310

Analytic and Windowing Functions 315

BigQuery SQL Extensions 318

The EACH Keyword 318

Data Sampling 320

Repeated Fields 324

Query Errors 334

Result Too Large 334

Resources Exceeded 337

Recipes 338

Pivot 339

Cohort Analysis 340

Parallel Lists 343

Exact Count Distinct 344

Trailing Averages 346

Finding Concurrency 347

Summary 348

Chapter 11 Managing Data Stored in BigQuery 349

Query Caching 349

Result Caching 350

Table Snapshots 354

AppEngine Datastore Integration 358

Simple Kind 359

Mixing Types 366

Final Thoughts 368

Metatables and Table Sharding 368

Time Travel 368

Selecting Tables 374

Summary 378

Part IV BigQuery Applications 381

Chapter 12 External Data Processing 383

Getting Data Out of BigQuery 384

Extract Jobs 384

TableData.list() 396

AppEngine MapReduce 405

Sequential Solution 407

Basic AppEngine MapReduce 409

BigQuery Integration 412

Using BigQuery with Hadoop 418

Querying BigQuery from a Spreadsheet 419

BigQuery Queries in Google Spreadsheets (Apps Script) 419

BigQuery Queries in Microsoft Excel 429

Summary 433

Chapter 13 Using BigQuery from Third-Party Tools 435

BigQuery Adapters 436

Simba ODBC Connector 436

JDBC Connection Options 444

Client-Side Encryption with Encrypted BigQuery 445

Scientifi c Data Processing Tools in BigQuery 452

BigQuery from R 452

Python Pandas and BigQuery 461

Visualizing Data in BigQuery 467

Visualizing Your BigQuery Data with Tableau 467

Visualizing Your BigQuery Data with BIME 473

Other Data Visualization Options 477

Summary 478

Chapter 14 Querying Google Data Sources 479

Google Analytics 480

Setting Up BigQuery Access 480

Table Schema 481

Querying the Tables 483

Google AdSense 485

Table Structure 486

Leveraging BigQuery 490

Google Cloud Storage 491

Summary 494

Index 495

Chapter 1
The Story of Big Data at Google

Since its founding in 1998, Google has grown by multiple orders of magnitude in several different dimensions—how many queries it handles, the size of the search index, the amount of user data it stores, the number of services it provides, and the number of users who rely on those services. From a hardware perspective, the Google Search engine has gone from a server sitting under a desk in a lab at Stanford to hundreds of thousands of servers located in dozens of datacenters around the world.

The traditional approach to scaling (outside of Google) has been to scale the hardware up as the demands on it grow. Instead of running your database on a small blade server, run it on a Big Iron machine with 64 processors and a terabyte of RAM. Instead of relying on inexpensive disks, the traditional scaling path moves critical data to costly network-attached storage (NAS).

There are some problems with the scale-up approach, however:

Scaled-up machines are expensive. If you need one that has twice the processing power, it might cost you five times as much.
Scaled-up machines are single points of failure. You might need to get more than one expensive server in case of a catastrophic problem, and each one usually ends up being built with so many backup and redundant pieces that you're paying for a lot more hardware than you actually need.
Scale up has limits. At some point, you lose the ability to add more processors or RAM; you've bought the most expensive and fastest machine that is made (or that you can afford), and it still might not be fast enough.
Scale up doesn't protect you against software failures. If you have a Big Iron server that has a kernel bug, that machine will crash just as easily (and as hard) as your Windows laptop.

Google, from an early point in time, rejected scale-up architectures. It didn't, however, do this because it saw the limitations more clearly or because it was smarter than everyone else. It rejected scale-up because it was trying to save money. If the hardware vendor quotes you $1 million for the server you need, you could buy 200 $5,000 machines instead. Google engineers thought, “Surely there is a way we could put those 200 servers to work so that the next time we need to increase the size, we just need to buy a few more cheap machines, rather than upgrade to the $5 million server.” Their solution was to scale out, rather than scale up.

Big Data Stack 1.0

Between 2000 and 2004, armed with a few principles, Google laid the foundation for its Big Data strategy:

Anything can fail, at any time, so write your software expecting unreliable hardware. At most companies, when a database server crashes, it is a serious event. If a network switch dies, it will probably cause downtime. By running in an environment in which individual components fail often, you paradoxically end up with a much more stable system because your software is designed to handle those failures. You can quantify your risk beyond blindly quoting statistics, such as mean time between failures (MTBFs) or service-level agreements (SLAs).
Use only commodity, off-the-shelf components. This has a number of advantages: You don't get locked into a particular vendor's feature set; you can always find replacements; and you don't experience big price discontinuities when you upgrade to the “bigger” version.
The cost for twice the amount of capacity should not be considerably more than the cost for twice the amount of hardware. This means the software must be built to scale out, rather than up. However, this also imposes limits on the types of operations that you can do. For instance, if you scale out your database, it may be difficult to do a JOIN operation, since you'd need to join data together that lives on different machines.
“A foolish consistency is the hobgoblin of little minds.” If you abandon the “C” (consistency) in ACID database operations, it becomes much easier to parallelize operations. This has a cost, however; loss of consistency means that programmers have to handle cases in which reading data they just wrote might return a stale (inconsistent) copy. This means you need smart programmers.

These principles, along with a cost-saving necessity, inspired new computation architectures. Over a short period of time, Google produced three technologies that inspired the Big Data revolution:

Google File System (GFS): A distributed, cluster-based filesystem. GFS assumes that any disk can fail, so data is stored in multiple locations, which means that data is still available even when a disk that it was stored on crashes.
MapReduce: A computing paradigm that divides problems into easily parallelizable pieces and orchestrates running them across a cluster of machines.
Bigtable: A forerunner of the NoSQL database, Bigtable enables structured storage to scale out to multiple servers. Bigtable is also replicated, so failure of any particular tablet server doesn't cause data loss.

What's more, Google published papers on these technologies, which enabled others to emulate them outside of Google. Doug Cutting and other open source contributors integrated the concepts into a tool called Hadoop. Although Hadoop is considered to be primarily a MapReduce implementation, it also incorporates GFS and BigTable clones, which are called HDFS and HBase, respectively.

Armed with these three technologies, Google replaced nearly all the off-the-shelf software usually used to run a business. It didn't need (with a couple of exceptions) a traditional SQL database; it didn't need an e-mail server because its Gmail service was built on top of these technologies.

Big Data Stack 2.0 (and Beyond)

The three technologies—GFS, MapReduce, and Bigtable—made it possible for Google to scale out its infrastructure. However, they didn't make it easy. Over the next few years, a number of problems emerged:

MapReduce is hard. It can be difficult to set up and difficult to decompose your problem into Map and Reduce phases. If you need multiple MapReduce rounds (which is common for many real-world problems), you face the issue of how to deal with state in between phases and how to deal with partial failures without having to restart the whole thing.
MapReduce can be slow. If you want to ask questions of your data, you have to wait minutes or hours to get the answers. Moreover, you have to write custom C++ or Java code each time you want to change the question that you're asking.
GFS, while improving durability of the data (since it is replicated multiple times) can suffer from reduced availability, since the metadata server is a single point of failure.
Bigtable has problems in a multidatacenter environment. Most services run in multiple locations; Bigtable replication between datacenters is only eventually consistent (meaning that data that gets written out will show up everywhere, but not immediately). Individual services spend a lot of redundant effort babysitting the replication process.
Programmers (even Google programmers) have a really difficult time dealing with eventual consistency. This same problem occurred when Intel engineers tried improving CPU performance by relaxing the memory model to be eventually consistent; it caused lots of subtle bugs because the hardware stopped working the way people's mental model of it operated.

Over the next several years, Google built a number of additional infrastructure components that refined the ideas from the 1.0 stack:

Colossus: A distributed filesystem that works around many of the limitations in GFS. Unlike many of the other technologies used at Google, Colossus' architecture hasn't been publicly disclosed in research papers.
Megastore: A geographically replicated, consistent NoSQL-type datastore. Megastore uses the Paxos algorithm to ensure consistent reads and writes. This means that if you write data in one datacenter, it is immediately available in all other datacenters.
Spanner: A globally replicated datastore that can handle data locality constraints, like “This data is allowed to reside only in European datacenters.” Spanner managed to solve the problem of global time ordering in a geographically distributed system by using atomic clocks to guarantee synchronization to within a known bound.
FlumeJava: A system that allows you to write idiomatic Java code that runs over collections of Big Data. Flume operations get compiled and optimized to run as a series of MapReduce operations. This solves the ease of setup, ease of writing, and ease of handling multiple MapReduce problems previously mentioned.
Dremel: A distributed SQL query engine that can perform complex queries over data stored on Colossus, GFS, or elsewhere.

The version 2.0 stack, built piecemeal on top of the version 1.0 stack (Megastore is built on top of Bigtable, for instance), addresses many of the drawbacks of the previous version. For instance, Megastore allows services to write from any datacenter and know that other readers will read the most up-to-date version. Spanner, in many ways, is a successor to Megastore, which adds automatic planet-scale replication and data provenance protection.

On...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Google BigQuery Analytics

Description

More details

Other editions

Additional editions

Persons

Content

Chapter 1
The Story of Big Data at Google

Big Data Stack 1.0

Big Data Stack 2.0 (and Beyond)

System requirements