Data Analysis Using SQL and Excel

Name: Data Analysis Using SQL and Excel
Brand: Wiley
Price: 39.99 EUR
Availability: OnlineOnly

Gordon S. Linoff(Autor*in)

Wiley (Verlag)

2. Auflage

Erschienen am 3. Dezember 2015

792 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-02144-5 (ISBN)

39,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

Foreword xxxiii
Introduction xxxvii
Chapter 1 A Data Miner Looks at SQL 1
Chapter 2 What's in a Table? Getting Started with Data Exploration 49
Chapter 3 How Different Is Different? 97
Chapter 4 Where Is It All Happening? Location, Location, Location 145
Chapter 5 It's a Matter of Time 197
Chapter 6 How Long Will Customers Last? Survival Analysis to Understand Customers and Their Value 255
Chapter 7 Factors Affecting Survival: The What and Why of Customer Tenure 315
Chapter 8 Customer Purchases and Other Repeated Events 367
Chapter 9 What's in a Shopping Cart? Market Basket Analysis 421
Chapter 10 Association Rules and Beyond 465
Chapter 11 Data Mining Models in SQL 507
Chapter 12 The Best-Fit Line: Linear Regression Models 561
Chapter 13 Building Customer Signatures for Further Analysis 609
Chapter 14 Performance Is the Issue: Using SQL Effectively 655
Appendix Equivalent Constructs Among Databases 703
Index 731

Introduction

The first edition of this book set out to explain data analysis from an eminently practical perspective, using the familiar tools of SQL and Excel. The guiding principle of the book was to start with questions and guide the reader through the solutions, both from a business perspective and a technical perspective. This approach proved to be quite successful.

Much has changed in the ten years since I started writing the first edition. The tools themselves have changed. In those days, Excel did not have a Ribbon, for instance. And, window functions were rare in databases. The world that analysts inhabit has also changed, with tools such as Python and R and NoSQL databases becoming more common. However, relational databases are still in widespread use, and SQL is, if anything, even more relevant today as technology spreads through businesses big and small. Excel still seems to be the reporting and presentation tool of choice for many business users. Big data is no longer a future frontier; it is a problem, a challenge, and an opportunity that we face on a daily basis.

The second edition has been revised and updated to reflect the changes in the underlying software, with more examples and more techniques, and an additional chapter on database performance. In doing so, I have strived to keep the strengths from the first edition. The book is still organized around the principles of data, analysis, and presentation-three capabilities that are rarely treated together. Examples are organized around questions, with a discussion of both the business relevance and the technical approaches to the problems. The examples carry through to actual code. The data, the code, and the Excel examples are all available on the companion website.

The motivation for this approach originally came from a colleague, Nick Drake, who is a statistician by training. Once upon a time, he was looking for a book that would explain how to use SQL for the complex queries needed for data analysis. Books on SQL tend to cover either basic query constructs or the details of how databases work. None come strictly from a perspective of analyzing data, and none are structured around answering questions about data. Of the many books on statistics, none address the simple fact that most of the data being used resides in relational databases. This book fills that gap.

My other books on data mining, written with Michael Berry, focus on advanced algorithms and case studies. By contrast, this book focuses on the "how-to." It starts by describing data stored in databases and continues through preparing and producing results. Interspersed are stories based on my experience in the field, explaining how results might be applied and why some things work and other things do not. The examples are so practical that the data used for them is available on the book's companion website (www.wiley.com/go/dataanalysisusingsqlandexcel2e).

One of the truisms about data warehouses and analysis databases in general is that they don't actually do anything. Yes, they store data. Yes, they bring together data from different sources, cleansing and clarifying along the way. Yes, they define business dimensions, store transactions about customers, and, perhaps, summarize important data. (And, yes, all these are very important!) However, data in a database resides on many spinning disks and in complex data structures in a computer's memory. So much data, so little information.

How can we exploit this data, particularly data that describes customers? The many fancy algorithms for statistical modeling and data mining all have a simple rule: "garbage-in, garbage-out." The results of even the most sophisticated techniques are only as good as the data being used (and the assumptions being fed into the model). Data is central to the task of understanding customers, understanding products, and understanding markets.

The chapters in this book cover different aspects of data and several important analytic techniques that are readily supported by SQL and Excel. The analytic techniques range from exploratory data analysis to survival analysis, from market basket analysis to naïve Bayesian models, and from simple animations to regression. Of course, the potential range of possible techniques is much larger than can be presented in one book. These methods have proven useful over time and are applicable in many different areas.

And finally, data and analysis are not enough. Data must be analyzed, and the results must be presented to the right audience. To fully exploit its value, we must transform data into stories and scenarios, charts and metrics and insights.

Overview of the Book and Technology

This book focuses on three key technological areas used for transforming data into actionable information:

Relational databases store data. The basic language for retrieving data is SQL. (Note that variants of SQL are also used for NoSQL databases.)
Excel spreadsheets are the most popular tool for presenting data. Perhaps the most powerful feature of Excel is its charting capability, which turns columns of numbers into pictures.
Statistics is the foundation of data analysis.

These three technologies are presented together because they are all interrelated. SQL answers the question "How do we access data?" Statistics answers the question "How is it relevant?" And Excel makes it possible to convince other people of the veracity of our findings and to provide them results that they can play with.

The description of data processing is organized around the SQL language. Databases such as Oracle, Postgres, MySQL, IBM DB2, and Microsoft SQL Server are common in the business world, storing the vast majority of business data transactions. The good news is that all relational databases support SQL as a query language. However, just as England and the United States have been described as "two countries separated by a common language," each database supports a slightly different dialect of SQL. The Appendix contains a list of how commonly used functionality is represented in various different dialects.

Similarly, beautiful presentation tools and professional graphics packages are available. However, rare and exceptional is the workplace computer that does not have Excel or an equivalent spreadsheet.

Statistics and data mining techniques do not always require advanced tools. Some very important techniques are readily available using the combination of SQL and Excel, including survival analysis, look-alike models, naïve Bayesian models, and association rules. In fact, the methods in this book are often more powerful than the methods available in such tools, precisely because they are close to the data and readily customizable. The explanation of the techniques covers both the basic ideas and the extensions that may not be available in other tools.

The chapters describing the various techniques provide a solid introduction to modeling and data exploration, in the context of familiar tools and data. They also highlight when more advanced tools are useful because the problem exceeds the capabilities of the simpler tools.

How This Book Is Organized

The 14 chapters in this book fall into four parts. The first three introduce key concepts of SQL, Excel, and statistics. The seven middle chapters discuss various methods of exploring data and analytic techniques specifically suited to SQL and Excel. More formal ideas about modeling, in the sense of statistics and data mining, are in the next three chapters. And, finally, a new chapter discusses performance issues when writing SQL queries.

Each chapter explains some aspect of data analysis using SQL and Excel from several different perspectives, including:

Business examples for using the analysis
Questions the analysis answers
Explanations about how the analytic techniques work
SQL syntax for implementing the techniques
Results as tables or charts and how to create them in Excel

Examples in the chapters are generally available in Excel at www.wiley.com/go/dataanalysisusingsqlandexcel2e.

SQL is a concise language that is sometimes difficult to follow. Dataflows, graphical representations of data processing, are used to illustrate how SQL works. These dataflow diagrams are a reasonable approximation of how SQL engines actually process the data, although the details necessarily differ based on the underlying engine.

Results are presented in charts and tables, sprinkled throughout the book. In addition, important features of Excel are highlighted, and interesting uses of Excel graphics are explained. Each chapter has technical asides, typically explaining some aspect of a technique or an interesting bit of history associated with the methods described in the chapter.

Introductory Chapters

Chapter 1, "A Data Miner Looks at SQL," introduces SQL from the perspective of data analysis. This is the querying part of the SQL language, used to extract data from databases using SELECT queries.

This chapter introduces entity-relationship diagrams to describe the structure of the data-the tables and columns and how they relate to each other. It also introduces dataflow diagrams to describe the processing of queries; dataflow diagrams give a visual explanation of how data is processed. This chapter introduces the important functionality used throughout the book-such as joins, aggregations, and window...

Inhalt (EPUB)

Systemvoraussetzungen

Als PDF speichern Als Link merken