Introductory Statistics and Analytics

Name: Introductory Statistics and Analytics | A Resampling Perspective
Brand: Wiley
Price: 72.99 EUR
Availability: OnlineOnly

A Resampling Perspective

Peter C. Bruce(Author)

Wiley (Publisher)

Published on 8. January 2015

312 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-88133-0 (ISBN)

€72.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

Concise, thoroughly class-tested primer that features basicstatistical concepts in the concepts in the context of analytics,resampling, and the bootstrap A uniquely developed presentation of key statistical topics,Introductory Statistics and Analytics: A ResamplingPerspective provides an accessible approach to statisticalanalytics, resampling, and the bootstrap for readers with variouslevels of exposure to basic probability and statistics. Originallyclass-tested at one of the first online learning companies in thediscipline, www.statistics.com, the book primarily focuses onapplications of statistical concepts developed via resampling, witha background discussion of mathematical theory. This featurestresses statistical literacy and understanding, which demonstratesthe fundamental basis for statistical inference and demystifiestraditional formulas. The book begins with illustrations that have the essentialstatistical topics interwoven throughout before moving on todemonstrate the proper design of studies. Meeting all of theGuidelines for Assessment and Instruction in Statistics Education(GAISE) requirements for an introductory statistics course,Introductory Statistics and Analytics: A ResamplingPerspective also includes: * Over 300 "Try It Yourself" exercises andintermittent practice questions, which challenge readers atmultiple levels to investigate and explore key statisticalconcepts * Numerous interactive links designed to provide solutions toexercises and further information on crucial concepts * Linkages that connect statistics to the rapidly growing fieldof data science * Multiple discussions of various software systems, such asMicrosoft Office Excel®, StatCrunch, and R, to develop andanalyze data * Areas of concern and/or contrasting points-of-view indicatedthrough the use of "Caution" icons Introductory Statistics and Analytics: A ResamplingPerspective is an excellent primary textbook for courses inpreliminary statistics as well as a supplement for courses inupper-level statistics and related fields, such as biostatisticsand econometrics. The book is also a general reference for readersinterested in revisiting the value of statistics.

Reviews / Votes

"The book is an excellent primary textbook for courses in preliminary statistics as well as a supplement for courses in upper-level statistics and related fields, such as biostatistics and econometrics. The book is also a general reference for readers interested in revisiting the value of statistics." (Zentralblatt MATH, 1 April 2015)

More details

Other editions

Person

Content

Preface ix

Acknowledgments xi

Introduction xiii

1 Designing and Carrying Out a Statistical Study 1

1.1 A Small Example 3

1.2 Is Chance Responsible? The Foundation of Hypothesis Testing 3

1.3 A Major Example 7

1.4 Designing an Experiment 8

1.5 What to Measure-Central Location 13

1.6 What to Measure-Variability 16

1.7 What to Measure-Distance (Nearness) 19

1.8 Test Statistic 21

1.9 The Data 22

1.10 Variables and Their Flavors 28

1.11 Examining and Displaying the Data 31

1.12 Are we Sure we Made a Difference? 39

Appendix: Historical Note 39

1.13 Exercises 40

2 Statistical Inference 45

2.1 Repeating the Experiment 46

2.2 How Many Reshuffles? 48

2.3 How Odd is Odd? 53

2.4 Statistical and Practical Significance 55

2.5 When to use Hypothesis Tests 56

2.6 Exercises 56

3 Displaying and Exploring Data 59

3.1 Bar Charts 59

3.2 Pie Charts 61

3.3 Misuse of Graphs 62

3.4 Indexing 64

3.5 Exercises 68

4 Probability 71

4.1 Mendel's Peas 72

4.2 Simple Probability 73

4.3 Random Variables and their Probability Distributions 77

4.4 The Normal Distribution 80

4.5 Exercises 84

5 Relationship between Two Categorical Variables 87

5.1 Two-Way Tables 87

5.2 Comparing Proportions 90

5.3 More Probability 92

5.4 From Conditional Probabilities to Bayesian Estimates 95

5.5 Independence 97

5.6 Exploratory Data Analysis (EDA) 99

5.7 Exercises 100

6 Surveys and Sampling 104

6.1 Simple Random Samples 105

6.2 Margin of Error: Sampling Distribution for a Proportion 109

6.3 Sampling Distribution for a Mean 111

6.4 A Shortcut-the Bootstrap 113

6.5 Beyond Simple Random Sampling 117

6.6 Absolute Versus Relative Sample Size 120

6.7 Exercises 120

7 Confidence Intervals 124

7.1 Point Estimates 124

7.2 Interval Estimates (Confidence Intervals) 125

7.3 Confidence Interval for a Mean 126

7.4 Formula-Based Counterparts to the Bootstrap 126

7.5 Standard Error 132

7.6 Confidence Intervals for a Single Proportion 133

7.7 Confidence Interval for a Difference in Means 136

7.8 Confidence Interval for a Difference in Proportions 139

7.9 Recapping 140

Appendix A: More on the Bootstrap 141

Resampling Procedure-Parametric Bootstrap 141

Formulas and the Parametric Bootstrap 144

Appendix B: Alternative Populations 144

Appendix C: Binomial Formula Procedure 144

7.10 Exercises 147

8 Hypothesis Tests 151

8.1 Review of Terminology 151

8.2 A-B Tests: The Two Sample Comparison 154

8.3 Comparing Two Means 156

8.4 Comparing Two Proportions 157

8.5 Formula-Based Alternative-t-Test for Means 159

8.6 The Null and Alternative Hypotheses 160

8.7 Paired Comparisons 163

Appendix A: Confidence Intervals Versus Hypothesis Tests 167

Confidence Interval 168

Relationship Between the Hypothesis Test and the Confidence Interval 169

Comment 170

Appendix B: Formula-Based Variations of Two-Sample Tests 170

Z-Test With Known Population Variance 170

Pooled Versus Separate Variances 171

Formula-Based Alternative: Z-Test for Proportions 172

8.8 Exercises 172

9 Hypothesis Testing-2 178

9.1 A Single Proportion 178

9.2 A Single Mean 180

9.3 More Than Two Categories or Samples 181

9.4 Continuous Data 187

9.5 Goodness-of-Fit 187

Appendix: Normal Approximation; Hypothesis Test of a Single Proportion 190

Confidence Interval for a Mean 190

9.6 Exercises 191

10 Correlation 193

10.1 Example: Delta Wire 194

10.2 Example: Cotton Dust and Lung Disease 195

10.3 The Vector Product and Sum Test 196

10.4 Correlation Coefficient 199

10.5 Other Forms of Association 204

10.6 Correlation is not Causation 205

10.7 Exercises 206

11 Regression 209

11.1 Finding the Regression Line by Eye 210

11.2 Finding the Regression Line by Minimizing Residuals 212

11.3 Linear Relationships 213

11.4 Inference for Regression 217

11.5 Exercises 221

12 Analysis of Variance-ANOVA 224

12.1 Comparing More Than Two Groups: ANOVA 225

12.2 The Problem of Multiple Inference 228

12.3 A Single Test 229

12.4 Components of Variance 230

12.5 Two-Way ANOVA 240

12.6 Factorial Design 246

12.7 Exercises 248

13 Multiple Regression 251

13.1 Regression as Explanation 252

13.2 Simple Linear Regression-Explore the Data First 253

13.3 More Independent Variables 257

13.4 Model Assessment and Inference 261

13.5 Assumptions 267

13.6 Interaction Again 270

13.7 Regression for Prediction 272

13.8 Exercises 277

Index 283

Introduction

As of the writing of this book, the fields of statistics and data science are evolving rapidly to meet the changing needs of business, government, and research organizations. It is an oversimplification, but still useful, to think of two distinct communities as you proceed through the book:

The traditional academic and medical research communities that typically conduct extended research projects adhering to rigorous regulatory or publication standards, and
Business and large organizations that use statistical methods to extract value from their data, often on the fly. Reliability and value are more important than academic rigor to this data science community.

If You Can't Measure it, You Can't Manage It

You may be familiar with this phrase or its cousin: if you can't measure it, you can't fix it. The two come up frequently in the context of Total Quality Management or Continuous Improvement programs in organizations. The flip side of these expressions is the fact that if you do measure something and make the measurements available to decision-makers, the something that you measure is likely to change.

Toyota found that placing a real-time gas-mileage gauge on the dashboard got people thinking about their driving habits and how they relate to gas consumption. As a result, their gas mileage-miles they drove per gallon of gas-improved.

In 2003, the Food and Drug Administration began requiring that food manufacturers include trans fat quantities on their food labels. In 2008, it was found from a study that blood levels of trans fats in the population had dropped 58% since 2000 (reported in the Washington Post, February 9, 2012, A3).

Thus, the very act of measurement is, in itself, a change agent. Moreover, measurements of all sorts abound-so much so that the term Big Data came into vogue in 2011 to describe the huge quantities of data that organizations are now generating.

Big Data: If You Can Quantify and Harness It, You Can Use It

In 2010, a statistician from Target described how the company used customer transaction data to make educated guesses about whether customers were pregnant or not. On the strength of these guesses, Target sent out advertising flyers to likely prospects, centered around the needs of pregnant women.

How did Target use data to make those guesses? The key was data used to "train" a statistical model: data in which the outcome of interest-pregnant/not pregnant-was known in advance. Where did Target get such data? The "not pregnant" data was easy-the vast majority of customers were not pregnant so the data on their purchases was easy to come by. The "pregnant" data came from a baby shower registry. Both datasets were quite large, containing lists of items purchased by thousands of customers.

Some clues are obvious-purchase of a crib and baby clothes is a dead giveaway. But, from Target's perspective, by the time a customer purchases these obvious big ticket items, it was too late-they had already chosen their shopping venue. Target wanted to reach customers earlier, before they decided where to do their shopping for the big day. For that, Target used statistical modeling to make use of nonobvious patterns in the data that distinguish pregnant from nonpregnant customers. One such clue was shifts in the pattern of supplement purchases-for example, a customer who was not buying supplements 60 days ago but is buying them now. Crafting a marketing campaign on the basis of educated guesses about whether a customer is pregnant aroused controversy for Target, needless to say.

Much of the book that follows deals with important issues that can determine whether data yields meaningful information or not:

The role that random chance plays in creating apparently interesting results or patterns in data.
How to design experiments and surveys to get useful and reliable information.
How to formulate simple statistical models to describe relationships between one variable and another.

Phantom Protection from Vitamin E

In 1993, researchers examining a database on nurses' health found that nurses who took vitamin E supplements had 30-40% fewer heart attacks than those who did not. These data fit with theories that antioxidants such as vitamins E and C could slow damaging processes within the body. Linus Pauling, winner of the Nobel Prize in Chemistry in 1954, was a major proponent of these theories. The Linus Pauling Institute at Oregon State University is still actively promoting the role of vitamin E and other nutritional supplements in inhibiting disease. These results provided a major boost to the dietary supplements industry. The only problem? The heart health benefits of vitamin E turned out to be illusory. A study completed in 2007 divided 14,641 male physicians randomly into four groups:

Take 400 IU of vitamin E every other day
Take 500 mg of vitamin C every day
Take both vitamin E and C
Take placebo.

Those who took vitamin E fared no better than those who did not take vitamin E. As the only difference between the two groups was whether or not they took vitamin E, if there were a vitamin E effect, it would have shown up. Several meta-analyses, which are consolidated reviews of the results of multiple published studies, have reached the same conclusion. One found that vitamin E at the above dosage might even increase mortality.

What made the researchers in 1993 think that they had found a link between vitamin E and disease inhibition? After reviewing a vast quantity of data, researchers thought that they saw an interesting association. In retrospect, with the benefit of a well-designed experiment, it appears that this association was merely a chance coincidence. Unfortunately, coincidences happen all the time in life. In fact, they happen to a greater extent than we think possible.

Statistician, Heal Thyself

In 1993, Mathsoft Corp., the developer of Mathcad mathematical software, acquired StatSci, the developer of S-PLUS statistical software, predecessor to the open-source R software. Mathcad was an affordable tool popular with engineers-prices were in the hundreds of dollars, and the number of users was in the hundreds of thousands. S-PLUS was a high-end graphical and statistical tool used primarily by statisticians-prices were in the thousands of dollars, and the number of users was in the thousands.

In an attempt to boost revenues, Mathsoft turned to an established marketing principle-cross-selling. In other words, trying to convince the people who bought product A to buy product B. With the acquisition of a highly regarded niche product, S-PLUS, and an existing large customer base for Mathcad, Mathsoft decided that the logical thing to do would be to ramp up S-PLUS sales via direct mail to its installed Mathcad user base. It also decided to purchase lists of similar prospective customers for both Mathcad and S-PLUS.

This major mailing program boosted revenues, but it boosted expenses even more. The company lost over $13 million in 1993 and 1994 combined-significant numbers for a company that had only $11 million in revenue in 1992.

What Happened?

In retrospect, it was clear that the mailings were not well targeted. The costs of the unopened mail exceeded the revenue from the few recipients who did respond. In particular, Mathcad users turned out to be unlikely users of S-PLUS. The huge losses could have been avoided through the use of two common statistical techniques:

Doing a test mailing to the various lists being considered to (a) determine whether the list is productive and (b) test different headlines, copy, pricing, and so on, to see what works best.
Using predictive modeling techniques to identify which names on a list are most likely to turn into customers.

Identifying Terrorists in Airports

Since the September 11, 2001 Al Qaeda attacks in the United States and subsequent attacks elsewhere, security screening programs at airports have become a major undertaking, costing billions of dollars per year in the United States alone. Most of these resources are consumed by an exhaustive screening process. All passengers and their tickets are reviewed, their baggage is screened, and individuals pass through detectors of varying sophistication. An individual and his or her bag can only receive a limited amount of attention in an exhaustive screening process. The process is largely the same for each individual. Potential terrorists can see the process and its workings in detail and identify its weaknesses.

To improve the effectiveness of the system, security officials have studied ways of focusing more concentrated attention on a small number of travelers. In the years after the attacks, one technique enhanced the screening for a limited number of randomly selected travelers. Although it adds some uncertainty to the process, which acts as a deterrent to attackers, random selection does nothing to focus attention on high-risk individuals.

Determining who is of high risk is, of course, the problem. How do you know who the high-risk passengers are?

One method is passenger profiling-specifying some guidelines about what passenger characteristics merit special attention. These characteristics were determined by a reasoned, logical approach. For example, purchasing a...

Content (EPUB)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Introductory Statistics and Analytics

Description

Reviews / Votes

More details

Other editions

Additional editions

Person

Content

Introduction

If You Can't Measure it, You Can't Manage It

Big Data: If You Can Quantify and Harness It, You Can Use It

Phantom Protection from Vitamin E

Statistician, Heal Thyself

What Happened?

Identifying Terrorists in Airports

System requirements