Statistics for Big Data For Dummies

Name: Statistics for Big Data For Dummies
Brand: Wiley
Price: 15.99 EUR
Availability: OnlineOnly

Alan Anderson David Semmelroth(Author)

Wiley (Publisher)

Published on 11. August 2015

384 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-94002-0 (ISBN)

€15.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

The fast and easy way to make sense of statistics for bigdata Does the subject of data analysis make you dizzy? You've come tothe right place! Statistics For Big Data For Dummies breaksthis often-overwhelming subject down into easily digestible parts,offering new and aspiring data analysts the foundation they need tobe successful in the field. Inside, you'll find an easy-to-followintroduction to exploratory data analysis, the lowdown oncollecting, cleaning, and organizing data, everything you need toknow about interpreting data using common software and programminglanguages, plain-English explanations of how to make sense of datain the real world, and much more. Data has never been easier to come by, and the tools studentsand professionals need to enter the world of big data are based onapplied statistics. While the word "statistics" alone can evokefeelings of anxiety in even the most confident student orprofessional, it doesn't have to. Written in the familiar andfriendly tone that has defined the For Dummies brand for more thantwenty years, Statistics For Big Data For Dummies takes theintimidation out of the subject, offering clear explanations andtons of step-by-step instruction to help you make sense of datamining--without losing your cool. * Helps you to identify valid, useful, and understandablepatterns in data * Provides guidance on extracting previously unknown informationfrom large databases * Shows you how to discover patterns available in big data * Gives you access to the latest tools and techniques for workingin big data If you're a student enrolled in a related Applied Statisticscourse or a professional looking to expand your skillset,Statistics For Big Data For Dummies gives you access toeverything you need to succeed.

More details

Other editions

Persons

Content

Introduction 1 Part I: Introducing Big Data Statistics 7 Chapter 1: What Is Big Data and What Do You Do With It? 9 Chapter 2: Characteristics of Big Data: The Three Vs 19 Chapter 3: Using Big Data: The Hot Applications 27 Chapter 4: Understanding Probabilities 41 Chapter 5: Basic Statistical Ideas 57 Part II: Preparing and Cleaning Data 81 Chapter 6: Dirty Work: Preparing Your Data for Analysis 83 Chapter 7: Figuring the Format: Important Computer File Formats 99 Chapter 8: Checking Assumptions: Testing for Normality 107 Chapter 9: Dealing with Missing or Incomplete Data 119 Chapter 10: Sending Out a Posse: Searching for Outliers 129 Part III: Exploratory Data Analysis (EDA) 141 Chapter 11: An Overview of Exploratory Data Analysis (EDA) 143 Chapter 12: A Plot to Get Graphical: Graphical Techniques 155 Chapter 13: You're the Only Variable for Me: Univariate Statistical Techniques 173 Chapter 14: To All the Variables We've Encountered:Multivariate Statistical Techniques 191 Chapter 15: Regression Analysis 215 Chapter 16: When You've Got the Time: Time Series Analysis 243 Part IV: Big Data Applications 269 Chapter 17: Using Your Crystal Ball: Forecasting with Big Data 271 Chapter 18: Crunching Numbers: Performing Statistical Analysis on Your Computer 297 Chapter 19: Seeking Free Sources of Financial Data 319 Part V: The Part of Tens 331 Chapter 20: Ten (or So) Best Practices in Data Preparation 333 Chapter 21: Ten (or So) Questions Answered by Exploratory Data Analysis (EDA) 339 Index 349

Chapter 1

What Is Big Data and What Do You Do with It?

In This Chapter

Understanding what big data is all about

Seeing how data may be analyzed using Exploratory Data Analysis (EDA)

Gaining insight into some of the key statistical techniques used to analyze big data

Big data refers to sets of data that are far too massive to be handled with traditional hardware. Big data is also problematic for software such as database systems, statistical packages, and so forth. In recent years, data-gathering capabilities have experienced explosive growth, so that storing and analyzing the resulting data has become progressively more challenging.

Many fields have been affected by the increasing availability of data, including finance, marketing, and e-commerce. Big data has also revolutionized more traditional fields such as law and medicine. Of course, big data is gathered on a massive scale by search engines such as Google and social media sites such as Facebook. These developments have led to the evolution of an entirely new profession: the data scientist, someone who can combine the fields of statistics, math, computer science, and engineering with knowledge of a specific application.

This chapter introduces several key concepts that are discussed throughout the book. These include the characteristics of big data, applications of big data, key statistical tools for analyzing big data, and forecasting techniques.

Characteristics of Big Data

The three factors that distinguish big data from other types of data are volume, velocity, and variety.

Clearly, with big data, the volume is massive. In fact, new terminology must be used to describe the size of these datasets. For example, one petabyte of data consists of bytes of data. That's 1,000 trillion bytes!

A byte is a single unit of storage in a computer's memory. A byte is used to represent a single number, character, or symbol. A byte consists of eight bits, each consisting of either a 0 or a 1.

Velocity refers to the speed at which data is gathered. Big datasets consist of data that's continuously gathered at very high speeds. For example, it has been estimated that Twitter users generate more than a quarter of a million tweets every minute. This requires a massive amount of storage space as well as real-time processing of the data.

Variety refers to the fact that the contents of a big dataset may consist of a number of different formats, including spreadsheets, videos, music clips, email messages, and so on. Storing a huge quantity of these incompatible types is one of the major challenges of big data.

Chapter 2 covers these characteristics in more detail.

Exploratory Data Analysis (EDA)

Before you apply statistical techniques to a dataset, it's important to examine the data to understand its basic properties. You can use a series of techniques that are collectively known as Exploratory Data Analysis (EDA) to analyze a dataset. EDA helps ensure that you choose the correct statistical techniques to analyze and forecast the data. The two basic types of EDA techniques are graphical techniques and quantitative techniques.

Graphical EDA techniques

Graphical EDA techniques show the key properties of a dataset in a convenient format. It's often easier to understand the properties of a variable and the relationships between variables by looking at graphs rather than looking at the raw data. You can use several graphical techniques, depending on the type of data being analyzed. Chapters 11 and 12 explain how to create and use the following:

Box plots
Histograms
Normal probability plots
Scatter plots

Quantitative EDA techniques

Quantitative EDA techniques provide a more rigorous method of determining the key properties of a dataset. Two of the most important of these techniques are

Interval estimation (discussed in Chapter 11).
Hypothesis testing (introduced in Chapter 5).

Interval estimates are used to create a range of values within which a variable is likely to fall. Hypothesis testing is used to test various propositions about a dataset, such as

The mean value of the dataset.
The standard deviation of the dataset.
The probability distribution the dataset follows.

Hypothesis testing is a core technique in statistics and is used throughout the chapters in Part III of this book.

Statistical Analysis of Big Data

Gathering and storing massive quantities of data is a major challenge, but ultimately the biggest and most important challenge of big data is putting it to good use.

For example, a massive quantity of data can be helpful to a company's marketing research department only if it can identify the key drivers of the demand for the company's products. Political polling firms have access to massive amounts of demographic data about voters; this information must be analyzed intensively to find the key factors that can lead to a successful political campaign. A hedge fund can develop trading strategies from massive quantities of financial data by finding obscure patterns in the data that can be turned into profitable strategies.

Many statistical techniques can be used to analyze data to find useful patterns:

Probability distributions are introduced in Chapter 4 and explored at greater length in Chapter 13.
Regression analysis is the main topic of Chapter 15.
Time series analysis is the primary focus of Chapter 16.
Forecasting techniques are discussed in Chapter 17.

Probability distributions

You use a probability distribution to compute the probabilities associated with the elements of a dataset. The following distributions are described and applied in this book:

Binomial distribution: You would use the binomial distribution to analyze variables that can assume only one of two values. For example, you could determine the probability that a given percentage of members at a sports club are left-handed. See Chapter 4 for details.
Poisson distribution: You would use the Poisson distribution to describe the likelihood of a given number of events occurring over an interval of time. For example, it could be used to describe the probability of a specified number of hits on a website over the coming hour. See Chapter 13 for details.
Normal distribution: The normal distribution is the most widely used probability distribution in most disciplines, including economics, finance, marketing, biology, psychology, and many others. One of the characteristic features of the normal distribution is symmetry - the probability of a variable being a given distance below the mean of the distribution equals the probability of it being the same distance above the mean. For example, if the mean height of all men in the United States is 70 inches, and heights are normally distributed, a randomly chosen man is equally likely to be between 68 and 70 inches tall as he is to be between 70 and 72 inches tall. See Chapter 4 and the chapters in Parts III and IV for details.

The normal distribution works well with many applications. For example, it's often used in the field of finance to describe the returns to financial assets. Due to its ease of interpretation and implementation, the normal distribution is sometimes used even when the assumption of normality is only approximately correct.
The Student's t-distribution: The Student's t-distribution is similar to the normal distribution, but with the Student's t-distribution, extremely small or extremely large values are much more likely to occur. This distribution is often used in situations where a variable exhibits too much variation to be consistent with the normal distribution. This is true when the properties of small samples are being analyzed. With small samples, the variation among samples is likely to be quite considerable, so the normal distribution shouldn't be used to describe their properties. See Chapter 13 for details.

Note: The Student's t-distribution was developed by W.S. Gosset while employed at the Guinness brewing company. He was attempting to describe the properties of small sample means.
The chi-square distribution: The chi-square distribution is appropriate for several types of applications. For example, you can use it to determine whether a population follows a particular probability distribution. You can also use it to test whether the variance of a population equals a specified value, and to test for the independence of two datasets. See Chapter 13 for details.
The F-distribution: The F-distribution is derived from the chi-square distribution. You use it to test whether the variances of two populations equal each other. The F-distribution is also useful in applications such as regression analysis (covered next). See Chapter 14 for details.

Regression analysis

Regression analysis is used to...

Content (EPUB)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Statistics for Big Data For Dummies

Description

More details

Other editions

Additional editions

Persons

Content

What Is Big Data and What Do You Do with It?

Characteristics of Big Data

Exploratory Data Analysis (EDA)

Graphical EDA techniques

Quantitative EDA techniques

Statistical Analysis of Big Data

Probability distributions

Regression analysis

System requirements