Statistical Hypothesis Testing with SAS and R

Name: Statistical Hypothesis Testing with SAS and R
Brand: Wiley
Price: 71.99 EUR
Availability: OnlineOnly

Dirk Taeger Sonja Kuhnt(Author)

Wiley (Publisher)

Published on 7. January 2014

312 pages

E-Book

PDF with Adobe-DRM

System requirements

978-1-118-76260-8 (ISBN)

€71.99incl. 7% vat

System requirements

for PDF with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Preface xiii Part I INTRODUCTION 1 1 Statistical hypothesis testing 3 1.1 Theory of statistical hypothesis testing 3 1.2 Testing statistical hypothesis with SAS and R 4 1.3 Presentation of the statistical tests 13 References 15 Part II NORMAL DISTRIBUTION 17 2 Tests on the mean 19 2.1 One-sample tests 19 2.2 Two-sample tests 23 References 35 3 Tests on the variance 36 3.1 One-sample tests 36 3.2 Two-sample tests 41 References 47 Part III BINOMIAL DISTRIBUTION 49 4 Tests on proportions 51 4.1 One-sample tests 51 4.2 Two-sample tests 55 4.3 K-sample tests 62 References 64 Part IV OTHER DISTRIBUTIONS 65 5 Poisson distribution 67 5.1 Tests on the Poisson parameter 67 References 75 6 Exponential distribution 76 6.1 Test on the parameter of an exponential distribution 76 Reference 78 Part V CORRELATION 79 7 Tests on association 81 7.1 One-sample tests 81 7.2 Two-sample tests 94 References 98 Part VI NONPARAMETRIC TESTS 99 8 Tests on location 101 8.1 One-sample tests 101 8.2 Two-sample tests 110 8.3 K-sample tests 116 References 118 9 Tests on scale difference 120 9.1 Two-sample tests 120 References 131 10 Other tests 132 10.1 Two-sample tests 132 References 135 Part VII GOODNESS-OF-FIT TESTS 137 11 Tests on normality 139 11.1 Tests based on the EDF 139 11.2 Tests not based on the EDF 148 References 152 12 Tests on other distributions 154 12.1 Tests based on the EDF 154 12.2 Tests not based on the EDF 164 References 166 Part VIII TESTS ON RANDOMNESS 167 13 Tests on randomness 169 13.1 Run tests 169 13.2 Successive difference tests 178 References 185 Part IX TESTS ON CONTINGENCY TABLES 187 14 Tests on contingency tables 189 14.1 Tests on independence and homogeneity 189 14.2 Tests on agreement and symmetry 197 14.3 Test on risk measures 205 References 214 Part X TESTS ON OUTLIERS 217 15 Tests on outliers 219 15.1 Outliers tests for Gaussian null distribution 219 15.2 Outlier tests for other null distributions 229 References 235 Part XI TESTS IN REGRESSION ANALYSIS 237 16 Tests in regression analysis 239 16.1 Simple linear regression 239 16.2 Multiple linear regression 246 References 252 17 Tests in variance analysis 253 17.1 Analysis of variance 253 17.2 Tests for homogeneity of variances 258 References 263 Appendix A Datasets 264 Appendix B Tables 271 Glossary 284 Index 287

Chapter 1 Statistical hypothesis testing

1.1 Theory of statistical hypothesis testing

Hypothesis testing is a key tool in statistical inference next to point estimation and confidence sets. All three concepts make an inference about a population based on a sample taken from it. Hypothesis testing aims at a decision on whether or not a hypothesis on the nature of the population is supported by the sample.

In the following we shortly run through the steps of a statistical test procedure and introduce the notation used throughout this book. For a detailed mathematical explanation please refer to the book by Lehmann (1997).

We denote a sample of size by , where the are observations of identically independently distributed random variables , . Usually some further assumptions are needed concerning the nature of the mechanism generating the sample. These can be rather general assumptions like a symmetric continuous distribution. Often a parametric distribution is assumed with only parameter values unknown, for example, the Gaussian distribution with both or either unknown mean and variance. In this case hypothesis tests deal with statements on the unknown population parameters. We exemplify our general discussion by this situation.

Each of the statistical tests presented in the following chapters is introduced by a verbal description of the type of conjecture to be decided upon together with the made assumptions. Next the test problem is formalized by the null hypothesis and the alternative hypothesis . If a statement on population parameters is of interest, often the parameter space , is partitioned into disjunct sets and with , corresponding to and , respectively.

As the next building stone of a statistical test the test statistic, which is a function of the random sample, is stated. This function fulfills two criteria. First of all its value must provide insight on whether or not the null hypothesis might be true. Next the distribution of the test statistic must be known, given that the null hypothesis is true. Table 1.1 shows the four possible outcomes of a statistical test. In two of the cases the result of the test is a correct decision. Namely, a true null hypothesis is not rejected and a false null hypothesis is rejected. If the null hypothesis is true but is rejected as a result of the test, a type I error occurs. In the opposite situation that is true in nature but the test does not reject the null hypothesis, a type II error occurs.

Table 1.1 Possible results in statistical testing.

Generally, unless sample size or hypothesis are changed, a decrease in the probability of a type I error causes an increase in the probability for a type II error and vice versa. With the significance level the maximal probability of the appearance of a type I error is fixed and the critical region of the test is chosen according to this condition. If the observed value of the test statistic lies in the critical region, the null hypothesis is rejected. Hence, the error probability is under control when a decision is made against but not when the decision is for , which needs to be kept in mind while drawing conclusions from test results. If possible, the researcher's conjecture corresponds to the alternative hypothesis due to primarily controlling the type I error. However, in goodness-of-fit tests one is forced to formulate the researcher's hypothesis, that is, the specific distribution of interest, as null hypothesis as it is otherwise usually unfeasible to derive the distribution of the test statistic.

The power function measures the quality of a test. It yields the probability of rejecting the hypothesis for a given true parameter value . The test with the greatest power among all tests with a given significance level is called the most powerful test.

Traditionally a pre-specified significance level of or is selected. However, there is no reason why a different value should not be chosen.

Up to here we are in the context of the Neyman–Pearson test theory. Most statistical computer programs are not returning whether the calculated test statistic lies within the critical region or not. Instead the p-value (probability-value) is given. This is the probability to obtain the observed value of the test statistic or a value that is more extreme in the direction of the alternative hypothesis calculated when is true. If the p-value is smaller than it follows that is rejected, otherwise is not rejected.

As already mentioned in the introduction this is the common approach. For further reading on the differences please refer to Goodman (1994), Hubbard and Bayarri (2003), Johnstone (1987), and Lehmann (1993).

1.2 Testing statistical hypothesis with SAS and R

Testing statistical hypotheses with SAS and R is very convenient. A lot of tests are already integrated in these software packages. In SAS tests are invoked via procedures while R uses functions. Although many test problems are handled in this way situations may occur where a SAS procedure or a R function is not available. Reasons are manifold. The SAS Institute decides which statistical test to include in SAS. Even if a newly developed test is accepted for inclusion in SAS it takes some time to develop a new procedure or to incorporate it in an existing SASprocedure. If a test is not implemented in a SAS procedure or in the R standard packages the likelihood is high to find the test as a SAS macro or in R user packages which are available through the World Wide Web. However, in this book we have refrained from presenting tests from SAS macros or R user packages for several reasons. We do not know how long macros, program code, or user packages are supported by the programmer and are therefore available for newer versions of SAS or R. In addition it is not possible to trace if the code is correct. If a statistical test is not implemented in the SAS software as procedure or in the R standard packages we will provide an algorithm with small SAS and R code to circumvent these problems. All presented statistical tests are accompanied by an example of their use in a given dataset. So it is easy to retrace the example and to translate the code to your own datasets. Sometimes more than one SAS procedure or R function is available to perform a statistical test. We only present one way to do so.

1.2.1 Programming philosophy of SAS and R

Testing statistical hypothesis in SAS or R is not the same, while R is a matrix language orientated software, SAS follows a different philosophy (except for SAS/IML). With a matrix orientated language some calculations are easier. For instance the average of a few observations, for example, the age and of four children in a family, can be calculated with one line of code in R by applying the function mean() to the vector containing the values, c(1,4,2,5).

mean(c(1,4,2,5))

Here the numeric vector of data values to be analyzed is inserted directly in the R function. However, it is also possible to call data from a previously defined object, for example, a dataframe

children<-data.frame(age=c(1,4,2,5)) mean(children$age)

In SAS a little more effort is necessary due to the required division into data and proc steps.

data children; input age; datalines; 1 4 2 5 ; run; proc means; var age; run;

The dataset children holds the variable age with observed values and . The SAS procedure proc means calculates the mean value. This type of programming philosophy must not be a disadvantage. It can save a lot of time, because the SAS procedures are very powerful and incorporate many statistical calculations in one go.

We assume that the reader is familiar with the basic programming features of SAS or R, such as data input and output, and only remark on some important points related to conducting statistical tests. Concerning data format usually one entry per observation and a column for each variable are suitable. However, in some cases it may be required to reorganize the dataset for test procedures. We accompany our examples with small datasets (see Appendix A), such that it is easy to see how data need to be arranged for the specific test.

In SAS most statistical tests are performed with procedures, which usually follow the schema:

proc proc-name data=dataset-name options; var variable-names options; options; run;

The data= statement identifies the dataset to be analyzed. If missing, the most recent dataset is taken. In some procedures it is necessary to fix some options to set up the statistical test, for example, to define the value to test against, or if the test is one or two sided. The var statement is followed by the variables on which the test shall be performed. Sometimes further options can be stated in separate command lines, for instance requesting an exact test. Note, some procedures differ from this general set-up. The procedure proc freq as an example has no var but a table statement. Occasionally the statement class class-variable is needed indicating a grouping variable which assigns each observation to a specific group. As options of procedures can be numerous and not all of them may be needed for the treated test, we restrict our exposure to the indispensable options. The same applies to the output we present for the examples.

Conducting a statistical test in the program R usually only requires one line of code. The common layout of R functions...

Content (PDF)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Statistical Hypothesis Testing with SAS and R

Description

More details

Other editions

Additional editions

Persons

Content

Chapter 1

Statistical hypothesis testing

1.1 Theory of statistical hypothesis testing

1.2 Testing statistical hypothesis with SAS and R

1.2.1 Programming philosophy of SAS and R

System requirements