Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Mark van der Loo and Edwin de Jonge, Department of Statistical Methods, Statistics Netherlands, The Netherlands
Foreword xi
About the Companion Website xiii
1 Data Cleaning 1
1.1 The Statistical Value Chain 1
1.1.1 Raw Data 2
1.1.2 Input Data 2
1.1.3 Valid Data 3
1.1.4 Statistics 3
1.1.5 Output 3
1.2 Notation and Conventions Used in this Book 3
2 A Brief Introduction to R 5
2.1 R on the Command Line 5
2.1.1 Getting Help and Learning R 6
2.2 Vectors 7
2.2.1 Computing with Vectors 9
2.2.2 Arrays and Matrices 10
2.3 Data Frames 11
2.3.1 The Formula-Data Interface 12
2.3.2 Selecting Rows and Columns; Boolean Operators 12
2.3.3 Selection with Indices 13
2.3.4 Data Frame Manipulation:The dplyr Package 14
2.4 Special Values 15
2.4.1 Missing Values 17
2.5 Getting Data into and out of R 18
2.5.1 File Paths in R 19
2.5.2 Formats Provided by Packages 20
2.5.3 Reading Data from a Database 20
2.5.4 Working with Data External to R 21
2.6 Functions 21
2.6.1 Using Functions 22
2.6.2 Writing Functions 22
2.7 Packages Used in this Book 23
3 Technical Representation of Data 27
3.1 Numeric Data 28
3.1.1 Integers 28
3.1.2 Integers in R 30
3.1.3 Real Numbers 31
3.1.4 Double Precision Numbers 31
3.1.5 The Concept of Machine Precision 33
3.1.6 Consequences ofWorking with Floating Point Numbers 34
3.1.7 Dealing with the Consequences 35
3.1.8 Numeric Data in R 37
3.2 Text Data 38
3.2.1 Terminology and Encodings 38
3.2.2 Unicode 39
3.2.3 Some Popular Encodings 40
3.2.4 Textual Data in R: Objects of Class Character 43
3.2.5 Encoding in R 44
3.2.6 Reading andWriting of Data with Non-Local Encoding 46
3.2.7 Detecting Encoding 48
3.2.8 Collation and Sorting 49
3.3 Times and Dates 50
3.3.1 AIT, UTC, and POSIX Seconds Since the Epcoch 50
3.3.2 Time and Date Notation 52
3.3.3 Time and Date Storage in R 54
3.3.4 Time and Date Conversion in R 55
3.3.5 Leap Days, Time Zones, and Daylight Saving Times 57
3.4 Notes on Locale Settings 58
4 Data Structure 61
4.1 Introduction 61
4.2 Tabular Data 61
4.2.1 data.frame 61
4.2.2 Databases 62
4.2.3 dplyr 64
4.3 Matrix Data 65
4.4 Time Series 66
4.5 Graph Data 68
4.6 Web Data 69
4.6.1 Web Scraping 69
4.6.2 Web API 70
4.7 Other Data 72
4.8 Tidying Tabular Data 72
4.8.1 Variable Per Column 74
4.8.2 Single Observation Stored in Multiple Tables 75
5 Cleaning Text Data 77
5.1 Character Normalization 78
5.1.1 Encoding Conversion and Unicode Normalization 78
5.1.2 Character Conversion and Transliteration 80
5.2 Pattern Matching with Regular Expressions 81
5.2.1 Basic Regular Expressions 82
5.2.2 Practical Regular Expressions 85
5.2.3 Generating Regular Expressions in R 92
5.3 Common String Processing Tasks in R 93
5.4 Approximate Text Matching 98
5.4.1 String Metrics 100
5.4.2 String Metrics and Approximate Text Matching in R 109
6 Data Validation 119
6.1 Introduction 119
6.2 A First Look at the validate Package 120
6.2.1 Quick Checks with check_that 120
6.2.2 The BasicWorkflow: validator and confront 122
6.2.3 A Little Background on validate and DSLs 124
6.3 Defining Data Validation 125
6.3.1 Formal Definition of Data Validation 126
6.3.2 Operations on Validation Functions 128
6.3.3 Validation and Missing Values 130
6.3.4 Structure of Validation Functions 131
6.3.5 Demarcating Validation Rules in validate 132
6.4 A Formal Typology of Data Validation Functions 134
6.4.1 A Closer Look at Measurement 134
6.4.2 Classification of Validation Rules 135
6.5 Validating Data with the validate Package 137
6.5.1 Validation Rules in the Console and the validator Object 137
6.5.2 Validating in the Pipeline 139
6.5.3 Raising Errors orWarnings 140
6.5.4 Tolerance for Testing Linear Equalities 140
6.5.5 Setting and Resetting Options 141
6.5.6 Importing and Exporting Validation Rules from and to File 142
6.5.7 Checking Variable Types and Metadata 145
6.5.8 Checking Value Ranges and Code Lists 146
6.5.9 Checking In-Record Consistency Rules 146
6.5.10 Checking Cross-Record Validation Rules 148
6.5.11 Checking Functional Dependencies 149
6.5.12 Cross-Dataset Validation 150
6.5.13 Macros, Variable Groups, Keys 152
6.5.14 Analyzing Output: validation Objects 152
6.5.15 Output Dimensionality and Output Selection 155
7 Localizing Errors in Data Records 157
7.1 Error Localization 157
7.2 Error Localization with R 160
7.2.1 The Errorlocate Package 160
7.3 Error Localization as MIP-Problem 163
7.3.1 Error Localization and Mixed-Integer Programming 163
7.3.2 Linear Restrictions 164
7.3.3 Categorical Restrictions 165
7.3.4 Mixed-Type Restrictions 167
7.4 Numerical Stability Issues 170
7.4.1 A Short Overview of MIP Solving 170
7.4.2 Scaling Numerical Records 172
7.4.3 Setting NumericalThreshold Values 173
7.5 Practical Issues 174
7.5.1 Setting ReliabilityWeights 174
7.5.2 Simplifying Conditional Validation Rules 176
7.6 Conclusion 180
8 Rule Set Maintenance and Simplification 183
8.1 Quality of Validation Rules 183
8.1.1 Completeness 183
8.1.2 Superfluous Rules and Infeasibility 184
8.2 Rules in the Language of Logic 184
8.2.1 Using Logic to Rewrite Rules 185
8.3 Rule Set Issues 186
8.3.1 Infeasible Rule Set 186
8.3.2 Fixed Value 187
8.3.3 Redundant Rule 188
8.3.4 Nonrelaxing Clause 189
8.3.5 Nonconstraining Clause 189
8.4 Detection and Simplification Procedure 190
8.4.1 Mixed-Integer Programming 190
8.4.2 Detecting Feasibility 191
8.4.3 Finding Rules Causing Infeasibility 191
8.4.4 Detecting Conflicting Rules 191
8.4.5 Detect Partial Infeasibility 192
8.4.6 Detect Fixed Values 192
8.4.7 Detect Nonrelaxing Clauses 192
8.4.8 Detecting Nonconstraining Clauses 193
8.4.9 Detecting Redundant Rules 193
8.5 Conclusion 194
9 Methods Based on Models for Domain Knowledge 195
9.1 Correction with Data Modifying Rules 195
9.1.1 Modifying Functions 196
9.1.2 A Class of Modifying Functions on Numerical Data 201
9.2 Rule-Based Correction with dcmodify 205
9.2.1 Reading Rules from File 206
9.2.2 Modifying Rule Syntax 207
9.2.3 Missing Values 208
9.2.4 Sequential and Sequence-Independent Execution 208
9.2.5 Options Settings Management 209
9.3 Deductive Correction 209
9.3.1 Correcting Typing Errors in Numeric Data 209
9.3.2 Deductive Imputation Using Linear Restrictions 213
10 Imputation and Adjustment 219
10.1 Missing Data 219
10.1.1 Missing Data Mechanisms 219
10.1.2 Visualizing and Testing for Patterns in Missing Data Using R 220
10.2 Model-Based Imputation 224
10.3 Model-Based Imputation in R 226
10.3.1 Specifying ImputationMethods with simputation 226
10.3.2 Linear Regression-Based Imputation 227
10.3.3 M-Estimation 230
10.3.4 Lasso, Ridge, and Elasticnet Regression 231
10.3.5 Classification and Regression Trees 232
10.3.6 Random Forest 235
10.4 Donor Imputation with R 236
10.4.1 Random and Sequential Hot Deck Imputation 237
10.4.2 k Nearest Neighbors and Predictive Mean Matching 238
10.5 Other Methods in the simputation Package 239
10.6 Imputation Based on the EM Algorithm 240
10.6.1 The EM Algorithm 241
10.6.2 EM Imputation Assuming the Multivariate Normal Distribution 243
10.7 Sampling Variance under Imputation 244
10.8 Multiple Imputations 246
10.8.1 Multiple Imputation Based on the EM Algorithm 248
10.8.2 The Amelia Package 249
10.8.3 Multivariate Imputation with Chained Equations (Mice) 252
10.8.4 Imputation with the mice Package 254
10.9 Analytic Approaches to Estimate Variance of Imputation 256
10.9.1 Imputation as Part of the Estimator 256
10.10 Choosing an ImputationMethod 257
10.11 Constraint Value Adjustment 259
10.11.1 Formal Description 259
10.11.2 Application to Imputed Data 262
10.11.3 Adjusting Imputed Values with the rspa Package 263
11 Example: A Small Data-Cleaning System 265
11.1 Setup 266
11.1.1 DeterministicMethods 266
11.1.2 Error Localization 269
11.1.3 Imputation 269
11.1.4 Adjusting Imputed Data 271
11.2 Monitoring Changes in Data 273
11.2.1 Data Diff (Daff) 274
11.2.2 Summarizing Cell Changes 275
11.2.3 Summarizing Changes in Conformance to Validation Rules 277
11.2.4 Track Changes in Data Automatically with lumberjack 278
11.3 Integration and Automation 282
11.3.1 Using RScript 283
11.3.2 The docopt Package 283
11.3.3 Automated Data Cleaning 285
References 287
Index 297
The following sections provide an overview of some of R's core features. Besides an installation of R, we recommend installing one of the available integrated development environments (IDEs) for R. A good IDE does not only offer a nice interface to R and its help system but also helps you to organize projects, code, and data.
To benefit the most of this tutorial, it is a good idea to try out the code examples for yourself, play around with them, and to explain the results.
After starting R, or an IDE that connects to R, you have access to an interactive console, or command-line interface. The first use of it is to replace a pocket calculator. You can type in a calculation, and R will return the answer (preceded by a [1]).
[1]
1 + 1 ## [1] 2
To get started, experiment with the following statements. Make sure to play around a little. All common mathematical functions are implemented in R.
1 + 1 3?2 sin(pi/2) (1 + 4) * 3 exp(1) sqrt(16)
To reuse results or values, you can store them with the <- operator.
<-
x <- 10 y <- 20
R has now remembered the values 10 and 20 and named them x and y. In fact, x and y are now officially R objects. R is very flexible, and there are several other ways to define an R object. We may replace <- with =, we may replace a statement x <- 10 with 10 -> x, or we can be extra verbose and use assign("x",10). The = operator is the only one that is encountered with some frequency in practice. Since = is also used for named argument passing in function calls (see Section 2.6.1), we recommend using the <- for assignment.
x
y
=
x <- 10
10 -> x
assign("x",10)
The content of an R object can be printed simply by typing its name in the console.
x ## [1] 10
R objects can be stored for further computation, the results of which may again be stored.
x + y ## [1] 30 z <- x * y q <- x?2*z q ## [1] 20000
Finally, we note that values and variables can be compared using standard comparison operators.
x <= y ## [1] TRUE x == y ## [1] FALSE x> y ## [1] FALSE
Observe that the operator testing for equality is written as the double equals symbol '=='. Make sure not to confuse this with the single equals symbol, which functions as assignment operator.
==
R has a built-in help system where every possible function is described. If you know the name of the function, its help file can be requested with the ? operator. For example, to show the help of the function mean, type the following:
?
mean
?mean
If you are not sure of the function's name, the help files may be searched using the double question mark operator.
??average
IDEs for R have built-in search for the help files that may be more convenient.
There are a number of good online resources to get help from fellow users. Most notably, the Q&A site stackoverflow.com provides many R-related questions that have already been answered by users (and questions about many other topics as well). In fact, if you type an R-related question in a search engine, chances are that the first hit is a stackoverflow page. You may also want to subscribe to the R-help mailing list (see https://www.r-project.org/mail.html). Here, questions are often answered by the developers of the GNU R itself. Do observe the 'netiquette' and follow the posting guide before posting a question to the list. In particular, you should search the mailing list prior to posting a question to avoid double posts.
stackoverflow.com
https://www.r-project.org/mail.html
Besides resources where answers to questions can be found, there are many blogs discussing R and applications of R. A good way to become familiar with all the possibilities of R is to frequently visit r-bloggers.com, where many R-related blogs are collected and presented in a newspaper-like format. Browsing through the blogs allows you to stumble upon functions and ideas that you cannot get from just following a tutorial.
r-bloggers.com
Learning R is not something you should do alone. Besides the online community from which you can benefit, many cities have R user groups that organize frequent meetings that you can join. If your organization is using R, it is a good idea to organize a local user group within the organization. All you need is a room, a projector, and a laptop to start organizing meetings. In our experience, user meetings are a very efficient (and fun!) way to share knowledge and experiences among colleagues, friends, or classmates. The point is that even in base R, there are thousands of functions and many ways to solve the same problem. Informal user meetings are a good way of bumping into solutions you otherwise might not have thought of.
The most basic type of object in R is called a vector, a sequence of values of the same type. The object is so basic that you have already worked with them. When in the previous examples we computed x + y, R was in fact adding two numeric vectors of length 1 containing the numbers 10 and 20.
x + y
numeric
10
20
There are several ways to create a vector. One simple way is to use the function c() (for concatenate, or combine).
c()
# a vector with numbers 1, 2, and 3 c(1,3,5) # a vector with two text elements c("hello world","hello universe")
Ordered number sequences can be generated with the colon operator (:) or with the seq function.
:
seq
# a vector with numbers 1,2,.,10 1:10 # a sequence of numbers from 1 to 6 in 100 steps. seq(1,6,length.out=100)
Sequences of random numbers from various distributions can be generated as well.
# 100 numbers drawn from the standard normal distribution rnorm(100) # 50 numbers drawn from the uniform distribution on [2,7] runif(50,min=2,max=7)
You may try to combine values of a different type in a vector, but R will then convert the type when necessary.
c(1,"hello", 3.14) ## [1] "1" "hello" "3.14"
When this vector is printed, there are quotes around the 'numbers' "1" and "3.14". That is because R decided to convert these numbers to text since one of the elements in the vector is text (you can always convert a number to text but not the other way around). By the way, in R such a conversion of type is usually referred to as coercion, which is just another word for the same thing.
"1"
"3.14"
This automatic conversion has consequences for everyday use. For example, the function read.csv reads csv files into R's working memory. It automatically detects the value types of the columns assuming that the first row contains the column names. Now if you feed it a csv file, where one of the columns contains all numeric data, except in one field, say somewhere at the bottom, that whole column will be interpreted as a categorical variable by default. Of course this behavior can be controlled, but it is typical of R to perform coercion rather than throwing an error.
read.csv
csv
There are a few basic vector types with which R can work, listed in the following table:
logical
TRUE
FALSE
integer
complex
character
raw
There are also types for storing categorical and ordered data.
factor
ordered
These types are really integer vectors combined with a table that describes which category (level) is stored as what integer.
You can ask any object of what type it is, using the class function.
class
x <- 1:3 y <- c("foo", "bar") class(x) ## [1] "integer" class(y) ## [1] "character"
There are two more types of metadata stored with a vector. The first is its number of elements, which can be retrieved with the length function.
length
length(y) ## [1] 2
Secondly, the elements of a vector can be given names. For example:
shoesize <- c(jan=43, pier=39, joris=45, korneel=42)
The names are printed when a vector is printed to screen, but they do not affect any computations based on the vector.
mean(shoesize) ## [1] 42.25
The names of a vector can be retrieved with the names function.
names
names(shoesize) ## [1] "jan" "pier" "joris" "korneel"
All arithmetic and comparison operators and mathematical functions...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.