The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in R
Every experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R.
Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling. They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.
* The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining data
* Begins with the basics and walks readers through all the steps necessary to get data ready for the modeling process
* Provides expert guidance on how to document the processes described so that they are reproducible
* Written by seasoned professionals, it provides both introductory and advanced techniques
* Features case studies with supporting data and R code, hosted on a companion website
A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.
SAMUEL E. BUTTREY, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.
LYN R. WHITAKER, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.
R Data, Part 1: Vectors
The basic unit of computation in R is the vector. A vector is a set of one or more basic objects of the same kind. (Actually, it is even possible to have a vector with no objects in it, as we will see, and this happens sometimes.) Each of the entries in a vector is called an element. In this chapter, we talk about the different sorts of vectors that you can have in R. Then, we describe the very important topic of subsetting, which is our word for extracting pieces of vectors - all of the elements that are greater than 10, for example. That topic goes together with assigning, or replacing, certain elements of a vector. We describe the way missing values are handled in R; this topic arises in almost every data cleaning problem. The rest of the chapter gives some tools that are useful when handling vectors.
By a "basic" object, we mean an object of one of R's so-called "atomic" classes. These classes, which you can find in
F are provided as synonyms);
numeric (also called
character, which refers to text;
raw, which can hold binary data; and
complex. Some of these, such as
complex, probably won't arise in data cleaning.
2.1.1 Creating Vectors
We are mostly concerned with vectors that have been given to us as data. However, there are a number of situations when you will need to construct your own vectors. Of course, since a scalar is a vector of length 1, you can construct one directly, by typing its value:
> 5  5
R displays the
 before the answer to show you that the
5 is the first element of the resulting vector. Here, of course, the resulting vector only had one entry, but R displays the
 nonetheless. There is no such thing as a "scalar" in R; even , represented in R by the built-in value
pi, is a vector of length 1. To combine several items into a vector, use the
c() function, which combines as many items as you need.
> c(1, 17)  1 17 > c(-1, pi, 17)  -1.000000 3.141593 17.000000 > c(-1, pi, 1700000)  -1.000000e+00 3.141593e+00 1.700000e+06
R has formatted the numbers in the vectors in a consistent way. In the second example, the number of digits of
pi is what determines the formatting; see Section 1.3.3. In example three, the same number of digits is used, but the large number has caused R to use scientific notation. We discuss that in Section 4.2.2. Analogous formatting rules are applied to non-numeric vectors as well; this makes output much more readable. The
c() function can also be used to combine vectors, as long as all the vectors are of the same sort.
Another vector-creation function is
rep(), which repeats a value as many times as you need. For example,
rep(3, 4) produces a vector of four 3s. In this example, we show some more of the abilities of
> rep (c(2, 4), 3) # repeat a vector  2 4 2 4 2 4 > rep (c("Yes", "No"), c(3, 1)) # repeat elements of vector  "Yes" "Yes" "Yes" "No" > rep (c("Yes", "No"), each = 8)  "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "No"  "No" "No" "No" "No" "No" "No" "No"
The last two examples show
rep() operating on a character vector. The final one shows how R displays longer vectors - by giving the number of the first element on each line. Here, for example, the
 indicates that the first
"No" on the second line is the 10th element of the vector.
We also very often create vectors of sets of consecutive integers. For example, we might want the first 10 integers, so that we can get hold of the first 10 rows in a table. For that task we can use the colon operator,
: . Actually, the colon operator doesn't have to be confined to integers; you can also use it to produce a sequence of non-integers that are one unit apart, as in the following example, but we haven't found that to be very useful.
> 1:5  1 2 3 4 5 > 6:-2  6 5 4 3 2 1 0 -1 -2 # Can go in reverse, by 1 > 2.3:5.9  2.3 3.3 4.3 5.3 # Permitted (but unusual) > 3 + 2:7 # Watch out here! This is 3 +  5 6 7 8 9 10 # (vector produced by 2:7) > (3 + 2):7  5 6 7 # This is 5:7
In that last pair of examples, we see that R evaluates the
2:7 operation before adding the
3. This is because
: has a higher precedence in the order of operations than addition. The list of operators and their precedences can be found at
?Syntax, and precedence can always be over-ridden with parentheses, as in the example - but this is the only example of operator precedence that is likely to trip you up. Also notice that adding
3 to a vector adds
3 to each element of that vector; we talk more about vector operations in Section 2.1.4.
Finally, we sometimes need to create vectors whose entries differ by a number other than one. For that, we use
seq(), a function that allows much finer control of starting points, ending points, lengths, and step sizes.
2.1.3 Logical Vectors
We can create logical vectors using the
c() function, but most often they are constructed by R in response to an operation on other vectors. We saw examples of operators back in Section 1.3.2; the R operators that perform comparisons are
== (for "is equal to") and
!= (for "not equal to"). In this example, we do some simple comparisons on a short vector.
> 101:105>= 102 # Which elements are>= 102?  FALSE TRUE TRUE TRUE TRUE > 101:105 == 104 # Which equal (==) 104?  FALSE FALSE FALSE TRUE FALSE
Of course, when you compare two floating-point numbers for equality, you can get unexpected results. In this example, we compute
1 - 1/46 * 46, which is zero;
1 - 1/47 * 47, and so on up through 50. We have seen this example before!
> 1 - 1/46:50 * 46:50 == 0  TRUE TRUE TRUE FALSE TRUE
We noted earlier that R provides
F as synonyms for
FALSE. We sometimes use these synonyms in the book. However, it is best to beware of using these shortened forms in code. It is possible to create objects named
F, which might interfere with their usage as logical values. In contrast, the full names
FALSE are reserved words in R. This means that you cannot directly assign one of these names to an object and, therefore, that they are never ambiguous in code.
The Number and Proportion of Elements That Meet a Criterion
One task that comes up a lot in data cleaning is to count the number (or proportion) of events that meet some criterion. We might want to know how many missing values there are in a vector, for example, or the proportion of elements that are less than 0.5. For these tasks, computing the
mean() of a logical vector is an excellent approach. In our earlier example, we might have been interested in the number of elements that are 102, or the proportion that are exactly 104.
> 101:105>= 102  FALSE TRUE TRUE TRUE TRUE > sum (101:105>= 102)  4 # Four elements are>= 102 > 101:105 == 104  FALSE FALSE FALSE TRUE FALSE > mean (101:105 == 104)  0.2 # 20% are == 104
It may be worth pondering this last example for a moment. We start with the
logical vector that is the result of the comparison operator. In order to apply a mathematical function to that vector, R needs to convert the
logical elements to
FALSE values get turned into zeros and
TRUE values into ones (we discuss conversion further in Section 2.2.3). Then,
sum() adds up those 0s and 1s, producing the total number of 1s in the converted vector - that is, the number of
TRUE values in the
logical vector or the number of elements of the original vector that meet the criterion by being . The
mean() function computes the sum of the number of 1s and then divides that sum by the total number of elements, and that operation produces the proportion of
TRUE values in the
logical vector, that is, the proportion of elements in the original vector that meet the criterion.
2.1.4 Vector Operations
Understanding how vectors work is crucial to using R properly and efficiently....