Some short code to remove certain numbers from a data frame.

When I first write code, either for analysis or just cleaning up some data, it can be incredibly messy and inefficient. My first priority is just getting something to work, but as I have been using R for longer, I have tried to focus on making the code cleaner and thinking more like a programmer.

Recently I have been working on a project where I was asked to fix a problem with some prior analyses. To cut a long story short, there was some missing data and a previous collaborator had imputed the data using an undocumented method (another good reason to use R as I have since been able to make all the analyses reproducible). We decided to take action and remove all of their imputed values so that we could start from scratch and document the whole process. The benefit we had is we knew what the real data should look like. We had Likert responses which could only be whole numbers between 1 and 6. Therefore, any number with a decimal place was due to their previously undocumented missing data imputation process. If you were to do this manually in a spread sheet, this could take hours of going through the data set and deleting all the numbers. With the use of R, this kind of process can be automated in a couple of lines of code.

Firstly, lets take a sub-set of the data and see what we’re dealing with.

dat <- dat %>% 
  select(X, A1:A10)

head(dat, 15)
##     X       A1 A2       A3       A4 A5 A6 A7 A8 A9 A10
## 1   1 5.000000  1 5.000000 1.000000  1  3  5  1  5   1
## 2   2 6.000000  1 4.000000 2.000000  2  3  4  2  4   3
## 3   3 3.000000  3 1.000000 1.000000  1  1  5  4  3   1
## 4   4 5.000000  3 4.000000 3.000000  1  1  4  4  5   5
## 5   5 5.000000  2 1.000000 3.000000  1  4  6  1  6   1
## 6   6 4.000000  2 3.000000 1.000000  1  2  6  1  4   4
## 7   7 2.000000  1 1.000000 1.000000  4  6  3  6  2   3
## 8   8 6.000000  1 2.000000 1.000000  2  6  5  3  6   5
## 9   9 5.000000  2 4.000000 2.000000  2  5  6  4  5   3
## 10 10 4.000000  1 3.000000 2.000000  1  4  5  6  4   6
## 11 11 2.000000  1 3.000000 1.000000  1  1  2  1  2   1
## 12 12 6.000000  2 1.000000 3.000000  3  3  6  2  5   6
## 13 13 3.855063  1 3.000000 1.000000  1  1  2  1  3   3
## 14 14 6.000000  1 2.095608 2.648368  1  2  3  2  6   3
## 15 15 6.000000  1 1.000000 2.000000  4  4  6  1  6   1

This shows us that although some of the data is fine, some of the variables contain decimal places that we know should be impossible given the Likert scale responses. My first approach (not thinking like a programmer) was to filter the data by selecting each real number manually.

integer.dat <- dat %>% 
  gather(Scale, Rating, -X) %>% # X is a simple participant number to match up the data
  filter(Rating == 1 | Rating == 2 | Rating == 3 | Rating == 4 | Rating == 5 | Rating == 6)

nrow(integer.dat)
## [1] 3788

This converts the data to long format through gather and then only selects the rows that fit your criteria using filter. This does the job, but firstly there is a lot of repetition, it would be very easy to make a typo and it just takes too long to type. Secondly, this approach would not scale very well. This is relatively painless with 6 options, but what if you had 20 options? A more efficient method is clearly in order.

Having a dig around, I found a handy little function called as.integer(). This tries to coerce each value to an integer (whole number) and evaluates to TRUE if it succeeds. However, if the value cannot be coerced, it throws out an NA. Using this in filter provides the same results as above as only the values that evaluate to TRUE will be retained.

integer.dat2 <- dat %>% 
  gather(Scale, Rating, -X) %>%
  filter(Rating == as.integer(Rating))

nrow(integer.dat2)
## [1] 3788

If you then convert this back to wide format, having the participant ID keeps all of the data in the correct structure, leaving NA values where the numbers could not be coerced to an integer. This then leaves you with the task of imputing the missing data with your desired method.

dat <- integer.dat2 %>% 
  spread(key = Scale, value = Rating)

head(dat, 15)
##     X A1 A10 A2 A3 A4 A5 A6 A7 A8 A9
## 1   1  5   1  1  5  1  1  3  5  1  5
## 2   2  6   3  1  4  2  2  3  4  2  4
## 3   3  3   1  3  1  1  1  1  5  4  3
## 4   4  5   5  3  4  3  1  1  4  4  5
## 5   5  5   1  2  1  3  1  4  6  1  6
## 6   6  4   4  2  3  1  1  2  6  1  4
## 7   7  2   3  1  1  1  4  6  3  6  2
## 8   8  6   5  1  2  1  2  6  5  3  6
## 9   9  5   3  2  4  2  2  5  6  4  5
## 10 10  4   6  1  3  2  1  4  5  6  4
## 11 11  2   1  1  3  1  1  1  2  1  2
## 12 12  6   6  2  1  3  3  3  6  2  5
## 13 13 NA   3  1  3  1  1  1  2  1  3
## 14 14  6   3  1 NA NA  1  2  3  2  6
## 15 15  6   1  1  1  2  4  4  6  1  6

You may be thinking that this is obvious, but hopefully at least one person originally thought of the clunky method like I did. Hopefully seeing this will save them some code and provide them with a more efficient way of doing this task. There may be an even more straight forward way of dealing with a problem like this. If there is, leave a comment and I can add that in too.