6  Data Wrangling 3: Pivots and Pipes

In the last chapter, we added two more functions to your data wrangling toolkit. You learnt how to filter data to retain or remove observations and summarise data to calculate different summary statistics.

In this chapter, we start with another recap of all your data wrangling skills so far on a new data set. There is no substitute for practice when it comes to data skills. By applying your skills to new data, you can transfer your knowledge to novel scenarios and develop independence. We then add two more sets of data wrangling functions. First, we demonstrate pivots to restructure your data from wide format into long format, and vice versa. Second, we show how you can string together multiple functions from the tidyverse using pipes. These help streamline your code to avoid creating lots of intermediary objects.

After this chapter, you will be ready for the first data analysis journey chapter: Analysis Journey 1: Data Wrangling. This is where you can test the skills you have developed so far on a more independent task as a bridge between the core chapters and your assessments.

Chapter Intended Learning Outcomes (ILOs)

By the end of this chapter, you will be able to:

6.1 Chapter preparation

6.1.1 Introduction to the data set

For this chapter, we are using open data from Alter et al. (2024). The abstract of their article is:

The biggest difference in statistical training from previous decades is the increased use of software. However, little research examines how software impacts learning statistics. Assessing the value of software to statistical learning demands appropriate, valid, and reliable measures. The present study expands the arsenal of tools by reporting on the psychometric properties of the Value of Software to Statistical Learning (VSSL) scale in an undergraduate student sample. We propose a brief measure with strong psychometric support to assess students’ perceived value of software in an educational setting. We provide data from a course using SPSS, given its wide use and popularity in the social sciences. However, the VSSL is adaptable to any statistical software, and we provide instructions for customizing it to suit alternative packages. Recommendations for administering, scoring, and interpreting the VSSL are provided to aid statistics instructors and education researchers understand how software influences students’ statistical learning.

To summarise, they developed a new scale to measure students’ perceived value of software to learning statistics - Value of Software to Statistical Learning (VSSL). The authors wanted to develop this scale in a way that could be adapted to different software, from SPSS in their article (which some of you may have used in the past), to perhaps R in future. Alongside data from their new scale, they collected data from other scales measuring a similar kind of construct (e.g., Students’ Attitudes toward Statistics and Technology) and related constructs (e.g., Quantitative Attitudes).

In this chapter, we will wrangle their data to reinforce skills from Chapter 4 and 5. Scale data is extremely common to work with in psychology and there is a high likelihood you will use one or more in your dissertation or future careers. After recapping skills from the past two chapters on this new data set, we will add more data wrangling functions to your toolkit.

6.1.2 Organising your files and project for the chapter

Before we can get started, you need to organise your files and project for the chapter, so your working directory is in order.

  1. In your folder for research methods and the book ResearchMethods1_2/Quant_Fundamentals, you should have a folder from chapter 4 called Chapter_04_06_datawrangling where you created an R Project.

  2. Create a new R Markdown document and give it a sensible title describing the chapter, such as 06 Data Wrangling 3. Delete everything below line 10 so you have a blank file to work with and save the file in your Chapter_04_06_datawrangling folder.

  3. We are working with a new data set separated into two files. The links are data file one (Alter_2024_demographics.csv) and data file two (Alter_2024_scales.csv). Right click the links and select “save link as”, or clicking the links will save the files to your Downloads. Make sure that both files are saved as “.csv”. Save or copy the file to your data/ folder within Chapter_04_06_datawrangling.

You are now ready to start working on the chapter!

6.2 Recapping all the previous dplyr functions

In this first section, we will prepare the data for some analysis later by practicing the data wrangling skills you learnt in Chapters 4 and 5 on this new data set.

6.2.1 Activity 1 - Load tidyverse and read the data files

As the first activity, load tidyverse and read the two data files. As a prompt, save the data files to these object names to be consistent with the activities below, but you can check your answer below if you are stuck.

# Load the tidyverse package below
?

# Load the data files
# This should be the Alter_2024_demographics.csv file 
demog <- ?

# This should be the Alter_2024_scales.csv file 
scales <- ?

You should have the following in a code chunk:

# Load the tidyverse package below
library(tidyverse)

# Load the data files
# This should be the Alter_2024_demographics.csv file 
demog <- read_csv("data/Alter_2024_demographics.csv")

# This should be the Alter_2024_scales.csv file 
scales <- read_csv("data/Alter_2024_scales.csv")

6.2.2 Activity 2 - Explore demog and scales

The data from Alter et al. (2024) is split into two data files. In demog, we have the participant ID (StudentIDE) and several demographic variables. The columns (variables) we have in the data set are:

Variable Type Description
StudentIDE double Participant number
GenderE double Gender: 1 = Female, 2 = Male, 3 = Non-Binary
RaceEthE double Race: 1 = Black/African American, 2 = Hispanic/Other Latinx, 3 = White, 4 = Multiracial, 5 = Asian/Pacific Islander, 6 = Native American/Alaska Native, 7 = South/Central American
GradeE character Expected grade: 1 = A, 2 = B, 3 = C, 4 = D, 5 = F
StuStaE character Student status: 1 = Freshman, 2 = Sophomore, 3 = Junior, 4 = Senior or Higher
GPAE character Expected Grade Point Average (GPA)
MajorE character Degree major
AgeE double Age in years

In scales, we then have the participant ID (StudentIDE) and all the individual scale items. The columns (variables) we have in the data set are:

Variable Type Description
StudentIDE double Participant number
MA1E to MA8E double Enjoyment of Mathematics and statistics, not analysed in this study.
QANX1E to QANX4E double Quantitative anxiety: four items scored on a 5-point Likert scale ranging from 1 (Not at all Anxious) to 5 (Extremely Anxious)
QINFL1E to QINFL7E double Quantitative attitudes: seven items scored on a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree)
QSF1E to QSF4E double Study motivation, not analysed in this study.
QHIND1E to QHIND5E double Quantitative hindrances: five items scored on a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree)
QSC1E to QSC4E double Mathematical self-efficacy, not analysed in this study.
QSE1E to QSE6E double Mathematical ability, not analysed in this study.
SPSS1E to SPSS10E double VSSL scale on SPSS: 10 items scored on a 5-point Likert scale ranging from 1 (Never True) to 5 (Always True)
Try this

Now we have introduced the two data sets, explore them using different methods we introduced. For example, opening the data objects as a tab to scroll around, explore with glimpse(), or even try plotting some of the variables to see what they look like using visualisation skills from Chapter 3.

6.2.3 Activity 3 - Joining the two data sets using inner_join()

At the moment, we have two separate data sets, but it will make things easier to join them together so we have both demographic information and the participants’ responses to the scales.

We did not recap joining data sets in the last chapter, so you might need to revisit Chapter 4 - Joining two data frames - for a recap.

Create a new data object called full_data and see if you can spot a common variable between both data sets that you can use an identifier.

# join demog and scales by a common identifier 
full_data <- ?

You should have the following in a code chunk:

# join demog and scales by a common identifier 
full_data <- inner_join(x = demog,
                        y = scales,
                        by = "StudentIDE")

6.2.4 Activity 4 - Selecting a range of columns using select()

There are some scales in the data that Alter et al. (2024) did not analyse, so we can get rid of them to declutter. Furthermore, the purpose of their study was to validate the new VSSL scale and they found some items did not make the cut. Create a new object called full_data_select and retain the following variables from your new full_data object:

  • StudentIDE

  • GenderE

  • RaceEthE

  • AgeE

  • QANX1E to QINFL7E

  • QHIND1E to QHIND5E

  • SPSS1E, SPSS4E, SPSS5E, SPSS6E, SPSS7E, SPSS8E, SPSS9E.

Remember: you can select variables either by retaining the variables you want to keep, or removing the variables you want to remove. You should have 27 columns remaining.

# select the key variables listed above
full_data_select <- ?

You should have the following in a code chunk if you chose to retain:

# select the key variables listed above
full_data_select <- select(full_data,
                           StudentIDE, 
                           GenderE,
                           RaceEthE, 
                           AgeE, 
                           QANX1E:QINFL7E, 
                           QHIND1E:QHIND5E, 
                           SPSS1E, SPSS4E, SPSS5E, SPSS6E, SPSS7E, SPSS8E, SPSS9E)

or the following if you chose to remove:

# select the key variables listed above
full_data_select <- select(full_data,
                           -GradeE, 
                           -StuStaE, 
                           -GPAE, 
                           -MajorE, 
                           -MA1E:-MA8E,
                           -QSF1E:-QSF4E,
                           -QSC1E:-QSE6E,
                           -SPSS2E, -SPSS3E, -SPSS10E)

There are a similar number to retain or remove, so there is no real time saving one way or the other.

6.2.5 Activity 5 - Reorder observations using arrange()

For a quick check of the data, order the values of AgeE using the object full_data_select and answer the following questions:

  1. The youngest participant is years old.

  2. The old participant is years old.

# youngest participants
?

# oldest participants
?

You should have the following in a code chunk:

# youngest participants
arrange(full_data_select, 
        AgeE)

# oldest participants
arrange(full_data_select, 
        desc(AgeE))

6.2.6 Activity 6 - Modifying or creating variables using mutate()

At the moment, we have categorical variables such gender (GenderE) and race (RaceEthE) which have numerical codes. When it comes to summarising or plotting later, this would not be the easiest to understand.

Using the full_data_select object, use mutate() to recode these two existing variables and replace the numbers with labels and create a new object full_data_mutate. As a reminder of what each number refers to:

GenderE

  • 1 = Female

  • 2 = Male

  • 3 = Non-binary

RaceEthE

  • 1 = Black/African American

  • 2 = Hispanic/Other Latinx

  • 3 = White

  • 4 = Multiracial

  • 5 = Asian/Pacific Islander

  • 6 = Native American/Alaska Native

  • 7 = South/Central American

# recode gender and race to labels
full_data_mutate <- ?

You should have the following in a code chunk (some lines wrap due to being quite long, but it will look right if you copy and paste it to your RStudio):

# recode gender and race to labels
full_data_mutate <- mutate(full_data_select,
                           GenderE = case_match(GenderE,
                                                1 ~ "Female",
                                                2 ~ "Male",
                                                3 ~ "Non-binary"),
                           RaceEthE = case_match(RaceEthE,
                                                 1 ~ "Black/African American",
                                                 2 ~ "Hispanic/Other Latinx",
                                                 3 ~ "White",
                                                 4 ~ "Multiracial",
                                                 5 ~ "Asian/Pacific Islander",
                                                 6 ~ "Native American/Alaska Native",
                                                 7 ~ "South/Central American"))

6.2.6.1 Bonus activity - reverse coding scales

For a bonus activity, we want to demonstrate a super common task when working with scale data. Often, scales will reverse code some items to express the same idea in opposite ways: one positive and one negative. If the scale is measuring a consistent construct, the responses should be more positive in one and more negative in the other. If you analysed this immediately, you would get two opposing answers, so a key data wrangling step is reverse coding some items so all the numbers mean a similar thing.

In Alter et al. (2024), the three VSSL items we removed were the ones which needed to be reverse coded, but it is a good excuse to practice. Using the scales object, what function could you use to recode existing responses? Hint: we want to recode 1 to 5, 2 to 4, etc.

# recode items 2, 3, and 10
scales_reverse <- mutate(scales,
                         SPSS2_R = ?,
                         SPSS3_R = ?,
                         SPSS10_R = ?)

Based on what we covered before, we expect you will have completed a perfectly accurate but long process of recoding each item one by one:

# recode items 2, 3, and 10
scales_reverse <- mutate(scales,
                         SPSS2_R = case_match(SPSS2E,
                                              1 ~ 5,
                                              2 ~ 4,
                                              3 ~ 3,
                                              4 ~ 2,
                                              5 ~ 1),
                         SPSS3_R = case_match(SPSS3E,
                                              1 ~ 5,
                                              2 ~ 4,
                                              3 ~ 3,
                                              4 ~ 2,
                                              5 ~ 1),
                         SPSS10_R = case_match(SPSS10E,
                                              1 ~ 5,
                                              2 ~ 4,
                                              3 ~ 3,
                                              4 ~ 2,
                                              5 ~ 1))

However, there is a neat shortcut where you can subtract the response from the biggest scale unit plus 1. For example, if you have a 5-point scale, you would subtract the response from 6, if you have a 7-point scale, from 8 etc.

# Reverse code by subtracting responses from 6
scales_reverse <- mutate(scales,
                         SPSS2_R = 6 - SPSS2E,
                         SPSS3_R = 6 - SPSS3E,
                         SPSS10_R = 6 - SPSS10E)

Explore your new data object to see what the new reverse coded variables look like.

6.2.7 Activity 7 - Removing or retaining observations using filter()

To practice filtering data to retain specific participants, imagine we wanted to focus on two specific groups of people.

First, we just want to explore the data of “Non-binary” participants. Second, we want to explore the data of “Female”, “Asian/Pacific Islander” participants. Use filter() on the full_data_mutate object to create two objects: NB_participants and F_asian_participants.

# non-binary participants
NB_participants <- ?

# female, Asian/Pacific Islander participants
F_asian_participants <- ?

After creating the objects, answer the following questions:

  1. We have non-binary participant(s) in the data set.

  2. We have female, Asian/Pacific Islander participant(s) in the data set.

You should have the following in a code chunk:

# non-binary participants
NB_participants <- filter(full_data_mutate,
                          GenderE == "Non-binary")

# female, Asian/Pacific Islander participants
F_asian_participants <- filter(full_data_mutate,
                          GenderE == "Female",
                          RaceEthE == "Asian/Pacific Islander")

6.2.7.1 Bonus activity - Removing NAs with drop_na()

One concept we will spend more time on in Chapter 11 - Screening Data - is removing participants who do not provide an answer. We delve more into the decision making in the course materials, but there is a handy function in tidyr called drop_na(). You could do this using filter, but the standalone function streamlines things. If you run the function on your whole data set, it will remove observations with one or more NAs in all their variables:

# remove observations with any NAs
no_NAs <- drop_na(full_data_mutate)

However, often you do not want to remove all variables with an NA as there might be valuable information elsewhere. You can add one or more variables to ask drop_na() to only remove NAs present in those specific variables:

# remove observations with any NAs
age_NAs <- drop_na(full_data_mutate,
                   AgeE)

This impacts the number of participants we remove as we had 171 when we removed all NAs, but 179 when we only removed NAs in age.

6.2.8 Activity 8 - Summarising data using count() and summarise()

6.2.8.1 Counting observations

As the final recap activity, it is time to calculate some summary statistics to understand our data set. First, use count() on the full_data_mutate object to answer the following questions:

  1. How many observations do we have of each gender? males, females, and non-binary.

  2. How many observations do we have of each race? white, Black/African American, and NA with missing data.

# count each group in GenderE
?

# count each group in RaceE
?

You should have the following in a code chunk:

# count each group in GenderE
count(full_data_mutate,
      GenderE)

# count each group in RaceE
count(full_data_mutate,
      RaceEthE)

6.2.8.2 Summarising observations

One useful demographic summary is the mean and standard deviation (SD) of participant ages. We have covered the function for the mean (mean()) several times, but a key part of coding is knowing what you want, but not the function to do it. So, in the process of the next answer, try and find the function for the standard deviation on your own. If you are really stuck though, you can see the hint below.

# Function for the standard deviation
sd()
# Mean and SD age
mean_age <- summarise(full_data_mutate,
                      mean_age = ?,
                      SD_age = ?)

You should have the following in a code chunk:

# Mean and SD age
mean_age <- summarise(full_data_mutate,
                      mean_age = mean(AgeE, na.rm = TRUE),
                      SD_age = sd(AgeE, na.rm = TRUE))

Remember, if there are NAs present in the data like this, you need to add na.rm = TRUE or handle NAs prior to applying the function.

Error mode

As a transition point to restructuring data, imagine we wanted to calculate the sum score of the items to calculate a number for the whole scale per participant. Based on how we have used mutate() or summarise() before, you might try:

sum_VSSL <- mutate(full_data_mutate,
                      VSSL = sum(c(SPSS1E, SPSS4E, SPSS5E, SPSS6E, SPSS7E, SPSS8E, SPSS9E), na.rm = TRUE))

However, if you look at the object, the VSSL column is the same for every participant (4413) which does not look right? This is due to how functions work within mutate(). It is essentially applying the sum() function to all the columns first and adding them together, rather than summing the values of each column within each participant.

We can fix this problem by restructuring the data.

6.3 Restructuring data using pivot_longer() and pivot_wider()

Apart from joining two data sets, we have pretty much just worked with the data files as they come to us where each row represents one observation/participant and each column represents one variable. That is great but there are scenarios where you get data sets in messier formats that do not follow this pattern. Furthermore, you might need to restructure your data to perform certain functions, like taking the mean/sum of many columns per participant or visualising multiple elements. Before we work on the data wrangling side, we need a brief explanation of data formats.

6.3.1 Tidy data

For most of this book, we use a type of data organisation known as tidy data. Any data in this format is easily processed through the tidyverse family of packages. However, the data you work with will not always be formatted in the most efficient way possible. If that happens, then our first step is to put it into a tidy data format. There are two fundamental principles defining tidy data:

  1. Each variable must have its own column.

  2. Each observation must have its own row.

Wickham (2014) adds the following principle:

  1. Each type of observation unit forms a table.

Grolemund and Wickham (2023) restate this third principle as: “Each value must have its own cell (i.e. no grouping two variables together, e.g. time/date in one cell)” where a cell is where any specific row and column meet. A single data point in a data frame / tibble is a cell for example. The Grolemund and Wickham (2023) book is a very useful source for further reading and it is free, but browsing the chapter on tidy data will help you visualise how you want to arrange data.

Note

If you have worked with any kind of data before, particularly if you have used Excel, it is likely that you will have used wide format or long format data. In wide format, each participant’s data is all in one row with multiple columns for different data points. This means that the data set tends to be very wide and you will have as many rows as you have participants.

Long format is where each row is a single observation, typically a single trial in an experiment or a response to a single item on a questionnaire. When you have multiple trials per participant, you will have multiple rows for the same participant. To identify participants, you would need a variable with some kind of participant id, which can be as simple as a distinct integer value for each participant. In addition to the participant identifier, you would have any measurements taken during each observation (e.g., response time) and what experimental condition the observation was taken under.

In wide format data, each row corresponds to a single participant, with multiple observations for that participant spread across columns. So for instance, with survey data, you would have a separate column for each survey question.

Tidy data is a mix of both of these approaches and most functions in the tidyverse assume the tidy format, so typically the first thing you need to do when you get data is think about what format you need your data to perform the functions and analyses you want. For some functions, you need your data in wide format, and in others you need your data in long format. This means being able to quickly restructure your data is a key skill.

6.3.2 Activity 9: Gathering with pivot_longer()

In it’s current format, we have wide data where each row is a separate participant and each column is a separate variable. We can use the function pivot_longer() from the tidyr package within tidyverse.

The pivot functions can be easier to show than explain first, so type and run the following code using the full_data_mutate object:

full_data_long <- pivot_longer(data = full_data_mutate,
                      cols = SPSS1E:SPSS9E,
                      names_to = "Question", 
                      values_to = "Response")

To break down the code:

  • We create a new data object called full_data_long by applying the pivot_longer() function to full_data_mutate.

  • In the cols argument, we specify the columns we want to gather. We use the colon method here like select() to choose the 7 columns for the VSSL items. If the columns are not in order, you could use the c() method instead (e.g., cols = c(SPSS1E, SPSS9E)).

  • The names_to argument is what your first new column will be called. All the column names you selected in cols will be pivoted into this new column, so call it something sensible you will remember later. Here, we call the new column “Question”.

  • The values_to argument is what your second new column will be called. For all the columns you gather, the response of each participant will be in one column stacked on top of each other next to its label in “Question”. You also need to call this something memorable, like “Response” here.

Now, explore the new full_data_long object you just created and compare it to full_data_mutate. Instead of 181 rows, we now have 1267 rows. Instead of 27 variables, we now have 22 variables. We have 181 participants who responded to 7 VSSL items, so we pivot the data into long format to get 181 * 7 = 1267 rows.

Visually, you can see the difference with a preview of just the participant ID and VSSL items here:

StudentIDE SPSS1E SPSS4E SPSS5E SPSS6E SPSS7E SPSS8E SPSS9E
1 4 3 3 3 4 4 4
2 4 4 4 4 4 4 4
3 3 2 3 2 2 3 3
4 2 2 2 2 2 2 2
5 4 3 4 4 3 4 3
6 2 3 2 1 3 3 3
7 4 2 4 4 2 3 2
8 2 3 3 3 3 3 2
9 3 3 3 1 4 3 2
10 4 4 3 5 4 2 4
StudentIDE Question Response
1 SPSS1E 4
1 SPSS4E 3
1 SPSS5E 3
1 SPSS6E 3
1 SPSS7E 4
1 SPSS8E 4
1 SPSS9E 4
2 SPSS1E 4
2 SPSS4E 4
2 SPSS5E 4

Now we have our data in long form, we can calculate summary statistics for participants using group_by() and summarise(). First, we group the data by the participant ID, as we want one value per participant:

# group full_data_long by StudentIDE
longdata_grouped <- group_by(full_data_long, 
                             StudentIDE)

Second, we create a new variable using summarise() to take the sum of all the items. This will create the VSSL scale score consistent with Alter et al. (2024):

# Calculate the sum of VSSL items by taking the sum of Response
VSSL_sum <- summarise(longdata_grouped,
                      VSSL_sum = sum(Response))

Our new object goes from 1267 rows back to 181 as we grouped by the participant ID and took the sum of Response. This means we apply the function we provide summarise() to all the rows we want to group by, in this case across all 7 VSSL items. Your new object has just two columns: StudentIDE and VSSL_sum and should look like the following extract:

StudentIDE VSSL_sum
1 25
2 28
3 18
4 14
5 25
6 17

At this point, you could join the object to full_data_mutate to add the scale score to all the other variables.

Try this

We calculated the VSSL scale score by pivoting longer, grouping the data, and taking the sum of the 7 items. To test your understanding, complete the same steps to calculate the scale score of Quantitative anxiety using the four columns QANX1E to QANX4E. The scale score here also involves taking the sum of the columns. Use the full_data_mutate object as your starting point for the data.

Check your attempt with the solution below when you have tried on your own.

# gather the four quant anxiety items to long form
quant_anxiety_long <- ?

# group the long data by participant ID
quant_anxiety_group <- ?

# calculate the sum quant anxiety per participant
sum_quant_anxiety <- ?

To check your answers:

  1. Participant 1 has a sum quantitative anxiety score of

  2. Participant 5 has a sum quantitative anxiety score of

We complete the task in three steps. First, we pivot longer using the four columns QANX1E to QANX4E to create the new quant_anxiety_long object. Second, we group that new long data object by the participant ID. Third, we calculate the sum quantitative anxiety score by taking the sum of the responses per participant ID.

# gather the four quant anxiety items to long form
quant_anxiety_long <- pivot_longer(data = full_data_mutate,
                                   cols = QANX1E:QANX4E,
                                   names_to = "Question",
                                   values_to = "Response")

# group the long data by participant ID
quant_anxiety_group <- group_by(quant_anxiety_long, 
                                StudentIDE)

# calculate the sum quant anxiety per participant
sum_quant_anxiety <- summarise(quant_anxiety_group,
                               sum_quant_anxiety = sum(Response))

6.3.3 Spreading with pivot_wider()

You might also find yourself in situations where you must restructure data in the opposite direction: from long to wide. There is a complementary function called pivot_wider() where you can spread values from one column to multiple columns. You need two columns in your long form data set, one for the variable names which will be your new column names, then one for the responses which will be the values in each cell.

To demonstrate this function, we will transform full_data_long back to wide format so we have 7 columns of VSSL items:

full_data_wide <- pivot_wider(data = full_data_long,
                              names_from = "Question",
                              values_from = "Response")

To break down the code:

  • We create a new object full_data_wide by applying the function pivot_wider() to full_data_long.

  • In the names_from argument, we add the column name “Question” which contains the names of the variables you want as your new column names.

  • In the values_from argument, we add the column name “Response” which contains the values of the variables which will be the cells of your data frame.

The new object full_data_wide should now look exactly the same as the full_data_mutate object we started with.

Do I need to add quotes to the column names?

You might have noticed we added quotes around the column names to specify the names_from and values_from arguments. When we specify columns in tidyverse functions, we do not need to add the quotes, we can just type the name and it will work (names_from = "Question" and names_from = Question would both work here). However, in other functions outside the tidyverse, you normally need to add the quotes around column names. When to add quotes or not can take a while to get used to, so this is just a note to highlight you might try one method and it does not work, but you can try the other method if you get an error.

Try this

In the pivot_longer() section, you should have created a new object quant_anxiety_long if you completed the “Try this” activity. To test your understanding of pivot_wider(), spread the four items and responses of Quantitative anxiety back to wide format. Use the quant_anxiety_long object as your starting point and create a new object called quant_anxiety_wide.

Check your attempt with the solution below when you have tried on your own.

# spread the quant anxiety items back to wide form
quant_anxiety_wide <- ?

This task follows the exact format as full_data_wide if you named your variables the same as ours. It is just important the names_from and values_from columns are the same as those you used in quant_anxiety_long.

# spread the quant anxiety items back to wide form
quant_anxiety_wide <- pivot_wider(quant_anxiety_long, 
                                  names_from = "Question",
                                  values_from = "Response")

6.4 Combining several functions with pipes

In this final section on data wrangling, we are not covering new functions, but a new way of working. So far, we have created lots of new objects by applying individual tidyverse functions, but there is a way to string together several functions and streamline your code. We wanted to introduce you to the individual functions first to develop your fundamentals skills and understanding of what the functions do, but now we can be a little more efficient.

Instead of creating several objects, you can use pipes. We write pipes as %>% and you can read them as “and then”. Pipes allow you to string together ‘sentences’ of code into ‘paragraphs’ so that you do not need to create intermediary objects.

This is another one of those concepts that is initially easier to show than tell:

# Create an object starting with demog 
full_data_pipe <-  demog %>%
  # Join with scales
  inner_join(y = scales,
             by = "StudentIDE") %>%
  # Select key columns
  select(StudentIDE, 
         GenderE,
         RaceEthE, 
         AgeE, 
         QANX1E:QINFL7E, 
         QHIND1E:QHIND5E, 
         SPSS1E, SPSS4E, SPSS5E, SPSS6E, SPSS7E, SPSS8E, SPSS9E) %>% 
  # Recode variables with labels
  mutate(GenderE = case_match(GenderE,
                              1 ~ "Female",
                              2 ~ "Male",
                              3 ~ "Non-binary"),
         RaceEthE = case_match(RaceEthE,
                               1 ~ "Black/African American",
                               2 ~ "Hispanic/Other Latinx",
                               3 ~ "White",
                               4 ~ "Multiracial",
                               5 ~ "Asian/Pacific Islander",
                               6 ~ "Native American/Alaska Native",
                               7 ~ "South/Central American"))

Instead of creating all the intermediary objects, we go straight from joining the two data sets to recoding the variables in mutate, all in one object. Side by side, you can see the difference in the process we had to go through:

# join demog and scales by a common identifier 
full_data <- inner_join(x = demog,
                        y = scales,
                        by = "StudentIDE")

# select the key variables listed above
full_data_select <- select(full_data,
                           StudentIDE, 
                           GenderE,
                           RaceEthE, 
                           AgeE, 
                           QANX1E:QINFL7E, 
                           QHIND1E:QHIND5E, 
                           SPSS1E, SPSS4E:SPSS9E)

# recode gender and race to labels
full_data_mutate <- mutate(full_data_select,
                           GenderE = case_match(GenderE,
                                                1 ~ "Female",
                                                2 ~ "Male",
                                                3 ~ "Non-binary"),
                           RaceEthE = case_match(RaceEthE,
                                                 1 ~ "Black...",
                                                 2 ~ "Hispanic...",
                                                 3 ~ "White",
                                                 4 ~ "Multiracial",
                                                 5 ~ "Asian...",
                                                 6 ~ "Native American...",
                                                 7 ~ "South..."))
# Create an object starting with demog 
full_data_pipe <-  demog %>%
  # Join with scales
  inner_join(y = scales,
             by = "StudentIDE") %>%
  # Select key columns
  select(StudentIDE, 
         GenderE,
         RaceEthE, 
         AgeE, 
         QANX1E:QINFL7E, 
         QHIND1E:QHIND5E, 
         SPSS1E, SPSS4E:SPSS9E) %>% 
  # Recode variables with labels
  mutate(GenderE = case_match(GenderE,
                              1 ~ "Female",
                              2 ~ "Male",
                              3 ~ "Non-binary"),
         RaceEthE = case_match(RaceEthE,
                               1 ~ "Black...",
                               2 ~ "Hispanic...",
                               3 ~ "White",
                               4 ~ "Multiracial",
                               5 ~ "Asian...",
                               6 ~ "Native American...",
                               7 ~ "South..."))

As you get used to using pipes, remember you can interpret them as “and then”. So, we could explain the function of the code to ourselves as:

  • Create full_data_pipe by starting with demog data, and then

  • Join with the scales data using StudentIDE as an identifier, and then,

  • Select our key columns, and then

  • Mutate to recode gender and race.

It can be tricky at first to understand what pipes are doing from a conceptual point of view, but it is well worth learning to use them. When your code starts getting longer, they are much more efficient and you write less code which is always a good thing to debug and find errors. You also have fewer objects in your environment as we created one object instead of three, tidying your workspace.

Error mode

One key difference that can trip people up is we no longer specify the data object as the first argument in each function. The reason that this function - the %>% - is called a pipe is because it ‘pipes’ the data through to the next function. When you wrote the code previously, the first argument of each function was the dataset you wanted to work on. When you use pipes, it will automatically take the data from the previous line of code so you do not need to specify it again.

For example, if we tried to specify demog again in select(), we would just receive an error.

# Create an object starting with demog 
full_data_pipe <-  demog %>%
  # Join with scales
  inner_join(y = scales,
             by = "StudentIDE") %>%
  # Select key columns
  select(.data = demog,
         StudentIDE, 
         GenderE,
         RaceEthE, 
         AgeE, 
         QANX1E:QINFL7E, 
         QHIND1E:QHIND5E, 
         SPSS1E, SPSS4E:SPSS9E)
Try this

Pipes also work with other functions like filter(), group_by() and summarise(). If you start with the object full_data_mutate, try and express the following instructions in code:

  1. Create a new object age_groups using full_data_mutate as your starting point, and then

  2. Filter to only include “White” and “Black/African American” participants using RaceEthE, and then,

  3. Group the observations by RaceEthE, and then,

  4. Summarise the data to calculate the mean and standard deviation AgeE.

Check your attempt with the solution below when you have tried on your own.

# create age_groups by filtering, grouping, and summarising
age_groups <- full_data_mutate %>% 
  ?

We complete this task in three steps. First, we filter full_data_mutate to just focus on White and Black/African American participants. Second, we group the data by RaceEthE so our summary statistics are split into two groups. Third, we calculate the mean and SD age.

# create age_groups by filtering, grouping, and summarising
age_groups <- full_data_mutate %>% 
  filter(RaceEthE %in% c("White", "Black/African American")) %>% 
  group_by(RaceEthE) %>% 
  summarise(mean_age = mean(AgeE, na.rm = TRUE),
            SD_age = sd(AgeE, na.rm = TRUE))

6.5 Test yourself

To end the chapter, we have some knowledge check questions to test your understanding of the concepts we covered in the chapter. We then have some error mode tasks to see if you can find the solution to some common errors in the concepts we covered in this chapter.

6.5.1 Knowledge check

Which function(s) would you use to approach each of the following problems?

Question 1. We have a data set of 400 adults but we want to remove anyone with an age of 50 years or more. To do this, we could use:

Question 2. We are interested in overall summary statistics for our data, such as the mean and total number of observations for a variable. To do this, we could use:

Question 3. Our data set has a column with the number of cats a person has and a column with the number of dogs. We want to calculate a new column which contains the total number of pets each participant has. To do this, we could use:

Question 4. We want to calculate the mean value of a column for several groups in our data set. To do this, we could use:

Question 5. If we wanted to apply the following wrangling steps with pipes, which series of functions would work? With the object wide_data, select several columns and then, pivot three columns longer and then, group by a participant ID and then, calculate the sum of responses.

6.5.2 Error mode

The following questions are designed to introduce you to making and fixing errors. For this topic, we focus on data wrangling using past functions, pivot_longer(), and pipes (%>%). Remember to keep a note of what kind of error messages you receive and how you fixed them, so you have a bank of solutions when you tackle errors independently.

Create and save a new R Markdown file for these activities. Delete the example code, so your file is blank from line 10. Create a new code chunk to load tidyverse and the data files:

# Load the tidyverse package below
library(tidyverse)

# Load the data files
# This should be the Alter_2024_demographics.csv file 
demog <- read_csv("data/Alter_2024_demographics.csv")

# This should be the Alter_2024_scales.csv file 
scales <- read_csv("data/Alter_2024_scales.csv")

Below, we have several variations of a code chunk error or misspecification. Copy and paste them into your R Markdown file below the code chunk to load tidyverse and the data files. Once you have copied the activities, click knit and look at the error message you receive. See if you can fix the error and get it working before checking the answer.

Question 6. Copy the following code chunk into your R Markdown file and press knit. In this code chunk, we want to calculate the mean and SD age of all participants using demog. There are two errors/omissions here to try and fix:

  1. One causes the document not to knit. You should receive an error like Caused by error in "SD()": ! could not find function "SD".

  2. The other looks like we just get NA values?

```{r}
# calculate the mean and SD age
demog %>% 
  summarise(mean_age = mean(AgeE),
            SD_age = SD(AgeE))
```

The first error is using the wrong function name for SD. Because we always abbreviate standard deviation to SD, it is tempting to try and use that as the function name. However, the function is lowercase: sd().

The second error is not including the na.rm = TRUE argument. There are NAs in the data, so you either need to address them before running the function, or ignoring the NAs with na.rm = TRUE.

# calculate the mean and SD age
demog %>% 
  summarise(mean_age = mean(AgeE, na.rm = TRUE),
            SD_age = sd(AgeE, na.rm = TRUE))

Question 7. Copy the following code chunk into your R Markdown file and press knit. We want to calculate the sum of the five quantitative hindrances items per participant. This code… works, but does it look like it fits in the possible 5-25 range?

```{r}
# sum quant hindrances items per participant
sum_quant_hindrance <- scales %>% 
  mutate(sum_quant_hindrance = sum(c(QHIND1E, QHIND2E, QHIND3E, QHIND4E, QHIND5E), na.rm = TRUE))
```

This is the main of warning we flagged in the opening section to pivot_longer(). Intuitively, it is the right idea to try and calculate the sum in a new column. However, in mutate(), it sums all the columns, not the observations for each participant.

Instead, we can pivot longer focusing on the quantitative hindrance items, group by participant ID, and summarise.

# sum quant hindrances items per participant
sum_quant_hindrance <- scales %>% 
  # pivot longer on 5 quant hindrances items
  pivot_longer(cols = QHIND1E:QHIND5E,
               names_to = "Question",
               values_to = "Response") %>% 
  # group by student ID
  group_by(StudentIDE) %>% 
  # summarise for sum of new long column 
  summarise(sum_quant_hindrance = sum(Response, na.rm = TRUE))

Question 8. Copy the following code chunk into your R Markdown file and press knit. We want to filter demog to focus on female participant and calculate the mean age of the female participants. You should receive an error containing Caused by error:! "..1$StudentIDE" must be a logical vector, not a double vector which is not the most helpful error for diagnosing the problem.

```{r}
# filter for females then calculate mean age
demog %>% 
  filter(.data = demog, 
         GenderE == 2) %>% 
  summarise(.data = demog,
            mean_age = mean(AgeE, na.rm = TRUE))
```

We are using pipes but we tried adding in the .data argument in each line. Remember %>% “pipes” the previous line into the next line, so you do not need to specify an object for it to work with. If you try and specify the .data argument with pipes, it thinks you are trying to set the next argument in the list.

# filter for females then calculate mean age
demog %>% 
  filter(GenderE == 2) %>% 
  summarise(mean_age = mean(AgeE, na.rm = TRUE))

6.6 Words from this Chapter

Below you will find a list of words that were used in this chapter that might be new to you in case it helps to have somewhere to refer back to what they mean. The links in this table take you to the entry for the words in the PsyTeachR Glossary. Note that the Glossary is written by numerous members of the team and as such may use slightly different terminology from that shown in the chapter.

term definition
pipe A way to order your code in a more readable format using the symbol %>%
pivot_longer() Gather data by increasing the number of rows and decreasing the number of columns.
pivot_wider() Spread data by decreasing the number of rows and increasing the number of columns.
reverse-code Having two similar questions, one expressed in a positive way, and another expressed in a negative way.
tidy-data A format for data that maps the meaning onto the structure.

6.7 End of chapter

Brilliant work again! You recapped data wrangling functions from the past two chapters to a new data set, and added two more concepts to your arsenal: pivots and pipes. There really is no substitute for practicing on new data to transfer your knowledge and understanding. As you work with more and more data, you will see how far these data wrangling functions take you. They will your foundational skills for any new data set and give you the confidence to search for new functions when there is a new problem to solve.

This is a key milestone, so remember to go over anything you are unsure of. If you have any questions about data wrangling, please post them on Teams, visit the GTA support sessions, or pop into office hours.

At this point, we direct you to the first data analysis journey chapter: Analysis Journey 1: Data Wrangling. This is a bridge between the structured learning in these chapters and your assessments. We present you with a new data set, show you what the end product should look like, and see if you can apply your data wrangling skills to get there. If you get stuck, we have a range of hints and steps you can unhide, then the solution to check your attempts against.

In the next core chapter though, we turn to more advanced data visualisation to demonstrate how to create scatterplots, boxplots, and violin-boxplots.