4 Data Wrangling
In this part, we will explore more packages and functions within the tidyr
, dplyr
, and stringr
which cover the most common data wrangling tasks you will need in almost every project. Alongside data visualisation, the ability to wrangle data reproducibly is one of the best demonstrations of the value of R. Many of the tasks are either not possible in software like SPSS or very difficult to perform reproducibly. As you develop your R skills, the potential to complete the whole wrangling, visualisation, and analysis pipeline will really help your workflow.
4.1 Intended Learning Outcomes
By the end of this part, you will be able to:
Join two data sets by matching one or more identifying columns in common.
Select a range and reorder variables in your data set.
Modify or create variables, such as recoding values or creating new groups.
Filter observations to retain a subset of your data.
Summarise your data to calculate summary statistics.
Restructure data into different formats, such as long and wide form.
String together multiple functions using pipes.
4.2 Key chapters
4.3 Target Dataset
For this chapter, we are using open data from Alter et al. (2024). The abstract of their article is:
The biggest difference in statistical training from previous decades is the increased use of software. However, little research examines how software impacts learning statistics. Assessing the value of software to statistical learning demands appropriate, valid, and reliable measures. The present study expands the arsenal of tools by reporting on the psychometric properties of the Value of Software to Statistical Learning (VSSL) scale in an undergraduate student sample. We propose a brief measure with strong psychometric support to assess students’ perceived value of software in an educational setting. We provide data from a course using SPSS, given its wide use and popularity in the social sciences. However, the VSSL is adaptable to any statistical software, and we provide instructions for customizing it to suit alternative packages. Recommendations for administering, scoring, and interpreting the VSSL are provided to aid statistics instructors and education researchers understand how software influences students’ statistical learning.
To summarise, they developed a new scale to measure students’ perceived value of software to learning statistics - Value of Software to Statistical Learning (VSSL). The authors wanted to develop this scale in a way that could be adapted to different software, from SPSS in their article (which some of you may have used in the past), to perhaps R in future. Alongside data from their new scale, they collected data from other scales measuring a similar kind of construct (e.g., Students’ Attitudes toward Statistics and Technology) and related constructs (e.g., Quantitative Attitudes).
The data from Alter et al. is split into two data files. In Alter_2024_demographics.csv
, we have the participant ID (StudentIDE
) and several demographic variables. The columns (variables) we have in the data set are:
Variable | Description |
---|---|
StudentIDE | Participant number |
GenderE | Gender: 1 = Female, 2 = Male, 3 = Non-Binary |
RaceEthE | Race: 1 = Black/African American, 2 = Hispanic/Other Latinx, 3 = White, 4 = Multiracial, 5 = Asian/Pacific Islander, 6 = Native American/Alaska Native, 7 = South/Central American |
GradeE | Expected grade: 1 = A, 2 = B, 3 = C, 4 = D, 5 = F |
StuStaE | Student status: 1 = Freshman, 2 = Sophomore, 3 = Junior, 4 = Senior or Higher |
GPAE | Expected Grade Point Average (GPA) |
MajorE | Degree major |
AgeE | Age in years |
In Alter_2024_scales.csv
, we then have the participant ID (StudentIDE
) and all the individual scale items. The columns (variables) we have in the data set are:
Variable | Description |
---|---|
StudentIDE | Participant number |
MA1E to MA8E | Enjoyment of Mathematics and statistics, not analysed in this study. |
QANX1E to QANX4E | Quantitative anxiety: four items scored on a 5-point Likert scale ranging from 1 (Not at all Anxious) to 5 (Extremely Anxious) |
QINFL1E to QINFL7E | Quantitative attitudes: seven items scored on a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree) |
QSF1E to QSF4E | Study motivation, not analysed in this study. |
QHIND1E to QHIND5E | Quantitative hindrances: five items scored on a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree) |
QSC1E to QSC4E | Mathematical self-efficacy, not analysed in this study. |
QSE1E to QSE6E | Mathematical ability, not analysed in this study. |
SPSS1E to SPSS10E | VSSL scale on SPSS: 10 items scored on a 5-point Likert scale ranging from 1 (Never True) to 5 (Always True) |
We will use this data to demonstrate a range of key functions for wrangling data into different formats ready for analysis.