Data Management
Topics covered in this section are
➤ Recoding variables
➤ Storing recoded variable in the data frame
➤ Recoding continuous variable into a categorical variable
➤ Sorting the data frame
➤ Sorting in descending order
➤ Missing values
➤ Handling missing values in data frames
➤ Recoding variable with missing data
➤ Recoding variable by replacing missing values
➤ Recoding variable by deleting missing values
➤ Infinity and NaN
Data Management
↪ Recoding variables
Recoding is a process of creating a new variable from the existing values of the same or other variables. For example, recoding process is used for creating a categorical variable from a continuous variable, creating a boolean variable based on certain criteria, and handling missing values.
Following example shows how variables from 'fruitsframe' are used for creating new set of variables.
# Creating 'friutsframe' data frame fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry") jan <- c(10, 3, 12, 5, 8) # fruits prices in january feb <- c(12.5, 5, 14, 3, 10) # fruits prices in february mar <- c(10.5, 6, 10, 4, 13) # fruits prices in march fruitsframe <- data.frame(fruits, jan, feb, mar, stringsAsFactors = FALSE) fruitsframe
---Output--- - fruits jan feb mar 1 apple 10 12.5 10.5 2 banana 3 5.0 6.0 3 cherry 12 14.0 10.0 4 dragon fruit 5 3.0 4.0 5 elderberry 8 10.0 13.0
From the data frame above, the mean price can be calculated by adding data from each column.
mean_price <- (fruitsframe$jan + fruitsframe$feb + fruitsframe$mar )/3 # OR mean_price <- rowMeans(fruitsframe[,2:4], na.rm=TRUE) mean_price
---Output--- 1] 11.000000 4.666667 12.000000 4.000000 10.333333
Data Management
↪ Storing recoded variable in the data frame
The recoded variable 'mean_price' can be stored in the data frame itself.
fruitsframe$mean_price <- (fruitsframe$jan + fruitsframe$feb + fruitsframe$mar)/3 fruitsframe$mean_price fruitsframe
---Output--- - fruits jan feb mar mean_price 1 apple 10 12.5 10.5 11.000000 2 banana 3 5.0 6.0 4.666667 3 cherry 12 14.0 10.0 12.000000 4 dragon fruit 5 3.0 4.0 4.000000 5 elderberry 8 10.0 13.0 10.333333
The transform() function achieves the same results.
fruitsframe <- data.frame(fruits,jan,feb,mar,stringsAsFactors = FALSE) fruitsframe <- transform(fruitsframe, mean_price = (jan + feb + mar)/3) fruitsframe
---Output--- - fruits jan feb mar mean_price 1 apple 10 12.5 10.5 11.000000 2 banana 3 5.0 6.0 4.666667 3 cherry 12 14.0 10.0 12.000000 4 dragon fruit 5 3.0 4.0 4.000000 5 elderberry 8 10.0 13.0 10.333333
Data Management
↪ Recoding continuous variable into a categorical variable
The continuous variable mean_price can be recoded into a categorical variable price_cat with values High, Medium, and Low.
fruitsframe$price_cat[fruitsframe$mean_price >= 10] <- "High" fruitsframe$price_cat[fruitsframe$mean_price > 5 & fruitsframe$mean_price < 10] <- "Medium" fruitsframe$price_cat[fruitsframe$mean_price <= 5] <- "Low" fruitsframe
---Output--- - fruits jan feb mar mean_price price_cat 1 apple 10 12.5 10.5 11.000000 High 2 banana 3 5.0 6.0 4.666667 Low 3 cherry 12 14.0 10.0 12.000000 High 4 dragon fruit 5 3.0 4.0 4.000000 Low 5 elderberry 8 10.0 13.0 10.333333 High
This code can be written using within() function as below.
fruitsframe <- within(fruitsframe,{ pricecat2 <- NA pricecat2[mean_price >= 10] <- "High" pricecat2[mean_price > 5 & mean_price < 10] <- "Medium" pricecat2[mean_price <= 5] <- "Low" }) fruitsframe
---Output--- - fruits jan feb mar mean_price price_cat pricecat2 1 apple 10 12.5 10.5 11.000000 High High 2 banana 3 5.0 6.0 4.666667 Low Low 3 cherry 12 14.0 10.0 12.000000 High High 4 dragon fruit 5 3.0 4.0 4.000000 Low Low 5 elderberry 8 10.0 13.0 10.333333 High High
Data Management
↪ Sorting the data frame
The order() or sort.list() functions can be used for sorting the frame. The following listing sorts fruitsframe by 'price_cat' ascending order.
fruitsframe[order(fruitsframe$price_cat),] fruitsframe[sort.list(fruitsframe$price_cat),]
---Output--- fruitsframe[order(fruitsframe$price_cat),] - fruits jan feb mar mean_price price_cat pricecat2 1 apple 10 12.5 10.5 11.000000 High High 3 cherry 12 14.0 10.0 12.000000 High High 5 elderberry 8 10.0 13.0 10.333333 High High 2 banana 3 5.0 6.0 4.666667 Low Low 4 dragon fruit 5 3.0 4.0 4.000000 Low Low fruitsframe[sort.list(fruitsframe$price_cat),] - fruits jan feb mar mean_price price_cat pricecat2 1 apple 10 12.5 10.5 11.000000 High High 3 cherry 12 14.0 10.0 12.000000 High High 5 elderberry 8 10.0 13.0 10.333333 High High 2 banana 3 5.0 6.0 4.666667 Low Low 4 dragon fruit 5 3.0 4.0 4.000000 Low Low
Data Management
↪ Sorting in descending order
An argument 'decreasing=TRUE' makes a sort in descending order. The following listing sorts fruitsframe by 'price_cat' in descending order.
fruitsframe[order(fruitsframe$price_cat, decreasing=TRUE),] fruitsframe[sort.list(fruitsframe$price_cat, decreasing=TRUE),]
---Output--- - fruits jan feb mar mean_price price_cat pricecat2 2 banana 3 5.0 6.0 4.666667 Low Low 4 dragon fruit 5 3.0 4.0 4.000000 Low Low 1 apple 10 12.5 10.5 11.000000 High High 3 cherry 12 14.0 10.0 12.000000 High High 5 elderberry 8 10.0 13.0 10.333333 High High
Data Management
↪ Missing values
Missing values in R are represented by the symbol NA (not available) for both character and numeric data. The function is.na() is used for testing the presence of missing values in a data set. For example,
x <- c(10, 20, NA, 30) is.na(x) # returns c(FALSE, FALSE, TRUE, FALSE)
The is.na() function works on an object and it returns an object of the same size with the entries replaced by TRUE for the missing values or FALSE otherwise.
Missing values are considered non-comparable, even to themselves. This means comparison operators cannot be used to test the missing values. For example, the test x == NA always returns FALSE.
Missing values reduce the statistical power of the analysis and an analysis based on missing values will result in a biased statistical model. Missing values are errors resulting from the data collection process; for example, respondents not answering some questions.
Reason for the missing values is important to consider because this helps to determine the type of missing values and what to do about it. Ensure that any missing values are properly coded as missing before analyzing the data, otherwise the results could be meaningless. Arithmetic expressions and functions that contain missing values yield missing values.
Take an example,
x <- c(10, NA, 30, 40) sum1 <- x[1] + x[2] + x[3] + x[4] sum1 # Output NA sum2 <- sum(x) sum2 # Output NABoth sum1 and sum2 will be NA because the vector x has a missing value.
In R, most numeric functions take an option 'na.rm=TRUE' that removes missing values prior to calculations and applies the function to the remaining values:
x <- c(10, NA, 30, 40) sum3 <- sum(x, na.rm=TRUE) # Output 80
Data Management
↪ Handling missing values in data frames
Example below shows how missing values can be handled in a Data Frame.
fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", "figs", "grapes") jan <- c(10, 3, 12, 5, 8, NA, 20) # price of fruits in january feb <- c(12.5, 5, 14,NA, 10, 18, NA) # price of fruits in february mar <- c(NA, 6, 10, 4, 13, 9, 11) # price of fruits in march fruitsframe <- data.frame(fruits, jan, feb, mar, stringsAsFactors = FALSE) fruitsframe
---Output--- - fruits jan feb mar 1 apple 10 12.5 NA 2 banana 3 5.0 6 3 cherry 12 14.0 10 4 dragon fruit 5 NA 4 5 elderberry 8 10.0 13 6 figs NA 18.0 9 7 grapes 20 NA 11
The is.na() function on 'fruitsframe' displays the missing values with the value TRUE as shown below.
is.na(fruitsframe)
---Output--- - fruits jan feb mar [1,] FALSE FALSE FALSE TRUE [2,] FALSE FALSE FALSE FALSE [3,] FALSE FALSE FALSE FALSE [4,] FALSE FALSE TRUE FALSE [5,] FALSE FALSE FALSE FALSE [6,] FALSE TRUE FALSE FALSE [7,] FALSE FALSE TRUE FALSE
Data Management
↪ Recoding variable with missing data
From the 'fruitsname' data frame, create a new variable 'mean_price' as shown below.
fruitsframe$mean_price <- (as.integer(fruitsframe$jan) + as.integer(fruitsframe$feb) + as.integer(fruitsframe$mar))/3 fruitsframe
---Output--- - fruits jan feb mar mean_price 1 apple 10 12.5 NA NA 2 banana 3 5.0 6 4.666667 3 cherry 12 14.0 10 12.000000 4 dragon fruit 5 NA 4 NA 5 elderberry 8 10.0 13 10.333333 6 figs NA 18.0 9 NA 7 grapes 20 NA 11 NA
The 'mean_price' is NA when the calculation involves a missing value. These missing values can be removed by passing na.rm=TRUE argument to rowMeans() function.
fruitsframe$mean_price_na_rm <- rowMeans(fruitsframe[,2:4], na.rm=TRUE) fruitsframe
---Output--- - fruits jan feb mar mean_price mean_price_na_rm 1 apple 10 12.5 NA NA 11.250000 2 banana 3 5.0 6 4.666667 4.666667 3 cherry 12 14.0 10 12.000000 12.000000 4 dragon fruit 5 NA 4 NA 4.500000 5 elderberry 8 10.0 13 10.333333 10.333333 6 figs NA 18.0 9 NA 13.500000 7 grapes 20 NA 11 NA 15.500000
The price for Apple is missing in March. The mean_price calculation for Apple was based on 2 months of data while the same calculation for Banana was based on 3 months of data. Is this what you want?
Data Management
↪ Recoding variable by replacing missing values
The variable[condition] <- value statement will make the assignment when the condition is TRUE. For example, replace missing values with '0' and calculate the mean_price again.
fruitsframe$jan[is.na(fruitsframe$jan)] <- 0 fruitsframe$feb[is.na(fruitsframe$feb)] <- 0 fruitsframe$mar[is.na(fruitsframe$mar)] <- 0 fruitsframe$mean_price_na_rm_0 <- rowMeans(fruitsframe[,2:4], na.rm=TRUE) fruitsframeNow, observe that the mean_price calculation for Apple is based on 3 months of data, but with the price set to '0'. Is it realistic?
---Output--- - fruits jan feb mar mean_price mean_price_na_rm mean_price_na_rm_0 1 apple 10 12.5 0 NA 11.250000 7.500000 2 banana 3 5.0 6 4.666667 4.666667 4.666667 3 cherry 12 14.0 10 12.000000 12.000000 12.000000 4 dragon fruit 5 0.0 4 NA 4.500000 3.000000 5 elderberry 8 10.0 13 10.333333 10.333333 10.333333 6 figs 0 18.0 9 NA 13.500000 9.000000 7 grapes 20 0.0 11 NA 15.500000 10.333333
The above examples show how missing values weaken the power of statistical analysis.
Data Management
↪ Recoding variable by deleting missing values
The na.omit() function deletes any rows with missing values. When missing values are omitted from the analysis, only those observations with complete records are included in the analysis.
fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", "figs", "grapes") jan <- c(10, 3, 12, 5, 8, NA, 20) # price of fruits in january feb <- c(12.5, 5, 14, NA, 10, 18, NA) # price of fruits in february mar <- c(NA, 6, 10, 4, 13, 9, 11) # price of fruits in march fruitsframe <- data.frame(fruits, jan, feb, mar, stringsAsFactors = FALSE) fruitsframe
---Output--- - fruits jan feb mar 1 apple 10 12.5 NA 2 banana 3 5.0 6 3 cherry 12 14.0 10 4 dragon fruit 5 NA 4 5 elderberry 8 10.0 13 6 figs NA 18.0 9 7 grapes 20 NA 11
newfruitsname <- na.omit(fruitsframe) newfruitsname
---Output--- - fruits jan feb mar 2 banana 3 5 6 3 cherry 12 14 10 5 elderberry 8 10 13
The above approach may result in biased results when the data set is small or when there is a significant number of missing values.
Data Management
↪ Infinity and NaN
Infinite or impossible values are not the same as missing values. the symbols 'Inf' and '–Inf' represent positive and negative infinity respectively. The function is.infinite() can be used for testing infinity.
The symbol 'NaN' represents an impossible number, sin(Inf) for example. The is.nan() function is used for testing NaN.
Data Management
↪ Summary
- Recoding is a process of creating a new variable from the existing values of the same or other variables.
- Continuous variables, as well as categorical variables, can be created from existing values.
- Missing values reduce the statistical power of the analysis and an analysis based on missing values will result in a biased statistical model.
- Reason for the missing values is important to consider and the approach taken to deal with missing data directly influences the statistical model.
- The na.rm=TRUE parameter, is.na(), and na.omit() functions are used for recoding the missing data.
- Infinity and NaN are not the same as missing values.