Data Management

Topics covered in this section are

➤ Recoding variables

➤ Storing recoded variable in the data frame

➤ Recoding continuous variable into a categorical variable

➤ Sorting the data frame

➤ Sorting in descending order

➤ Missing values

➤ Handling missing values in data frames

➤ Recoding variable with missing data

➤ Recoding variable by replacing missing values

➤ Recoding variable by deleting missing values

➤ Infinity and NaN

Data Management

↪ Recoding variables

Recoding is a process of creating a new variable from the existing values of the same or other variables. For example, recoding process is used for creating a categorical variable from a continuous variable, creating a boolean variable based on certain criteria, and handling missing values.

Following example shows how variables from 'fruitsframe' are used for creating new set of variables.

      # Creating 'friutsframe' data frame
      fruits <- c("apple", "banana", "cherry",
                  "dragon fruit", "elderberry")
      jan <- c(10, 3, 12, 5, 8)                  # fruits prices in january
      feb <- c(12.5, 5, 14, 3, 10)               # fruits prices in february
      mar <- c(10.5, 6, 10, 4, 13)               # fruits prices in march
      fruitsframe <- data.frame(fruits, jan, feb, mar, stringsAsFactors = FALSE)
      fruitsframe

      ---Output---
      -        fruits jan  feb  mar
      1         apple  10 12.5 10.5
      2        banana   3  5.0  6.0
      3        cherry  12 14.0 10.0
      4  dragon fruit   5  3.0  4.0
      5    elderberry   8 10.0 13.0

From the data frame above, the mean price can be calculated by adding data from each column.

      mean_price <- (fruitsframe$jan + fruitsframe$feb + fruitsframe$mar )/3
      # OR
      mean_price <- rowMeans(fruitsframe[,2:4], na.rm=TRUE)
      mean_price

      ---Output---
      1] 11.000000  4.666667 12.000000  4.000000 10.333333

Data Management

↪ Storing recoded variable in the data frame

The recoded variable 'mean_price' can be stored in the data frame itself.

      fruitsframe$mean_price <- (fruitsframe$jan + fruitsframe$feb + fruitsframe$mar)/3 
      fruitsframe$mean_price
      fruitsframe

      ---Output---
      -        fruits jan  feb  mar mean_price
      1         apple  10 12.5 10.5  11.000000
      2        banana   3  5.0  6.0   4.666667
      3        cherry  12 14.0 10.0  12.000000
      4  dragon fruit   5  3.0  4.0   4.000000
      5    elderberry   8 10.0 13.0  10.333333

The transform() function achieves the same results.

      fruitsframe <- data.frame(fruits,jan,feb,mar,stringsAsFactors = FALSE)
      fruitsframe <- transform(fruitsframe,
                               mean_price = (jan + feb + mar)/3)
      fruitsframe

      ---Output---
      -        fruits jan  feb  mar mean_price
      1         apple  10 12.5 10.5  11.000000
      2        banana   3  5.0  6.0   4.666667
      3        cherry  12 14.0 10.0  12.000000
      4  dragon fruit   5  3.0  4.0   4.000000
      5    elderberry   8 10.0 13.0  10.333333

Data Management

↪ Recoding continuous variable into a categorical variable

The continuous variable mean_price can be recoded into a categorical variable price_cat with values High, Medium, and Low.

      fruitsframe$price_cat[fruitsframe$mean_price >= 10] <- "High"
      fruitsframe$price_cat[fruitsframe$mean_price > 5 & fruitsframe$mean_price < 10] <- "Medium"
      fruitsframe$price_cat[fruitsframe$mean_price <= 5] <- "Low"
      fruitsframe

      ---Output---
      -        fruits jan  feb  mar mean_price price_cat
      1         apple  10 12.5 10.5  11.000000      High
      2        banana   3  5.0  6.0   4.666667       Low
      3        cherry  12 14.0 10.0  12.000000      High
      4  dragon fruit   5  3.0  4.0   4.000000       Low
      5    elderberry   8 10.0 13.0  10.333333      High

This code can be written using within() function as below.

      fruitsframe <- within(fruitsframe,{
        pricecat2 <- NA
        pricecat2[mean_price >= 10] <- "High"
        pricecat2[mean_price > 5 & mean_price < 10] <- "Medium"
        pricecat2[mean_price <= 5] <- "Low"
      })
      fruitsframe

      ---Output---
      -        fruits jan  feb  mar mean_price price_cat pricecat2
      1         apple  10 12.5 10.5  11.000000      High      High
      2        banana   3  5.0  6.0   4.666667       Low       Low
      3        cherry  12 14.0 10.0  12.000000      High      High
      4  dragon fruit   5  3.0  4.0   4.000000       Low       Low
      5    elderberry   8 10.0 13.0  10.333333      High      High

Data Management

↪ Sorting the data frame

The order() or sort.list() functions can be used for sorting the frame. The following listing sorts fruitsframe by 'price_cat' ascending order.

      fruitsframe[order(fruitsframe$price_cat),]
      fruitsframe[sort.list(fruitsframe$price_cat),]

      ---Output---
       fruitsframe[order(fruitsframe$price_cat),]
      -       fruits jan  feb  mar mean_price price_cat pricecat2
      1        apple  10 12.5 10.5  11.000000      High      High
      3       cherry  12 14.0 10.0  12.000000      High      High
      5   elderberry   8 10.0 13.0  10.333333      High      High
      2       banana   3  5.0  6.0   4.666667       Low       Low
      4 dragon fruit   5  3.0  4.0   4.000000       Low       Low
 
       fruitsframe[sort.list(fruitsframe$price_cat),]
      -       fruits jan  feb  mar mean_price price_cat pricecat2
      1        apple  10 12.5 10.5  11.000000      High      High
      3       cherry  12 14.0 10.0  12.000000      High      High
      5   elderberry   8 10.0 13.0  10.333333      High      High
      2       banana   3  5.0  6.0   4.666667       Low       Low
      4 dragon fruit   5  3.0  4.0   4.000000       Low       Low

Data Management

↪ Sorting in descending order

An argument 'decreasing=TRUE' makes a sort in descending order. The following listing sorts fruitsframe by 'price_cat' in descending order.

      fruitsframe[order(fruitsframe$price_cat, decreasing=TRUE),]
      fruitsframe[sort.list(fruitsframe$price_cat, decreasing=TRUE),]

      ---Output---
      -       fruits jan  feb  mar mean_price price_cat pricecat2
      2       banana   3  5.0  6.0   4.666667       Low       Low
      4 dragon fruit   5  3.0  4.0   4.000000       Low       Low
      1        apple  10 12.5 10.5  11.000000      High      High
      3       cherry  12 14.0 10.0  12.000000      High      High
      5   elderberry   8 10.0 13.0  10.333333      High      High

Data Management

↪ Missing values

Missing values in R are represented by the symbol NA (not available) for both character and numeric data. The function is.na() is used for testing the presence of missing values in a data set. For example,

      x <- c(10, 20, NA, 30)
      is.na(x)  # returns c(FALSE, FALSE, TRUE, FALSE)

The is.na() function works on an object and it returns an object of the same size with the entries replaced by TRUE for the missing values or FALSE otherwise.

Missing values are considered non-comparable, even to themselves. This means comparison operators cannot be used to test the missing values. For example, the test x == NA always returns FALSE.

Missing values reduce the statistical power of the analysis and an analysis based on missing values will result in a biased statistical model. Missing values are errors resulting from the data collection process; for example, respondents not answering some questions.

Reason for the missing values is important to consider because this helps to determine the type of missing values and what to do about it. Ensure that any missing values are properly coded as missing before analyzing the data, otherwise the results could be meaningless. Arithmetic expressions and functions that contain missing values yield missing values.

Take an example,

      x <- c(10, NA, 30, 40)
      sum1 <- x[1] + x[2] + x[3] + x[4]
      sum1                               # Output NA
      sum2 <- sum(x) 
      sum2                               # Output NA

Both sum1 and sum2 will be NA because the vector x has a missing value.

In R, most numeric functions take an option 'na.rm=TRUE' that removes missing values prior to calculations and applies the function to the remaining values:

      x <- c(10, NA, 30, 40)
      sum3 <- sum(x, na.rm=TRUE)        # Output 80

Data Management

↪ Handling missing values in data frames

Example below shows how missing values can be handled in a Data Frame.

      fruits <- c("apple", "banana", "cherry", "dragon fruit", 
                  "elderberry", "figs", "grapes")
      jan <- c(10, 3, 12, 5, 8, NA, 20)                # price of fruits in january
      feb <- c(12.5, 5, 14,NA, 10, 18, NA)             # price of fruits in february
      mar <- c(NA, 6, 10, 4, 13, 9, 11)                # price of fruits in march
      fruitsframe <- data.frame(fruits, jan, feb, mar,
                                stringsAsFactors = FALSE)
      fruitsframe

      ---Output---
      -       fruits jan  feb mar
      1        apple  10 12.5  NA
      2       banana   3  5.0   6
      3       cherry  12 14.0  10
      4 dragon fruit   5   NA   4
      5   elderberry   8 10.0  13
      6         figs  NA 18.0   9
      7       grapes  20   NA  11

The is.na() function on 'fruitsframe' displays the missing values with the value TRUE as shown below.

      is.na(fruitsframe)

      ---Output---
        -  fruits   jan   feb   mar
      [1,]  FALSE FALSE FALSE  TRUE
      [2,]  FALSE FALSE FALSE FALSE
      [3,]  FALSE FALSE FALSE FALSE
      [4,]  FALSE FALSE  TRUE FALSE
      [5,]  FALSE FALSE FALSE FALSE
      [6,]  FALSE  TRUE FALSE FALSE
      [7,]  FALSE FALSE  TRUE FALSE

Data Management

↪ Recoding variable with missing data

From the 'fruitsname' data frame, create a new variable 'mean_price' as shown below.

      fruitsframe$mean_price <- (as.integer(fruitsframe$jan) +
                       as.integer(fruitsframe$feb) +
                       as.integer(fruitsframe$mar))/3
      fruitsframe

      ---Output---
      -       fruits jan  feb mar mean_price
      1        apple  10 12.5  NA         NA
      2       banana   3  5.0   6   4.666667
      3       cherry  12 14.0  10  12.000000
      4 dragon fruit   5   NA   4         NA
      5   elderberry   8 10.0  13  10.333333
      6         figs  NA 18.0   9         NA
      7       grapes  20   NA  11         NA

The 'mean_price' is NA when the calculation involves a missing value. These missing values can be removed by passing na.rm=TRUE argument to rowMeans() function.

      fruitsframe$mean_price_na_rm <- rowMeans(fruitsframe[,2:4], na.rm=TRUE)
      fruitsframe

      ---Output---
      -       fruits jan  feb mar mean_price mean_price_na_rm
      1        apple  10 12.5  NA         NA        11.250000
      2       banana   3  5.0   6   4.666667         4.666667
      3       cherry  12 14.0  10  12.000000        12.000000
      4 dragon fruit   5   NA   4         NA         4.500000
      5   elderberry   8 10.0  13  10.333333        10.333333
      6         figs  NA 18.0   9         NA        13.500000
      7       grapes  20   NA  11         NA        15.500000

The price for Apple is missing in March. The mean_price calculation for Apple was based on 2 months of data while the same calculation for Banana was based on 3 months of data. Is this what you want?

Data Management

↪ Recoding variable by replacing missing values

The variable[condition] <- value statement will make the assignment when the condition is TRUE. For example, replace missing values with '0' and calculate the mean_price again.

      fruitsframe$jan[is.na(fruitsframe$jan)] <- 0
      fruitsframe$feb[is.na(fruitsframe$feb)] <- 0
      fruitsframe$mar[is.na(fruitsframe$mar)] <- 0
      fruitsframe$mean_price_na_rm_0 <- rowMeans(fruitsframe[,2:4], na.rm=TRUE)
      fruitsframe

      ---Output---
      -       fruits jan  feb mar mean_price mean_price_na_rm mean_price_na_rm_0
      1        apple  10 12.5   0         NA        11.250000           7.500000
      2       banana   3  5.0   6   4.666667         4.666667           4.666667
      3       cherry  12 14.0  10  12.000000        12.000000          12.000000
      4 dragon fruit   5  0.0   4         NA         4.500000           3.000000
      5   elderberry   8 10.0  13  10.333333        10.333333          10.333333
      6         figs   0 18.0   9         NA        13.500000           9.000000
      7       grapes  20  0.0  11         NA        15.500000          10.333333

Now, observe that the mean_price calculation for Apple is based on 3 months of data, but with the price set to '0'. Is it realistic?

The above examples show how missing values weaken the power of statistical analysis.

Data Management

↪ Recoding variable by deleting missing values

The na.omit() function deletes any rows with missing values. When missing values are omitted from the analysis, only those observations with complete records are included in the analysis.

      fruits <- c("apple", "banana", "cherry", "dragon fruit", 
                  "elderberry", "figs", "grapes")
      jan <- c(10, 3, 12, 5, 8, NA, 20)                # price of fruits in january
      feb <- c(12.5, 5, 14, NA, 10, 18, NA)             # price of fruits in february
      mar <- c(NA, 6, 10, 4, 13, 9, 11)                # price of fruits in march
      fruitsframe <- data.frame(fruits, jan, feb, mar,
                                stringsAsFactors = FALSE)
      fruitsframe

      ---Output---
      -       fruits jan  feb mar
      1        apple  10 12.5  NA
      2       banana   3  5.0   6
      3       cherry  12 14.0  10
      4 dragon fruit   5   NA   4
      5   elderberry   8 10.0  13
      6         figs  NA 18.0   9
      7       grapes  20   NA  11

      newfruitsname <- na.omit(fruitsframe)
      newfruitsname

      ---Output---
      -     fruits jan feb mar
      2     banana   3   5   6
      3     cherry  12  14  10
      5 elderberry   8  10  13

The above approach may result in biased results when the data set is small or when there is a significant number of missing values.

Data Management

↪ Infinity and NaN

Infinite or impossible values are not the same as missing values. the symbols 'Inf' and '–Inf' represent positive and negative infinity respectively. The function is.infinite() can be used for testing infinity.

The symbol 'NaN' represents an impossible number, sin(Inf) for example. The is.nan() function is used for testing NaN.

Data Management

↪ Summary

Recoding is a process of creating a new variable from the existing values of the same or other variables.
Continuous variables, as well as categorical variables, can be created from existing values.
Missing values reduce the statistical power of the analysis and an analysis based on missing values will result in a biased statistical model.
Reason for the missing values is important to consider and the approach taken to deal with missing data directly influences the statistical model.
The na.rm=TRUE parameter, is.na(), and na.omit() functions are used for recoding the missing data.
Infinity and NaN are not the same as missing values.