Data Management

Topics covered in this section are

  ➤ Recoding variables

  ➤ Storing recoded variable in the data frame

  ➤ Recoding continuous variable into a categorical variable

  ➤ Sorting the data frame

  ➤ Sorting in descending order

  ➤ Missing values

  ➤ Handling missing values in data frames

  ➤ Recoding variable with missing data

  ➤ Recoding variable by replacing missing values

  ➤ Recoding variable by deleting missing values

  ➤ Infinity and NaN

Data Management

↪ Recoding variables

Recoding is a process of creating a new variable from the existing values of the same or other variables. For example, recoding process is used for creating a categorical variable from a continuous variable, creating a boolean variable based on certain criteria, and handling missing values.

Following example shows how variables from 'fruitsframe' are used for creating new set of variables.

      # Creating 'friutsframe' data frame
      fruits <- c("apple", "banana", "cherry",
                  "dragon fruit", "elderberry")
      jan <- c(10, 3, 12, 5, 8)                  # fruits prices in january
      feb <- c(12.5, 5, 14, 3, 10)               # fruits prices in february
      mar <- c(10.5, 6, 10, 4, 13)               # fruits prices in march
      fruitsframe <- data.frame(fruits, jan, feb, mar, stringsAsFactors = FALSE)
      fruitsframe

      ---Output---       - fruits jan feb mar       1 apple 10 12.5 10.5       2 banana 3 5.0 6.0       3 cherry 12 14.0 10.0       4 dragon fruit 5 3.0 4.0       5 elderberry 8 10.0 13.0

From the data frame above, the mean price can be calculated by adding data from each column.

      mean_price <- (fruitsframe$jan + fruitsframe$feb + fruitsframe$mar )/3
      # OR
      mean_price <- rowMeans(fruitsframe[,2:4], na.rm=TRUE)
      mean_price

      ---Output---       1] 11.000000 4.666667 12.000000 4.000000 10.333333

Data Management

↪ Storing recoded variable in the data frame

The recoded variable 'mean_price' can be stored in the data frame itself.

      fruitsframe$mean_price <- (fruitsframe$jan + fruitsframe$feb + fruitsframe$mar)/3 
      fruitsframe$mean_price
      fruitsframe

      ---Output---       - fruits jan feb mar mean_price       1 apple 10 12.5 10.5 11.000000       2 banana 3 5.0 6.0 4.666667       3 cherry 12 14.0 10.0 12.000000       4 dragon fruit 5 3.0 4.0 4.000000       5 elderberry 8 10.0 13.0 10.333333

The transform() function achieves the same results.

      fruitsframe <- data.frame(fruits,jan,feb,mar,stringsAsFactors = FALSE)
      fruitsframe <- transform(fruitsframe,
                               mean_price = (jan + feb + mar)/3)
      fruitsframe

      ---Output---       - fruits jan feb mar mean_price       1 apple 10 12.5 10.5 11.000000       2 banana 3 5.0 6.0 4.666667       3 cherry 12 14.0 10.0 12.000000       4 dragon fruit 5 3.0 4.0 4.000000       5 elderberry 8 10.0 13.0 10.333333

Data Management

↪ Recoding continuous variable into a categorical variable

The continuous variable mean_price can be recoded into a categorical variable price_cat with values High, Medium, and Low.

      fruitsframe$price_cat[fruitsframe$mean_price >= 10] <- "High"
      fruitsframe$price_cat[fruitsframe$mean_price > 5 & fruitsframe$mean_price < 10] <- "Medium"
      fruitsframe$price_cat[fruitsframe$mean_price <= 5] <- "Low"
      fruitsframe

      ---Output---       - fruits jan feb mar mean_price price_cat       1 apple 10 12.5 10.5 11.000000 High       2 banana 3 5.0 6.0 4.666667 Low       3 cherry 12 14.0 10.0 12.000000 High       4 dragon fruit 5 3.0 4.0 4.000000 Low       5 elderberry 8 10.0 13.0 10.333333 High

This code can be written using within() function as below.

      fruitsframe <- within(fruitsframe,{
        pricecat2 <- NA
        pricecat2[mean_price >= 10] <- "High"
        pricecat2[mean_price > 5 & mean_price < 10] <- "Medium"
        pricecat2[mean_price <= 5] <- "Low"
      })
      fruitsframe

      ---Output---       - fruits jan feb mar mean_price price_cat pricecat2       1 apple 10 12.5 10.5 11.000000 High High       2 banana 3 5.0 6.0 4.666667 Low Low       3 cherry 12 14.0 10.0 12.000000 High High       4 dragon fruit 5 3.0 4.0 4.000000 Low Low       5 elderberry 8 10.0 13.0 10.333333 High High

Data Management

↪ Sorting the data frame

The order() or sort.list() functions can be used for sorting the frame. The following listing sorts fruitsframe by 'price_cat' ascending order.

      fruitsframe[order(fruitsframe$price_cat),]
      fruitsframe[sort.list(fruitsframe$price_cat),]

      ---Output---        fruitsframe[order(fruitsframe$price_cat),]       - fruits jan feb mar mean_price price_cat pricecat2       1 apple 10 12.5 10.5 11.000000 High High       3 cherry 12 14.0 10.0 12.000000 High High       5 elderberry 8 10.0 13.0 10.333333 High High       2 banana 3 5.0 6.0 4.666667 Low Low       4 dragon fruit 5 3.0 4.0 4.000000 Low Low          fruitsframe[sort.list(fruitsframe$price_cat),]       - fruits jan feb mar mean_price price_cat pricecat2       1 apple 10 12.5 10.5 11.000000 High High       3 cherry 12 14.0 10.0 12.000000 High High       5 elderberry 8 10.0 13.0 10.333333 High High       2 banana 3 5.0 6.0 4.666667 Low Low       4 dragon fruit 5 3.0 4.0 4.000000 Low Low

Data Management

↪ Sorting in descending order

An argument 'decreasing=TRUE' makes a sort in descending order. The following listing sorts fruitsframe by 'price_cat' in descending order.

      fruitsframe[order(fruitsframe$price_cat, decreasing=TRUE),]
      fruitsframe[sort.list(fruitsframe$price_cat, decreasing=TRUE),]

      ---Output---       - fruits jan feb mar mean_price price_cat pricecat2       2 banana 3 5.0 6.0 4.666667 Low Low       4 dragon fruit 5 3.0 4.0 4.000000 Low Low       1 apple 10 12.5 10.5 11.000000 High High       3 cherry 12 14.0 10.0 12.000000 High High       5 elderberry 8 10.0 13.0 10.333333 High High

Data Management

↪ Missing values

Missing values in R are represented by the symbol NA (not available) for both character and numeric data. The function is.na() is used for testing the presence of missing values in a data set. For example,

      x <- c(10, 20, NA, 30)
      is.na(x)  # returns c(FALSE, FALSE, TRUE, FALSE)

The is.na() function works on an object and it returns an object of the same size with the entries replaced by TRUE for the missing values or FALSE otherwise.

Missing values are considered non-comparable, even to themselves. This means comparison operators cannot be used to test the missing values. For example, the test x == NA always returns FALSE.

Missing values reduce the statistical power of the analysis and an analysis based on missing values will result in a biased statistical model. Missing values are errors resulting from the data collection process; for example, respondents not answering some questions.

Reason for the missing values is important to consider because this helps to determine the type of missing values and what to do about it. Ensure that any missing values are properly coded as missing before analyzing the data, otherwise the results could be meaningless. Arithmetic expressions and functions that contain missing values yield missing values.

Take an example,

      x <- c(10, NA, 30, 40)
      sum1 <- x[1] + x[2] + x[3] + x[4]
      sum1                               # Output NA
      sum2 <- sum(x) 
      sum2                               # Output NA

Both sum1 and sum2 will be NA because the vector x has a missing value.

In R, most numeric functions take an option 'na.rm=TRUE' that removes missing values prior to calculations and applies the function to the remaining values:

      x <- c(10, NA, 30, 40)
      sum3 <- sum(x, na.rm=TRUE)        # Output 80

Data Management

↪ Handling missing values in data frames

Example below shows how missing values can be handled in a Data Frame.

      fruits <- c("apple", "banana", "cherry", "dragon fruit", 
                  "elderberry", "figs", "grapes")
      jan <- c(10, 3, 12, 5, 8, NA, 20)                # price of fruits in january
      feb <- c(12.5, 5, 14,NA, 10, 18, NA)             # price of fruits in february
      mar <- c(NA, 6, 10, 4, 13, 9, 11)                # price of fruits in march
      fruitsframe <- data.frame(fruits, jan, feb, mar,
                                stringsAsFactors = FALSE)
      fruitsframe

      ---Output---       - fruits jan feb mar       1 apple 10 12.5 NA       2 banana 3 5.0 6       3 cherry 12 14.0 10       4 dragon fruit 5 NA 4       5 elderberry 8 10.0 13       6 figs NA 18.0 9       7 grapes 20 NA 11

The is.na() function on 'fruitsframe' displays the missing values with the value TRUE as shown below.

      is.na(fruitsframe)

      ---Output---        - fruits jan feb mar       [1,] FALSE FALSE FALSE TRUE       [2,] FALSE FALSE FALSE FALSE       [3,] FALSE FALSE FALSE FALSE       [4,] FALSE FALSE TRUE FALSE       [5,] FALSE FALSE FALSE FALSE       [6,] FALSE TRUE FALSE FALSE       [7,] FALSE FALSE TRUE FALSE

Data Management

↪ Recoding variable with missing data

From the 'fruitsname' data frame, create a new variable 'mean_price' as shown below.

      fruitsframe$mean_price <- (as.integer(fruitsframe$jan) +
                       as.integer(fruitsframe$feb) +
                       as.integer(fruitsframe$mar))/3
      fruitsframe

      ---Output---       - fruits jan feb mar mean_price       1 apple 10 12.5 NA NA       2 banana 3 5.0 6 4.666667       3 cherry 12 14.0 10 12.000000       4 dragon fruit 5 NA 4 NA       5 elderberry 8 10.0 13 10.333333       6 figs NA 18.0 9 NA       7 grapes 20 NA 11 NA

The 'mean_price' is NA when the calculation involves a missing value. These missing values can be removed by passing na.rm=TRUE argument to rowMeans() function.

      fruitsframe$mean_price_na_rm <- rowMeans(fruitsframe[,2:4], na.rm=TRUE)
      fruitsframe

      ---Output---       - fruits jan feb mar mean_price mean_price_na_rm       1 apple 10 12.5 NA NA 11.250000       2 banana 3 5.0 6 4.666667 4.666667       3 cherry 12 14.0 10 12.000000 12.000000       4 dragon fruit 5 NA 4 NA 4.500000       5 elderberry 8 10.0 13 10.333333 10.333333       6 figs NA 18.0 9 NA 13.500000       7 grapes 20 NA 11 NA 15.500000

The price for Apple is missing in March. The mean_price calculation for Apple was based on 2 months of data while the same calculation for Banana was based on 3 months of data. Is this what you want?

Data Management

↪ Recoding variable by replacing missing values

The variable[condition] <- value statement will make the assignment when the condition is TRUE. For example, replace missing values with '0' and calculate the mean_price again.

      fruitsframe$jan[is.na(fruitsframe$jan)] <- 0
      fruitsframe$feb[is.na(fruitsframe$feb)] <- 0
      fruitsframe$mar[is.na(fruitsframe$mar)] <- 0
      fruitsframe$mean_price_na_rm_0 <- rowMeans(fruitsframe[,2:4], na.rm=TRUE)
      fruitsframe

      ---Output---       - fruits jan feb mar mean_price mean_price_na_rm mean_price_na_rm_0       1 apple 10 12.5 0 NA 11.250000 7.500000       2 banana 3 5.0 6 4.666667 4.666667 4.666667       3 cherry 12 14.0 10 12.000000 12.000000 12.000000       4 dragon fruit 5 0.0 4 NA 4.500000 3.000000       5 elderberry 8 10.0 13 10.333333 10.333333 10.333333       6 figs 0 18.0 9 NA 13.500000 9.000000       7 grapes 20 0.0 11 NA 15.500000 10.333333
Now, observe that the mean_price calculation for Apple is based on 3 months of data, but with the price set to '0'. Is it realistic?

The above examples show how missing values weaken the power of statistical analysis.

Data Management

↪ Recoding variable by deleting missing values

The na.omit() function deletes any rows with missing values. When missing values are omitted from the analysis, only those observations with complete records are included in the analysis.

      fruits <- c("apple", "banana", "cherry", "dragon fruit", 
                  "elderberry", "figs", "grapes")
      jan <- c(10, 3, 12, 5, 8, NA, 20)                # price of fruits in january
      feb <- c(12.5, 5, 14, NA, 10, 18, NA)             # price of fruits in february
      mar <- c(NA, 6, 10, 4, 13, 9, 11)                # price of fruits in march
      fruitsframe <- data.frame(fruits, jan, feb, mar,
                                stringsAsFactors = FALSE)
      fruitsframe

      ---Output---       - fruits jan feb mar       1 apple 10 12.5 NA       2 banana 3 5.0 6       3 cherry 12 14.0 10       4 dragon fruit 5 NA 4       5 elderberry 8 10.0 13       6 figs NA 18.0 9       7 grapes 20 NA 11
      newfruitsname <- na.omit(fruitsframe)
      newfruitsname

      ---Output---       - fruits jan feb mar       2 banana 3 5 6       3 cherry 12 14 10       5 elderberry 8 10 13

The above approach may result in biased results when the data set is small or when there is a significant number of missing values.

Data Management

↪ Infinity and NaN

Infinite or impossible values are not the same as missing values. the symbols 'Inf' and '–Inf' represent positive and negative infinity respectively. The function is.infinite() can be used for testing infinity.

The symbol 'NaN' represents an impossible number, sin(Inf) for example. The is.nan() function is used for testing NaN.

Data Management

↪ Summary

  1. Recoding is a process of creating a new variable from the existing values of the same or other variables.
  2. Continuous variables, as well as categorical variables, can be created from existing values.
  3. Missing values reduce the statistical power of the analysis and an analysis based on missing values will result in a biased statistical model.
  4. Reason for the missing values is important to consider and the approach taken to deal with missing data directly influences the statistical model.
  5. The na.rm=TRUE parameter, is.na(), and na.omit() functions are used for recoding the missing data.
  6. Infinity and NaN are not the same as missing values.