Pattern Matching

Topics covered in this section are

  ➤ Patterns

  ➤ Pattern Matching – grep() and grepl() functions

  ➤ Pattern Matching – regexpr() function

  ➤ Pattern Matching – gregexpr() function

  ➤ Pattern Matching – regexec() and gregexec() functions

  ➤ Replacement – sub() and gsub() functions

Pattern Matching

↪ Patterns

A pattern or regular expression describes a set of strings. The grep(), grepl(), regexpr(), gregexpr(), regexec(), and gregexec() functions take pattern as argument and then search within each element of a character vector for matches to the pattern argument.

The sub() and gsub() functions perform replacement of the first and all matches respectively.

Two types of regular expressions are used in R – extended regular expressions (the default) and Perl-like regular expressions.

The patterns used in pattern matching are regular expressions. Any single character (or series of characters) matches directly, unless it is a metacharacter with a special meaning. The metacharacters in extended regular expressions (the default) are

  • Quote the next metacharacter, including escape sequences ( , etc.
  • ^ Match the beginning of the string
  • . Match any character
  • $ Match the end of the string
  • | OR – Match any one of a set of patterns, usually grouped in brackets.
  • () Grouping of subpatterns, numbered automatically left to right by the sequence of their opening parenthesis.
  • [] Character class, matching any of the characters in the enclosed list.
  • A pattern may be followed by one of several repetition quantifiers

  • * Match 0 or more times (equivalent to {0,})
  • + Match 1 or more times (equivalent to {1,})
  • ? Match 0 or 1 times (equivalent to {0,1})
  • {n} Match exactly n times
  • {n,} Match at least n times
  • {n,m} Match at least n but not more than m times
  • Pattern Matching

    ↪ Pattern Matching – grep() and grepl() functions

    The grep() function returns a vector of the indices of the elements of a character vector that yielded a match.

          fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", 
                      "Amla", "Avocado", "Apricots", "Apple", "Apple Apple")
          grep("[a]", fruits)
    
          ---Output---       [1] 1 2 4 6 7

    The grep(value = TRUE) returns a character vector containing the selected elements of a character vector.

          grep("[a]", fruits, value=TRUE)
    
          ---Output---       [1] "apple" "banana" "dragon fruit" "Amla" "Avocado"

    Few examples:

          grep("apple", fruits, value=TRUE)            # output [1] "apple"
          grep("[a|A]pple", fruits, value=TRUE)        # output [1] "apple"  "Apple" "Apple Apple"
          grep('[a|A]p{2}', fruits, value=TRUE)        # output [1] "apple"  "Apple" "Apple Apple"
          grep("^b", fruits, value=TRUE)               # output [1] "banana"
          grep("rry$", fruits, value=TRUE)             # output [1] "cherry"     "elderberry"
          fruits[grep("[d]", fruits)]                  # output [1] "dragon fruit" "elderberry"   "Avocado" 
          fruits[grep("^e", fruits)]                   # output [1] "elderberry"
          fruits[grep("^E", fruits, ignore.case=TRUE)] # output [1] "elderberry"
    
    

    The grepl() function works much like grep() except that it differs in its return value. The grepl() returns a logical vector: TRUE for the match and FALSE for not match, for each element of the character vector.

          grepl("[a]", fruits)
    
          ---Output---       [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE

    Both the grep() and the grepl() functions tell that which strings in a character vector matches a certain pattern but they don’t tell exactly where the match occurs.

    Pattern Matching

    ↪ Pattern Matching – regexpr() function

    The regexpr() function returns an integer vector of the same length as the input vector giving the starting position of the first match or -1 if there is none. The function returns with the attribute match.length – an integer vector giving the length of the matched text (or -1 for no match). The function also returns with attributes index.type and useBytes. The match positions and lengths are in characters unless useBytes = TRUE is used.

          regexpr("App", "Apple is red, Apple is green")
          reg
    
          ---Output---       [1] 1       attr(,"match.length")       [1] 3       attr(,"index.type")       [1] "chars"       attr(,"useBytes")       [1] TRUE
          fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", 
                      "Amla", "Avocado", "Apricots", "Apple", "Apple apple")
          regexpr("ap(.*)", fruits)
    
          ---Output---       [1] 1 -1 -1 -1 -1 -1 -1 -1 -1 7       attr(,"match.length")       [1] 5 -1 -1 -1 -1 -1 -1 -1 -1 5       attr(,"index.type")       [1] "chars"       attr(,"useBytes")       [1] TRUE
          regexpr("Ap", fruits)
    
          ---Output---       [1] -1 -1 -1 -1 -1 -1 -1 1 1 1       attr(,"match.length")       [1] -1 -1 -1 -1 -1 -1 -1 2 2 2       attr(,"index.type")       [1] "chars"       attr(,"useBytes")       [1] TRUE

    Pattern Matching

    ↪ Pattern Matching – gregexpr() function

    The gregexpr() function is similar to the regexpr() function, except that it returns a list of the same length of input vector giving the starting position of the each match or -1 if there is none.

          gregexpr("Ap", "Apple is red, Apple is green")
    
          ---Output---       [[1]]       [1] 1 15       attr(,"match.length")       [1] 2 2       attr(,"index.type")       [1] "chars"       attr(,"useBytes")       [1] TRUE
          fruits <- c("apple","Apricots","Apple")
          gregexpr("Ap", fruits)
    
          ---Output---       [[1]]       [1] -1       attr(,"match.length")       [1] -1       attr(,"index.type")       [1] "chars"       attr(,"useBytes")       [1] TRUE         [[2]]       [1] 1       attr(,"match.length")       [1] 2       attr(,"index.type")       [1] "chars"       attr(,"useBytes")       [1] TRUE         [[3]]       [1] 1       attr(,"match.length")       [1] 2       attr(,"index.type")       [1] "chars"       attr(,"useBytes")       [1] TRUE

    Pattern Matching

    ↪ Pattern Matching – regexec() and gregexec() functions

    The regexec() function returns a list of the same length of the input vector giving the starting position of the first match or -1 if there is none. The interpretation of positions and length and the attributes follows the regexpr() function.

          regexec("Ap", "Apple is red, Apple is green")
          regexec("Ap", fruits)
    
    

    The gregexec() function returns the same as the regexec() function, except that it accommodates multiple matches per element of the text. The integer sequences for each match is made into columns of a matrix, with one matrix per element of text with matches.

          gregexec("Ap", "Apple is red, Apple is green")
    
          ---Output---       [[1]]       [,1] [,2]       [1,] 1 15       attr(,"match.length")       [,1] [,2]       [1,] 2 2       attr(,"useBytes")       [1] TRUE       attr(,"index.type")       [1] "chars"
          gregexec("Ap", fruits)
    
    

    Pattern Matching

    ↪ Replacement – sub() and gsub() functions

    The sub() and the gsub() functions return a character vector of the same length and with the same attributes as the input vector (after possible coercion to character). Elements of input vectors that are not substituted will be returned unchanged.

          fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", 
                      "Amla", "Avocado", "Apricots", "Apple", "Apple Apple")
          sub("Apple", "apple", fruits)
    
          ---Output---       [1] "apple" "banana" "cherry" "dragon fruit" "elderberry" "Amla"       [7] "Avocado" "Apricots" "apple" "apple Apple"

    Observe that sub() function replaced the first occurrence of the last element. The gsub() function replaces all occurrences.

          fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", 
                      "Amla", "Avocado", "Apricots", "Apple", "Apple Apple")
          gsub("Apple", "apple", fruits)
    
          ---Output---       [1] "apple" "banana" "cherry" "dragon fruit" "elderberry" "Amla"       [7] "Avocado" "Apricots" "apple" "apple apple"

    The regmatches() function returns matched substrings from the data obtained by regexpr(), gregexpr(), regexec() or gregexec() functions.

          fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", 
                      "Amla", "Avocado", "Apricots", "Apple", "Apple Apple")
          p <- regexpr("[A|a]p(.*)", fruits)
          regmatches(fruits, p)
    
          ---Output---       [1] "apple" "Apricots" "Apple" "Apple Apple"
          m <- regmatches(fruits, p)
          gsub(" Apple","",m)           # Removed second Apple from the last element
    
          ---Output---       [1] "apple" "Apricots" "Apple" "Apple"

    Pattern Matching

    ↪ Summary

  • Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions. This section covered only extended regular expressions.
  • The grep(), grepl(), regexpr(), gregexpr(), regexec(), and gregexec() functions take pattern as argument and then search within each element of a character vector for matches to pattern argument.
  • The sub() and gsub() functions perform replacement of the first and all matches respectively.