Pattern Matching
Topics covered in this section are
➤ Patterns
➤ Pattern Matching – grep() and grepl() functions
➤ Pattern Matching – regexpr() function
➤ Pattern Matching – gregexpr() function
➤ Pattern Matching – regexec() and gregexec() functions
➤ Replacement – sub() and gsub() functions
Pattern Matching
↪ Patterns
A pattern or regular expression describes a set of strings. The grep(), grepl(), regexpr(), gregexpr(), regexec(), and gregexec() functions take pattern as argument and then search within each element of a character vector for matches to the pattern argument.
The sub() and gsub() functions perform replacement of the first and all matches respectively.
Two types of regular expressions are used in R – extended regular expressions (the default) and Perl-like regular expressions.
The patterns used in pattern matching are regular expressions. Any single character (or series of characters) matches directly, unless it is a metacharacter with a special meaning. The metacharacters in extended regular expressions (the default) are
A pattern may be followed by one of several repetition quantifiers
Pattern Matching
↪ Pattern Matching – grep() and grepl() functions
The grep() function returns a vector of the indices of the elements of a character vector that yielded a match.
fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", "Amla", "Avocado", "Apricots", "Apple", "Apple Apple") grep("[a]", fruits)
---Output--- [1] 1 2 4 6 7
The grep(value = TRUE) returns a character vector containing the selected elements of a character vector.
grep("[a]", fruits, value=TRUE)
---Output--- [1] "apple" "banana" "dragon fruit" "Amla" "Avocado"
Few examples:
grep("apple", fruits, value=TRUE) # output [1] "apple" grep("[a|A]pple", fruits, value=TRUE) # output [1] "apple" "Apple" "Apple Apple" grep('[a|A]p{2}', fruits, value=TRUE) # output [1] "apple" "Apple" "Apple Apple" grep("^b", fruits, value=TRUE) # output [1] "banana" grep("rry$", fruits, value=TRUE) # output [1] "cherry" "elderberry" fruits[grep("[d]", fruits)] # output [1] "dragon fruit" "elderberry" "Avocado" fruits[grep("^e", fruits)] # output [1] "elderberry" fruits[grep("^E", fruits, ignore.case=TRUE)] # output [1] "elderberry"
The grepl() function works much like grep() except that it differs in its return value. The grepl() returns a logical vector: TRUE for the match and FALSE for not match, for each element of the character vector.
grepl("[a]", fruits)
---Output--- [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
Both the grep() and the grepl() functions tell that which strings in a character vector matches a certain pattern but they don’t tell exactly where the match occurs.
Pattern Matching
↪ Pattern Matching – regexpr() function
The regexpr() function returns an integer vector of the same length as the input vector giving the starting position of the first match or -1 if there is none. The function returns with the attribute match.length – an integer vector giving the length of the matched text (or -1 for no match). The function also returns with attributes index.type and useBytes. The match positions and lengths are in characters unless useBytes = TRUE is used.
regexpr("App", "Apple is red, Apple is green") reg
---Output--- [1] 1 attr(,"match.length") [1] 3 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE
fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", "Amla", "Avocado", "Apricots", "Apple", "Apple apple") regexpr("ap(.*)", fruits)
---Output--- [1] 1 -1 -1 -1 -1 -1 -1 -1 -1 7 attr(,"match.length") [1] 5 -1 -1 -1 -1 -1 -1 -1 -1 5 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE
regexpr("Ap", fruits)
---Output--- [1] -1 -1 -1 -1 -1 -1 -1 1 1 1 attr(,"match.length") [1] -1 -1 -1 -1 -1 -1 -1 2 2 2 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE
Pattern Matching
↪ Pattern Matching – gregexpr() function
The gregexpr() function is similar to the regexpr() function, except that it returns a list of the same length of input vector giving the starting position of the each match or -1 if there is none.
gregexpr("Ap", "Apple is red, Apple is green")
---Output--- [[1]] [1] 1 15 attr(,"match.length") [1] 2 2 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE
fruits <- c("apple","Apricots","Apple") gregexpr("Ap", fruits)
---Output--- [[1]] [1] -1 attr(,"match.length") [1] -1 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE [[2]] [1] 1 attr(,"match.length") [1] 2 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE [[3]] [1] 1 attr(,"match.length") [1] 2 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE
Pattern Matching
↪ Pattern Matching – regexec() and gregexec() functions
The regexec() function returns a list of the same length of the input vector giving the starting position of the first match or -1 if there is none. The interpretation of positions and length and the attributes follows the regexpr() function.
regexec("Ap", "Apple is red, Apple is green") regexec("Ap", fruits)
The gregexec() function returns the same as the regexec() function, except that it accommodates multiple matches per element of the text. The integer sequences for each match is made into columns of a matrix, with one matrix per element of text with matches.
gregexec("Ap", "Apple is red, Apple is green")
---Output--- [[1]] [,1] [,2] [1,] 1 15 attr(,"match.length") [,1] [,2] [1,] 2 2 attr(,"useBytes") [1] TRUE attr(,"index.type") [1] "chars"
gregexec("Ap", fruits)
Pattern Matching
↪ Replacement – sub() and gsub() functions
The sub() and the gsub() functions return a character vector of the same length and with the same attributes as the input vector (after possible coercion to character). Elements of input vectors that are not substituted will be returned unchanged.
fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", "Amla", "Avocado", "Apricots", "Apple", "Apple Apple") sub("Apple", "apple", fruits)
---Output--- [1] "apple" "banana" "cherry" "dragon fruit" "elderberry" "Amla" [7] "Avocado" "Apricots" "apple" "apple Apple"
Observe that sub() function replaced the first occurrence of the last element. The gsub() function replaces all occurrences.
fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", "Amla", "Avocado", "Apricots", "Apple", "Apple Apple") gsub("Apple", "apple", fruits)
---Output--- [1] "apple" "banana" "cherry" "dragon fruit" "elderberry" "Amla" [7] "Avocado" "Apricots" "apple" "apple apple"
The regmatches() function returns matched substrings from the data obtained by regexpr(), gregexpr(), regexec() or gregexec() functions.
fruits <- c("apple", "banana", "cherry", "dragon fruit", "elderberry", "Amla", "Avocado", "Apricots", "Apple", "Apple Apple") p <- regexpr("[A|a]p(.*)", fruits) regmatches(fruits, p)
---Output--- [1] "apple" "Apricots" "Apple" "Apple Apple"
m <- regmatches(fruits, p) gsub(" Apple","",m) # Removed second Apple from the last element
---Output--- [1] "apple" "Apricots" "Apple" "Apple"
Pattern Matching
↪ Summary