R: Pattern Matching and Replacement

grep {base}

R Documentation

Pattern Matching and Replacement

Description

grep searches for matches to pattern (its first argument) within the character vector x (second argument). grepl is an alternative way to return the results. regexpr and gregexpr do too, but return more detail in a different format.

sub and gsub perform replacement of matches determined by regular expression matching.

Usage

grep(pattern, x, ignore.case = FALSE, extended = TRUE,
     perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE,
     invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, extended = TRUE,
     perl = FALSE, fixed = FALSE, useBytes = FALSE)

sub(pattern, replacement, x,
    ignore.case = FALSE, extended = TRUE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)

gsub(pattern, replacement, x,
     ignore.case = FALSE, extended = TRUE, perl = FALSE,
     fixed = FALSE, useBytes = FALSE)

regexpr(pattern, text, ignore.case = FALSE, extended = TRUE,
        perl = FALSE, fixed = FALSE, useBytes = FALSE)

gregexpr(pattern, text, ignore.case = FALSE, extended = TRUE,
         perl = FALSE, fixed = FALSE, useBytes = FALSE)

Arguments

pattern

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible.

x, text

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.

ignore.case

if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.

extended

if TRUE, extended regular expression matching is used, and if FALSE basic regular expressions are used.

perl

logical. Should perl-compatible regexps be used? Has priority over extended.

value

if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.

fixed

logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

useBytes

logical. If TRUE the matching is done byte-by-byte rather than character-by-character. See ‘Details’.

invert

logicaL. If TRUE return indices or values for elements that do not match.

replacement

a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case.

Details

Arguments which should be character strings or character vectors are coerced to character if possible.

The two *sub functions differ only in that sub replaces only the first occurrence of a pattern whereas gsub replaces all occurrences.

For regexpr it is an error for pattern to be NA, otherwise NA is permitted and gives an NA match.

The regular expressions used are those specified by POSIX 1003.2, either extended or basic, depending on the value of the extended argument, unless perl = TRUE when they are those of PCRE, http://www.pcre.org/. (The exact set of patterns supported may depend on the version of PCRE installed on the system in use if R was configured to use the system PCRE.)

useBytes is only used if fixed = TRUE or perl = TRUE. Its main effect is to avoid errors/warnings about invalid inputs and spurious matches, but for regexpr it changes the interpretation of the output.

PCRE only supports caseless matching for a non-ASCII pattern in a UTF-8 locale (and not for useBytes = TRUE in any locale).

Value

For grep a vector giving either the indices of the elements of x that yielded a match or, if value is TRUE, the matched elements of x (after coercion, preserving names but no other attributes).

grepl differs only in that it returns a logical vector (match or no for each element of x).

For sub and gsub a character vector of the same length and with the same attributes as x (after possible coercion). Elements of character vectors x which are not substituted will be return unchanged (including any declared encoding). If useBytes = FALSE, either perl = TRUE or fixed = TRUE and any element of pattern, replacement and x is declared to be in UTF-8, the result will be in UTF-8. Otherwise changed elements of the result will be have the encoding declared as that of the current locale (see Encoding if the corresponding input had a declared encoding and the current locale is either Latin-1 or UTF-8.

For regexpr an integer vector of the same length as text giving the starting position of the first match, or -1 if there is none, with attribute "match.length" giving the length of the matched text (or -1 for no match). In a multi-byte locale these quantities are in characters rather than bytes unless useBytes = TRUE is used with fixed = TRUE or perl = TRUE.

For gregexpr a list of the same length as text each element of which is an integer vector as in regexpr, except that the starting positions of every (disjoint) match are given.

If in a multi-byte locale the pattern or replacement is not a valid sequence of bytes, an error is thrown. An invalid string in x or text is a non-match with a warning for grep or regexpr, but an error for sub or gsub.

Warning

The standard regular-expression code has been reported to be very slow when applied to extremely long character strings (tens of thousands of characters or more): the code used when perl = TRUE seems much faster and more reliable for such usages.

The standard version of gsub does not substitute correctly repeated word-boundaries (e.g. pattern = "\b"). Use perl = TRUE for such matches.

The perl = TRUE option is only implemented for single-byte and UTF-8 encodings, and will warn if used in a non-UTF-8 multi-byte locale (unless useBytes = TRUE).

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole (grep)

Examples

grep("[a-z]", letters)

txt <- c("arm","foot","lefroo", "bafoobar")
if(length(i <- grep("foo",txt)))
   cat("'foo' appears at least once in\n\t",txt,"\n")
i # 2 and 4
txt[i]

## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
gsub("([ab])", "\\1_\\1_", "abc and ABC")

txt <- c("The", "licenses", "for", "most", "software", "are",
  "designed", "to", "take", "away", "your", "freedom",
  "to", "share", "and", "change", "it.",
   "", "By", "contrast,", "the", "GNU", "General", "Public", "License",
   "is", "intended", "to", "guarantee", "your", "freedom", "to",
   "share", "and", "change", "free", "software", "--",
   "to", "make", "sure", "the", "software", "is",
   "free", "for", "all", "its", "users")
( i <- grep("[gu]", txt) ) # indices
stopifnot( txt[i] == grep("[gu]", txt, value = TRUE) )

## Note that in locales such as en_US this includes B as the
## collation order is aAbBcCdEe ...
(ot <- sub("[b-e]",".", txt))
txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution

txt[gsub("g","#", txt) !=
    gsub("g","#", txt, ignore.case = TRUE)] # the "G" words

regexpr("en", txt)

gregexpr("e", txt)

## trim trailing white space
str <- 'Now is the time      '
sub(' +$', '', str)  ## spaces only
sub('[[:space:]]+$', '', str) ## white space, POSIX-style
sub('\\s+$', '', str, perl = TRUE) ## Perl-style white space

## capitalizing
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", "a test of capitalizing", perl=TRUE)
gsub("\\b(\\w)", "\\U\\1", "a test of capitalizing", perl=TRUE)

[Package base version 2.9.0 ]