grep {base} | R Documentation |
grep
searches for matches to pattern
(its first
argument) within the character vector x
(second argument).
grepl
is an alternative way to return the results.
regexpr
and gregexpr
do too, but return more detail in
a different format.
sub
and gsub
perform replacement of matches determined
by regular expression matching.
grep(pattern, x, ignore.case = FALSE, extended = TRUE,
perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE,
invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, extended = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
sub(pattern, replacement, x,
ignore.case = FALSE, extended = TRUE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x,
ignore.case = FALSE, extended = TRUE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
regexpr(pattern, text, ignore.case = FALSE, extended = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, extended = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
pattern |
character string containing a regular expression
(or character string for |
x , text |
a character vector where matches are sought, or an
object which can be coerced by |
ignore.case |
if |
extended |
if |
perl |
logical. Should perl-compatible regexps be used?
Has priority over |
value |
if |
fixed |
logical. If |
useBytes |
logical. If |
invert |
logicaL. If |
replacement |
a replacement for matched pattern in |
Arguments which should be character strings or character vectors are coerced to character if possible.
The two *sub
functions differ only in that sub
replaces
only the first occurrence of a pattern
whereas gsub
replaces all occurrences.
For regexpr
it is an error for pattern
to be NA
,
otherwise NA
is permitted and gives an NA
match.
The regular expressions used are those specified by POSIX 1003.2,
either extended or basic, depending on the value of the
extended
argument, unless perl = TRUE
when they are
those of PCRE, http://www.pcre.org/.
(The exact set of patterns supported may depend on the version of
PCRE installed on the system in use if R was configured to use the
system PCRE.)
useBytes
is only used if fixed = TRUE
or perl = TRUE
.
Its main effect is to avoid errors/warnings about invalid inputs and
spurious matches, but for regexpr
it changes the interpretation
of the output.
PCRE only supports caseless matching for a non-ASCII pattern in a
UTF-8 locale (and not for useBytes = TRUE
in any locale).
For grep
a vector giving either the indices of the elements of
x
that yielded a match or, if value
is TRUE
, the
matched elements of x
(after coercion, preserving names but no
other attributes).
grepl
differs only in that it returns a logical vector (match
or no for each element of x
).
For sub
and gsub
a character vector of the same length
and with the same attributes as x
(after possible coercion).
Elements of character vectors x
which are not substituted will
be return unchanged (including any declared encoding). If
useBytes = FALSE
, either perl = TRUE
or fixed =
TRUE
and any element of pattern
, replacement
and
x
is declared to be in UTF-8, the result will be in UTF-8.
Otherwise changed elements of the result will be have the encoding
declared as that of the current locale (see Encoding
if
the corresponding input had a declared encoding and the current locale
is either Latin-1 or UTF-8.
For regexpr
an integer vector of the same length as text
giving the starting position of the first match, or -1
if there
is none, with attribute "match.length"
giving the length of the
matched text (or -1
for no match). In a multi-byte locale these
quantities are in characters rather than bytes unless
useBytes = TRUE
is used with fixed = TRUE
or
perl = TRUE
.
For gregexpr
a list of the same length as text
each
element of which is an integer vector as in regexpr
, except
that the starting positions of every (disjoint) match are given.
If in a multi-byte locale the pattern or replacement is not a valid
sequence of bytes, an error is thrown. An invalid string in x
or text
is a non-match with a warning for grep
or
regexpr
, but an error for sub
or gsub
.
The standard regular-expression code has been reported to be very slow
when applied to extremely long character strings
(tens of thousands of characters or more): the code used when
perl = TRUE
seems much faster and more reliable for such
usages.
The standard version of gsub
does not substitute correctly
repeated word-boundaries (e.g. pattern = "\b"
).
Use perl = TRUE
for such matches.
The perl = TRUE
option is only implemented for single-byte and
UTF-8 encodings, and will warn if used in a non-UTF-8 multi-byte
locale (unless useBytes = TRUE
).
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
The New S Language.
Wadsworth & Brooks/Cole (grep
)
regular expression (aka regexp
) for the details
of the pattern specification.
glob2rx
to turn wildcard matches into regular expressions.
agrep
for approximate matching.
tolower
, toupper
and chartr
for character translations.
charmatch
, pmatch
, match
.
apropos
uses regexps and has nice examples.
grep("[a-z]", letters)
txt <- c("arm","foot","lefroo", "bafoobar")
if(length(i <- grep("foo",txt)))
cat("'foo' appears at least once in\n\t",txt,"\n")
i # 2 and 4
txt[i]
## Double all 'a' or 'b's; "\" must be escaped, i.e., 'doubled'
gsub("([ab])", "\\1_\\1_", "abc and ABC")
txt <- c("The", "licenses", "for", "most", "software", "are",
"designed", "to", "take", "away", "your", "freedom",
"to", "share", "and", "change", "it.",
"", "By", "contrast,", "the", "GNU", "General", "Public", "License",
"is", "intended", "to", "guarantee", "your", "freedom", "to",
"share", "and", "change", "free", "software", "--",
"to", "make", "sure", "the", "software", "is",
"free", "for", "all", "its", "users")
( i <- grep("[gu]", txt) ) # indices
stopifnot( txt[i] == grep("[gu]", txt, value = TRUE) )
## Note that in locales such as en_US this includes B as the
## collation order is aAbBcCdEe ...
(ot <- sub("[b-e]",".", txt))
txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution
txt[gsub("g","#", txt) !=
gsub("g","#", txt, ignore.case = TRUE)] # the "G" words
regexpr("en", txt)
gregexpr("e", txt)
## trim trailing white space
str <- 'Now is the time '
sub(' +$', '', str) ## spaces only
sub('[[:space:]]+$', '', str) ## white space, POSIX-style
sub('\\s+$', '', str, perl = TRUE) ## Perl-style white space
## capitalizing
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", "a test of capitalizing", perl=TRUE)
gsub("\\b(\\w)", "\\U\\1", "a test of capitalizing", perl=TRUE)