strsplit {base} | R Documentation |
Split the elements of a character vector x
into substrings
according to the presence of substring split
within them.
strsplit(x, split, extended = TRUE, fixed = FALSE, perl = FALSE)
x |
character vector, each element of which is to be split. Other inputs, including a factor, will give an error. |
split |
character vector (or object which can be coerced to such)
containing regular expression(s) (unless |
extended |
logical. If |
fixed |
logical. If |
perl |
logical. Should perl-compatible regexps be used?
Has priority over |
Argument split
will be coerced to character, so
you will see uses with split = NULL
to mean
split = character(0)
, including in the examples below.
Note that splitting into single characters can be done via
split=character(0)
or split=""
; the two are
equivalent. The definition of ‘character’ here depends on the
locale (and perhaps OS): in a single-byte locale it is a byte, and in
a multi-byte locale it is the unit represented by a ‘wide
character’ (almost always a Unicode point).
A missing value of split
does not split the corresponding
element(s) of x
at all.
The algorithm applied to each input string is
repeat { if the string is empty break. if there is a match add the string to the left of the match to the output. remove the match and all to the left of it. else add the string to the output. break. }
Note that this means that if there is a match at the beginning of a
(non-empty) string, the first element of the output is ""
, but
if there is a match at the end of the string, the output is the same
as with the match removed.
A list of length length(x)
the i
-th element of which
contains the vector of splits of x[i]
.
If fixed = TRUE
or perl = TRUE
and if any element of
x
or split
is declared to be in UTF-8 (see
Encoding
, non-ASCII character strings in the result will
be in UTF-8 and have the encoding declared as UTF-8. Otherwise they
will be in the current locale's encoding, and be declared to have the
encoding of the current locale if either Latin-1 or UTF-8 and the
corresponding input had a declared encoding.
The standard regular expression code has been reported to be very slow
when applied to extremely long character strings
(tens of thousands of characters or more): the code used when
perl = TRUE
seems much faster and more reliable for such usages.
The perl = TRUE
option is only implemented for single-byte and
UTF-8 encodings, and will warn if used in a non-UTF-8 multibyte locale.
paste
for the reverse,
grep
and sub
for string search and
manipulation; further nchar
, substr
.
‘regular expression’ for the details of the pattern specification.
noquote(strsplit("A text I want to display with spaces", NULL)[[1]])
x <- c(as = "asfef", qu = "qwerty", "yuiop[", "b", "stuff.blah.yech")
# split x on the letter e
strsplit(x,"e")
unlist(strsplit("a.b.c", "."))
## [1] "" "" "" "" ""
## Note that 'split' is a regexp!
## If you really want to split on '.', use
unlist(strsplit("a.b.c", "\\."))
## [1] "a" "b" "c"
## or
unlist(strsplit("a.b.c", ".", fixed = TRUE))
## a useful function: rev() for strings
strReverse <- function(x)
sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
strReverse(c("abc", "Statistics"))
## get the first names of the members of R-core
a <- readLines(file.path(R.home("doc"),"AUTHORS"))[-(1:8)]
a <- a[(0:2)-length(a)]
(a <- sub(" .*","", a))
# and reverse them
strReverse(a)
## Note that final empty strings are not produced:
strsplit(paste(c("", "a", ""), collapse="#"), split="#")[[1]]
# [1] "" "a"
## and also an empty string is only produced before a definite match:
strsplit("", " ")[[1]] # character(0)
strsplit(" ", " ")[[1]] # [1] ""