Encoding {base} | R Documentation |
Read or set the declared encodings for a character vector.
Encoding(x)
Encoding(x) <- value
x |
A character vector. |
value |
A character vector of positive length. |
Character strings in R can be declared to be in "latin1"
or
"UTF-8"
. These declarations can be read by Encoding
,
which will return a character vector of values "latin1"
,
"UTF-8"
or "unknown"
, or set, when value
is
recycled as needed and other values are silently treated as
"unknown"
. ASCII strings will never be marked with a declared
encoding, since their representation is the same in all encodings.
There are other ways for character strings to acquire a declared
encoding apart from explicitly setting it (and these have changed as
R has evolved). Functions scan
,
read.table
, readLines
, and
parse
have an encoding
argument that is used to
declare encodings, iconv
declares encodings from its
from
argument, and console input in suitable locales is also
declared. intToUtf8
declares its output as
"UTF-8"
, and output text connections are marked if running in a
suitable locale. Under some circumstances (see its help page)
source(encoding=)
will mark encodings of character
strings it outputs.
Most character manipulation functions will set the encoding on output
strings if it was declared on the corresponding input. These include
chartr
, strsplit
, strtrim
,
tolower
and toupper
as well as sub(useBytes = FALSE)
and
gsub(useBytes = FALSE)
. Note that such functions
do not preserve the encoding, but if they know the input
encoding and that the string has been successfully re-encoded to the
current encoding, they mark the output with the latter (if it is
"latin1"
or "UTF-8"
).
substr
does preserve the encoding, and
chartr
, tolower
and toupper
preserve UTF-8 encoding on systems with Unicode wide characters. With
their fixed
and perl
options, strsplit
,
sub
and gsub
will give a marked UTF-8 result if
any of the inputs are UTF-8.
paste
and sprintf
return a UTF-8 marked
element if any of the inputs to that element are UTF-8.
A character vector.
## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)