R: Read or Set the Declared Encodings for a Character Vector

Encoding {base}

R Documentation

Read or Set the Declared Encodings for a Character Vector

Description

Read or set the declared encodings for a character vector.

Usage

Encoding(x)

Encoding(x) <- value

Arguments

x

A character vector.

value

A character vector of positive length.

Details

Character strings in R can be declared to be in "latin1" or "UTF-8". These declarations can be read by Encoding, which will return a character vector of values "latin1", "UTF-8" or "unknown", or set, when value is recycled as needed and other values are silently treated as "unknown". ASCII strings will never be marked with a declared encoding, since their representation is the same in all encodings.

There are other ways for character strings to acquire a declared encoding apart from explicitly setting it (and these have changed as R has evolved). Functions scan, read.table, readLines, and parse have an encoding argument that is used to declare encodings, iconv declares encodings from its from argument, and console input in suitable locales is also declared. intToUtf8 declares its output as "UTF-8", and output text connections are marked if running in a suitable locale. Under some circumstances (see its help page) source(encoding=) will mark encodings of character strings it outputs.

Most character manipulation functions will set the encoding on output strings if it was declared on the corresponding input. These include chartr, strsplit, strtrim, tolower and toupper as well as sub(useBytes = FALSE) and gsub(useBytes = FALSE). Note that such functions do not preserve the encoding, but if they know the input encoding and that the string has been successfully re-encoded to the current encoding, they mark the output with the latter (if it is "latin1" or "UTF-8").

substr does preserve the encoding, and chartr, tolower and toupper preserve UTF-8 encoding on systems with Unicode wide characters. With their fixed and perl options, strsplit, sub and gsub will give a marked UTF-8 result if any of the inputs are UTF-8.

paste and sprintf return a UTF-8 marked element if any of the inputs to that element are UTF-8.

Value

A character vector.

Examples

## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)

[Package base version 2.9.0 ]