R subsetting is complex but very powerful. Or, it’s powerful but very complex – so complex that some of what I am about to tell you, I learned only this year. (I’ve been using R since 2003.) This page has everything I know about it, from the basics up. If you’re an expert, there still might be something for you here: you might want to skip to the “Advanced” section.
Let’s give ourselves some variables to play with – a simple character vector with some names.
food <- c(
ham = "Ham",
eggs = "Eggs",
tomatoes = "Tomatoes"
)
R knows three basic way to subset.
The first is the easiest: subsetting with a number n
gives you the nth element.
food[1]
ham
"Ham"
If you have a vector of numbers, you get a vector of elements.
food[2:3]
eggs tomatoes
"Eggs" "Tomatoes"
food[c(1, 3)]
ham tomatoes
"Ham" "Tomatoes"
The second is also pretty easy: if you subset with a character vector, you get the element(s) with the corresponding name(s).
food["ham"]
ham
"Ham"
food[c("ham", "eggs")]
ham eggs
"Ham" "Eggs"
You can repeat yourself:
cat(food[c("ham", "ham", "ham")], "! I love ham!")
Ham Ham Ham ! I love ham!
Lastly, you can subset with a logical vector. This is a bit different: it returns every element where the logical vector is TRUE
.
allergies <- c(FALSE, TRUE, FALSE)
vegetarian <- c(FALSE, TRUE, TRUE)
vegan <- c(FALSE, FALSE, TRUE)
# Vegetarian food:
food[vegetarian]
eggs tomatoes
"Eggs" "Tomatoes"
# Vegan food:
food[vegan]
tomatoes
"Tomatoes"
Logical subsetting is useful when you want to find things with a particular characteristic:
# Long food names:
nchar(food) > 5
ham eggs tomatoes
FALSE FALSE TRUE
food[nchar(food) > 5]
tomatoes
"Tomatoes"
Those are the three basic kinds of subsetting, depending on the type of object we use as an index.
But there are also three basic kinds of subsetting on a different dimension – the syntax of the command. So far we’ve just seen one syntax, using [
.
[[
If you want to select a single element, you can subset with double brackets: [[]]
.
food[[1]]
[1] "Ham"
food[["ham"]]
[1] "Ham"
There’s a subtle difference between [
and [[
, can you spot it?
food[1]
ham
"Ham"
food[[1]]
[1] "Ham"
Subsetting with [
keeps the names, subsetting with [[
doesn’t. Actually, it’s a bit more than that. Subsetting with [
really is taking a subset. A subset of a named vector is a smaller named vector. Subsetting with [[
is taking a single element. Mathematicians can think of it like this: [
is to ⊂ as [[
is to ∈.
This gets important when you deal with lists.
details <- list(
name = "Pete",
birthday = as.Date("1993-01-01"),
spouse = "Miranda"
)
Here, details[["spouse"]]
will give you the name "Miranda"
.
wife <- details[["spouse"]]
cat("Pete's wife is", wife, ".\n")
Pete's wife is Miranda .
But details["spouse"]
will give you a list containing one element, named spouse
. That may not be what you want, because not all R commands accept lists:
list_of_wife <- details["spouse"]
cat("This won't work:", list_of_wife)
This won't work:
Error in cat("This won't work:", list_of_wife): argument 2 (type 'list') cannot be handled by 'cat'
Instead, you can select list_of_wife[[1]]
to get Pete’s first (and only) wife.
$
The third kind of subsetting is a close relative of [[
. It uses $
.
For example, details$spouse
is the same as details[["spouse"]]
.
details$spouse
[1] "Miranda"
The two big differences with [[
are:
$
only takes literal names. So, this works:date <- "birthday"
details[[date]]
[1] "1993-01-01"
But this doesn’t, because there’s no element literally called date
:
details$date
NULL
$
only works with lists. It doesn’t work with atomic vectors (vectors of numbers, characters, etc.).So, this won’t work:
food$ham
Error in food$ham: $ operator is invalid for atomic vectors
$
is basically a useful shorthand for [[
, when you can select an element by name.
So far, so easy, but there are already questions. What happens if a logical vector is shorter or longer than what it is indexing?
The answer is that R recycles the vector (i.e. repeats it) along the length of the array. So, for example, food[TRUE]
becomes food[c(TRUE, TRUE, TRUE)]
, which gives you the whole vector back.
food[TRUE]
ham eggs tomatoes
"Ham" "Eggs" "Tomatoes"
If the recycling doesn’t quite fit, you get a warning:
food[c(TRUE, TRUE)] # 2 into 3 doesn't go
ham eggs tomatoes
"Ham" "Eggs" "Tomatoes"
Oh! No you don’t. I thought you did.
OK, but if your vector is too long, you get a warning:
food[c(TRUE, FALSE, TRUE, TRUE)]
ham tomatoes <NA>
"Ham" "Tomatoes" NA
Damn.
(Goes to ?Extract
.)
(Nothing there.)
(Goes to R language definition.)
Ah, OK. If the index is longer than the vector being indexed, then the vector is “conceptually extended with NAs”. So, above, c(TRUE, FALSE, TRUE, TRUE)
gave us the first element, not the second element, the third element… and an imaginary “fourth element” which is assumed to be NA
.
While we’re on the topic, what happens if you have a NA
in your index?
food[c(TRUE, NA, TRUE)]
ham <NA> tomatoes
"Ham" NA "Tomatoes"
food[c(1, NA, 3)]
ham <NA> tomatoes
"Ham" NA "Tomatoes"
food[c("ham", NA, "eggs")]
ham <NA> eggs
"Ham" NA "Eggs"
You get a NA
in the answer.
And if you have a zero-length index, or NULL
, you get zero elements in the response:
food[numeric(0)]
named character(0)
food[character(0)]
named character(0)
food[logical(0)]
named character(0)
food[NULL]
named character(0)
But, what if you have literally nothing in the index?
food[]
ham eggs tomatoes
"Ham" "Eggs" "Tomatoes"
You get everything back. Wat.
We’re almost over the basics, and we’ve just got one more thing to know:
Some objects in R have two dimensions, like matrices and data frames. So you need two indices to take subsets of them.
Here’s the food on sale at the local shop:
(food_data <- data.frame(
name = food,
price = c(2.00, 1.50, 0.80),
vegan = c(FALSE, FALSE, TRUE),
vegetarian = c(FALSE, TRUE, TRUE),
stringsAsFactors = FALSE,
row.names = NULL
))
name price vegan vegetarian
1 Ham 2.0 FALSE FALSE
2 Eggs 1.5 FALSE TRUE
3 Tomatoes 0.8 TRUE TRUE
To get the first two rows and the first four columns of this data frame, we can do:
food_data[1:2, 1:4]
name price vegan vegetarian
1 Ham 2.0 FALSE FALSE
2 Eggs 1.5 FALSE TRUE
The first index gives the rows. The second index gives the columns.
As we’ve included all the columns, there’s a shorthand we could use, which is to leave out one index altogether.
food_data[1:2, ]
name price vegan vegetarian
1 Ham 2.0 FALSE FALSE
2 Eggs 1.5 FALSE TRUE
Aha! Now you see why it made sense to have “nothing” select everything.
All the kinds of subset we used before still work here. For example, we can get rows 1 and 3 like this:
food_data[c(TRUE, FALSE, TRUE), ]
name price vegan vegetarian
1 Ham 2.0 FALSE FALSE
3 Tomatoes 0.8 TRUE TRUE
Or we could get columns like this, by name:
food_data[, c("name", "vegan")]
name vegan
1 Ham FALSE
2 Eggs FALSE
3 Tomatoes TRUE
A typical pattern is to get columns by name and rows using a logical vector. For example, here’s how to get the names and prices of vegetarian food:
food_data[food_data$vegetarian, c("name", "price")]
name price
2 Eggs 1.5
3 Tomatoes 0.8
That seems clear enough… only, what is food_data$vegetarian
? It must be giving the logical index that selects only vegetarian food — but didn’t I say that $
only worked for lists?
I did and it does. The reason this works is that a data frame is secretly a list. It’s just a list of columns. So, food_data$vegetarian
selects one element — i.e. one column. Here’s some more examples:
food_data$name # The 'name' column is a character vector
[1] "Ham" "Eggs" "Tomatoes"
food_data$price # The 'price' column is a numeric vector
[1] 2.0 1.5 0.8
And here’s how to get all foods with a price of over 1.40:
food_data[food_data$price > 1.40, ]
name price vegan vegetarian
1 Ham 2.0 FALSE FALSE
2 Eggs 1.5 FALSE TRUE
As you might expect, the [[
subsetting works just the same way as $
. You can use it to select a single element of the data frame – i.e. a single column. Unlike $
, you can use it with computed names as well as literal names:
my_diet_preference <- "vegan"
ok_to_eat <- food_data[[my_diet_preference]]
food_data[ok_to_eat, ]
name price vegan vegetarian
3 Tomatoes 0.8 TRUE TRUE
We could have written those last two lines as one line:
food_data[ food_data[[my_diet_preference]], ]
but all those [
characters together can make your eyes start to cross.
You can also select a single column using [
. This is useful when you also want to select a subset of rows:
# Names of vegan food:
food_data[food_data$vegetarian, "name"]
[1] "Eggs" "Tomatoes"
Now, for some stuff you might not know.
If a data frame is a list, and [
selects subsets of lists, then you might have expected food_data[, "name"]
to select a subset of that data frame – i.e. a smaller data frame with one column. That would have been consistent, and it might have saved a lot of pain for R developers over the years… but as you can see above, it didn’t happen. When you selected food_data[, "name"]
, you just got a vector, not a data frame with one column.
To repeat: if you select a single column of a data frame using [
, you get just a vector of data. If you select two or more columns, you get the whole data frame.
This is a powerful foot shotgun.
Maybe we write a complex computation to select certain columns:
frobnitz_the_zacular <- function () 2:3
calculated_columns <- frobnitz_the_zacular()
food_to_buy <- food_data[, calculated_columns]
nrow(food_to_buy)
[1] 3
Later on, we look through our complex code and realise we can reverse the Hyperlight Drive:
frobnitz_the_zacular <- function () 2
Oh dear.
calculated_columns <- frobnitz_the_zacular()
food_to_buy <- food_data[, calculated_columns]
nrow(food_to_buy)
NULL
The way to avoid this is the special argument drop = FALSE
. You can put this after your rows and columns in the subset:
calculated_columns <- frobnitz_the_zacular()
food_to_buy <- food_data[, calculated_columns, drop = FALSE]
nrow(food_to_buy)
[1] 3
You will forget this. And then, you will regret it.
There’s another gotcha with $
subsetting. I’ll just feed my pet owl:
cat("My pet owl says to-whit", pets$owl)
My pet owl says to-whit TO-WRAAAARGH!
Oh… that wasn’t an owl! Unfortunately, my owl escaped from the list, and I tried to feed an owlbear:
str(pets)
List of 1
$ owlbear: chr "TO-WRAAAARGH!"
In other words, $
matches using substrings, if it can’t find an exact match. So you may not always get what you expect.
Just as you use positive integers to select elements, you can use negative numbers in indices to remove elements.
food[-2]
ham tomatoes
"Ham" "Tomatoes"
food[c(-1, -3)]
eggs
"Eggs"
Of course, this works for data frames too. (I won’t keep saying that from now on.)
# Everything but the first column:
food_data[, -1]
price vegan vegetarian
1 2.0 FALSE FALSE
2 1.5 FALSE TRUE
3 0.8 TRUE TRUE
If you mix positive and negative numbers, you get an error. No, really.
food[c(-2, 2)]
Error in food[c(-2, 2)]: only 0's may be mixed with negative subscripts
What about 0? It turns out that 0 just gets ignored.
food[c(0, 1)]
ham
"Ham"
So far, we have only picked out a “square” of data from a data frame. That is, we can subset rows and columns. But what if we want to have, say, the element [1, 2]
, the element [3, 3]
and the element [3, 4]
?
You can do this by indexing with a 2-column matrix. The first column gives the row numbers to pick from the original data, the second column gives the column numbers. Results will be coerced into a vector:
elements <- matrix(
c(1, 2,
3, 3,
3, 4),
nrow = 3,
byrow = TRUE
)
food_data[elements]
[1] "2.0" " TRUE" " TRUE"
I’ve assumed that one-dimensional objects get one dimensional indexing, and two-dimensional objects get two dimensional indexing. But really, all objects in R are one dimensional. They just pretend to have more dimensions if you ask them nicely.
If you want to, you can just use one-dimensional indexing. We saw this already with food_data$vegan
and food_data[["vegan"]]
. This treated our data frame as a one dimensional list of columns. You can do this with [
as well:
food_data[1:2]
name price
1 Ham 2.0
2 Eggs 1.5
3 Tomatoes 0.8
This does just the same thing as food_data[, 1:2]
. On the downside, it’s less clear that you are selecting columns not rows.
On the very very up side, the following are not equivalent.
food_data[, "name"] # a vector
[1] "Ham" "Eggs" "Tomatoes"
food_data["name"] # a one-column data frame!
name
1 Ham
2 Eggs
3 Tomatoes
In other words, food_data["name"]
doesn’t drop to a vector. As it is also easier to read than food_data[, "name", drop = FALSE]
, this is a useful trick for defensive code.
But there’s more.
Not everything with 2 dimensions is a data frame. There are also matrices. We can use 1 dimensional indexing on these too.
Here’s the euro.cross
data of conversion rates between various European currencies when the € was introduced:
euro.cross <- round(euro.cross, 2)
euro.cross[1:3, 1:3]
ATS BEF DEM
ATS 1.00 2.93 0.14
BEF 0.34 1.00 0.05
DEM 7.04 20.63 1.00
rownames(euro.cross)
[1] "ATS" "BEF" "DEM" "ESP" "FIM" "FRF" "IEP" "ITL" "LUF" "NLG" "PTE"
colnames(euro.cross)
[1] "ATS" "BEF" "DEM" "ESP" "FIM" "FRF" "IEP" "ITL" "LUF" "NLG" "PTE"
We know how to pick out denmark’s currency rates – just do euro.cross["DEM", ]
. What if we want to pick out the currencies that are within 80% of each other’s value?
euro.cross[euro.cross < 10/8 & euro.cross > 8/10]
[1] 1.00 1.00 1.00 1.00 0.89 1.00 0.83 1.00 0.91 1.10 1.00 1.00 1.00 1.00
[15] 1.00 1.13 1.00 1.20 1.00
Here is how this works:
euro.cross
.euro.cross
are ignored, and the result is just a one-dimensional vector.This may not seem so useful – how do we know which elements we’ve got? But it is good if we want to change some elements. For example, we might want to run a counterfactual and look at Eurozone currency rates if these currencies had entered the Euro at parity with each other. Then our data would be:
euro.cross[euro.cross < 10/8 & euro.cross > 8/10] <- 1
euro.cross
ATS BEF DEM ESP FIM FRF IEP ITL LUF NLG PTE
ATS 1.00 2.93 0.14 12.09 0.43 0.48 0.06 140.71 2.93 0.16 14.57
BEF 0.34 1.00 0.05 4.12 0.15 0.16 0.02 48.00 1.00 0.05 4.97
DEM 7.04 20.63 1.00 85.07 3.04 3.35 0.40 990.00 20.63 1.00 102.50
ESP 0.08 0.24 0.01 1.00 0.04 0.04 0.00 11.64 0.24 0.01 1.00
FIM 2.31 6.78 0.33 27.98 1.00 1.00 0.13 325.66 6.78 0.37 33.72
FRF 2.10 6.15 0.30 25.37 1.00 1.00 0.12 295.18 6.15 0.34 30.56
IEP 17.47 51.22 2.48 211.27 7.55 8.33 1.00 2458.56 51.22 2.80 254.56
ITL 0.01 0.02 0.00 0.09 0.00 0.00 0.00 1.00 0.02 0.00 0.10
LUF 0.34 1.00 0.05 4.12 0.15 0.16 0.02 48.00 1.00 0.05 4.97
NLG 6.24 18.31 1.00 75.50 2.70 2.98 0.36 878.64 18.31 1.00 90.97
PTE 0.07 0.20 0.01 1.00 0.03 0.03 0.00 9.66 0.20 0.01 1.00
How does it mean to treat a 2D object as a 1D object? That depends.
A data frame is just a list of columns, so indexing in one dimension just gives you columns. For example, food_data[1]
is the first column of food data.
A matrix isn’t a list of columns; it’s a vector with a dim()
attribute. The vector indexing goes down the columns. Here’s an illustration:
rc <- paste(rep(1:3, 3), rep(1:3, each = 3), sep = ",")
m <- matrix(rc, 3, 3)
ht
1,1 | 1,2 | 1,3 |
2,1 | 2,2 | 2,3 |
3,1 | 3,2 | ... |
And here’s how it gets unwound by matrix indexing:
ht2
1 (1,1) |
2 (2,1) |
3 (3,1) |
4 (1,2) |
5 (2,2) |
6 (3,2) |
7 (1,3) |
8 (2,3) |
... |
So, for example, element 5 is the yellow one – row 2 column 2.
The ability to ignore dimensions is great when you are using functions that don’t preserve dimensions themselves. For example, suppose I want to change the text of this matrix by adding some labels:
paste("Row, column: ", m)
[1] "Row, column: 1,1" "Row, column: 2,1" "Row, column: 3,1"
[4] "Row, column: 1,2" "Row, column: 2,2" "Row, column: 3,2"
[7] "Row, column: 1,3" "Row, column: 2,3" "Row, column: 3,3"
Unfortunately, paste
doesn’t preserve dimensions, so if I did
m <- paste("Row, column:", m)
I’d have a vector, not the matrix I wanted.
But the result is what I need. It just needs its dimensions back. I could either do dim(m) <- c(3, 3)
, or, to avoid any errors with the dimensions wrong, I can put my data back into the matrix shape with the “empty subset” trick:
m[] <- paste("Row, column:", m)
m[]
[,1] [,2] [,3]
[1,] "Row, column: 1,1" "Row, column: 1,2" "Row, column: 1,3"
[2,] "Row, column: 2,1" "Row, column: 2,2" "Row, column: 2,3"
[3,] "Row, column: 3,1" "Row, column: 3,2" "Row, column: 3,3"
Using m[] <-
assigns into every element of m
. That is different from m<-
, which overwrites the old object with a whole new one.
In fact, you can mix 2-dimensional subsetting with one dimensional subsetting, at your convenience. For example, suppose I only want the “row, column” labels at the top of every column:
m <- matrix(rc, 3, 3)
# In 1D space:
with_labels <- paste("Row, column:", m)
# In 2D space:
dim(with_labels) <- dim(m)
m[1, ] <- with_labels[1, ]
m
[,1] [,2] [,3]
[1,] "Row, column: 1,1" "Row, column: 1,2" "Row, column: 1,3"
[2,] "2,1" "2,2" "2,3"
[3,] "3,1" "3,2" "3,3"
I used this trick a lot recently, when I rewrote the huxtable package, which produces LaTeX and HTML tables. (Huxtable produced the pictures above.) Obviously table data comes in 2D form, but sometimes you want to work on a subset of it, treating the subset as just a vector, then just assign back into the matrix. The pattern is:
vec_subset <- matrix_data[rows, cols]
# if you need to explicitly remove the dimensions, you can do:
vec_subset <- c(vec_subset)
# Now do a bunch of vector operations:
vec_subset <- manipulate(vec_subset)
# Assign back into the matrix
matrix_data[rows, cols] <- vec_subset
Vector operations are fast, so this can be a nice way to speed up code.
You normally only see [[
used with one index, to select from a list. But you can use it with multiple dimensions just like [
. It always only returns one element:
food_data[[1, "name"]]
[1] "Ham"
So, if you know your code should only want one element, this is a safe way of asking for it.
I never even knew about this until this year, and have never used it.
Lists are good for recursive data structures:
details <- list(
name = "Henry",
birthday = as.Date("1491-07-28"),
wives = list(
Catherine = list(languages = c("Spanish", "English", children = "Mary")),
Anne = list(motto = "Grogne qui grogne...", children = "Elizabeth"),
Jane = list(children = "Edward"),
Ann = list(from = "Flanders"),
Katherine = list(bad_habits = "flirtatious"),
Catherine = list(religion = "Protestant")
)
)
What if we want to dive into those structures? Say, finding the name of Henry’s second wife’s child?
You can deep dive into a recursive (list-like) object with [[
and a vector index:
details[[c("wives", "Anne", "children")]]
[1] "Elizabeth"
Notice that this only works with recursive objects. If you subset an atomic vector using [[
and a vector index, you’ll get an error.
Here’s a summary of how indexing works:
Vector | Index | [ |
[[ |
Atomic | numeric | Subset selected by position | Single element selected by position |
character | Subset selected by names | Single element selected by name | |
logical | Subset where index is TRUE |
-- | |
Recursive | numeric | Subset selected by position | Single element selected by position |
character | Subset selected by names | Single element by name (use vector for deep indexing) | |
logical | Subset where index is TRUE |
-- |
That’s all! Or at least, it’s all I know. Maybe in 15 years, I’ll discover some more tricks. Meanwhile, good luck, and don’t behead yourself.