Introduction
Santoku is a package for cutting data into intervals. It provides
chop()
, a replacement for base R’s cut()
function, as well as several convenience functions to cut different
kinds of intervals.
To install santoku, run:
install.packages("santoku")
Basic usage
Use chop()
like cut()
, to cut numeric data
into intervals between a set of breaks
.
library(santoku)
x <- runif(10, 0, 10)
(chopped <- chop(x, breaks = 0:10))
#> [1] [4, 5) [8, 9) [3, 4) [4, 5) [7, 8) [9, 10] [6, 7) [8, 9) [1, 2)
#> [10] [4, 5)
#> Levels: [1, 2) [3, 4) [4, 5) [6, 7) [7, 8) [8, 9) [9, 10]
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [4, 5)
#> 2 8.970 [8, 9)
#> 3 3.392 [3, 4)
#> 4 4.677 [4, 5)
#> 5 7.057 [7, 8)
#> 6 9.708 [9, 10]
#> 7 6.714 [6, 7)
#> 8 8.377 [8, 9)
#> 9 1.086 [1, 2)
#> 10 4.495 [4, 5)
chop()
returns a factor.
If data is beyond the limits of breaks
, they will be
extended automatically:
chopped <- chop(x, breaks = 3:7)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [4, 5)
#> 2 8.970 [7, 9.708]
#> 3 3.392 [3, 4)
#> 4 4.677 [4, 5)
#> 5 7.057 [7, 9.708]
#> 6 9.708 [7, 9.708]
#> 7 6.714 [6, 7)
#> 8 8.377 [7, 9.708]
#> 9 1.086 [1.086, 3)
#> 10 4.495 [4, 5)
To chop a single number into a separate category, put the number
twice in breaks
:
x_fives <- x
x_fives[1:5] <- 5
chopped <- chop(x_fives, c(2, 5, 5, 8))
data.frame(x_fives, chopped)
#> x_fives chopped
#> 1 5.000 {5}
#> 2 5.000 {5}
#> 3 5.000 {5}
#> 4 5.000 {5}
#> 5 5.000 {5}
#> 6 9.708 [8, 9.708]
#> 7 6.714 (5, 8)
#> 8 8.377 [8, 9.708]
#> 9 1.086 [1.086, 2)
#> 10 4.495 [2, 5)
To quickly produce a table of chopped data, use
tab()
:
More ways to chop
To chop into fixed-width intervals, starting at the minimum value,
use chop_width()
:
chopped <- chop_width(x, 2)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3.086, 5.086)
#> 2 8.970 [7.086, 9.086)
#> 3 3.392 [3.086, 5.086)
#> 4 4.677 [3.086, 5.086)
#> 5 7.057 [5.086, 7.086)
#> 6 9.708 [9.086, 11.09]
#> 7 6.714 [5.086, 7.086)
#> 8 8.377 [7.086, 9.086)
#> 9 1.086 [1.086, 3.086)
#> 10 4.495 [3.086, 5.086)
To chop into a fixed number of intervals, each with the same width,
use chop_evenly()
:
chopped <- chop_evenly(x, intervals = 3)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3.96, 6.834)
#> 2 8.970 [6.834, 9.708]
#> 3 3.392 [1.086, 3.96)
#> 4 4.677 [3.96, 6.834)
#> 5 7.057 [6.834, 9.708]
#> 6 9.708 [6.834, 9.708]
#> 7 6.714 [3.96, 6.834)
#> 8 8.377 [6.834, 9.708]
#> 9 1.086 [1.086, 3.96)
#> 10 4.495 [3.96, 6.834)
To chop into groups with a fixed number of members, use
chop_n()
:
chopped <- chop_n(x, 4)
table(chopped)
#> chopped
#> [1.086, 4.978) [4.978, 8.97) [8.97, 9.708]
#> 4 4 2
To chop into a fixed number of groups, each with the same number of
elements, use chop_equally()
:
chopped <- chop_equally(x, groups = 5)
table(chopped)
#> chopped
#> [1.086, 4.275) [4.275, 4.858) [4.858, 6.851) [6.851, 8.495) [8.495, 9.708]
#> 2 2 2 2 2
To chop data up by quantiles, use chop_quantiles()
:
chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [25%, 50%)
#> 2 8.970 [75%, 100%]
#> 3 3.392 [0%, 25%)
#> 4 4.677 [25%, 50%)
#> 5 7.057 [50%, 75%)
#> 6 9.708 [75%, 100%]
#> 7 6.714 [50%, 75%)
#> 8 8.377 [75%, 100%]
#> 9 1.086 [0%, 25%)
#> 10 4.495 [0%, 25%)
To chop data up by proportions of the data range, use
chop_proportions()
:
chopped <- chop_proportions(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3.242, 5.397)
#> 2 8.970 [7.552, 9.708]
#> 3 3.392 [3.242, 5.397)
#> 4 4.677 [3.242, 5.397)
#> 5 7.057 [5.397, 7.552)
#> 6 9.708 [7.552, 9.708]
#> 7 6.714 [5.397, 7.552)
#> 8 8.377 [7.552, 9.708]
#> 9 1.086 [1.086, 3.242)
#> 10 4.495 [3.242, 5.397)
You can think of these six functions as logically arranged in a table.
To chop into… | Sizing intervals by… | |
---|---|---|
number of elements: | interval width: | |
a specific number of equal intervals… | chop_equally() |
chop_evenly() |
intervals of one specific size… | chop_n() |
chop_width() |
intervals of different specific sizes… | chop_quantiles() |
chop_proportions() |
To chop data by standard deviations around the mean, use
chop_mean_sd()
:
chopped <- chop_mean_sd(x)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [-1 sd, 0 sd)
#> 2 8.970 [1 sd, 2 sd)
#> 3 3.392 [-1 sd, 0 sd)
#> 4 4.677 [-1 sd, 0 sd)
#> 5 7.057 [0 sd, 1 sd)
#> 6 9.708 [1 sd, 2 sd)
#> 7 6.714 [0 sd, 1 sd)
#> 8 8.377 [0 sd, 1 sd)
#> 9 1.086 [-2 sd, -1 sd)
#> 10 4.495 [-1 sd, 0 sd)
To chop data into attractive intervals, use
chop_pretty()
. This selects intervals which are a multiple
of 2, 5 or 10. It’s useful for producing bar plots.
chopped <- chop_pretty(x)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [4, 6)
#> 2 8.970 [8, 10]
#> 3 3.392 [2, 4)
#> 4 4.677 [4, 6)
#> 5 7.057 [6, 8)
#> 6 9.708 [8, 10]
#> 7 6.714 [6, 8)
#> 8 8.377 [8, 10]
#> 9 1.086 [0, 2)
#> 10 4.495 [4, 6)
tab_n()
, tab_width()
, and friends act
similarly to tab()
, calling the related chop_*
function and then table()
on the result.
tab_n(x, 4)
#> [1.086, 4.978) [4.978, 8.97) [8.97, 9.708]
#> 4 4 2
tab_width(x, 2)
#> [1.086, 3.086) [3.086, 5.086) [5.086, 7.086) [7.086, 9.086) [9.086, 11.09]
#> 1 4 2 2 1
tab_evenly(x, 5)
#> [1.086, 2.81) [2.81, 4.535) [4.535, 6.259) [6.259, 7.983) [7.983, 9.708]
#> 1 2 2 2 3
tab_mean_sd(x)
#> [-2 sd, -1 sd) [-1 sd, 0 sd) [0 sd, 1 sd) [1 sd, 2 sd)
#> 1 4 3 2
Specifying labels
By default, santoku labels intervals using mathematical notation:
-
[0, 1]
means all numbers between 0 and 1 inclusive. -
(0, 1)
means all numbers strictly between 0 and 1, not including the endpoints. -
[0, 1)
means all numbers between 0 and 1, including 0 but not 1. -
(0, 1]
means all numbers between 0 and 1, including 1 but not 0. -
{0}
means just the number 0.
To override these labels, provide names to the breaks
argument:
chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 Low
#> 2 8.970 Highest
#> 3 3.392 Low
#> 4 4.677 Low
#> 5 7.057 Higher
#> 6 9.708 Highest
#> 7 6.714 Higher
#> 8 8.377 Highest
#> 9 1.086 Lowest
#> 10 4.495 Low
Or, you can specify factor labels with the labels
argument:
chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest"))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 Low
#> 2 8.970 Highest
#> 3 3.392 Low
#> 4 4.677 Low
#> 5 7.057 Higher
#> 6 9.708 Highest
#> 7 6.714 Higher
#> 8 8.377 Highest
#> 9 1.086 Lowest
#> 10 4.495 Low
You need as many labels as there are intervals - one fewer than
length(breaks)
if your data doesn’t extend beyond
breaks
, one more than length(breaks)
if it
does.
To label intervals with a dash, use lbl_dash()
:
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash())
data.frame(x, chopped)
#> x chopped
#> 1 4.978 2—5
#> 2 8.970 8—9.708
#> 3 3.392 2—5
#> 4 4.677 2—5
#> 5 7.057 5—8
#> 6 9.708 8—9.708
#> 7 6.714 5—8
#> 8 8.377 8—9.708
#> 9 1.086 1.086—2
#> 10 4.495 2—5
To label integer data, use lbl_discrete()
. It uses more
informative right endpoints:
chopped <- chop(1:10, c(2, 5, 8), labels = lbl_discrete())
chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash())
data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2)
#> x lbl_discrete lbl_dash
#> 1 1 1 1—2
#> 2 2 2—4 2—5
#> 3 3 2—4 2—5
#> 4 4 2—4 2—5
#> 5 5 5—7 5—8
#> 6 6 5—7 5—8
#> 7 7 5—7 5—8
#> 8 8 8—10 8—10
#> 9 9 8—10 8—10
#> 10 10 8—10 8—10
You can customize the first or last labels:
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+"))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 2—5
#> 2 8.970 8+
#> 3 3.392 2—5
#> 4 4.677 2—5
#> 5 7.057 5—8
#> 6 9.708 8+
#> 7 6.714 5—8
#> 8 8.377 8+
#> 9 1.086 < 2
#> 10 4.495 2—5
To label intervals in order use lbl_seq()
:
chopped <- chop(x, c(2, 5, 8), labels = lbl_seq())
data.frame(x, chopped)
#> x chopped
#> 1 4.978 b
#> 2 8.970 d
#> 3 3.392 b
#> 4 4.677 b
#> 5 7.057 c
#> 6 9.708 d
#> 7 6.714 c
#> 8 8.377 d
#> 9 1.086 a
#> 10 4.495 b
You can use numerals or even roman numerals:
chop(x, c(2, 5, 8), labels = lbl_seq("(1)"))
#> [1] (2) (4) (2) (2) (3) (4) (3) (4) (1) (2)
#> Levels: (1) (2) (3) (4)
chop(x, c(2, 5, 8), labels = lbl_seq("i."))
#> [1] ii. iv. ii. ii. iii. iv. iii. iv. i. ii.
#> Levels: i. ii. iii. iv.
Other labelling functions include:
-
lbl_endpoints()
- use left endpoints as labels -
lbl_midpoints()
- use interval midpoints as labels -
lbl_glue()
- specify labels flexibly with the glue package
Specifying breaks
By default, chop()
extends breaks
if
necessary. If you don’t want that, set extend = FALSE
:
chopped <- chop(x, c(3, 5, 7), extend = FALSE)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3, 5)
#> 2 8.970 <NA>
#> 3 3.392 [3, 5)
#> 4 4.677 [3, 5)
#> 5 7.057 <NA>
#> 6 9.708 <NA>
#> 7 6.714 [5, 7]
#> 8 8.377 <NA>
#> 9 1.086 <NA>
#> 10 4.495 [3, 5)
Data outside the range of breaks
will become
NA
.
By default, intervals are closed on the left, i.e. they include their
left endpoints. If you want right-closed intervals, set
left = FALSE
:
y <- 1:5
data.frame(
y = y,
left_closed = chop(y, 1:5),
right_closed = chop(y, 1:5, left = FALSE)
)
#> y left_closed right_closed
#> 1 1 [1, 2) [1, 2]
#> 2 2 [2, 3) [1, 2]
#> 3 3 [3, 4) (2, 3]
#> 4 4 [4, 5] (3, 4]
#> 5 5 [4, 5] (4, 5]
By default, the last interval is closed on both ends. If you want to
keep the last interval open at the end, set
close_end = FALSE
:
data.frame(
y = y,
end_closed = chop(y, 1:5),
end_open = chop(y, 1:5, close_end = FALSE)
)
#> y end_closed end_open
#> 1 1 [1, 2) [1, 2)
#> 2 2 [2, 3) [2, 3)
#> 3 3 [3, 4) [3, 4)
#> 4 4 [4, 5] [4, 5)
#> 5 5 [4, 5] {5}
Chopping dates, times and other vectors
You can chop many kinds of vectors with santoku, including Date objects…
y2k <- as.Date("2000-01-01") + 0:10 * 7
data.frame(
y2k = y2k,
chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01")))
)
#> y2k chopped
#> 1 2000-01-01 [2000-01-01, 2000-02-01)
#> 2 2000-01-08 [2000-01-01, 2000-02-01)
#> 3 2000-01-15 [2000-01-01, 2000-02-01)
#> 4 2000-01-22 [2000-01-01, 2000-02-01)
#> 5 2000-01-29 [2000-01-01, 2000-02-01)
#> 6 2000-02-05 [2000-02-01, 2000-03-01)
#> 7 2000-02-12 [2000-02-01, 2000-03-01)
#> 8 2000-02-19 [2000-02-01, 2000-03-01)
#> 9 2000-02-26 [2000-02-01, 2000-03-01)
#> 10 2000-03-04 [2000-03-01, 2000-03-11]
#> 11 2000-03-11 [2000-03-01, 2000-03-11]
… and POSIXct (date-time) objects:
# hours of the 2020 Crew Dragon flight:
crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"),
length.out = 24, by = "hours")
liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York")
dock <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York")
data.frame(
crew_dragon = crew_dragon,
chopped = chop(crew_dragon, c(liftoff, dock),
labels = c("pre-flight", "flight", "docked"))
)
#> Warning in .check_tzones(e1, e2): 'tzone' attributes are inconsistent
#> Warning in .check_tzones(e1, e2): 'tzone' attributes are inconsistent
#> crew_dragon chopped
#> 1 2020-05-30 18:00:00 pre-flight
#> 2 2020-05-30 19:00:00 pre-flight
#> 3 2020-05-30 20:00:00 flight
#> 4 2020-05-30 21:00:00 flight
#> 5 2020-05-30 22:00:00 flight
#> 6 2020-05-30 23:00:00 flight
#> 7 2020-05-31 00:00:00 flight
#> 8 2020-05-31 01:00:00 flight
#> 9 2020-05-31 02:00:00 flight
#> 10 2020-05-31 03:00:00 flight
#> 11 2020-05-31 04:00:00 flight
#> 12 2020-05-31 05:00:00 flight
#> 13 2020-05-31 06:00:00 flight
#> 14 2020-05-31 07:00:00 flight
#> 15 2020-05-31 08:00:00 flight
#> 16 2020-05-31 09:00:00 flight
#> 17 2020-05-31 10:00:00 flight
#> 18 2020-05-31 11:00:00 flight
#> 19 2020-05-31 12:00:00 flight
#> 20 2020-05-31 13:00:00 flight
#> 21 2020-05-31 14:00:00 flight
#> 22 2020-05-31 15:00:00 docked
#> 23 2020-05-31 16:00:00 docked
#> 24 2020-05-31 17:00:00 docked
Note how santoku correctly handles the different timezones.
You can use chop_width()
with objects from the
lubridate
package, to chop by irregular periods such as
months:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
data.frame(
y2k = y2k,
chopped = chop_width(y2k, months(1))
)
#> y2k chopped
#> 1 2000-01-01 [2000-01-01, 2000-02-01)
#> 2 2000-01-08 [2000-01-01, 2000-02-01)
#> 3 2000-01-15 [2000-01-01, 2000-02-01)
#> 4 2000-01-22 [2000-01-01, 2000-02-01)
#> 5 2000-01-29 [2000-01-01, 2000-02-01)
#> 6 2000-02-05 [2000-02-01, 2000-03-01)
#> 7 2000-02-12 [2000-02-01, 2000-03-01)
#> 8 2000-02-19 [2000-02-01, 2000-03-01)
#> 9 2000-02-26 [2000-02-01, 2000-03-01)
#> 10 2000-03-04 [2000-03-01, 2000-04-01)
#> 11 2000-03-11 [2000-03-01, 2000-04-01)
You can format labels using format strings from
strptime()
. lbl_discrete()
is useful here:
data.frame(
y2k = y2k,
chopped = chop_width(y2k, months(1), labels = lbl_discrete(fmt = "%e %b"))
)
#> y2k chopped
#> 1 2000-01-01 1 Jan—31 Jan
#> 2 2000-01-08 1 Jan—31 Jan
#> 3 2000-01-15 1 Jan—31 Jan
#> 4 2000-01-22 1 Jan—31 Jan
#> 5 2000-01-29 1 Jan—31 Jan
#> 6 2000-02-05 1 Feb—29 Feb
#> 7 2000-02-12 1 Feb—29 Feb
#> 8 2000-02-19 1 Feb—29 Feb
#> 9 2000-02-26 1 Feb—29 Feb
#> 10 2000-03-04 1 Mar—31 Mar
#> 11 2000-03-11 1 Mar—31 Mar
You can also chop vectors with units, using the units
package:
library(units)
#> udunits database from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/units/share/udunits/udunits2.xml
x <- set_units(1:10 * 10, cm)
br <- set_units(1:3, ft)
data.frame(
x = x,
chopped = chop(x, br)
)
#> x chopped
#> 1 10 [cm] [ 10.00 [cm], 30.48 [cm])
#> 2 20 [cm] [ 10.00 [cm], 30.48 [cm])
#> 3 30 [cm] [ 10.00 [cm], 30.48 [cm])
#> 4 40 [cm] [ 30.48 [cm], 60.96 [cm])
#> 5 50 [cm] [ 30.48 [cm], 60.96 [cm])
#> 6 60 [cm] [ 30.48 [cm], 60.96 [cm])
#> 7 70 [cm] [ 60.96 [cm], 91.44 [cm])
#> 8 80 [cm] [ 60.96 [cm], 91.44 [cm])
#> 9 90 [cm] [ 60.96 [cm], 91.44 [cm])
#> 10 100 [cm] [ 91.44 [cm], 100.00 [cm]]
You should be able to chop anything that has a comparison operator. You can even chop character data using lexical ordering. By default santoku emits a warning in this case, to avoid accidentally misinterpreting results:
chop(letters[1:10], c("d", "f"))
#> Warning in categorize_non_numeric(x, breaks, left): `x` or `breaks` is of type
#> character, using lexical sorting. To turn off this warning, run:
#> options(santoku.warn_character = FALSE)
#> [1] [a, d) [a, d) [a, d) [d, f) [d, f) [f, j] [f, j] [f, j] [f, j] [f, j]
#> Levels: [a, d) [d, f) [f, j]
If you find a type of data that you can’t chop, please file an issue.