Introduction to santoku • santoku

Introduction

Santoku is a package for cutting data into intervals. It provides chop(), a replacement for base R’s cut() function, as well as several convenience functions to cut different kinds of intervals.

To install santoku, run:

install.packages("santoku")

Basic usage

Use chop() like cut(), to cut numeric data into intervals between a set of breaks.

library(santoku)

x <- runif(10, 0, 10)
(chopped <- chop(x, breaks = 0:10))
#>  [1] [4, 5)  [8, 9)  [3, 4)  [4, 5)  [7, 8)  [9, 10] [6, 7)  [8, 9)  [1, 2) 
#> [10] [4, 5) 
#> Levels: [1, 2) [3, 4) [4, 5) [6, 7) [7, 8) [8, 9) [9, 10]
data.frame(x, chopped)
#>        x chopped
#> 1  4.978  [4, 5)
#> 2  8.970  [8, 9)
#> 3  3.392  [3, 4)
#> 4  4.677  [4, 5)
#> 5  7.057  [7, 8)
#> 6  9.708 [9, 10]
#> 7  6.714  [6, 7)
#> 8  8.377  [8, 9)
#> 9  1.086  [1, 2)
#> 10 4.495  [4, 5)

chop() returns a factor.

If data is beyond the limits of breaks, they will be extended automatically:

chopped <- chop(x, breaks = 3:7)
data.frame(x, chopped)
#>        x    chopped
#> 1  4.978     [4, 5)
#> 2  8.970 [7, 9.708]
#> 3  3.392     [3, 4)
#> 4  4.677     [4, 5)
#> 5  7.057 [7, 9.708]
#> 6  9.708 [7, 9.708]
#> 7  6.714     [6, 7)
#> 8  8.377 [7, 9.708]
#> 9  1.086 [1.086, 3)
#> 10 4.495     [4, 5)

To chop a single number into a separate category, put the number twice in breaks:

x_fives <- x
x_fives[1:5] <- 5
chopped <- chop(x_fives, c(2, 5, 5, 8))
data.frame(x_fives, chopped)
#>    x_fives    chopped
#> 1    5.000        {5}
#> 2    5.000        {5}
#> 3    5.000        {5}
#> 4    5.000        {5}
#> 5    5.000        {5}
#> 6    9.708 [8, 9.708]
#> 7    6.714     (5, 8)
#> 8    8.377 [8, 9.708]
#> 9    1.086 [1.086, 2)
#> 10   4.495     [2, 5)

To quickly produce a table of chopped data, use tab():

tab(1:10, c(2, 5, 8))
#>  [1, 2)  [2, 5)  [5, 8) [8, 10] 
#>       1       3       3       3

Chopping by width and number of elements

To chop into fixed-width intervals, starting at the minimum value, use chop_width():

chopped <- chop_width(x, 2)
data.frame(x, chopped)
#>        x        chopped
#> 1  4.978 [3.086, 5.086)
#> 2  8.970 [7.086, 9.086)
#> 3  3.392 [3.086, 5.086)
#> 4  4.677 [3.086, 5.086)
#> 5  7.057 [5.086, 7.086)
#> 6  9.708 [9.086, 11.09]
#> 7  6.714 [5.086, 7.086)
#> 8  8.377 [7.086, 9.086)
#> 9  1.086 [1.086, 3.086)
#> 10 4.495 [3.086, 5.086)

To chop into a fixed number of intervals, each with the same width, use chop_evenly():

chopped <- chop_evenly(x, intervals = 3)
data.frame(x, chopped)
#>        x        chopped
#> 1  4.978  [3.96, 6.834)
#> 2  8.970 [6.834, 9.708]
#> 3  3.392  [1.086, 3.96)
#> 4  4.677  [3.96, 6.834)
#> 5  7.057 [6.834, 9.708]
#> 6  9.708 [6.834, 9.708]
#> 7  6.714  [3.96, 6.834)
#> 8  8.377 [6.834, 9.708]
#> 9  1.086  [1.086, 3.96)
#> 10 4.495  [3.96, 6.834)

To chop into groups with a fixed number of elements, use chop_n():

chopped <- chop_n(x, 4)
table(chopped)
#> chopped
#> [1.086, 4.978)  [4.978, 8.97)  [8.97, 9.708] 
#>              4              4              2

To chop into a fixed number of groups, each with the same number of elements, use chop_equally():

chopped <- chop_equally(x, groups = 5)
table(chopped)
#> chopped
#> [1.086, 4.275) [4.275, 4.858) [4.858, 6.851) [6.851, 8.495) [8.495, 9.708] 
#>              2              2              2              2              2

To chop data up by quantiles, use chop_quantiles():

chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
#>        x     chopped
#> 1  4.978  [25%, 50%)
#> 2  8.970 [75%, 100%]
#> 3  3.392   [0%, 25%)
#> 4  4.677  [25%, 50%)
#> 5  7.057  [50%, 75%)
#> 6  9.708 [75%, 100%]
#> 7  6.714  [50%, 75%)
#> 8  8.377 [75%, 100%]
#> 9  1.086   [0%, 25%)
#> 10 4.495   [0%, 25%)

To chop data up by proportions of the data range, use chop_proportions():

chopped <- chop_proportions(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
#>        x        chopped
#> 1  4.978 [3.242, 5.397)
#> 2  8.970 [7.552, 9.708]
#> 3  3.392 [3.242, 5.397)
#> 4  4.677 [3.242, 5.397)
#> 5  7.057 [5.397, 7.552)
#> 6  9.708 [7.552, 9.708]
#> 7  6.714 [5.397, 7.552)
#> 8  8.377 [7.552, 9.708]
#> 9  1.086 [1.086, 3.242)
#> 10 4.495 [3.242, 5.397)

You can think of these six functions as logically arranged in a table.

Different ways to chop by size
To chop into…	Sizing intervals by…
	number of elements:	interval width:
a specific number of equal intervals…	`chop_equally()`	`chop_evenly()`
intervals of one specific size…	`chop_n()`	`chop_width()`
intervals of different specific sizes…	`chop_quantiles()`	`chop_proportions()`

Even more ways to chop

To chop data by standard deviations around the mean, use chop_mean_sd():

chopped <- chop_mean_sd(x)
data.frame(x, chopped)
#>        x        chopped
#> 1  4.978  [-1 sd, 0 sd)
#> 2  8.970   [1 sd, 2 sd)
#> 3  3.392  [-1 sd, 0 sd)
#> 4  4.677  [-1 sd, 0 sd)
#> 5  7.057   [0 sd, 1 sd)
#> 6  9.708   [1 sd, 2 sd)
#> 7  6.714   [0 sd, 1 sd)
#> 8  8.377   [0 sd, 1 sd)
#> 9  1.086 [-2 sd, -1 sd)
#> 10 4.495  [-1 sd, 0 sd)

To chop data into attractive intervals, use chop_pretty(). This selects intervals which are a multiple of 2, 5 or 10. It’s useful for producing bar plots.

chopped <- chop_pretty(x)
data.frame(x, chopped)
#>        x chopped
#> 1  4.978  [4, 6)
#> 2  8.970 [8, 10]
#> 3  3.392  [2, 4)
#> 4  4.677  [4, 6)
#> 5  7.057  [6, 8)
#> 6  9.708 [8, 10]
#> 7  6.714  [6, 8)
#> 8  8.377 [8, 10]
#> 9  1.086  [0, 2)
#> 10 4.495  [4, 6)

Isolating common values

In exploratory work, it’s sometimes useful to find common values and treat them differently. You can use dissect() to do this:

x_spike <- rnorm(100)
x_spike[1:50] <- x_spike[1]

chopped <- dissect(x_spike, -3:3, prop = 0.1)
table(chopped)
#> chopped
#> [-3, -2) [-2, -1)  [-1, 0)   [0, 1) {0.6996}   [1, 2) 
#>        2        5       15       18       50       10

prop = 0.2 will put any unique value of x into its own separate category if it makes up at least 20% of the data.

Note that unlike all the other chop_* functions, dissect() doesn’t always categorize x into ordered, connected intervals. To remind you of this, it is named differently. If you want to create separate intervals on the left and right of common elements, use chop_spikes():

chopped <- chop_spikes(x_spike, -3:3, prop = 0.1)
table(chopped)
#> chopped
#>    [-3, -2)    [-2, -1)     [-1, 0) [0, 0.6996)    {0.6996} (0.6996, 1) 
#>           2           5          15          13          50           5 
#>      [1, 2) 
#>          10

Compare this to the table before. There are two intervals on either side of the common value, instead of one interval surrounding it.

Quick tables

tab_n(), tab_width(), and friends act similarly to tab(), calling the related chop_* function and then table() on the result.

tab_n(x, 4)
#> [1.086, 4.978)  [4.978, 8.97)  [8.97, 9.708] 
#>              4              4              2
tab_width(x, 2)
#> [1.086, 3.086) [3.086, 5.086) [5.086, 7.086) [7.086, 9.086) [9.086, 11.09] 
#>              1              4              2              2              1
tab_evenly(x, 5)
#>  [1.086, 2.81)  [2.81, 4.535) [4.535, 6.259) [6.259, 7.983) [7.983, 9.708] 
#>              1              2              2              2              3
tab_mean_sd(x)
#> [-2 sd, -1 sd)  [-1 sd, 0 sd)   [0 sd, 1 sd)   [1 sd, 2 sd) 
#>              1              4              3              2

Specifying labels

By default, santoku labels intervals using mathematical notation:

[0, 1] means all numbers between 0 and 1 inclusive.
(0, 1) means all numbers strictly between 0 and 1, not including the endpoints.
[0, 1) means all numbers between 0 and 1, including 0 but not 1.
(0, 1] means all numbers between 0 and 1, including 1 but not 0.
{0} means just the number 0.

To override these labels, provide names to the breaks argument:

chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8))
data.frame(x, chopped)
#>        x chopped
#> 1  4.978     Low
#> 2  8.970 Highest
#> 3  3.392     Low
#> 4  4.677     Low
#> 5  7.057  Higher
#> 6  9.708 Highest
#> 7  6.714  Higher
#> 8  8.377 Highest
#> 9  1.086  Lowest
#> 10 4.495     Low

Or, you can specify factor labels with the labels argument:

chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest"))
data.frame(x, chopped)
#>        x chopped
#> 1  4.978     Low
#> 2  8.970 Highest
#> 3  3.392     Low
#> 4  4.677     Low
#> 5  7.057  Higher
#> 6  9.708 Highest
#> 7  6.714  Higher
#> 8  8.377 Highest
#> 9  1.086  Lowest
#> 10 4.495     Low

You need as many labels as there are intervals - one fewer than length(breaks) if your data doesn’t extend beyond breaks, one more than length(breaks) if it does.

To label intervals with a dash, use lbl_dash():

chopped <- chop(x, c(2, 5, 8), labels = lbl_dash())
data.frame(x, chopped)
#>        x chopped
#> 1  4.978     2—5
#> 2  8.970 8—9.708
#> 3  3.392     2—5
#> 4  4.677     2—5
#> 5  7.057     5—8
#> 6  9.708 8—9.708
#> 7  6.714     5—8
#> 8  8.377 8—9.708
#> 9  1.086 1.086—2
#> 10 4.495     2—5

To label integer data, use lbl_discrete(). It uses more informative right endpoints:

chopped  <- chop(1:10, c(2, 5, 8), labels = lbl_discrete())
chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash())
data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2)
#>     x lbl_discrete lbl_dash
#> 1   1            1      1—2
#> 2   2          2—4      2—5
#> 3   3          2—4      2—5
#> 4   4          2—4      2—5
#> 5   5          5—7      5—8
#> 6   6          5—7      5—8
#> 7   7          5—7      5—8
#> 8   8         8—10     8—10
#> 9   9         8—10     8—10
#> 10 10         8—10     8—10

You can customize the first or last labels:

chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+"))
data.frame(x, chopped)
#>        x chopped
#> 1  4.978     2—5
#> 2  8.970      8+
#> 3  3.392     2—5
#> 4  4.677     2—5
#> 5  7.057     5—8
#> 6  9.708      8+
#> 7  6.714     5—8
#> 8  8.377      8+
#> 9  1.086     < 2
#> 10 4.495     2—5

To label intervals in order use lbl_seq():

chopped <- chop(x, c(2, 5, 8), labels = lbl_seq())
data.frame(x, chopped)
#>        x chopped
#> 1  4.978       b
#> 2  8.970       d
#> 3  3.392       b
#> 4  4.677       b
#> 5  7.057       c
#> 6  9.708       d
#> 7  6.714       c
#> 8  8.377       d
#> 9  1.086       a
#> 10 4.495       b

You can use numerals or even roman numerals:

chop(x, c(2, 5, 8), labels = lbl_seq("(1)"))
#>  [1] (2) (4) (2) (2) (3) (4) (3) (4) (1) (2)
#> Levels: (1) (2) (3) (4)
chop(x, c(2, 5, 8), labels = lbl_seq("i."))
#>  [1] ii.  iv.  ii.  ii.  iii. iv.  iii. iv.  i.   ii. 
#> Levels: i. ii. iii. iv.

Other labelling functions include:

lbl_endpoints() - use left endpoints as labels
lbl_midpoints() - use interval midpoints as labels
lbl_glue() - specify labels flexibly with the glue package

Specifying breaks

By default, chop() extends breaks if necessary. If you don’t want that, set extend = FALSE:

chopped <- chop(x, c(3, 5, 7), extend = FALSE)
data.frame(x, chopped)
#>        x chopped
#> 1  4.978  [3, 5)
#> 2  8.970    <NA>
#> 3  3.392  [3, 5)
#> 4  4.677  [3, 5)
#> 5  7.057    <NA>
#> 6  9.708    <NA>
#> 7  6.714  [5, 7]
#> 8  8.377    <NA>
#> 9  1.086    <NA>
#> 10 4.495  [3, 5)

Data outside the range of breaks will become NA.

By default, intervals are closed on the left, i.e. they include their left endpoints. If you want right-closed intervals, set left = FALSE:

y <- 1:5
data.frame(
        y = y, 
        left_closed = chop(y, 1:5), 
        right_closed = chop(y, 1:5, left = FALSE)
      )
#>   y left_closed right_closed
#> 1 1      [1, 2)       [1, 2]
#> 2 2      [2, 3)       [1, 2]
#> 3 3      [3, 4)       (2, 3]
#> 4 4      [4, 5]       (3, 4]
#> 5 5      [4, 5]       (4, 5]

By default, the last interval is closed on both ends. If you want to keep the last interval open at the end, set close_end = FALSE:

data.frame(
  y = y,
  end_closed = chop(y, 1:5),
  end_open   = chop(y, 1:5, close_end = FALSE)
)
#>   y end_closed end_open
#> 1 1     [1, 2)   [1, 2)
#> 2 2     [2, 3)   [2, 3)
#> 3 3     [3, 4)   [3, 4)
#> 4 4     [4, 5]   [4, 5)
#> 5 5     [4, 5]      {5}

Chopping dates, times and other vectors

You can chop many kinds of vectors with santoku, including Date objects…

y2k <- as.Date("2000-01-01") + 0:10 * 7
data.frame(
  y2k = y2k,
  chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01")))
)
#>           y2k                  chopped
#> 1  2000-01-01 [2000-01-01, 2000-02-01)
#> 2  2000-01-08 [2000-01-01, 2000-02-01)
#> 3  2000-01-15 [2000-01-01, 2000-02-01)
#> 4  2000-01-22 [2000-01-01, 2000-02-01)
#> 5  2000-01-29 [2000-01-01, 2000-02-01)
#> 6  2000-02-05 [2000-02-01, 2000-03-01)
#> 7  2000-02-12 [2000-02-01, 2000-03-01)
#> 8  2000-02-19 [2000-02-01, 2000-03-01)
#> 9  2000-02-26 [2000-02-01, 2000-03-01)
#> 10 2000-03-04 [2000-03-01, 2000-03-11]
#> 11 2000-03-11 [2000-03-01, 2000-03-11]

… and POSIXct (date-time) objects:

# hours of the 2020 Crew Dragon flight:
crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"), 
                     length.out = 24, by = "hours")
liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York")
dock    <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York")

data.frame(
  crew_dragon = crew_dragon,
  chopped = chop(crew_dragon, c(liftoff, dock), 
                   labels = c("pre-flight", "flight", "docked"))
)
#> Warning in .check_tzones(e1, e2): 'tzone' attributes are inconsistent
#> Warning in .check_tzones(e1, e2): 'tzone' attributes are inconsistent
#>            crew_dragon    chopped
#> 1  2020-05-30 18:00:00 pre-flight
#> 2  2020-05-30 19:00:00 pre-flight
#> 3  2020-05-30 20:00:00     flight
#> 4  2020-05-30 21:00:00     flight
#> 5  2020-05-30 22:00:00     flight
#> 6  2020-05-30 23:00:00     flight
#> 7  2020-05-31 00:00:00     flight
#> 8  2020-05-31 01:00:00     flight
#> 9  2020-05-31 02:00:00     flight
#> 10 2020-05-31 03:00:00     flight
#> 11 2020-05-31 04:00:00     flight
#> 12 2020-05-31 05:00:00     flight
#> 13 2020-05-31 06:00:00     flight
#> 14 2020-05-31 07:00:00     flight
#> 15 2020-05-31 08:00:00     flight
#> 16 2020-05-31 09:00:00     flight
#> 17 2020-05-31 10:00:00     flight
#> 18 2020-05-31 11:00:00     flight
#> 19 2020-05-31 12:00:00     flight
#> 20 2020-05-31 13:00:00     flight
#> 21 2020-05-31 14:00:00     flight
#> 22 2020-05-31 15:00:00     docked
#> 23 2020-05-31 16:00:00     docked
#> 24 2020-05-31 17:00:00     docked

Note how santoku correctly handles the different timezones.

You can use chop_width() with objects from the lubridate package, to chop by irregular periods such as months:

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
data.frame(
  y2k = y2k,
  chopped = chop_width(y2k, months(1))
)
#>           y2k                  chopped
#> 1  2000-01-01 [2000-01-01, 2000-02-01)
#> 2  2000-01-08 [2000-01-01, 2000-02-01)
#> 3  2000-01-15 [2000-01-01, 2000-02-01)
#> 4  2000-01-22 [2000-01-01, 2000-02-01)
#> 5  2000-01-29 [2000-01-01, 2000-02-01)
#> 6  2000-02-05 [2000-02-01, 2000-03-01)
#> 7  2000-02-12 [2000-02-01, 2000-03-01)
#> 8  2000-02-19 [2000-02-01, 2000-03-01)
#> 9  2000-02-26 [2000-02-01, 2000-03-01)
#> 10 2000-03-04 [2000-03-01, 2000-04-01)
#> 11 2000-03-11 [2000-03-01, 2000-04-01)

lbl_date() produces nicely formatted dates:

data.frame(
  y2k = y2k,
  chopped = chop_width(y2k, days(28), labels = lbl_date())
)
#>           y2k              chopped
#> 1  2000-01-01        1-28 Jan 2000
#> 2  2000-01-08        1-28 Jan 2000
#> 3  2000-01-15        1-28 Jan 2000
#> 4  2000-01-22        1-28 Jan 2000
#> 5  2000-01-29 29 Jan - 25 Feb 2000
#> 6  2000-02-05 29 Jan - 25 Feb 2000
#> 7  2000-02-12 29 Jan - 25 Feb 2000
#> 8  2000-02-19 29 Jan - 25 Feb 2000
#> 9  2000-02-26 26 Feb - 24 Mar 2000
#> 10 2000-03-04 26 Feb - 24 Mar 2000
#> 11 2000-03-11 26 Feb - 24 Mar 2000

You can also chop vectors with units, using the units package:

library(units)
#> udunits database from /Users/davidhugh-jones/Library/R/arm64/4.6/library/units/share/udunits/udunits2.xml

x <- set_units(1:10 * 10, cm)
br <- set_units(1:3, ft)
data.frame(
  x = x,
  chopped = chop(x, br)
)
#>           x                    chopped
#> 1   10 [cm] [ 10.00 [cm],  30.48 [cm])
#> 2   20 [cm] [ 10.00 [cm],  30.48 [cm])
#> 3   30 [cm] [ 10.00 [cm],  30.48 [cm])
#> 4   40 [cm] [ 30.48 [cm],  60.96 [cm])
#> 5   50 [cm] [ 30.48 [cm],  60.96 [cm])
#> 6   60 [cm] [ 30.48 [cm],  60.96 [cm])
#> 7   70 [cm] [ 60.96 [cm],  91.44 [cm])
#> 8   80 [cm] [ 60.96 [cm],  91.44 [cm])
#> 9   90 [cm] [ 60.96 [cm],  91.44 [cm])
#> 10 100 [cm] [ 91.44 [cm], 100.00 [cm]]

You should be able to chop anything that has a comparison operator. You can even chop character data using lexical ordering. By default santoku emits a warning in this case, to avoid accidentally misinterpreting results:

chop(letters[1:10], c("d", "f"))
#> Warning in categorize_non_numeric(x, breaks, left): `x` or `breaks` is of type
#> character, using lexical sorting. To turn off this warning, run:
#> options(santoku.warn_character = FALSE)
#>  [1] [a, d) [a, d) [a, d) [d, f) [d, f) [f, j] [f, j] [f, j] [f, j] [f, j]
#> Levels: [a, d) [d, f) [f, j]

If you find a type of data that you can’t chop, please file an issue.