Skip to contents

santoku is a versatile cutting tool for R. It provides chop(), a replacement for base::cut().

Advantages

Here are some advantages of santoku:

  • By default, chop() always covers the whole range of the data, so you won’t get unexpected NA values.

  • chop() can handle single values as well as intervals. For example, chop(x, breaks = c(1, 2, 2, 3)) will create a separate factor level for values exactly equal to 2.

  • Flexible labelling, including easy ways to label intervals by numerals or letters.

  • Convenience functions for creating quantile intervals, evenly-spaced intervals or equal-sized groups.

  • Convenience functions for quickly tabulating chopped data.

These advantages make santoku especially useful for exploratory analysis, where you may not know the range of your data in advance.

Examples

chop returns a factor:

chop(1:8, c(3, 5, 7))
#> [1] [1, 3) [1, 3) [3, 5) [3, 5) [5, 7) [5, 7) [7, 8] [7, 8]
#> Levels: [1, 3) [3, 5) [5, 7) [7, 8]

Include a number twice to match it exactly:

chop(1:8, c(3, 5, 5, 7))
#> [1] [1, 3) [1, 3) [3, 5) [3, 5) {5}    (5, 7) [7, 8] [7, 8]
#> Levels: [1, 3) [3, 5) {5} (5, 7) [7, 8]

Customize output with lbl_* functions:

chop(1:8, c(3, 5, 7), labels = lbl_dash())
#> [1] 1—3 1—3 3—5 3—5 5—7 5—7 7—8 7—8
#> Levels: 1—3 3—5 5—7 7—8

Chop into fixed-width intervals:

chop_width(runif(10), 0.1)
#>  [1] [0.8278, 0.9278)  [0.8278, 0.9278)  [0.8278, 0.9278)  [0.3278, 0.4278) 
#>  [5] [0.7278, 0.8278)  [0.2278, 0.3278)  [0.9278, 1.028)   [0.02781, 0.1278)
#>  [9] [0.9278, 1.028)   [0.02781, 0.1278)
#> 6 Levels: [0.02781, 0.1278) [0.2278, 0.3278) ... [0.9278, 1.028)

Or into fixed-size groups:

chop_n(1:10, 5)
#>  [1] [1, 6)  [1, 6)  [1, 6)  [1, 6)  [1, 6)  [6, 10] [6, 10] [6, 10] [6, 10]
#> [10] [6, 10]
#> Levels: [1, 6) [6, 10]

Chop dates by calendar month, then tabulate:

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

tab_width(as.Date("2021-12-31") + 1:90, months(1), 
            labels = lbl_discrete(fmt = "%d %b")
          )
#> 01 Jan—31 Jan 01 Feb—28 Feb 01 Mar—31 Mar 
#>            31            28            31

For more information, see the vignette.