Skip to contents

santoku is a versatile cutting tool for R. It provides chop(), a replacement for base::cut().

Installation

Install from r-universe:

install.packages("santoku", repos = c("https://hughjonesd.r-universe.dev", 
                                      "https://cloud.r-project.org"))

Or from CRAN:

install.packages("santoku")

Or get the development version from github:

# install.packages("remotes")
remotes::install_github("hughjonesd/santoku")

Advantages

Here are some advantages of santoku:

  • By default, chop() always covers the whole range of the data, so you won’t get unexpected NA values.

  • chop() can handle single values as well as intervals. For example, chop(x, breaks = c(1, 2, 2, 3)) will create a separate factor level for values exactly equal to 2.

  • chop() can handle many kinds of data, including numbers, dates and times, and units.

  • chop_* functions create intervals in many ways, using quantiles of the data, standard deviations, fixed-width intervals, equal-sized groups, or pretty intervals for use in graphs.

  • It’s easy to label intervals: use names for your breaks vector, or use a lbl_* function to create interval notation like [1, 2), dash notation like 1-2, or arbitrary styles using glue::glue().

  • tab_* functions quickly chop data, then tabulate it.

These advantages make santoku especially useful for exploratory analysis, where you may not know the range of your data in advance.

Examples

chop returns a factor:

chop(1:5, c(2, 4))
#> [1] [1, 2) [2, 4) [2, 4) [4, 5] [4, 5]
#> Levels: [1, 2) [2, 4) [4, 5]

Include a number twice to match it exactly:

chop(1:5, c(2, 2, 4))
#> [1] [1, 2) {2}    (2, 4) [4, 5] [4, 5]
#> Levels: [1, 2) {2} (2, 4) [4, 5]

Use names in breaks for labels:

chop(1:5, c(Low = 1, Mid = 2, High = 4))
#> [1] Low  Mid  Mid  High High
#> Levels: Low Mid High

Or use lbl_* functions:

chop(1:5, c(2, 4), labels = lbl_dash())
#> [1] 1—2 2—4 2—4 4—5 4—5
#> Levels: 1—2 2—4 4—5

Chop into fixed-width intervals:

chop_width(runif(10), 0.1)
#>  [1] [0.1399, 0.2399) [0.5399, 0.6399) [0.5399, 0.6399) [0.5399, 0.6399)
#>  [5] [0.6399, 0.7399) [0.3399, 0.4399) [0.8399, 0.9399] [0.8399, 0.9399]
#>  [9] [0.5399, 0.6399) [0.1399, 0.2399)
#> 5 Levels: [0.1399, 0.2399) [0.3399, 0.4399) ... [0.8399, 0.9399]

Or into fixed-size groups:

chop_n(1:10, 5)
#>  [1] [1, 6)  [1, 6)  [1, 6)  [1, 6)  [1, 6)  [6, 10] [6, 10] [6, 10] [6, 10]
#> [10] [6, 10]
#> Levels: [1, 6) [6, 10]

Chop dates by calendar month, then tabulate:

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

dates <- as.Date("2021-12-31") + 1:90

tab_width(dates, months(1), labels = lbl_discrete(fmt = "%d %b"))
#> 01 Jan—31 Jan 01 Feb—28 Feb 01 Mar—31 Mar 
#>            31            28            31

For more information, see the vignette.