The codesamples package on github contains code snippets from 3 sources: R package examples, github projects, and Stackoverflow questions. It tries to be a reasonably representative source of real R code “in the wild”.

You can use this data to learn interesting things about R, like:

  • what functions are used most often?

  • what functions are often called together in the data?

Here, I’m going to look at R packages. What packages often get used together? This is interesting, because potentially it can help people find relevant packages to their use case. I’ll use the Stackoverflow data.

Counting libraries

As a first step, I’m going to count each library() call mentioned in our code. I could look for loadNamespace() calls, or fully-qualified function names like dplyr::filter() , but library() is common enough that it will probably do.

First let’s load some packages of our own:

knitr::opts_chunk$set(echo = TRUE)

library(conflicted)
conflict_prefer_all("dplyr", quiet = TRUE)

# pak::pkg_install("hughjonesd/codesamples")
library(codesamples)

library(dplyr)
library(ggraph)
library(gt)
library(igraph)
library(purrr)
library(stringr)
library(tidygraph)
library(tidyr)

Now we extract the library calls and create a list of calls for each post.

extract_packages <- function (snippets) {
  pattern <- paste0("library\\(", # library(
                    "('|\")?",    # an optional quote mark
                    "([A-Za-z][A-Za-z0-9\\.]+[A-Za-z0-9])", # the package
                    "('|\")?",    # optional quote mark
                    "\\)"         # end bracket
                    )
  matches <- str_match_all(snippets, pattern)
  matches <- map(matches, \(x) x[, 3])
  
  return(matches)
}


co_packages <- so_questions |> 
  mutate(
    packages = extract_packages(so_questions$snippet)
  ) |> 
  select(post_id, packages) |> 
  # creates one row for each unique member of the packages list-clumn
  tidyr::unchop(packages)

The table below lists the top 20 packages in the data:

co_packages |> 
  count(packages) |> 
  arrange(desc(n)) |> 
  mutate(
    Rank = 1:n()
  ) |> 
  rename(
    Package = packages
  ) |> 
  slice_head(n = 20) |> 
  gt::gt() |> 
  gt::tab_caption(md(
    "**Top 20 packages in a sample of Stackoverflow R snippets**"
  ))
Top 20 packages in a sample of Stackoverflow R snippets
Package n Rank
ggplot2 1439 1
dplyr 793 2
shiny 788 3
data.table 444 4
tidyverse 424 5
plotly 241 6
reshape2 181 7
shinydashboard 163 8
raster 162 9
plyr 156 10
grid 154 11
gridExtra 145 12
leaflet 136 13
tidyr 135 14
scales 134 15
caret 118 16
rgdal 117 17
stringr 104 18
lubridate 102 19
igraph 93 20

The top 5 are not surprising. But {plotly} , {raster} and {plyr} make it to the top 10, which I did not necessarily expect. One possibility is that some packages are harder to use and generate a lot of SO questions. Or, since the data stretches back to 2013, some old packages get mentioned more than you might expect nowadays. Remember {reshape2} ?

Making a co-occurrence matrix

Now, we do some data cleaning to convert our data into a co-occurrence matrix, where each row and column represents a package, and the cell counts the number of times the two packages were together. First, we create columns of dummy variables for each snippet.

# Create columns of dummy variables
package_names <- unique(co_packages$packages)
co_package_mx <- co_packages
co_package_mx[, package_names] <- FALSE

walk(package_names, \(nm) {
  co_package_mx[nm] <<- co_package_mx$packages == nm}
)

# Add up the co-occurrences
co_package_mx <- summarize(co_package_mx, 
                         .by = post_id,
                         # return TRUE if any row has TRUE for the package
                          across(any_of(package_names), any))

# Make it a matrix
co_package_mx <- co_package_mx |> 
  select(-post_id) |> 
  as.matrix()

To keep our data manageable, we’ll ignore any packages which are only mentioned in one post. We aren’t going to learn much from those anyway.

mentioned_once <- colSums(co_package_mx) <= 1
co_package_mx <- co_package_mx[, ! mentioned_once]

Now we can create the co-occurrence matrix.

package_names <- colnames(co_package_mx)

n_packages <- length(package_names)
co_package_graph <- matrix(NA_integer_, nrow = n_packages, 
                           ncol = n_packages)
rownames(co_package_graph) <- colnames(co_package_graph) <- package_names

# For each package, count how often each other package is with it
for (pkg in package_names) {
  in_post <- co_package_mx[, pkg]
  posts_with_pkg <- co_package_mx[in_post, , drop = FALSE] 
  pkgs_with_pkg <- colSums(posts_with_pkg) 
  # Populate the graph matrix
  co_package_graph[pkg, ] <- pkgs_with_pkg
}

# Diagonal elements are the *total* number of times a package is 
# used (the number of times it appears with itself)
total_mentions <- diag(co_package_graph)

We’ll also remove any packages that only appear on their own.

loners <- colSums(co_package_graph > 0) == 1
co_package_graph <- co_package_graph[! loners, ! loners]
package_names <- rownames(co_package_graph)

We still have very many associations, and most of them are likely meaningless. To trim our network a bit, let’s look only for significant associations. We’ll run chi-squared tests for each pair of packages.

n_posts <- length(unique(so_questions$post_id))

package_names <- rownames(co_package_graph)

package_pvals <- matrix(NA_real_, 
                        nrow(co_package_graph), 
                        ncol(co_package_graph))
rownames(package_pvals) <- colnames(package_pvals) <- package_names

suppressWarnings({
  for (pkg1 in package_names) for (pkg2 in package_names) {
    if (pkg1 >= pkg2) next
    n_together <- co_package_graph[pkg1, pkg2]
    n_pkg1 <- co_package_graph[pkg1, pkg1]
    n_pkg2 <- co_package_graph[pkg2, pkg2]
    # matrix of counts for pkg1 present/absent, pkg2 present/absent
    chisq_tbl <- matrix(
      c(n_together, n_pkg1 - n_together, 
        n_pkg2 - n_together, n_posts - n_pkg1 - n_pkg2), 2, 2)
    test <- chisq.test(chisq_tbl)
    package_pvals[pkg1, pkg2] <- test$p.value
  }
})

That’s a lot of tests, so we’ll use Bonferroni correction to avoid accepting too many false positives. By the way, the code below uses a little-known trick about R subsetting: if you subset with a two-column numeric matrix, then you can select individual cells. The first column becomes the row indices, and the second column is the column indices.

sig <- package_pvals < 0.05/length(package_pvals)
sig <- which(sig, arr.ind = TRUE) 
sig_assocs <- as_tibble(sig)
sig_assocs$pkg1 <- package_names[sig_assocs$row]
sig_assocs$pkg2 <- package_names[sig_assocs$col]
# the magic of matrix-based indexing
sig_assocs$p.value <- package_pvals[sig]

sig_assocs <- sig_assocs |> 
  select(pkg1, pkg2, p.value) |> 
  filter(pkg1 < pkg2) # get rid of duplicates using alphabetic order

This is a slightly risky strategy, because the chi-squared test might be significant if packages are negatively associated, appearing with each other less often than you’d expect.

In fact, that doesn’t seem to happen. Here are the top 20 pairs of most-significantly associated packages. They look pretty sensible: these packages do belong together.

sig_assocs |> 
  arrange(p.value) |> 
  select(Package = pkg1, Friend = pkg2) |> 
  slice_head(n = 20) |> 
  gt::gt() |> 
  gt::tab_caption(md("**20 significantly associated pairs of packages**"))
20 significantly associated pairs of packages
Package Friend
doParallel foreach
kableExtra knitr
shiny shinydashboard
selectr xml2
webshot wordcloud2
network sna
mapdata maps
htmlwidgets webshot
survival survminer
tibbletime tidyquant
tidyquant timetk
tibbletime timetk
rsample timetk
recipes timetk
rsample tibbletime
recipes tibbletime
fGarch rugarch
recipes rsample
ggrough hrbrthemes
flextable officer

Finding clusters of packages

By keeping only clear associations, we’ve made the sig_assocs data frame a manageable size. This means we can start to play with network analysis, using Thomas Lin Pedersen’s excellent {tidygraph} and {ggraph} packages.

Above, we see several groups of packages that are all associated with each other. That suggests using a clustering algorithm to put packages into different “communities”.

I picked a clustering algorithm by the advanced method of trying them until one gave sensible results. cluster_walktrap() seems to be OK.

ig <- igraph::graph_from_data_frame(sig_assocs, directed = FALSE)
ig <- tidygraph::as_tbl_graph(ig)

clust <- igraph::cluster_walktrap(ig)

# Add community data to the graph
ig <- ig |> 
  activate(nodes) |> 
  mutate(
    cluster = factor(membership(clust)),
    mentions = total_mentions[name]
  ) |> 
  group_by(cluster) |> 
  mutate(
    n_members = n()
  ) |> 
  ungroup()


# We'll use this to label just a few packages in each cluster
top_3_per_cluster <- function (x)  {
  x |> group_by(cluster) |> 
    arrange(desc(mentions)) |> 
    slice_head(n = 3)
}


ig |> 
  filter(n_members >= 10) |> 
  ggraph(layout = "fr") +  
    geom_edge_link(colour = "grey85", linewidth = 0.5) +
    geom_node_point(aes(colour = cluster)) +
    geom_node_label(aes(label = name, fill = cluster), size = 2.5,
                    data = top_3_per_cluster, repel = TRUE) + 
    theme(legend.position = "none") +
    labs(
      title = "Clusters of packages in Stackoverflow snippets"
    )

Again, these results look intuitively sensible. We can see:

  • a large tidyverse cluster;

  • a time-series cluster with {quantmod} and {zoo} ;

  • a spatial data cluster with {rgdal} , {raster} and {ggmap} ;

  • a “make it go faster” cluster, maybe, with {Rcpp} , {data.table} and {microbenchmark} ;

  • clusters for parallel processing, Shiny, web scraping and database access.

Finding alternatives

Can we do more?

Suppose that some packages are alternatives to each other - they are often used for similar tasks, but rarely used together. Canonical examples might be {dplyr} and {data.table} , or {sp} and {sf} .

If so, we might expect those packages to be used with similar sets of other packages, but not to have a direct link between them. So you might see

library(sf)
library(maptools)

or

library(sp)
library(maptools)

but rarely

library(sf)
library(sp)

Let’s look for such pairs of packages. We’ll join sig_assocs to itself, to get indirect associations between pairs of packages.

sig_assocs_reversed <- sig_assocs |> 
  select(pkg2, pkg1, p.value) |> 
  rename(pkg1 = pkg2, pkg2 = pkg1)

sig_assocs_twice <- rbind(sig_assocs, sig_assocs_reversed) |> 
  select(pkg1, pkg2) |> 
  arrange(pkg1, pkg2) 

# To capture all links, we have to go both ways through the graph
sig_indirect <- inner_join(sig_assocs_twice, sig_assocs_twice, 
                          by = c("pkg2" = "pkg1"),
                          relationship = "many-to-many") |> 
  select(-pkg2) |> 
  rename(pkg2 = pkg2.y) |> 
  # remove paths that loop back on themselves
  filter(pkg1 != pkg2)

Then we’ll remove any packages that have a direct link between them.

alts <- sig_indirect |> 
  anti_join(sig_assocs_twice, by = join_by(pkg1, pkg2)) |> 
  add_count(pkg1, pkg2) |> 
  distinct(pkg1, pkg2, .keep_all = TRUE)

Lastly, we’ll filter for packages with a minimum number of indirect paths between them, to avoid too many false positives.

alts |> 
  filter(n >= 3) |> 
  select(Package = pkg1, Alternative = pkg2) |> 
  DT::datatable(filter = "top", caption = "Alternative packages")

Exploring the above, it seems to work sometimes. {dplyr} and {data.table} don’t show up – perhaps those packages just live in too different worlds. But pairs of packages like {PerformanceAnalytics} and {ROI} , or {RCurl} and {rvest} might be reasonable alternatives. Then again, is {RColorBrewer} really an alternative to {rtweet} ? Hmm…. We might do better with a subtler network analysis, but I’ll leave it here for now.

Data

Here’s the list of communities.

Here’s the full list of associated packages:

Or you can get the codesamples data from github and run this script yourself.