Documenting datasets • roxygen2

Datasets are stored in data/, not as regular R objects in the package. This means you need to document them in a slightly different way: instead of documenting the data directly, you quote the dataset’s name. For example, this is the roxygen2 block used for ggplot2::diamonds:

#' Prices of over 50,000 round cut diamonds
#'
#' A dataset containing the prices and other attributes of almost 54,000
#'  diamonds. The variables are as follows:
#'
#' @format A data frame with 53940 rows and 10 variables:
#' \describe{
#'   \item{price}{price in US dollars ($326--$18,823)}
#'   \item{carat}{weight of the diamond (0.2--5.01)}
#'   \item{cut}{quality of the cut (Fair, Good, Very Good, Premium, Ideal)}
#'   \item{color}{diamond colour, from D (best) to J (worst)}
#'   \item{clarity}{a measurement of how clear the diamond is (I1 (worst), SI2,
#'     SI1, VS2, VS1, VVS2, VVS1, IF (best))}
#'   \item{x}{length in mm (0--10.74)}
#'   \item{y}{width in mm (0--58.9)}
#'   \item{z}{depth in mm (0--31.8)}
#'   \item{depth}{total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)}
#'   \item{table}{width of top of diamond relative to widest point (43--95)}
#' }
#'
#' @source {ggplot2} tidyverse R package.
"diamonds"

Datasets should never be exported with @export because they are not found in the NAMESPACE. Instead, datasets will either be automatically available if you set LazyData: true in your DESCRIPTION, or available after calling data() if not. This field also affects the default usage. If you have LazyData: true, the usage will be just the dataset name (e.g. diamonds). Otherwise, the usage will be wrapped in data() (e.g. data(diamonds)).

Note the use of two additional tags that are particularly useful for documenting data:

@format, which gives an overview of the structure of the dataset. This should include a definition list that describes each variable. There’s currently no way to generate this with Markdown, so this is one of the few places you’ll need to Rd markup directly.
@source where you got the data form, often a URL.