library("steward")

The goal of this package is to make it a little easier to build and publish data-dictionaries, in particular:

In this article, we will look at a few (what we think are) common tasks:

Create

For this set of examples, we use the ggplot2 diamonds dataset. We will create an stw_dataset, which is a data frame with an extra class attached to help manage metadata. An stw_dataset is a data frame, in the same way that a tibble is a data frame.

In the code that follows we have three steps:

  1. Create the stw_dataset() using a data frame.
  2. Add the top-level metadata using stw_mutate_meta().
  3. Add the column-level metadata using stw_mutate_dict().

The functions stw_mutate_meta() and stw_mutate_dict() work a little bit like dplyr::mutate(). However, instead of mutating the columns of a data frame, stw_mutate_meta() mutates the top-level metadata, and stw_mutate_dict() mutates the descriptions of columns in a data frame:

diamonds_new <-
  stw_dataset(ggplot2::diamonds) %>%
  stw_mutate_meta(
    name = "diamonds",
    title = "Prices of 50,000 round cut diamonds",
    description = "A dataset containing the prices and other attributes of almost 54,000 diamonds.",
    sources = list(
      list(
        title = "DiamondSearchEngine",
        path = "http://www.diamondse.info/"
      )
    )
  ) %>%
  stw_mutate_dict(
    price = "price in US dollars ($326--$18,823)",
    carat = "weight of diamond (0.2--5.01)",
    cut = "quality of the cut (Fair, Good, Very Good, Premium, Ideal)",
    color = "diamond color, from D (best) to J (worst)",
    clarity = "a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))",
    x = "length in mm (0--10.74)",
    y = "width in mm (0--58.9)",
    z = "depth in mm (0--31.8)",
    depth = "total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)",
    table = "width of top of diamond relative to widest point (43--95)"
  )

We also offer a function to make some basic checks on the completeness of the metadata:

stw_check(diamonds_new, verbosity = "all")
✔ Dictionary names are unique.
✔ Dictionary names are all non-trivial.
✔ Dictionary descriptions are all non-trivial.
✔ Dictionary types are all recognized.
✔ Metadata has all required fields.
✔ Metadata sources valid.
✔ Metadata has all optional fields.

Write to package

If you are including a dataset in a package, it is good practice to document it; in fact, CRAN insists!

You can use an stw_dataset to keep your documentation associated with your dataset as you build it for your package. By keeping all the “stuff” together, this helps assure that the dataset documentation is complete and current.

You can use the function stw_use_data() wraps usethis::use_data() and steward::stw_write_roxygen(). For example, by running:

stw_use_data(diamonds_new)

You will:

  • validate that the metadata is complete.
  • write the diamonds_new dataset to your package.
  • write the roxygen documentation to R/data-diamonds_new.R.

Create gt table

If you are creating an RMarkdown document, you can use the stw_to_table() function to create a gt table:

stw_to_table(diamonds_new)
DIAMONDS
Prices of 50,000 round cut diamonds
Name Type Description Levels
carat number

weight of diamond (0.2--5.01)

cut string

quality of the cut (Fair, Good, Very Good, Premium, Ideal)

Fair, Good, Very Good, Premium, Ideal
color string

diamond color, from D (best) to J (worst)

D, E, F, G, H, I, J
clarity string

a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF
depth number

total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

table number

width of top of diamond relative to widest point (43--95)

price integer

price in US dollars ($326--$18,823)

x number

length in mm (0--10.74)

y number

width in mm (0--58.9)

z number

depth in mm (0--31.8)

Sources: DiamondSearchEngine