How to Use Pins for R Programming
Pins in R programming allow data to be saved and shared conveniently. Here is an overview that aids your project data tasks.
With R programming comes data, and while some tools like the Environment in RStudio can help identify the objects being used, analysts sometimes need a temporary storage space for associated projects and files. This is where pins are helpful as a temporary storage solution.
Pins, the function from the R package of the same name, is a convenient way to cache data objects for later usage or sharing across projects when working with teams. Pins are designed to host small datasets or reference tables that don’t quite merit being in a database, yet need to persist in a storage medium more expansive than a spreadsheet. Version 1.0 was introduced in 2021, with 1.1 published by Julia Silge in January 2023. The current version is 1.3.
If you already have a shared drive with your teammates, you may be wondering “What is different about the value pins can provide?”
The value lies in update management. Think of this question — — How can you keep a team easily up-to-speed on the need to replace the dataset with a new one? Moreover, how do you let everyone know about the new version to avoid confusion? Is the latest data contained in the dataset_v2.csv file or in the dataset_v3.csv file?
Pins address this in its default design to keep the newest pin version available. It retrieves the newest pin version by default. Thus pin users never have to worry about receiving an outdated version of the pin or the data associated with it.
How do pins work?
Pins work by publishing R objects onto a virtual “corkboard” for shared access among people involved in a project using the pinned dataset.
There are three classes of functions — pin(), board(), and write() — designed to help you create and manage pins.
Pins
The pin functions create the pins. There are function variants based on the cloud platform service available, such as Google Cloud, Amazon S3, and so on. For example, to create a pin hosted on Amazon Web Services, I would use the following function.
Other hosting choices include RStudio Connect server, an AWS S3 bucket or Azure Blob Storage. A shared drive like Dropbox or Sharepoint is also possible.
Boards
The board functions essentially create the board in which the data is stored. Like pins, there are also variations of the boards to accommodate where the board resides.
Here is an example below to get you started with a board. You can create a temporary board using the board_temp() function. In this example, a board is created to hold the specifications of the mtcars dataset.
If you are sourcing data from a cloud service, there are dedicated board functions for each hosting service option, such as board_s3() for the AWS S3 buckets or board_gcs() for Google Cloud. Developers have found them particularly useful for small data sets or reference tables that are not suited for a database, but still require a shared location for teams to import into their scripts.
A board_folder() creates a folder that can be shared with others via a cloud service like Dropbox or a shared network drive. A specialty function board_rsconnect() is designed to share data with RStudio Connect. There are other board functions for Azure, Microsoft 365, and Amazon S3, reflecting the hosting choices I described earlier in this post.
The legacy functions are more of a generic category than a functional one. These functions are meant to manage the legacy pins API, to help with a transition due to a change of an API or a board. According to Posit, the company that created the boards, legacy functions “will continue to exist for some time so your existing code will continue to work, but we recommend you move to the modern API and modern boards where possible.”
One side note — A Kaggle version of the board function and legacy functions were deleted with the pins 1.2 release.
Write
Once a board is created you can write data to that board using pin_write(). In the mtcars tibble example, the function has a parameter to name the pin. In this case I am naming the tibble object mtcars_pin.
So the pin returns the following information in the console:
Note the type designation. In the pin_write() function, the user can indicate a pin file type. The default is rds for R document.
To read the contained information of the pin, you would use the pin_read() function. In the example, mtcars_pin will look exactly like the tibble for mtcars.
Is there a dataset size limit to a pin? A good dataset size for pins is a dataset that is a few hundred megabytes or less. Users can pin a variety of data object types — data from a CSV file, model outputs, JSON files, and other formats.
Pin Maintenance — Versioning
Pins and boards can be updated to reflect the changed data within them. Pins can be set up to keep an older pin available, or to rewrite the pin to reflect updated information. There are a few options to remember to keep pins updated.
You can put more than one pin in a board, but only one at a time — you can not chain them through a pipe for example. Here’s an example using the bikeshare data from the ml3rdata library and the gtcars from the gt library . There are two pin_writes calls, one for each object being added to the board.
If you need to maintain an old version of a pin, you can version pins using a temporary board such that writing to an existing pin adds a new copy rather than replacing the existing data. You can also do a local board with versioning as well.
Here’s how versioning works. In the board_local() function, you set the versioning parameter set to true ( , versioned = T). You can see that above in the screenshot.
This allows the user to create boards in a pin object named (via pin_write() ) without automatically deleting the previous pins. You can use the pin_list() function to display the pins.
You can also use pin_delete() to remove a pin, followed by the pin_list() function to display the remaining pins. Here is the output of a delete of the “bikeshare pin” followed by the output of the pin_list() function.
If you must update your pin regularly, you can use a scheduled R Quarto on RStudio Connect to handle this task.
Pins offer an opportunity to cache remote resources and intermediate results. A given pin will check for HTTP caching headers associated with the given URLs. This allows a pin() function to work when accessed offline or when the remote resource becomes unavailable. A warning may appear in the terminal or IDE prompt but the code relying on the pin will continue to work.
The cached information makes pins ideal in workflows such as sharing reproducible research that requires manual instruction to download resources before running your R script.
The latest version of pins introduces a logistics function called write_board_manifest(). The write_board_manifest() function creates a manifest file in a board’s root directory. The manifest file records all the pins, along with their versions, stored on a board.
If you starting with R (or Python — There is a Python version of pins here), learn how to add pins to your flow. You will find it a great aid when scheduling reports that need to be updated with the newest data each week or for sharing data across multiple content projects.
Overall pins offer a convenient way to manage data and associated information without extensive storage concerns.
You can learn more about pins and their functions at the Rstudio vignette, which you can click here.