Organizing scripts
Just as your data "flows" through your project, data should flow naturally through a script. Very generally, you want to:
- describe the work completed in the script in a comment header
- source your configuration file (
0-config.R
) - load all your data
- do all your analysis/computation
- save your data.
Each of these sections should be "chunked together" using comments. See this file for a good example of how to cleanly organize a file in a way that follows this "flow" and functionally separate pieces of code that are doing different things.
Documenting your code
File headers

Sections and subsections
Rstudio (v1.4 or more recent) supports the use of Sections and Subsections. You can easily navigate through longer scripts using the navigation pane in RStudio, as shown on the right below.
Code folding
Consider using RStudio's code folding feature to collapse and expand different sections of your code. Any comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section. For example:
Comments in the body of your script
Commenting your code is an important part of reproducibility and helps document your code for the future. When things change or break, you'll be thankful for comments. There's no need to comment excessively or unnecessarily, but a comment describing what a large or complex chunk of code does is always helpful. See this file for an example of how to comment your code and notice that comments are always in the form of:
# This is a comment -- first letter is capitalized and spaced away from the pound sign
Function documentation
Every function you write must include a header to document its purpose, inputs, and outputs. For any reproducible workflows, they are essential, because R is dynamically typed. This means, you can pass a string
into an argument that is meant to be a data.table
, or a list
into an argument meant for a tibble
. It is the responsibility of a function's author to document what each argument is meant to do and its basic type. This is an example for documenting a function (inspired by JavaDocs and R's Plumber API docs):
The header tells you what the function does, its various inputs, and how you might go about using the function to do what you want. Also notice that all optional arguments (i.e. ones with pre-specified defaults) follow arguments that require user input.
- Note: As someone trying to call a function, it is possible to access a function's documentation (and internal code) by
CMD-Left-Click
ing the function's name in RStudio - Note: Depending on how important your function is, the complexity of your function code, and the complexity of different types of data in your project, you can also add "type-checking" to your function with the
assertthat::assert_that()
function. You can, for example,assert_that(is.data.frame(statistical_input))
, which will ensure that collaborators or reviewers of your project attempting to use your function are using it in the way that it is intended by calling it with (at the minimum) the correct type of arguments. You can extend this to ensure that certain assumptions regarding the inputs are fulfilled as well (i.e. thattime_column
,location_column
,value_column
, andpopulation_column
all exist within thestatistical_input
tibble).
Object naming
Generally we recommend using nouns for objects and verbs for functions. This is because functions are performing actions, while objects are not.
Try to make your variable names both more expressive and more explicit. Being a bit more verbose is useful and easy in the age of autocompletion! For example, instead of naming a variable vaxcov_1718
, try naming it vaccination_coverage_2017_18
. Similarly, flu_res
could be named absentee_flu_residuals
, making your code more readable and explicit.
- For more help, check out Be Expressive: How to Give Your Variables Better Names
We recommend you use Snake_Case.
- Base R allows in variable names and functions (such as
read.csv()
), but this goes against best practices for variable naming in many other coding languages. For consistency's sake,snake_case
has been adopted across languages, and modern packages and functions typically use it (i.e.readr::read_csv()
). As a very general rule of thumb, if a package you're using doesn't usesnake_case
, there may be an updated version or more modern package that does, bringing with it the variety of performance improvements and bug fixes inherent in more mature and modern software. - Note: you may also see
camelCase
throughout the R code you come across. This is okay but not ideal -- try to stay consistent across all your code withsnake_case
. - Note: again, its also worth noting there's nothing inherently wrong with using in variable names, just that it goes against style best practices that are cropping up in data science, so its worth getting rid of these bad habits now.
Function calls


The here package
The here
package is one great R package that helps multiple collaborators deal with the mess that is working directories within an R project structure. Let's say we have an R project at the path /home/oski/Some-R-Project
. My collaborator might clone the repository and work with it at some other path, such as /home/bear/R-Code/Some-R-Project
. Dealing with working directories and paths explicitly can be a very large pain, and as you might imagine, setting up a Config with paths requires those paths to flexibly work for all contributors to a project. This is where the here
package comes in and this a great vignette describing it.
Reading/Saving Data
.RDS
vs .RData
Files
One of the most common ways to load and save data in Base R is with the load()
and save()
functions to serialize multiple objects in a single .RData
file. The biggest problems with this practice include an inability to control the names of things getting loaded in, the inherent confusion this creates in understanding older code, and the inability to load individual elements of a saved file. For this, we recommend using the RDS format to save R objects.
- Note: if you have many related R objects you would have otherwise saved all together using the
save
function, the functional equivalent withRDS
would be to create a (named) list containing each of these objects, and saving it.
CSVs
Once again, the readr
package as part of the Tidvyerse is great, with a much faster read_csv()
than Base R's read.csv()
. For massive CSVs (> 5 GB), you'll find data.table::fread()
to be the fastest CSV reader in any data science language out there. For writing CSVs, readr::write_csv()
and data.table::fwrite()
outclass Base R's write.csv()
by a significant margin as well.
Tidyverse
Base R | Better Style, Performance, and Utility | ||
---|---|---|---|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
For a more extensive set of syntactical translations to Tidyverse, you can check out this document.
Working with Tidyverse within functions can be somewhat of a pain due to non-standard evaluation (NSE) semantics. If you're an avid function writer, we'd recommend checking out the following resources:
- Tidy Eval in 5 Minutes (video)
- Tidy Evaluation (e-book)
- Data Frame Columns as Arguments to Dplyr Functions (blog)
- Standard Evaluation for *_join (stackoverflow)
- Programming with dplyr (package vignette)
Integrating Box and Dropbox
Box and Dropbox are cloud-based file sharing systems that are useful when dealing with large files. When our scripts generate large output files, the files can slow down the workflow if they are pushed to GitHub. This makes collaboration difficult when not everyone has a copy of the file, unless we decide to duplicate files and share them manually. The files might also take up a lot of local storage. Box and Dropbox help us avoid these issues by automatically storing the files, reading data, and writing data back to the cloud.
Box and Dropbox are separate platforms, but we can use either one to store and share files. To use them, we can install the packages that have been created to integrate Box and Dropbox into R. The set-up instructions are detailed below.
Make sure to authenticate before reading and writing from either Box or Dropbox. The authentication commands should go in the configuration file; it only needs to be done once. This will prompt you to give your login credentials for Box and Dropbox and will allow your application to access your shared folders.
Box
Follow the instructions in this section to use the boxr
package. Note that there are a few setup steps that need to be done on the box website before you can use the boxr
package, explained here in the section "Creating an Interactive App." This gets the authentication keys that must be put in box. Once that is done, add the authentication keys to your code in the configuration file, with box_auth(client_id = "<your_client_id>", client_secret = "<your_client_secret_id>")
. It is also important to set the default working directory so that the code can reference the correct folder in box: box_setwd(<folder_id>)
. The folder ID is the sequence of digits at the end of the URL.
Further details can be found here.
Dropbox
Follow the instructions at this link to use the rdrop2
package. Similar to the boxr
package, you must authenticate before reading and writing from Dropbox, which can be done by adding drop_auth()
to the configuration file.
Saving the authentication token is not required, although it may be useful if you plan on using Dropbox frequently. To do so, save the token with the following commands. Tokens are valid until they are manually revoked.
Coding with R and Python
If you're using both R and Python, you may wish to check out the Feather package for exchanging data between the two languages extremely quickly.
Reviewing Code
Before publishing new changes, it is important to ensure that the code has been tested and well-documented. GitHub makes it possible to document all of these changes in a pull request. Pull requests can be used to describe changes in a branch that are ready to be merged with the base branch (more information in the GitHub section). Github allows users to create a pull request template in a repository to standardize and customize the information in a pull request. When you add a pull request template to your repository, everyone will automatically see the template's contents in the pull request body.
Creating a Pull Request Template
Follow the instructions below to add a pull request template to a repository. More details can be found at this GitHub link.
- On GitHub, navigate to the main page of the repository.
- Above the file list, click
Create new file
. - Name the file
pull_request_template.md
. GitHub will not recognize this as the template if it is named anything else. The file must be on themaster
branch.- To store the file in a hidden directory instead of the main directory, name the file
.github/pull_request_template.md
.
- To store the file in a hidden directory instead of the main directory, name the file
- In the body of the new file, add your pull request template. This could include:
- A summary of the changes proposed in the pull request
- How the change has been tested
- \@mentions of the person or team responsible for reviewing proposed changes
Here is an example pull request template.