Coding Practices

Back to CTML Lab Manual

Organizing scripts

Just as your data "flows" through your project, data should flow naturally through a script. Very generally, you want to:

describe the work completed in the script in a comment header
source your configuration file (0-config.R)
load all your data
do all your analysis/computation
save your data.

Each of these sections should be "chunked together" using comments. See this file for a good example of how to cleanly organize a file in a way that follows this "flow" and functionally separate pieces of code that are doing different things.

Documenting your code

File headers

Every file in a project should have a header that allows it to be interpreted on its own. It should include the name of the project and a short description for what this file (among the many in your project) does specifically. You may optionally wish to include the inputs and outputs of the script as well, though the next section makes this significantly less necessary.

Sections and subsections

Rstudio (v1.4 or more recent) supports the use of Sections and Subsections. You can easily navigate through longer scripts using the navigation pane in RStudio, as shown on the right below.

Code folding

Consider using RStudio's code folding feature to collapse and expand different sections of your code. Any comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section. For example:

Comments in the body of your script

Commenting your code is an important part of reproducibility and helps document your code for the future. When things change or break, you'll be thankful for comments. There's no need to comment excessively or unnecessarily, but a comment describing what a large or complex chunk of code does is always helpful. See this file for an example of how to comment your code and notice that comments are always in the form of:

# This is a comment -- first letter is capitalized and spaced away from the pound sign

Function documentation

Every function you write must include a header to document its purpose, inputs, and outputs. For any reproducible workflows, they are essential, because R is dynamically typed. This means, you can pass a string into an argument that is meant to be a data.table, or a list into an argument meant for a tibble. It is the responsibility of a function's author to document what each argument is meant to do and its basic type. This is an example for documenting a function (inspired by JavaDocs and R's Plumber API docs):

The header tells you what the function does, its various inputs, and how you might go about using the function to do what you want. Also notice that all optional arguments (i.e. ones with pre-specified defaults) follow arguments that require user input.

Note: As someone trying to call a function, it is possible to access a function's documentation (and internal code) by CMD-Left-Clicking the function's name in RStudio
Note: Depending on how important your function is, the complexity of your function code, and the complexity of different types of data in your project, you can also add "type-checking" to your function with the assertthat::assert_that() function. You can, for example, assert_that(is.data.frame(statistical_input)), which will ensure that collaborators or reviewers of your project attempting to use your function are using it in the way that it is intended by calling it with (at the minimum) the correct type of arguments. You can extend this to ensure that certain assumptions regarding the inputs are fulfilled as well (i.e. that time_column, location_column, value_column, and population_column all exist within the statistical_input tibble).

Object naming

Generally we recommend using nouns for objects and verbs for functions. This is because functions are performing actions, while objects are not.

Try to make your variable names both more expressive and more explicit. Being a bit more verbose is useful and easy in the age of autocompletion! For example, instead of naming a variable vaxcov_1718, try naming it vaccination_coverage_2017_18. Similarly, flu_res could be named absentee_flu_residuals, making your code more readable and explicit.

For more help, check out Be Expressive: How to Give Your Variables Better Names

We recommend you use Snake_Case.

Base R allows in variable names and functions (such as read.csv()), but this goes against best practices for variable naming in many other coding languages. For consistency's sake, snake_case has been adopted across languages, and modern packages and functions typically use it (i.e. readr::read_csv()). As a very general rule of thumb, if a package you're using doesn't use snake_case, there may be an updated version or more modern package that does, bringing with it the variety of performance improvements and bug fixes inherent in more mature and modern software.
Note: you may also see camelCase throughout the R code you come across. This is okay but not ideal -- try to stay consistent across all your code with snake_case.
Note: again, its also worth noting there's nothing inherently wrong with using in variable names, just that it goes against style best practices that are cropping up in data science, so its worth getting rid of these bad habits now.

Function calls

In a function call, use "named arguments" and put each argument on a separate line to make your code more readable.

Here's an example of what not to do when calling the function a function calc_fluseas_mean (defined above):

And here it is again using the best practices we've outlined:

The here package

The here package is one great R package that helps multiple collaborators deal with the mess that is working directories within an R project structure. Let's say we have an R project at the path /home/oski/Some-R-Project. My collaborator might clone the repository and work with it at some other path, such as /home/bear/R-Code/Some-R-Project. Dealing with working directories and paths explicitly can be a very large pain, and as you might imagine, setting up a Config with paths requires those paths to flexibly work for all contributors to a project. This is where the here package comes in and this a great vignette describing it.

Reading/Saving Data

`.RDS` vs `.RData` Files

One of the most common ways to load and save data in Base R is with the load() and save() functions to serialize multiple objects in a single .RData file. The biggest problems with this practice include an inability to control the names of things getting loaded in, the inherent confusion this creates in understanding older code, and the inability to load individual elements of a saved file. For this, we recommend using the RDS format to save R objects.

Note: if you have many related R objects you would have otherwise saved all together using the save function, the functional equivalent with RDS would be to create a (named) list containing each of these objects, and saving it.

CSVs

Once again, the readr package as part of the Tidvyerse is great, with a much faster read_csv() than Base R's read.csv(). For massive CSVs (> 5 GB), you'll find data.table::fread() to be the fastest CSV reader in any data science language out there. For writing CSVs, readr::write_csv() and data.table::fwrite() outclass Base R's write.csv() by a significant margin as well.

Tidyverse

Throughout this document there have been references to the Tidyverse, but this section is to explicitly show you how to transform your Base R tendencies to Tidyverse (or Data.Table, Tidyverse's performance-optimized competitor). For most of our work that does not utilize very large datasets, we recommend that you code in Tidyverse rather than Base R. Tidyverse is quickly becoming the gold standard in R data analysis and modern data science packages and code should use Tidyverse style and packages unless there's a significant reason not to (i.e. big data pipelines that would benefit from Data.Table's performance optimizations).

The package author has published a great textbook on R for Data Science, which leans heavily on many Tidyverse packages and may be worth checking out.

The following list is not exhaustive, but is a compact overview to begin to translate Base R into something better:

Base R Better Style, Performance, and Utility

read.csv()

readr::read_csv() or data.table::fread()

write.csv()

readr::write_csv() or data.table::fwrite()

readRDS

readr::read_rds()

saveRDS()

readr::write_rds()

data.frame()

tibble::tibble() or data.table::data.table()

rbind()

dplyr::bind_rows()

cbind()

dplyr::bind_cols()

df$some_column

df %>% dplyr::pull(some_column)

df$some_column = ...

df %>% dplyr::mutate(some_column = ...)

df[get_rows_condition,]

df %>% dplyr::filter(get_rows_condition)

df[,c(col1, col2)]

df %>% dplyr::select(col1, col2)

merge(df1, df2, by = ..., all.x = ..., all.y = ...)

df1 %>% dplyr::left_join(df2, by = ...) or dplyr::full_join or dplyr::inner_join or dplyr::right_join

str()

dplyr::glimpse()

grep(pattern, x)

stringr::str_which(string, pattern)

gsub(pattern, replacement, x)

stringr::str_replace(string, pattern, replacement)

ifelse(test_expression, yes, no)

if_else(condition, true, false)

Nested: ifelse(test_expression1, yes1, ifelse(test_expression2, yes2, ifelse(test_expression3, yes3, no)))

case_when(test_expression1 ~ yes1, test_expression2 ~ yes2, test_expression3 ~ yes3, TRUE ~ no)

proc.time()

tictoc::tic() and tictoc::toc()

stopifnot()

assertthat::assert_that() or assertthat::see_if() or assertthat::validate_that()

For a more extensive set of syntactical translations to Tidyverse, you can check out this document.

Working with Tidyverse within functions can be somewhat of a pain due to non-standard evaluation (NSE) semantics. If you're an avid function writer, we'd recommend checking out the following resources:

Tidy Eval in 5 Minutes (video)
Tidy Evaluation (e-book)
Data Frame Columns as Arguments to Dplyr Functions (blog)
Standard Evaluation for *_join (stackoverflow)
Programming with dplyr (package vignette)

Integrating Box and Dropbox

Box and Dropbox are cloud-based file sharing systems that are useful when dealing with large files. When our scripts generate large output files, the files can slow down the workflow if they are pushed to GitHub. This makes collaboration difficult when not everyone has a copy of the file, unless we decide to duplicate files and share them manually. The files might also take up a lot of local storage. Box and Dropbox help us avoid these issues by automatically storing the files, reading data, and writing data back to the cloud.

Box and Dropbox are separate platforms, but we can use either one to store and share files. To use them, we can install the packages that have been created to integrate Box and Dropbox into R. The set-up instructions are detailed below.

Make sure to authenticate before reading and writing from either Box or Dropbox. The authentication commands should go in the configuration file; it only needs to be done once. This will prompt you to give your login credentials for Box and Dropbox and will allow your application to access your shared folders.

Box

Follow the instructions in this section to use the boxr package. Note that there are a few setup steps that need to be done on the box website before you can use the boxr package, explained here in the section "Creating an Interactive App." This gets the authentication keys that must be put in box. Once that is done, add the authentication keys to your code in the configuration file, with box_auth(client_id = "<your_client_id>", client_secret = "<your_client_secret_id>"). It is also important to set the default working directory so that the code can reference the correct folder in box: box_setwd(<folder_id>). The folder ID is the sequence of digits at the end of the URL.

Further details can be found here.

Dropbox

Follow the instructions at this link to use the rdrop2 package. Similar to the boxr package, you must authenticate before reading and writing from Dropbox, which can be done by adding drop_auth() to the configuration file.

Saving the authentication token is not required, although it may be useful if you plan on using Dropbox frequently. To do so, save the token with the following commands. Tokens are valid until they are manually revoked.

Coding with R and Python

If you're using both R and Python, you may wish to check out the Feather package for exchanging data between the two languages extremely quickly.

Reviewing Code

Before publishing new changes, it is important to ensure that the code has been tested and well-documented. GitHub makes it possible to document all of these changes in a pull request. Pull requests can be used to describe changes in a branch that are ready to be merged with the base branch (more information in the GitHub section). Github allows users to create a pull request template in a repository to standardize and customize the information in a pull request. When you add a pull request template to your repository, everyone will automatically see the template's contents in the pull request body.

Creating a Pull Request Template

Follow the instructions below to add a pull request template to a repository. More details can be found at this GitHub link.

On GitHub, navigate to the main page of the repository.
Above the file list, click Create new file.
Name the file pull_request_template.md. GitHub will not recognize this as the template if it is named anything else. The file must be on the master branch.
1. To store the file in a hidden directory instead of the main directory, name the file .github/pull_request_template.md.
In the body of the new file, add your pull request template. This could include:
- A summary of the changes proposed in the pull request
- How the change has been tested
- \@mentions of the person or team responsible for reviewing proposed changes

Here is an example pull request template.