How I Organize Research and Coding Projects for Long-Term Success

Tips I Wish I Had Known 10 Years Ago

Photo by Maksym Kaharlytskyi on Unsplash

Recently for one of my research group’s lab meetings we discussed ways of keeping projects organized and conducting work efficiently. Much of this discussion centered around organizing R scripts and data files, but the principles apply to basically all research projects.

These computer organizational skills may be even more important now, as many students coming up lack an understanding or a schema for file organization. Something that those of us who remember command line interactions with our first computers take for granted.

Being organized in research is absolutely critical. It may sometimes be months (or even years) between when you last looked at a project. Also, you may need to communicate your work to others. It is our responsibility to ourselves, to our colleagues, and to the public who funds our work to be organized and traceable.

Below, I’ll describe the system I use after years of trial and error in the following parts:

Directory (folder) organization
Naming files sequentially (most important part!)
Odds and ends

Directory Organization

I organize directories in a consistent manner across projects so I always know where to find, for example, the output of analyses for any project. I follow this format:

Within my home directory I always have the following directories (main directories in bold, subdirectories with ../ before them):

data/
../data_raw
../data_wrangling
../data_output
analyses/
../analyses_output
figures/
../figures_wrangling
../figures_output
../figures_edited
docs/
incubator/
src/

To explain these directories a bit:

data - only has subdirectories in it, no files live here
../data_raw - contains only raw data files do not edit them directly!
../data_wrangling - contains scripts to manipulate raw data. I begin my sequential numbering of scripts (see below) in this folder.
../data_output - contains the output of data_wrangling scripts
analyses - contains scripts associated with analysis…sometimes when I go from housing scripts in data_wrangling to analyses can get a bit blurry depending on the project, but that’s fine because the files are all named sequentially! (see below)
../analyses_output - contains any outputs from analyses scripts
figures - usually this directory contains no files, but I sometimes get sloppy here
../figures_wrangling - scripts used to make figures from data_output or analyses_output files
../figures_output - the output of figure wrangling
../figures_edited - often, I make final edits in InkScape, so I save those files here. I’ll often save the final version of figures in this directory or in a separate directory
docs - contains all word and markdown (and now quarto) docs. Basically, written stuff. Sometimes I’ll make a subdirectory for the manuscript specifically
incubator - a messy directory to house all the mess. Often I’ll use this a lot during the planning phase of the project and will make folders for things like meeting notes, site data, budgets, flyers to recruit technicians, etc. Really just a junk pile that I do my best to keep some sort of organization. When I push my parent directory to github, I use “gitignore” to remove this directory since it’s not intended to be public!
src - I don’t always have this folder, but sometimes I will if there are functions or something that I want to reference to across a bunch of scripts. (e.g. I made a custom color palette for a project and want it to live somewhere I can load in at the top of every figure script or whatever)

Naming Files Sequentially

After setting up my directories comes the most important (and fun) part: populating these directories with data and scripts. I think it’s reasonable to imagine alterations to the above choice of directories and what exactly goes in them, but the next part is critically important and made a huge impact on folks in my research group.

I name all of my scripts sequentially. That means the very first script in any project I do is named “01a_example_script_name”, the second is named “01b_script_name”, and so forth.

Typically, I will name all of the steps related to data organizing and preparation for downstream analysis with the 01 series, all scripts related to actual analysis (e.g. model fitting) with the 02 series, and scripts related to making figures with the 03 series.

So a typical project will have some number of scripts named something like the following. Imagine an example about measuring bee diversity:

01a_loading_bee_data.R

01b_matching_bee_plant_data.R

02a_bee_plant_linear_models.R

02b_bee_ordination.R

03a_figure1_bee_plant.R

03b_figure2_bee_NMDS.R

Critical: this naming means that the output of these scripts gets specific names as well. For example, the resulting dataset produced from 01b_matching_bee_plant_data.R lives in the /data_output directory and is named something like output_01b_matched_bee_plant_data.Rdata

Similarly, any files produced in the other scripts will get names that have matching prefixes (i.e. 01a, 02c, 02f; whatever). This way, I know exactly where any file was produced and can efficiently navigate back and make changes if needed.

Overall — this system has two main effects. One, I can efficiently track down any script or file and have a clear organization system. The second effect is that I can provide my project directory to someone completely naive of my project and they can know the exact order to run all the scripts to generate my exact same analysis. Technically, you could do this by having one giant script, but that is fraught with all sorts of potential error and memory issues.

Odds and Ends

Other bits and pieces in this system are as follows:

Naming stuff: Most notably, you may notice that I don’t mind having pretty long or descriptive filenames. I think it’s way better to name something output_01b_matched_bee_plant_data.Rdata than say bp_data.Rdata

When in the pipeline and in what script was the latter file produced? Before or after the matching? If it’s been months, what does “bp” mean again? Etc.

The fact is, we can use tab-complete or other auto completions to point our scripts to specific files or directories, so I usually err on the side of descriptive filenames or object names within scripts.

Saving data: I tend to save data as an Rdata object if I’m only ever going to use it in R and as a csv or some other format if I might be exporting it out of R. I’m not very consistent there – but because we’ve named our files in a traceable manner, we can always go back to the source script and change our minds!

Lumping vs. splitting folders: Using a sequential naming format, you’ll also be able to, if you prefer, keep a more streamlined directory structure. Some people prefer to have all of their scripts together in one directory. If you do this, naming your scripts 01a, 01b, etc and then 02a, 02b, etc still keeps them organized even in one giant pool. I still would recommend subdirectories for outputs if you go this route.

Messy reality: Sometimes I make slight divergences and will do something like 02a01 02a02 when I realized I needed to break something into parts but want to keep all the other scripts in order. Again, I think this is fine so long as you track stuff properly — it keeps things in order. Also sometimes if there is a big switch in what’s happening between stages of analysis, I’ll switch from calling everything in analysis 02a, 02b, etc to 02a, 02b, then 03a, 03b, etc. I might do this if there’s a big jump in the data format or if I need to run an analysis in some separate software. Then the figures would be 04a and so forth. The main objective is preserved: everything is sortable quickly by name.

Conclusion