Lab 1: Getting comfortable in R and RStudio

Objectives

Install R and RStudio
Get comfortable with basics of programming in R
Develop enough coding literacy to correct a script littered with errors

Downloading R and RStudio

Throughout this course, we will be almost exclusively using R to read and manipulate data, simulate data, fit models, etc. R is a programming language, so we will not access it directly. Instead, we will use RStudio, which is a Graphical User Interface (GUI) for working with R. R also has its own GUI, but it is not nearly as functional as RStudio, which is how the vast majority of R users interact with R. It is worth knowing that the R GUI exists, however, because sometimes people will mistakenly open it, do a lot of work, and then lose that work. The reason for that is because the R GUI is only the console, which we will get to in a minute. First, we will install both R and RStudio.

First, install R. Follow the instructions for your operating system here: https://cran.r-project.org/
Install RStudio. Note that this needs to be done after installing R. Follow the instructions for your operating system here: https://posit.co/download/rstudio-desktop/

Once you have both R and RStudio installed, make sure you know how to open RStudio. Depending on your operating system and how you’ve set up your computer, it may add a shortcut icon to access the R GUI. Do not use it. It will pull up an older looking GUI which is only the console (which we are still getting to soon!). Always use RStudio to interact with R (…unless you’re running shell scripts or some other niche purposes, but if you’re doing that, you probably aren’t reading instructions on how to download and access R).

The logo on the *right* is RStudio, which is the one you want.

RStudio panes

When you open RStudio for the first time, the upper left is the Source pane. This is where you will open scripts, edit code, run analyses, type notes, etc. To use a cooking analogy, the Source pane is your recipe book. You can always come back to it, make little changes, or leave notes about what to fix for next time. But it is not the actual meal, just instructions for making a meal.

In the bottom left is the Console. The Console is the process of cooking. It can be done following the recipe by running chunks of code from the Source, or you can wing it. If you decide to ‘cook’ on the fly by running code directly from the Console, you will save no notes on what on earth you put into the meal, you cannot go back and fix things easily, and you cannot share your recipe with anyone else. You also might not remember the order in which you did things. So if what you did worked and your ‘meal’ comes out great, you will not know what you did to make that happen so it is not reproducible. The only real use for the console is for code that you only want to run once, e.g. foundational things like installing packages; to really overcook this analogy (ha), you only install a kitchen counter once, not at the start of every recipe. More on packages later.

In the upper right is the Environment (along with a few other tabs that you’ll use far less). This is the current version of your meal, i.e. if you’re making a stew, this is everything that is currently in the pot or mise en place and ready to go. It is objects (think of these as ingredients) that you have created with your cooking and are now available to either eat or combine with other objects (i.e., ingredients).

In the bottom right are your Files, Plots, and Help - all of which you will use frequently (and other tabs you will use far less often). Think of these as things that are available to you, but external to your current kitchen counter and stew. The Files tab will let you browse files on your computer; think of it is your kitchen where you can get more resources for your meal if you are in your current working directory (more on directories later), or your house if you browse for other files on your computer. Your Plots tab is the equivalent of taking a picture of the current version of your meal, either to be able to visualize what you’ve currently got, or to share with others. The Help tab is exactly what it sounds like - the person you call to ask how to use the stand mixer (i.e. the function) and what to put into it (i.e., the arguments passed to the function) to make the perfect waffle. It’s up to you to know that you want to use the stand mixer to make waffles in the first place.

Some terms and definitions

Since that analogy is now burnt to a crisp, let’s unpack some of the extra terms in there and what they are because you will use them frequently in this course.

Packages

Packages are a collection of functions designed to work together to accomplish some specific outcome. Many packages are hosted on CRAN, however, you can also find R packages on repositories like GitHub. One way to find packages is through CRAN Task Views (e.g. these are all packages associated with meta-analysis https://cran.r-project.org/web/views/MetaAnalysis.html), but more often you’ll just Google what you want to do and find a package that way.

Objects

Objects are what are in your environment. They can take lots of different forms, have different classes, etc. Most objects are created by using the assignment operator <- to pass the output of some code to a named object. We will talk about different types of objects more during the semester as we encounter them.

Working directory

Your Files tab lets you see two things: files, and directories. Directories are the organizational structure for how you store data on your computer; you can think of them as folders for the most part, though folders are GUI ways to visualize directories and directories have a clear nested structure.

You working directory is very important when coding. File paths are relative to your current working directory, so when you read files in you must know both what your current working directory is; most of the time if you get an error reading in a file, it is because the path to the file is incorrect. In the cooking analogy, you are cooking on the kitchen countertop, which is nested within the kitchen, which is nested within your house. You could move directories within the kitchen, such as moving to the sink, or you could move up several levels to go to the living room. If you try to call a file that is not in your current working directory, you will get an error. For example, if you are in the living room and tell R to pick up your cutting board, it will say it does not exist.

Relative file paths are extremely useful in coding. Relative file paths begin in your current working directory. To load a file from your current directory, begin the file path with ./. To load a file from the working directory above you (i.e. if you’re working at the kitchen counter, the kitchen is the next hierarchical level above you), use ../. One way to remember the difference between one dot and two is that if the dots represent your feet, one shows where you are standing - but two means you’ve hopped somewhere new. Since directories are nested within each other, you can also combine these into longer relative paths. So, for example, if I am in the living room but I want R to go into the kitchen, then into the cabinet, and take out a cutting board, the path would be something like: ../kitchen/cabinet/cutting_board.txt.

Setting your working directory. There are two ways to set your Working Directory. One is the click-and-point way, where you can go to Session > Set Working Directory > Choose Location which will open up your normal file browser application and you can navigate around to find where you want to set as your current working directory. This is easier when you’re starting out and getting used to directory structure and how you have your files organized, but is not reproducible so can cause headaches if you think you’re in a different directory than you are later on when running a script. A better option is to use the function setwd() with an absolute path to a directory. An absolute path is one that starts at your home directory (e.g. on Linux or Mac, ~/, on Windows typically something like C:/). For example, I might run something like setwd("~/Desktop/BIOL431") at the start of my script, and then use relative paths throughout once I am in the directory where I have my files for analysis and where I want to save my output.

Functions

Functions take input as arguments, and return output. To see what arguments can be passed to a function, and also what its output will be, you can use the Help tab to search for a function. Or, much more quickly, use a ? followed by the function name if it is in a package that is currently loaded (e.g. ?rbinom). If it is in a package that is not currently loaded, use ?? instead (e.g. ??glmer).

Customizing RStudio

There are three modifications many people will want to make to RStudio:

To change the theme, go to Tools > Global Options. For example, you may prefer a dark theme if you’re coding frequently.
Rearrange the panes. There is a window-like icon to the left of ‘Addins’ in the tool bar; select the drop down and you can customize which panes are in which corners. For example, I prefer to move my Console to the right so I can see my code in parallel and make the Environment tab really small because I do not need to check it frequently.
You can change the font and font size, which can be helpful depending on your screen size and eyesight.

Calculation

42/3

[1] 14

1.2*2

[1] 2.4

4 + 7

[1] 11

1 - 0.2

[1] 0.8

(11-2)/3

[1] 3

Assigning objects

x <- 42
y <- 3
x/y

[1] 14

Types of objects

class("words")

[1] "character"

class(4.2)

[1] "numeric"

class(1:10)

[1] "integer"

class(TRUE)

[1] "logical"

class(x)

[1] "numeric"

class(factor(letters[1:5]))

[1] "factor"

a <- 1:10
b <- 11:20

dat <- data.frame(a, b)

as.numeric("two")

Warning: NAs introduced by coercion

[1] NA

sqrt(-1)

Warning in sqrt(-1): NaNs produced

[1] NaN

Functions

seq(1, 10, by=0.5)

 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0

seq(1, 10, l=30)

 [1]  1.000000  1.310345  1.620690  1.931034  2.241379  2.551724  2.862069
 [8]  3.172414  3.482759  3.793103  4.103448  4.413793  4.724138  5.034483
[15]  5.344828  5.655172  5.965517  6.275862  6.586207  6.896552  7.206897
[22]  7.517241  7.827586  8.137931  8.448276  8.758621  9.068966  9.379310
[29]  9.689655 10.000000

Installing packages

To install a package from CRAN, you can use the install.packages() function with the name of the package inside the open parentheses in quotes (e.g. install.packages("lme4")) to install the lme4 package which is useful for linear models. To install packages from other sources, follow the package developer’s instructions.

# install.packages("lme4")
# install.packages("remotes")
# remotes::install_github("/MetBrewer")

Finding help

?seq
??plot

Basic coding principles

Variable names

It is important to choose good variable names when coding. This is one of the many, many times where you should do as I say, not as I do, to quote my mom. I have terrible variable naming conventions. Please do better than me.

Pick something informative. If your object is a vector of temperatures for the month of January, probably don’t call it var1 for example. Call it something like Jtemp or JanT or something you’ll remember what it is.

When selecting a variable name, go for consistency. Choose a formatting and naming convention that makes sense to you so that you don’t have to repeatedly look at your environment to see what you named something. You’ll also be able to code more quickly if you know what your variable name most likely is rather than having to look it up.

Pick something easy to type. In general, it is good to avoid capitalization, especially when you use a combination of capitalization and not. Why? Because then we forget which letter(s) are capitalized. In the January temperature example, JanT is actually a pretty bad name because we have to remember we capitalized the J and the T. A better name might be something like jan if we’ve got a lot of months of temperature data, or if everything is from the same month, then just temp.

For similar reasons, I think it’s a good idea to mostly omit punctuation from a variable name. Some people will do stuff like jan_temp or jan.temp but then I at least can’t remember if I used an underscore or a period. If you go the punctuation route, be consistent about what you use.

You can use numbers in variable names, but only if they are not the start of the object name. 2b is not a valid name and R will yell at you for it, but not2b is valid. So if the question is if 2b or not2b is an allowed object name in R, the answer is not2b.

Lastly, do not use function names as object names. The absolute worst thing you could name an object that contains a plot, for example, is plot. Why? Because that’s the function that you use to generate a plot, i.e. plot() and you’ve just overwritten that very basic function with an object name.

Overwriting objects

Once you create an object, it is typically best to leave it alone. You might overwrite the object for a few lines of code as you’re getting it set up, but don’t overwrite the object later in the script. You will sometimes forget you did that and then think you’re working with a different object that has a different format or structure and it can lead to a lot of confusion especially if you run lines of your script out of order.

Annotation

Do your future self a favor and annotate your code. You can use the pound symbol # at the start of a line to add a comment or note. You can also add it at the end of a line. Anything following # will not be evaluated (i.e., run) by R.

You should take notes on what you tried, what worked, what didn’t work, where you looked for information, why you ran things the way you did, what a particular line of code or function does, etc. That way, when you set down a script and come back to it six months from now, you don’t have to re-learn everything, because past you left helpful notes.

Cleanliness

Try to keep your coding environment nice and neat. What I mean by this is to tidy up now and then, remove objects you aren’t currently using, don’t display plots that aren’t part of your current project, and also to have nice, neat, readable scripts.

I will often use rm(list=ls) at the start of a coding session to remove all objects from my environment. I also like to run dev.off() every now and then to kill the plotting window.

rm(list=ls())
dev.off()

null device 
          1

For coding, try to keep your lines of code fairly narrow and not let them extend so far right that you have to scroll to read them. A good rule of thumb is to keep your lines of code under 80 lines. You can even set a little margin for yourself in RStudio as a visual reminder of this. Go to Tools > Global Options > Code > Display and then tick the box to show a margin and how many characters you want it to be. This helps with reading code quickly.

If you feel like your code is getting messy, RStudio has a built-in feature to clean it up. The keyboard shortcut Ctrl + Shift + I will clean up the formatting and spacing for messy code, though only within reason.

It can be really useful to structure your code as an outline, because then you can quickly navigate through it. To insert a new section, start a line with a # and then end it with five or more ----- and it will be a section header. To add a subsection within that, use two ## in a row, and a subsubsection is ###, and so on. You can navigate through these sections by clicking the ‘outline’ tab in the upper right corner of your script pane.

Keyboard shortcuts

There are a lot of keyboard shortcuts in RStudio, which you can find under Tools > Keyboard Shortcuts Help. The ones I use most frequently are Ctrl + A to select all, Crtl + Alt + B to run all lines of code above my current position, Ctrl + Shift + N for a new script, and Ctrl + L to clear my console.

Do this not that

Instead of hardcoding, always aim for softcoding. What I mean by that is to assign values you intend to use to an object and reference that object, rather than manually plugging in numbers. It will make your life so much easier and make it way faster to reuse code in the future if you softcode things. As an example of this, let’s simulate a bit of data. We can force R to use the same starting point for a random number generator (essentially) with set.seed().

# hard coding
set.seed(42) # the answer to life, the universe, and everythign
rnorm(n = 1)

[1] 1.370958

rpois(n = 1, lambda = 3)

[1] 2

1.3709584 * 2

[1] 2.741917

# soft coding
set.seed(42)
x <- rnorm(n = 1)
y <- rpois(n = 1, lambda = 3)
x*y

[1] 2.741917

For similar reasons, try to always use relative, not absolute, paths when referring to files on your computer. This helps us maintain good working directory structures which are neat and tidy, and not a chaotic mess (…again, do as I say, not as I do, and don’t judge my poor file organization on my computer).

Base R and the tidyverse

There are two main coding camps when it comes to R: base R, and the tidyverse. Typically, people end up coding in one of these two styles, and the style you choose is what you stick with. In some ways, people stick with the way they first learn, but also I’m convinced some people’s brains are just more suited to one or the other. I code almost exclusively in base R because I like to see every single step laid out and I need to be able to have the concrete changes visualized in front of me so I can see what happens when I change something. In my much belabored cooking analogy, I am tasting that soup each and every single time I add something to it because I cannot process more than one thing at a time and I can’t picture what would happen if I added salt and pepper at the same time. I want to add the salt first, taste the soup, then decide how much pepper to add. Tidyverse people, on the other hand, tend to think in pipes which means they pass the output of one operation right on to the next one sort of like an assembly line. If they are making soup, they say take this soup, add some salt, add some pepper, then give me that to taste. I’m not quite as much of a hater as the guy who wrote this blog post about why students should not start with tidyverse but I do tend to think people who code in base R have a better understanding of what they are doing, can troubleshoot things more quickly and more easily, and have more potential to become much more adept at programming and teaching themselves new skills.

Assignment

Assignment 1: A Badly Broken Code

Optional: listen to Dessa while coding