Introduction to R

Table of Contents

1 Getting Started

The full R code used in this tutorial can be downloaded here (you might need to right-click and select "Save As…" so that it doesn't open in your browser).

The Comprehensive R Archive Network is where the R software and add-on packages reside. Follow the instructions there to download the software you need for your particular operating system.

1.1 Getting Help

1.1.1 R documentation

You can type ?functionname at the command prompt to get help on the function functionname. For example, to get help on the help() function, type ?help.

The help.start() function will open up the R documentation in a web browser.

1.1.2 On-line help

The place to begin is http://www.r-project.org. There is much that can be found online at the Comprehensive R Archive Network. A good place to start is the tutorial section. For full documentation, see the online manual.

If you are having difficulty finding what you need, R Seek is a good internet search engine devoted specifically to R. The fact that nearly every page on the internet that is written in English contains the letter "R" somewhere makes it hard to find what you need from Google!

2 The R Session Environment

2.1 Starting up and closing down

2.1.1 Batch sessions

The most common way you will interact with R is in interactive sessions. However, there is another way that is often useful that we will touch upon briefly first.

R is essentially a programming language, and you will typically be writing scripts that you will store in text files (conventionally with an extension of .R or .r) and you might want to just execute a saved script without going through an interactive session. At the command line of your operating system, type

R --save < filename.R

This will send the script in the file filename.R into the R interpreter. The --save option will save the environment (into the .RData and .RHistory files) when finished. If you don't need to save the environment, use the --no-save option.

2.1.2 Interactive sessions

Starting up is easy enough. One thing to note is that the directory in which you start R plays a special role. The R interpreter stores two "hidden" files in this directory, that form a record of the history of commands input during the session (.Rhistory) as well as an image of the data objects in the session (.RData). If you are on a "unix-alike" operating system (Mac OS, Linux), you can open up a terminal window and use the cd command to change to the directory you want to use (or create it with mkdir) before starting R.

To shut down, you type q() at the command line, for "quit". R will ask:

Save workspace image? [y/n/c]: 

Type 'y' to save the image, 'n' to exit without saving, and 'c' to undo the quit operation and return to the prompt.

2.1.3 The session "environment"

In an R session, you will create and manipulate various data objects, such as data frames, vectors, matrices, etc. You might also create your own functions. All of these items reside in memory, within the session "environment".

To see what's in your environment, use the ls() command. It returns a vector of type "character" with the names of the objects in the environment.

ls()
character(0)

Hmm, character(0) means a character vector of length 0, which means there's nothing there. Guess we might actually need to do something first.

We are going to create new variables x and y and assign to them values that are the result of the rnorm() and runif() functions. The '<-' syntax is something specific to R for assigning values to variables. Following this syntax, x <- 10 means "give the variable named 'x' the value of '10'. The reason why we don't just use the plain old equals sign to do this is that equals is ambiguous between assignment and testing for equality. We still use the equals sign when assigning values to arguments in function calls, and that seems inconsistent to me, but there's probably some deeper reason for it that I don't see. Note that functions are called by name (e.g., myfunction()), followed by parentheses that optinally contain arguments (more on functions later).

You can see both '<-' and '=' in operation below.

x <- rnorm(10, sd=2)  # create vector named 'x', drawing from
                      # a normal distribution with standard deviation of 2
y <- runif(10, min=0, max=1)  # create vector named 'y', drawing from
                # a uniform distribution from 0 to 1

print(x)
print(y)
 [1] -1.67539296 -1.02796192  0.23490288  1.13407282  1.63891794  0.68877181
 [7]  0.02430912 -0.59720188 -0.65095750  0.11747846
 [1] 0.52641168 0.55102570 0.02459610 0.17269570 0.07816773 0.41788894
 [7] 0.18890956 0.40645372 0.39614057 0.84168878

Note that we had to print the variables x and y to see what they contain. We could also just type their names at the command line, and that automatically will print them.

x
 [1] -1.67539296 -1.02796192  0.23490288  1.13407282  1.63891794  0.68877181
 [7]  0.02430912 -0.59720188 -0.65095750  0.11747846

OK, earlier I promised to explain the ls() function, so let's see what the environment contains now:

ls()
[1] "spelling.A" "spelling.B" "x"          "y"

Aargh! now my environment is all cluttered by x and y! Begone, x and y!

rm(x, y)

ls()
[1] "spelling.A" "spelling.B"

We used the rm() command to do that. Sometimes you might want to get rid of everything in the environment, but it would be a pain to type in all the names. To zap them all away in one step, we feed the results of the ls() command to the list argument of rm(), like so:

rm(list=ls())  # tidy it all up!

The only other way to do this is to exit and restart R, so this command often comes in handy.

2.2 Typing in commands

You have two choices for typing in commands: (1) type them directly into the command interpreter, or (2) type them into a text file and then get them into the interpreter in some other way. Unless I just need to compute something really fast, I usually use the second method, because that leaves me with a 'script' of commands that I can then easily use in the future, or run in batch mode. If I just type them into the interpreter, they will be saved in .Rhistory, but so will any mistakes that I type in as well.

My workflow with R involves the text editor emacs and org mode coupled with the emacs plug-in ESS (Emacs Speaks Statistics). This means that my day to day work life looks something like this:

terminaljunk.png

This allows me to type in single lines of R code, and send them to the interpreter line by line or section by section. The R snippets are actually woven into a file that can be exported to HTML or LaTeX. In fact, this tutorial was created using this setup. A similar setup is Sweave which allows R code to be integrated within a LaTeX document.

Working with R through emacs, org-mode, and ESS is in many ways ideal (not to mention, "retro cool"). But unless you are a true nerd and terminal junky like me, you will find such a setup abhorrent, and might want to find better things to do with your cortex than to cram it with various emacs keystroke combinations. For those who want the full-experience of a graphical interface but still want the option of line-by-line command processing, try R commander or R studio. For Ubuntu Linux users, I would also mention Dan Dediu's great Rgedit, a plug in for the text editor gedit.

2.2.1 Use of "whitespace" in commands

To improve readability, sometimes it is useful to break up a single long command over multiple lines. R will not try to interpret the line until the syntax is complete (or it encounters a syntax error).

When you break up a command over multiple lines (and are typing the command in the interpreter) the prompt will change from > to + to let you know that the command continues. For example:

> x <- 3 + 5 + 7/3 * 999 -
+ 88 + 22 + 5
>

Because the command on the first line (ending with the minus sign) is not a complete expression, R "knows" that I intended to continue the command on the second line. Because second line makes it a full expression, the command is evaluated and the > prompt returns.

2.2.2 Interrupting a long-running process

Sometimes you might have a mistake in your code that creates an infinite loop, or that initiates a very long computation. The program will lock until the operation completes. Ctrl-C is your friend; it will allow you to interrupt the process and return to the command prompt.

Another situation where Ctrl-C comes in handy is when you have typed in erroneous syntax, and your command prompt has turned into a '+' rather than an '>'. Rather than trying to diagnose what went wrong, just hit Ctrl-C a couple of times to get back to normal.

2.3 Calling functions

Functions perform some action on data objects. Let's illustrate with a very simple example, using the functions rnorm() and mean().

x <- rnorm(100) # create 100 random normal variates
mean(x)  # calculate the mean  
[1] -0.1969908

The object named x was set to the return value of the function rnorm(), which in this case, was 100 variates from the normal distribution.

Functions such as rnorm() usually accept arguments, that modify what they do. To see the accepted arguments for rnorm() type ?rnorm at the command prompt. The help page shows something like this:

Normal                  package:stats                  R Documentation

The Normal Distribution

Description:

     Density, distribution function, quantile function and random
     generation for the normal distribution with mean equal to ‘mean’
     and standard deviation equal to ‘sd’.

Usage:

     dnorm(x, mean = 0, sd = 1, log = FALSE)
     pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
     qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
     rnorm(n, mean = 0, sd = 1)
     
Arguments:

     x,q: vector of quantiles.

       p: vector of probabilities.

       n: number of observations. If ‘length(n) > 1’, the length is
          taken to be the number required.

    mean: vector of means.

      sd: vector of standard deviations.

log, log.p: logical; if TRUE, probabilities p are given as log(p).

lower.tail: logical; if TRUE (default), probabilities are P[X <= x]
          otherwise, P[X > x].

This documentation presents descriptions of four functions associated with the normal distribution, but the one that we are primarily interested in is the random generation function, rnorm(n, mean=0, sd=1). This tells us that rnorm takes three arguments: n, mean, and sd. The documentation page also tells us that n is the "number of observations", mean is the mean, and sd is the standard deviation.

When we typed rnorm(100), we failed to specify the mean and standard deviation, but the function worked anyway. This is because the mean and sd arguments have defaults of 0 and 1, respectively. In contrast, the n argument has no default, and thus will fail if we do not include it.

> rnorm()
Error in .Internal(rnorm(n, mean, sd)) : 'n' is missing

The help page tells us (implicitly) that n is an obligatory argument and that mean and sd are optional by specifying the default values for mean and sd, but not for n, in the function description rnorm(n, mean=0, sd=1).

Note that when we type rnorm(100), the interpreter knows that 100 is intended for the argument n, because it matches the position of n (first) in the function description. Because mean and sd are not present in the function call, the interpreter uses the default values. We could also specify the mean of the distribution like so:

x <- rnorm(100, 50)
mean(x)
[1] 50.04525

An equivalent way of doing this would have been with named arguments, e.g., rnorm(n=100, mean=50). Without the names for the arguments, the interpreter uses the position to assign the values. When the names are present, the interpreter ignores the position and uses the names. This is useful because we often want to use an argument but may not remember what position it was supposed to be in. For example, say we want to use the default of 0 for the mean, but adjust the standard deviation. All we need to do is this:

x <- rnorm(100, sd=100)
mean(x)
[1] -10.53998

Even though mean should have been the second argument, we can put sd in as a named argument and everything will be alright because the position will be ignored. This may not seem to buy us much for the case of rnorm, which only has three arguments, but it becomes critical when a function has many arguments only some of which we want to alter (see the function par() for example).

2.3.1 Creating your own functions

Sometimes you will want to create your own function, especially if it is an operation that you are repeating over and over throughout a script. One of the principles of good programming is to never repeat yourself. Each time we retype the same code presents a new chance to introduce a bug; furthermore, if we need to change the function later, we would have to do it everywhere it appears in the script. It also makes the code more verbose than necessary and less readable. A better way is to write a function once and then call it whenever it is needed.

Let's say we needed a function that would add A to B and then divide it by C. We define a new function AplusBoverC thus:

AplusBoverC <- function(A, B, C) {
  return( (A+B) / C )
}

AplusBoverC(1, 3, 2)
[1] 2

We created a new object AplusBoverC that took three arguments, A, B, C. The function body is enclosed within curly brackets {}, and ends with a return value, the value to be passed back to the calling process.

We call our function in the way we call any function (such as rnorm). See ?function and the R documentation on functions for more information.

2.4 Data input/output

2.4.1 Importing data

The data we will use in this section can be downloaded in any of the three formats below. The righthand column of the table tells you what R function to use to bring the data from the file into your environment.

FormatR function
Raw text (.txt)read.table
Comma-separated values (.csv)read.csv
R binary format (.RData)load
  • Importing Raw Text

    Use the read.table function. The output of the function will be a data.frame object. Note that the text file we will be reading looks like this:

    "SubjID" "Method" "Y"
    1 1 124
    2 1 129
    3 1 115
    4 1 112
    5 2 101
    6 2 88
    7 2 107
    8 2 92
    9 3 76
    10 3 91
    11 3 84
    12 3 81
    

    Note that the first row of the file contains the name of the corresponding column, enclosed in double quotes. The columns in each row are delineated by spaces. By default, read.table interprets any kind of white space (spaces or tabs) as a delineator. We want the first line of the file to be interpreted as column names, so we must use the argument header=TRUE.

    spelling.A <- read.table(file="spelling.txt", header=TRUE)
    print(spelling.A)
    
       SubjID Method   Y
    1       1      1 124
    2       2      1 129
    3       3      1 115
    4       4      1 112
    5       5      2 101
    6       6      2  88
    7       7      2 107
    8       8      2  92
    9       9      3  76
    10     10      3  91
    11     11      3  84
    12     12      3  81
    
  • Importing a CSV File

    If you are familiar with Microsoft Excel, it is sometimes convenient to save the data in .CSV format, which can be easily imported into R.

    Here is the file that we will be reading in:

    "SubjID","Method","Y"
    1,1,124
    2,1,129
    3,1,115
    4,1,112
    5,2,101
    6,2,88
    7,2,107
    8,2,92
    9,3,76
    10,3,91
    11,3,84
    12,3,81
    

    Note that the read.csv function has a similar syntax to the read.table function.

    spelling.A <- read.csv(file="spelling.csv", header=TRUE)
    print(spelling.A)
    
       SubjID Method   Y
    1       1      1 124
    2       2      1 129
    3       3      1 115
    4       4      1 112
    5       5      2 101
    6       6      2  88
    7       7      2 107
    8       8      2  92
    9       9      3  76
    10     10      3  91
    11     11      3  84
    12     12      3  81
    
  • Importing Native R Binary files

    This is accomplished using the load function. Unlike text or CSV files, files in the binary format (.RData) can have multiple data.frames or other objects inside of them. After using load, we will use the function ls to list the objects that exist in our environment.

    # what is in the environment before loading
    ls()
    
    load(file="spelling.RData")
    
    ls()  # list what is there after
    print(spelling.A)
    
    character(0)
    [1] "spelling.A" "spelling.B"
       SubjID Method   Y
    1       1      1 124
    2       2      1 129
    3       3      1 115
    4       4      1 112
    5       5      2 101
    6       6      2  88
    7       7      2 107
    8       8      2  92
    9       9      3  76
    10     10      3  91
    11     11      3  84
    12     12      3  81
    
  • Other options

    Note that it is also possible to download data from a MySQL database server into your environment using the RMySQL package (or other databases as well, such as sqlite using package RSQLite, or a Windows ODBC connection using RODBC). None of these packages is available by default, but must be installed from CRAN.

2.4.2 Saving objects

Often you will wish to save objects that you have created to external files, whether for further processing using different software or so that you can import them back into R again in a later session.

  • Saving a text file

    For this, we use the write.table function. Note that we are setting the row.names option to FALSE. Without this, R will create a new column with the name of the rows in your data frame (which is often unnecessary, unless the row names have been set to something meaningful). Note that the column names will appear as the first line in the file (surrounded by quotations).

    write.table(spelling.A, file="spelling.txt", row.names=FALSE)
    
  • Saving a CSV file

    This is just like the write.table command.

    write.csv(spelling.A, file="spelling.csv", row.names=FALSE)
    
  • Saving an R binary file

    The advantages of the native R binary (.RData) format is that the files are compressed, and it is furthermore possible to store multiple objects in a single file. However, the disadvantage is that the data can only be read by R, but not by other programs (e.g., Excel).

    To save in R binary format, use the save function. Enter the names of the objects to save as the first arguments to the function, and then use the file argument to specify the filename.

    save(spelling.A, spelling.B, file="spelling.RData")
    

2.5 Add-on packages

One of the great things about R is that it is open-source community-developed software. That means that anyone can create a "package" of functions and data for any purpose, and make their results freely available to other R users. It also means that the source code of the package is freely available so that you can see what's going on under the hood. After packages have been vetted to make sure they are bug-free and install properly, they are made available on CRAN. Some packages that are still under development (like my gmpm package for doing permutation tests) are available on R forge. You can install these as well using the typical way (the install.packages argument, though you'll have to enter the R-forge address for the repos argument).

CRAN Task Views shows packages sorted by theme. There are some amazing things out there, including packages for analyzing fMRI or eeg data, natural language processing, latent semantic analysis; not to mention a Sudoku puzzle solver, and even a twitter client!

To see how add-on packages work, let's install the fortunes package, a collection of wit and wisdom from the R community. It's useless, except for a laugh, but shows you how to install and load packages! You can install the package from the command line using the install.packages() command.

install.packages("fortunes")

OK, we've installed it! Note that we never have to do this step again; the package now has been downloaded onto our system. If that command failed, it probably means that you don't have the proper permissions for installing software on your system or you are behind a firewall and your proxy is not set up properly (in either case, go bother your sysadmin).

The fortunes package has one function, namely, fortune(). Let's try it and see what happens.

> fortunes()
Error: could not find function "fortune"

Huh? What happened? Oh yeah, we need to load the package first using the library command. You only need to do this once per session, whenever you want to use the package. This makes the functions and datasets in the package available to your session.

library(fortunes)

fortune()
If 'fools rush in where angels fear to tread', then Bayesians 'jump' in where
frequentists fear to 'step'.
   -- Charles C. Berry (about Bayesian model selection as an alternative to
      stepwise regression)
     R-help (November 2009)

Brilliant!! (your results may differ :) )

3 Basic Objects and Operations

3.1 Naming objects

As already mentioned, working with R involves creating and manipulating named data objects. You cannot give your objects any old names; obviously, to avoid ambiguity, names should not include elements of the syntax or whitespace. Often people use dots in the names (e.g., x.dat, x.lm, x.rand) to indicate that the objects somehow go together (are associated with the data object x), although this convention is meaningless to the interpreter itself.

In this section, we'll go through the most common R object types (there are more) and discuss how they are used and manipulated.

3.2 Vector

The most basic variables in R are vectors. As in mathematics, a vector is an ordered set of elements. Vectors in R have a length and a mode. The length is simply the number of elements, and the mode is the type of data contained in the vector.

3.2.1 Initializing

x <- 1
length(x)
mode(x)
[1] 1
[1] "numeric"

Here we see that the variable x is a numeric vector with length of 1 and mode "numeric". There are a number of useful techniques for creating sequences of numbers that save typing, including the colon : operator and the functions seq and rep.

1:3
3:1
c(1:3, 5:7, 9:11)
rep(1:3, times=3)
rep(1:3, each=3)
seq(1, 27, by=3)
[1] 1 2 3
[1] 3 2 1
[1]  1  2  3  5  6  7  9 10 11
[1] 1 2 3 1 2 3 1 2 3
[1] 1 1 1 2 2 2 3 3 3
[1]  1  4  7 10 13 16 19 22 25

Note the above syntax c() for defining a vector with multiple elements, where the elements are separated by commas.

We can also define "logical" vectors, where the elements are the values TRUE and FALSE. Note that we can also use rep with any mode of vector, including logical.

x <- c(TRUE, FALSE, TRUE)
mode(x)
rep(c(TRUE, FALSE), each=5)
[1] "logical"
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

We can name the elements of our vector by entering them explicitly when we create the vector, or by using the names function.

# put in names when creating vector
weather <- c(rain=TRUE, snow=FALSE, sleet=FALSE)
print(weather)
names(weather)
names(weather) <- c("sunny", "windy", "humid")
print(weather)
 rain  snow sleet 
 TRUE FALSE FALSE
[1] "rain"  "snow"  "sleet"
 sunny windy humid 
 TRUE FALSE FALSE

One can also have character vectors, where each element is a string.

actors <- c("Al Pacino", "Robert De Niro", "Alec Baldwin", "Jack Lemmon", "Ed Harris", 
            "Alan Arkin")
print(actors)
[1] "Al Pacino"      "Robert De Niro" "Alec Baldwin"   "Jack Lemmon"   
[5] "Ed Harris"      "Alan Arkin"

3.2.2 Accessing elements of a vector

There are three ways to access elements in a vector: using a numerical index, a name index, or a logical index. Each of these indices is, conveniently, also a vector, making it possible to select multiple elements simultaneously. The index is enclosed within brackets [].

actors[2]
actors[c(3,5)]
actors[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
weather["sunny"]
weather[c(1,3)]
[1] "Robert De Niro"
[1] "Alec Baldwin" "Ed Harris"
[1] "Al Pacino"    "Alec Baldwin" "Ed Harris"    "Alan Arkin"
sunny 
 TRUE
sunny humid 
 TRUE FALSE

3.2.3 Operations on vectors

One aspect of R that is confusing at first but is ultimately something that makes it very powerful is that it does something called implicit replication of arguments in order to apply certain operators, when the vectors the operator is supposed to apply to have different lengths. This is best explained by example.

Suppose we have a vector of integers from 1 to 6:

x <- 1:6
x
[1] 1 2 3 4 5 6

And we want to add 2 to each of the elements of the vector. In programming langauge such as C, the way this would be done would be using a for loop. In R, it's much simpler:

print(x + 2)
[1] 3 4 5 6 7 8

What happened here? The interpreter encountered an instruction to add a vector of length 1 to a vector of length 6. This is not possible, so what it did (implicitly) was to replicate the shorter vector to match the size of the larger vector, like this:

xafterRESULT
replication
1+23
2+24
3+25
4+26
5+27
6+28

So, what would happen if the value we added to x was had length 2 instead of 1?

print(x + c(1,3))
[1] 2 5 4 7 6 9

What happened here was that the vector <1,3> of length 2 was replicated 3 times to match the length of x (6 elements), like so:

after
xreplicationRESULT
112
235
314
436
517
639

Obviously, this implicit replication only makes sense if the length of the smaller vector is a factor of the length of the larger vector. When it's not, R does the replication anyway, but throws up a warning.

print(x + c(1.1,2.2,3.3,4.4))
[1] 2.1 4.2 6.3 8.4 6.1 8.2
Warning message:
In x + c(1.1, 2.2, 3.3, 4.4) :
  longer object length is not a multiple of shorter object length

The same goes for subtraction, multiplication, and division. We can also take a vector to a power using the syntax x^y, and the interpreter will do this for each element:

print(x^2)
[1]  1  4  9 16 25 36

One thing that is sometimes confusing is that certain functions, such as log, return a vector of length n when given a vector of length n, with each element transformed, whereas other functions, such as mean, perform aggregation functions over all elements, and return a scalar. Usually it will be obvious whether a certain function performs aggregation or operations on single elements.

log(x)
mean(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595
[1] 3.5

Sometimes, you might want to take the dot product of two vectors (multiply pairs of corresponding elements in the two vectors, then add the sums). This is accomplished with the %*% operator. The object that is returned is a matrix (with 1 row and 1 column in the case of vector dot products).

c(1,2,3) %*% c(5,10,30)
     [,1]
[1,]  115
  • Combining vectors

    Combining vectors is easy; we just use the concatenate (c()) function.

    x <- c(1,2,3)
    y <- c(4,5,6)
    xy <- c(x, y)
    print(xy)
    
    [1] 1 2 3 4 5 6
    
  • Sampling elements

    Sometimes we just want to randomly sample elements from the vector (e.g., for bootstrapping). For this, the sample function comes in handy.

    x <- 1:4
    sample(x)   # return a random permutation (sample without replacement)
    
    [1] 4 2 1 3
    
    sample(x, 3, replace=FALSE)  # sample 3 without replacement
    sample(x, 10, replace=TRUE)  # sample 10 with replacement
    sample(c(0,1), 30, replace=TRUE, prob=c(.3,.7)) # probs not uniform
    
    [1] 2 4 3
     [1] 3 4 4 1 2 4 3 2 1 3
     [1] 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1
    
  • Set operations

    Sometimes we might want to know, what elements do vectors x and y have in common? What elements does x have that y does not? Here are some examples of set operations.

    x <- 1:5
    y <- 3:7
    intersect(x, y)
    
    [1] 3 4 5
    
    union(x, y)
    
    [1] 1 2 3 4 5 6 7
    
    setdiff(x, y)  # which elements does x have that y lacks?
    
    [1] 1 2
    

    The %in% operator returns TRUE or FALSE depending on whether the left hand argument is an element in the right-hand argument.

    x <- c("Jan", "Feb", "Mar")
    "Feb" %in% x
    "Apr" %in% x
    
    [1] TRUE
    [1] FALSE
    
    x <- c(1, 1, 3, 3, 5, 5, 7, 7)
    unique(x)
    
    [1] 1 3 5 7
    

3.3 Matrix

A matrix is a vector with an "extra" dimension; it can be thought of a "vector of vectors." They can be created using the matrix command. The command takes as its first argument a vector. One can then specify the number of rows (and/or columns) in the matrix, and R automatically will fill them in. The values can filled in by column (default) or by row (by setting the argument byrow to TRUE). Note that you only need to give R the number of rows OR the number of columns, but not both; the function is smart enough to figure out the size of the missing dimension.

3.3.1 Initialization

mx.bycol <- matrix(1:12, nrow=4)
print(mx.bycol)
mx.byrow <- matrix(1:12, nrow=4, byrow=TRUE)
print(mx.byrow)
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

3.3.2 Access

You can access elements of a matrix using index vectors, similar to how one does this with simple vectors. However, in this case, one needs two indices, one for the row, and one for the column. If no index is specified for a particular dimension, then R returns all of the values for that dimension.

mx.bycol[1,3]  # row 1, col 3
mx.bycol[1,c(1,3)] # row 1, cols 1 and 3
mx.bycol[2:4,] # rows 2-4, all columns
mx.bycol[,c(1,3)] # all rows, cols 1 & 3
mx.bycol[,] # return everything
[1] 9
[1] 1 9
     [,1] [,2] [,3]
[1,]    2    6   10
[2,]    3    7   11
[3,]    4    8   12
     [,1] [,2]
[1,]    1    9
[2,]    2   10
[3,]    3   11
[4,]    4   12
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

3.3.3 Operations

Sometimes we want to perform the same operation on every element of a matrix; for example, we might want to multiply each element by two. Just as we saw for vectors, there is a shorthand way of doing this in R.

mx <- mx.bycol

# using a for loop: this is INEFFICIENT!
for (i in 1:nrow(mx)) {
  for (j in 1:ncol(mx)) {
    mx[i, j] <- mx[i, j] * 2
  }
}
print(mx)

mx <- mx.bycol

mx <- mx*2 # the R way is super easy!
print(mx)

     [,1] [,2] [,3]
[1,]    2   10   18
[2,]    4   12   20
[3,]    6   14   22
[4,]    8   16   24
     [,1] [,2] [,3]
[1,]    2   10   18
[2,]    4   12   20
[3,]    6   14   22
[4,]    8   16   24

Something else that we often would like to do is to perform the same operation over all columns (or alternatively, rows) of a matrix. For example, we might want to sum up each column of the matrix. R has the apply function just for this purpose. The syntax of apply is apply(X, MARGIN, FUN) where X is the named matrix object, FUN is the name of the function to be applied to each column (or row), and MARGIN indicates whether the operation applies to each row (1) or column (2).

# apply the function "sum" to each column
apply(mx.bycol, 2, sum) 
apply(mx.bycol, 2, mean)  # now, the mean

# apply the function "sum" to each row
apply(mx.bycol, 1, sum)
apply(mx.bycol, 1, mean)   # now, the mean
[1] 10 26 42
[1]  2.5  6.5 10.5
[1] 15 18 21 24
[1] 5 6 7 8

Note that what is returned in each case is a single vector containing the result of the operation for the corresponding vector (row or column). For instance, the vector (10, 26, 42) represents the sums of mx.bycol for the first column, second column, and third column, respectively.

  • binding together matrices

    The cbind() and rbind() functions allow you to concatenate two matrices together, with the former putting them side by side (such that the number of columns increases) and the latter putting them top to bottom (such that the number of rows increases).

    cbind(mx.bycol, mx.byrow)
    rbind(mx.bycol, mx.byrow)
    
         [,1] [,2] [,3] [,4] [,5] [,6]
    [1,]    1    5    9    1    2    3
    [2,]    2    6   10    4    5    6
    [3,]    3    7   11    7    8    9
    [4,]    4    8   12   10   11   12
         [,1] [,2] [,3]
    [1,]    1    5    9
    [2,]    2    6   10
    [3,]    3    7   11
    [4,]    4    8   12
    [5,]    1    2    3
    [6,]    4    5    6
    [7,]    7    8    9
    [8,]   10   11   12
    
  • turning a matrix into a vector

    The c() function applied to a matrix, will turn the matrix into a single dimensional vector, proceeding column by column, top to bottom.

    c(mx.byrow)
    c(mx.bycol)
    
     [1]  1  4  7 10  2  5  8 11  3  6  9 12
     [1]  1  2  3  4  5  6  7  8  9 10 11 12
    

3.4 List

A list is a collection of name value pairs. The values can be of any type; it is even possible to mix different kinds of objects (vectors, dataframes, etc) within a single list). In the below example, the name is the name of a musical artist, and the value is a (character) vector of album names.

3.4.1 Initialization

albums <- list(Michael.Jackson=c("Off the Wall","Thriller","Bad"),
               Nirvana=c("Bleach","Nevermind","In Utero"))  

3.4.2 Access

One can access elements of a list by index or by name. Unlike vectors, you use the double bracket [[]] syntax.

albums[[1]]
albums[["Nirvana"]]
[1] "Off the Wall" "Thriller"     "Bad"
[1] "Bleach"    "Nevermind" "In Utero"

The function lapply() will iterate over the elements of the list, applying a specified function to each element and returing the result as a list. For instance, to randomly permute the character vectors for each element, we would apply the sample function to each element.

lapply(albums, FUN=sample)
$Michael.Jackson
[1] "Bad"          "Off the Wall" "Thriller"    

$Nirvana
[1] "Nevermind" "In Utero"  "Bleach"

3.5 Data Frame

When you collect data, you will often have multiple measurements on single sampling units. It is inconvenient to have all of these pieces of data lying around your session environment in different variables. The data.frame structure in R allows you to bind together the different variables into a single object.

Let's have a look at the data from spelling dataset (this assumes you have already downloaded it, and it is in your working directory).

load("spelling.RData")
print(spelling.A)
   SubjID Method   Y
1       1      1 124
2       2      1 129
3       3      1 115
4       4      1 112
5       5      2 101
6       6      2  88
7       7      2 107
8       8      2  92
9       9      3  76
10     10      3  91
11     11      3  84
12     12      3  81

3.5.1 Entering data by hand

It is possible to directly enter data into a data frame object by typing in each of the vectors (or by creating them using the rep or seq functions). Here is how I created the two spelling data frames.

spelling.A <- data.frame(SubjID=1:12, Method=rep(1:3, each=4),
                         Y=c(124,129,115,112,101,88,107,92,76,91,84,81))

spelling.B <- data.frame(SubjID=1:12, Method=rep(1:3, each=4),
                         Y=c(124,88,84,112,101,129,107,81,76,91,115,92))

print(spelling.A)

print(spelling.B)
   SubjID Method   Y
1       1      1 124
2       2      1 129
3       3      1 115
4       4      1 112
5       5      2 101
6       6      2  88
7       7      2 107
8       8      2  92
9       9      3  76
10     10      3  91
11     11      3  84
12     12      3  81
   SubjID Method   Y
1       1      1 124
2       2      1  88
3       3      1  84
4       4      1 112
5       5      2 101
6       6      2 129
7       7      2 107
8       8      2  81
9       9      3  76
10     10      3  91
11     11      3 115
12     12      3  92

3.5.2 Access

Note that the data frame spelling.A has three columns, each representing a different variable: the subject identifier ("SubjID"), the spelling method that the subject learned ("Method"), and the observed spelling performance ("Y"). To access any of the vectors individually, we use the $ operator.

spelling.A$Y   # give the column named "Y" from the dataframe spelling.A
 [1] 124 129 115 112 101  88 107  92  76  91  84  81

We can access individual rows or columns using [] syntax. For the columns, we can also use character vectors to access them by name rather than by column number. (We can do this for rows too, although it's more common to do so by column; note that the default row names are just the row numbers).

spelling.A[1:3,]   # first three cases, all columns
spelling.A[c(4,5), 3] # cases 4 + 5, third column only
spelling.A[9:12, c("Method", "Y")] # cases 9-12, "Method" and "Y" only
  SubjID Method   Y
1      1      1 124
2      2      1 129
3      3      1 115
[1] 112 101
   Method  Y
9       3 76
10      3 91
11      3 84
12      3 81

3.5.3 Subsetting

We might want to look at only those subjects who had Method 3, so:

subset(spelling.A, Method==3)
   SubjID Method  Y
9       9      3 76
10     10      3 91
11     11      3 84
12     12      3 81

The first argument to the subset function is the name of the data frame, while the second is a logical expression saying which rows to keep. Note the double equals sign == for expressing equivalence. A single equals sign won't work as expected!

We can combine conditions with the logical operators "or" | and "and" &. For example, to get subjects who were in Method 3 and who had scores higher than 80:

subset(spelling.A, Method==3 & Y>80)
   SubjID Method  Y
10     10      3 91
11     11      3 84
12     12      3 81

Or, we might want to get all subjects where the method is NOT EQUAL to 3 using the != operator:

subset(spelling.A, Method!=3)
  SubjID Method   Y
1      1      1 124
2      2      1 129
3      3      1 115
4      4      1 112
5      5      2 101
6      6      2  88
7      7      2 107
8      8      2  92

3.5.4 Merging Data Frames

Let's say you have two data frames, one called subjects that has subject information, and the other called trials with trial information.

subjects <- data.frame(SubjID=1:4, Gender=c("M","M","F","F"))
trials <- data.frame(SubjID=rep(1:4, each=3),
                     ItemID=1:3, Response=rnorm(12))
print(subjects)
print(trials)
  SubjID Gender
1      1      M
2      2      M
3      3      F
4      4      F
   SubjID ItemID   Response
1       1      1 -0.1426569
2       1      2  0.6490175
3       1      3  1.7445806
4       2      1  0.1705198
5       2      2 -0.9361486
6       2      3 -0.3000100
7       3      1  0.1311008
8       3      2 -0.6824784
9       3      3 -1.2570611
10      4      1  1.2497215
11      4      2 -0.5346604
12      4      3 -0.7085383

We want to associate the gender information in subjects with the trial information in trials so, for instance, we can analyze the data by gender. What we want to do here is a database-style join. R has a function just for this, called merge().

merge(subjects, trials, by="SubjID")
   SubjID Gender ItemID   Response
1       1      M      1 -0.1426569
2       1      M      2  0.6490175
3       1      M      3  1.7445806
4       2      M      1  0.1705198
5       2      M      2 -0.9361486
6       2      M      3 -0.3000100
7       3      F      1  0.1311008
8       3      F      2 -0.6824784
9       3      F      3 -1.2570611
10      4      F      1  1.2497215
11      4      F      2 -0.5346604
12      4      F      3 -0.7085383

3.5.5 Factorially Combining Values

expand.grid() creates a data frame object that is the Cartesian product of the fields; that is, which includes all combinations of the elements of each field. For example, consider a study in which each of four subjects are given three items. There will be 4x3=12 observations in all in this study. We can create the data frame simply like this:

trials <- expand.grid(SubjID=1:4, ItemID=1:3)
print(trials)
   SubjID ItemID
1       1      1
2       2      1
3       3      1
4       4      1
5       1      2
6       2      2
7       3      2
8       4      2
9       1      3
10      2      3
11      3      3
12      4      3

3.5.6 Data Summaries

The summary function is useful. Let's try it on spelling.A.

summary(spelling.A)
     SubjID          Method        Y        
 Min.   : 1.00   Min.   :1   Min.   : 76.0  
 1st Qu.: 3.75   1st Qu.:1   1st Qu.: 87.0  
 Median : 6.50   Median :2   Median : 96.5  
 Mean   : 6.50   Mean   :2   Mean   :100.0  
 3rd Qu.: 9.25   3rd Qu.:3   3rd Qu.:112.8  
 Max.   :12.00   Max.   :3   Max.   :129.0

What if we want to summarize by certain values, for instance, calculate means for each method? There are two ways of doing this in R, using the functions aggregate() and tapply(). Here is how to do it with aggregate.

aggregate(x=list(Y=spelling.A$Y),
          by=list(Method=spelling.A$Method),
          FUN=mean)

with(spelling.A,
     aggregate(x=list(Y=Y), by=list(Method=Method), FUN=mean) )  
  Method   Y
1      1 120
2      2  97
3      3  83
  Method   Y
1      1 120
2      2  97
3      3  83

Note the use of the with() function in the second instance to simplify the syntax. The argument x is a list of values we want to aggregate (DVs), and we had to specify the name "Y" otherwise we get some default name from R that is opaque. The by argument tells aggregate how to break the data into groups; we can have multiple elements in the list, although in the current case we have only one (Method). We include the name "Method" for method, again, because otherwise R just gives default names like "Var1". The argument FUN is the function to be performed on the x argument, which in this case is mean.

Note: to create contingency tables, see ?table.

4 Basic Plotting

4.1 Scatterplot

Let's start off by making a basic scatterplot, with data series x, and including a trendline. We will use the function plot(x, y) to create the plot. The function abline() adds in a line with intercept a and slope b.

x <- 3 + 2 * (1:100) + rnorm(100, sd=20)
plot(1:100, x)
abline(a=3, b=2)

basicplot.png

Now let's include two series in the plot. Let's also include a legend to distinguish them.

x <- 3 + 2 * (1:100) + rnorm(100, sd=20)
y <- 3 + 1.5 * (1:100) + rnorm(100, sd=20)
plot(1:100, x)
abline(a=3, b=2)
points(1:100, y, pch=2)
abline(a=3, b=1.5, lty=2)
legend("topleft", c("x","y"), pch=1:2, lty=1:2)

basicplot2.png

Note that the points function adds a series to a plot without creating a new plot, whereas plot will always create a new plot (through a call to plot.new()). There are many parameters of the plot that can be tweaked. See ?plot and ?par for details.

4.2 Boxplot

If we specify a formula to plot, and the variable on the right-hand side is a factor, then we get boxplots. Factors re-code a variable in terms of an exhaustive set of levels (and thus only makes sense for categorical variables).

load(file="spelling.RData")
spelling.A$Method
spelling.A$Method <- factor(spelling.A$Method)  # make into factor
spelling.A$Method

plot(Y ~ Method, data=spelling.A)

box1.png

4.3 Basic line plot

This is accomplished in the same way as a scatterplot, except you set the argument type to 'l' (lines) or 'b' (both lines and points) instead of the default 'p' (points only).

plot(1:100, y=3 + 2*(1:100) + rnorm(100, sd=20), type='b')

lineplot1.png

4.4 Histogram

To create a histogram, use the hist() function.

x <- rnorm(1000)
myhist <- hist(x)
print(myhist)

hist1.png

You can alter how x is broken up into categories using the breaks argument.

4.5 Saving plots to a file

You will often want to output the graphics to a file for use in reports. It is easy to create PDF, PNG, and JPG graphics in R. What you need to do is to insert your plotting code between two commands, one to open the graphics output file, and the second to close (and save it). Here's how to save a file to PDF.

pdf(file="myplot.pdf")
plot(rnorm(10), rnorm(10))
dev.off()

Here's how to create a PNG file.

png(file="myplot.png")
plot(rnorm(10), rnorm(10))
dev.off()

4.6 Other graphics capabilities

To get a demo of the graphics capabilities try

demo(graphics)

You might also want to check out the fantastic graphics packages lattice and ggplot2. To get a sense for what you can do with these packages, here's a graphic I made using ggplot2:

ggplot-example.png

5 Statistical Functions

5.1 t-test and ANOVA

We can do a one-sample or two sample t-test using the function t.test().

5.1.1 One-sample t-test

x <- rnorm(100)
x.t <- t.test(x)  # test against zero
print(x.t)

x <- rnorm(100, mean=5)
x.t <- t.test(x)
print(x.t)
 
        One Sample t-test

data:  x 
t = 0.6321, df = 99, p-value = 0.5288
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 -0.1181721  0.2286716 
sample estimates:
 mean of x 
0.05524973
 
        One Sample t-test

data:  x 
t = 56.9896, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 4.794030 5.139901 
sample estimates:
mean of x 
 4.966965

Note that if we want to test against something other than zero, we can specify the null hypothesis using the mu argument of t.test.

5.1.2 Two-sample t-test

  • Independent samples

    For this, we use the same t.test function, but with additional arguments.

    x <- rnorm(50, mean=12, sd=10)
    y <- rnorm(50, mean=15, sd=10)
    t.test(x, y)  # default assumes unequal variances (Welch test)
    t.test(x, y, var.equal=TRUE)
    
     
            Welch Two Sample t-test
    
    data:  x and y 
    t = -1.3577, df = 98, p-value = 0.1777
    alternative hypothesis: true difference in means is not equal to 0 
    95 percent confidence interval:
     -6.732557  1.262544 
    sample estimates:
    mean of x mean of y 
     12.96880  15.70381
    
            Two Sample t-test
    
    data:  x and y 
    t = -1.3577, df = 98, p-value = 0.1777
    alternative hypothesis: true difference in means is not equal to 0 
    95 percent confidence interval:
     -6.732557  1.262544 
    sample estimates:
    mean of x mean of y 
     12.96880  15.70381
    

    The var.equal argument is used to specify whether the variances of the two groups are assumed to be unequal (default) or equal.

  • Paired-samples t-test
    x <- data.frame(SubjID=rep(1:8, each=2),
                    Condition=c("A","B"),
                    sre=rep(rnorm(8, sd=10), each=2),
                    err=rnorm(16, sd=4))
    x$Y <- 100 + x$sre + x$err
    
    grp1 <- subset(x, Condition=="A")
    grp2 <- subset(x, Condition=="B")
    
    t.test(grp1$Y, grp2$Y) # WRONG! observations treated as independent
    t.test(grp1$Y, grp2$Y, paired=TRUE) # CORRECT  
    
     
            Welch Two Sample t-test
    
    data:  grp1$Y and grp2$Y 
    t = 0.267, df = 13.982, p-value = 0.7933
    alternative hypothesis: true difference in means is not equal to 0 
    95 percent confidence interval:
     -6.916550  8.883557 
    sample estimates:
    mean of x mean of y 
     97.46473  96.48123
    
            Paired t-test
    
    data:  grp1$Y and grp2$Y 
    t = 0.8666, df = 7, p-value = 0.4149
    alternative hypothesis: true difference in means is not equal to 0 
    95 percent confidence interval:
     -1.700153  3.667159 
    sample estimates:
    mean of the differences 
                  0.9835033
    

5.1.3 One Factor ANOVA

The aov function does what we need. We need to declare the variable Method as a factor to get it to work properly (see ?factor for details).

The syntax for aov is: aov(formula, data), where formula is the formula for the GLM, and data is the named dataframe where the variables specified in the formula are found. In this case, our formula is Y ~ Method. Note that Y is the name of our DV, and Method is the name of our IV. Because the equals sign is often used for assignment in R, R requires us to use the tilde (~) symbol where the equals sign should have been.

Note that we do not need to specify the error term or the intercept in the model (they are simply assumed by the function). The aov function returns a "fitted model object"; we get a summary table for this object using the summary function.

# declare Method as a factor
spelling.A$Method <- factor(spelling.A$Method)

spelling.A.aov <- aov(Y ~ Method, data=spelling.A)
summary(spelling.A.aov)
            Df Sum Sq Mean Sq F value    Pr(>F)    
Method       2   2792 1396.00  23.886 0.0002515 ***
Residuals    9    526   58.44                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5.1.4 Factorial ANOVA

We can also perform a factorial ANOVA using aov. For this example, you will need to download the file anova2f.RData and load it into your R session. This will bring the data frame object dat into your environment, which contains the data we will be working with.

The first argument to aov() is formula, which specifies the GLM. The baseline value \(\mu\) and the error term are implicit, so all you have to is specify the main effects and interactions. Interactions are specified using the syntax A:B. So for the current analysis, our formula to aov() is Y ~ A + B + A:B.

An easier way is to use the shortcut syntax A*B which essentially says, "give me all possible main effects and interactions for factors A and B". This becomes important when we get to higher order designs, where spelling all the terms out by hand gets tedious and prone to error.

The aov() function returns a special kind of object. To get a summary of the object, we use the summary() function.

load(file="anova2f.RData")
# first way, spelling out the terms, which is equivalent.
dat.aov <- aov(formula=Y ~ A + B + A:B, data=dat)
summary(dat.aov)

# second way, using the '*' shortcut syntax
dat.aov <- aov(formula=Y ~ A*B, data=dat)
summary(dat.aov)  
            Df Sum Sq Mean Sq F value    Pr(>F)    
A            1 3502.1  3502.1 84.8990 1.559e-05 ***
B            1   18.8    18.8  0.4545    0.5192    
A:B          1   10.1    10.1  0.2444    0.6343    
Residuals    8  330.0    41.2                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
            Df Sum Sq Mean Sq F value    Pr(>F)    
A            1 3502.1  3502.1 84.8990 1.559e-05 ***
B            1   18.8    18.8  0.4545    0.5192    
A:B          1   10.1    10.1  0.2444    0.6343    
Residuals    8  330.0    41.2                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5.2 TODO Correlation, Regression and GLM

5.3 TODO Multilevel Modeling

Date: 12-Oct-2011

Author: Dale Barr

Org version 7.7 with Emacs version 23

Validate XHTML 1.0