# Introduction to R

## Table of Contents

- 1 Getting Started
- 2 The R Session Environment
- 3 Basic Objects and Operations
- 4 Basic Plotting
- 5 Statistical Functions

## 1 Getting Started

The full `R`

code used in this tutorial can be downloaded here (you
might need to right-click and select "Save As…" so that it doesn't
open in your browser).

The Comprehensive R Archive Network is where the `R`

software and
add-on packages reside. Follow the instructions there to download the
software you need for your particular operating system.

### 1.1 Getting Help

#### 1.1.1 R documentation

You can type `?functionname`

at the command prompt to get help on the
function *functionname*. For example, to get help on the `help()`

function, type `?help`

.

The `help.start()`

function will open up the `R`

documentation in a
web browser.

#### 1.1.2 On-line help

The place to begin is http://www.r-project.org. There is much that can be found online at the Comprehensive R Archive Network. A good place to start is the tutorial section. For full documentation, see the online manual.

If you are having difficulty finding what you need, R Seek is a good
internet search engine devoted specifically to `R`

. The fact that
nearly every page on the internet that is written in English contains
the letter "R" somewhere makes it hard to find what you need from
Google!

## 2 The R Session Environment

### 2.1 Starting up and closing down

#### 2.1.1 Batch sessions

The most common way you will interact with `R`

is in interactive
sessions. However, there is another way that is often useful that we
will touch upon briefly first.

R is essentially a programming language, and you will typically be
writing scripts that you will store in text files (conventionally with
an extension of `.R`

or `.r`

) and you might want to just execute a
saved script without going through an interactive session. At the
command line of your operating system, type

R --save < filename.R

This will send the script in the file `filename.R`

into the R
interpreter. The `--save`

option will save the environment (into the
.RData and .RHistory files) when finished. If you don't need to save
the environment, use the `--no-save`

option.

#### 2.1.2 Interactive sessions

Starting up is easy enough. One thing to note is that the directory
in which you start R plays a special role. The `R`

interpreter stores
two "hidden" files in this directory, that form a record of the
history of commands input during the session (.Rhistory) as well as an
image of the data objects in the session (.RData). If you are on a
"unix-alike" operating system (Mac OS, Linux), you can open up a
terminal window and use the `cd`

command to change to the directory you
want to use (or create it with `mkdir`

) before starting R.

To shut down, you type `q()`

at the command line, for "quit". R will ask:

Save workspace image? [y/n/c]:

Type 'y' to save the image, 'n' to exit without saving, and 'c' to undo the quit operation and return to the prompt.

#### 2.1.3 The session "environment"

In an `R`

session, you will create and manipulate various data
objects, such as data frames, vectors, matrices, etc. You might also
create your own functions. All of these items reside in memory,
within the session "environment".

To see what's in your environment, use the `ls()`

command. It returns
a vector of type "character" with the names of the objects in the
environment.

ls()

character(0)

Hmm, `character(0)`

means a character vector of length 0, which means
there's nothing there. Guess we might actually need to do something first.

We are going to create new variables `x`

and `y`

and assign to them
values that are the result of the `rnorm()`

and `runif()`

functions.
The '`<-`

' syntax is something specific to `R`

for assigning values to
variables. Following this syntax, `x <- 10`

means "give the variable
named 'x' the value of '10'. The reason why we don't just use the
plain old equals sign to do this is that equals is ambiguous between
*assignment* and *testing for equality*. We still use the equals sign
when assigning values to arguments in function calls, and that seems
inconsistent to me, but there's probably some deeper reason for it
that I don't see. Note that functions are called by name (e.g.,
`myfunction()`

), followed by parentheses that optinally contain
arguments (more on functions later).

You can see both '`<-`

' and '=' in operation below.

x <- rnorm(10, sd=2) # create vector named 'x', drawing from # a normal distribution with standard deviation of 2 y <- runif(10, min=0, max=1) # create vector named 'y', drawing from # a uniform distribution from 0 to 1 print(x) print(y)

[1] -1.67539296 -1.02796192 0.23490288 1.13407282 1.63891794 0.68877181 [7] 0.02430912 -0.59720188 -0.65095750 0.11747846 [1] 0.52641168 0.55102570 0.02459610 0.17269570 0.07816773 0.41788894 [7] 0.18890956 0.40645372 0.39614057 0.84168878

Note that we had to `print`

the variables `x`

and `y`

to see what they
contain. We could also just type their names at the command line, and
that automatically will `print`

them.

x

[1] -1.67539296 -1.02796192 0.23490288 1.13407282 1.63891794 0.68877181 [7] 0.02430912 -0.59720188 -0.65095750 0.11747846

OK, earlier I promised to explain the `ls()`

function, so let's see
what the environment contains now:

ls()

[1] "spelling.A" "spelling.B" "x" "y"

Aargh! now my environment is all cluttered by `x`

and `y`

! Begone, `x`

and `y`

!

rm(x, y) ls()

[1] "spelling.A" "spelling.B"

We used the `rm()`

command to do that. Sometimes you might want to
get rid of everything in the environment, but it would be a pain to
type in all the names. To zap them all away in one step, we feed the
results of the `ls()`

command to the `list`

argument of `rm()`

, like
so:

rm(list=ls()) # tidy it all up!

The only other way to do this is to exit and restart `R`

, so this
command often comes in handy.

### 2.2 Typing in commands

You have two choices for typing in commands: (1) type them directly into the command interpreter, or (2) type them into a text file and then get them into the interpreter in some other way. Unless I just need to compute something really fast, I usually use the second method, because that leaves me with a 'script' of commands that I can then easily use in the future, or run in batch mode. If I just type them into the interpreter, they will be saved in .Rhistory, but so will any mistakes that I type in as well.

My workflow with `R`

involves the text editor emacs and org mode
coupled with the emacs plug-in ESS (Emacs Speaks Statistics). This
means that my day to day work life looks something like this:

This allows me to type in single lines of `R`

code, and send them to
the interpreter line by line or section by section. The `R`

snippets
are actually woven into a file that can be exported to HTML or LaTeX.
In fact, this tutorial was created using this setup. A similar setup
is Sweave which allows R code to be integrated within a LaTeX
document.

Working with `R`

through emacs, org-mode, and ESS is in many ways
ideal (not to mention, "retro cool"). But unless you are a true nerd
and terminal junky like me, you will find such a setup abhorrent, and
might want to find better things to do with your cortex than to cram
it with various emacs keystroke combinations. For those who want the
full-experience of a graphical interface but still want the option of
line-by-line command processing, try R commander or R studio. For
Ubuntu Linux users, I would also mention Dan Dediu's great Rgedit, a
plug in for the text editor gedit.

#### 2.2.1 Use of "whitespace" in commands

To improve readability, sometimes it is useful to break up a single long command over multiple lines. R will not try to interpret the line until the syntax is complete (or it encounters a syntax error).

When you break up a command over multiple lines (and are typing the
command in the interpreter) the prompt will change from `>`

to `+`

to
let you know that the command continues. For example:

> x <- 3 + 5 + 7/3 * 999 - + 88 + 22 + 5 >

Because the command on the first line (ending with the minus sign) is
not a complete expression, `R`

"knows" that I intended to continue the
command on the second line. Because second line makes it a full
expression, the command is evaluated and the `>`

prompt returns.

#### 2.2.2 Interrupting a long-running process

Sometimes you might have a mistake in your code that creates an
infinite loop, or that initiates a very long computation. The program
will lock until the operation completes. **Ctrl-C** is your friend; it
will allow you to interrupt the process and return to the command
prompt.

Another situation where **Ctrl-C** comes in handy is when you have typed
in erroneous syntax, and your command prompt has turned into a '`+`

'
rather than an '`>`

'. Rather than trying to diagnose what went wrong,
just hit **Ctrl-C** a couple of times to get back to normal.

### 2.3 Calling functions

Functions perform some action on data objects. Let's illustrate with
a very simple example, using the functions `rnorm()`

and `mean()`

.

x <- rnorm(100) # create 100 random normal variates mean(x) # calculate the mean

[1] -0.1969908

The object named `x`

was set to the *return value* of the function
`rnorm()`

, which in this case, was 100 variates from the normal
distribution.

Functions such as `rnorm()`

usually accept *arguments*, that modify
what they do. To see the accepted arguments for `rnorm()`

type
`?rnorm`

at the command prompt. The help page shows something like
this:

Normal package:stats R Documentation The Normal Distribution Description: Density, distribution function, quantile function and random generation for the normal distribution with mean equal to ‘mean’ and standard deviation equal to ‘sd’. Usage: dnorm(x, mean = 0, sd = 1, log = FALSE) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) rnorm(n, mean = 0, sd = 1) Arguments: x,q: vector of quantiles. p: vector of probabilities. n: number of observations. If ‘length(n) > 1’, the length is taken to be the number required. mean: vector of means. sd: vector of standard deviations. log, log.p: logical; if TRUE, probabilities p are given as log(p). lower.tail: logical; if TRUE (default), probabilities are P[X <= x] otherwise, P[X > x].

This documentation presents descriptions of four functions associated
with the normal distribution, but the one that we are primarily
interested in is the random generation function, `rnorm(n, mean=0, sd=1)`

. This tells us that `rnorm`

takes three arguments: `n`

,
`mean`

, and `sd`

. The documentation page also tells us that `n`

is
the "number of observations", `mean`

is the mean, and `sd`

is the
standard deviation.

When we typed `rnorm(100)`

, we failed to specify the mean and standard
deviation, but the function worked anyway. This is because the `mean`

and `sd`

arguments have *defaults* of 0 and 1, respectively. In
contrast, the `n`

argument has no default, and thus will fail if we do
not include it.

> rnorm() Error in .Internal(rnorm(n, mean, sd)) : 'n' is missing

The help page tells us (implicitly) that `n`

is an *obligatory*
argument and that `mean`

and `sd`

are *optional* by specifying the
default values for `mean`

and `sd`

, but not for `n`

, in the function
description `rnorm(n, mean=0, sd=1)`

.

Note that when we type `rnorm(100)`

, the interpreter knows that 100 is
intended for the argument `n`

, because it matches the position of `n`

(first) in the function description. Because `mean`

and `sd`

are not
present in the function call, the interpreter uses the default values.
We could also specify the mean of the distribution like so:

```
x <- rnorm(100, 50)
mean(x)
```

[1] 50.04525

An equivalent way of doing this would have been with *named*
arguments, e.g., `rnorm(n=100, mean=50)`

. Without the names for the
arguments, the interpreter uses the position to assign the values.
When the names are present, the interpreter ignores the position and
uses the names. This is useful because we often want to use an
argument but may not remember what position it was supposed to be in.
For example, say we want to use the default of 0 for the mean, but
adjust the standard deviation. All we need to do is this:

```
x <- rnorm(100, sd=100)
mean(x)
```

[1] -10.53998

Even though `mean`

should have been the second argument, we can put
`sd`

in as a named argument and everything will be alright because the
position will be ignored. This may not seem to buy us much for the
case of `rnorm`

, which only has three arguments, but it becomes
critical when a function has many arguments only some of which we want
to alter (see the function `par()`

for example).

#### 2.3.1 Creating your own functions

Sometimes you will want to create your own function, especially if it is an operation that you are repeating over and over throughout a script. One of the principles of good programming is to never repeat yourself. Each time we retype the same code presents a new chance to introduce a bug; furthermore, if we need to change the function later, we would have to do it everywhere it appears in the script. It also makes the code more verbose than necessary and less readable. A better way is to write a function once and then call it whenever it is needed.

Let's say we needed a function that would add A to B and then divide
it by C. We define a new function `AplusBoverC`

thus:

AplusBoverC <- function(A, B, C) { return( (A+B) / C ) } AplusBoverC(1, 3, 2)

[1] 2

We created a new object `AplusBoverC`

that took three arguments, `A`

,
`B`

, `C`

. The *function body* is enclosed within curly brackets `{}`

,
and ends with a *return value*, the value to be passed back to the
calling process.

We call our function in the way we call any function (such as
`rnorm`

). See `?function`

and the R documentation on functions for
more information.

### 2.4 Data input/output

#### 2.4.1 Importing data

The data we will use in this section can be downloaded in any of the
three formats below. The righthand column of the table tells you what
`R`

function to use to bring the data from the file into your
environment.

Format | R function |
---|---|

Raw text (.txt) | read.table |

Comma-separated values (.csv) | read.csv |

R binary format (.RData) | load |

- Importing Raw Text

Use the

`read.table`

function. The output of the function will be a`data.frame`

object. Note that the text file we will be reading looks like this:"SubjID" "Method" "Y" 1 1 124 2 1 129 3 1 115 4 1 112 5 2 101 6 2 88 7 2 107 8 2 92 9 3 76 10 3 91 11 3 84 12 3 81

Note that the first row of the file contains the name of the corresponding column, enclosed in double quotes. The columns in each row are delineated by spaces. By default,

`read.table`

interprets any kind of white space (spaces or tabs) as a delineator. We want the first line of the file to be interpreted as column names, so we must use the argument`header=TRUE`

.spelling.A <- read.table(file="spelling.txt", header=TRUE) print(spelling.A)

SubjID Method Y 1 1 1 124 2 2 1 129 3 3 1 115 4 4 1 112 5 5 2 101 6 6 2 88 7 7 2 107 8 8 2 92 9 9 3 76 10 10 3 91 11 11 3 84 12 12 3 81

- Importing a CSV File

If you are familiar with Microsoft Excel, it is sometimes convenient to save the data in .CSV format, which can be easily imported into R.

Here is the file that we will be reading in:

"SubjID","Method","Y" 1,1,124 2,1,129 3,1,115 4,1,112 5,2,101 6,2,88 7,2,107 8,2,92 9,3,76 10,3,91 11,3,84 12,3,81

Note that the

`read.csv`

function has a similar syntax to the`read.table`

function.spelling.A <- read.csv(file="spelling.csv", header=TRUE) print(spelling.A)

SubjID Method Y 1 1 1 124 2 2 1 129 3 3 1 115 4 4 1 112 5 5 2 101 6 6 2 88 7 7 2 107 8 8 2 92 9 9 3 76 10 10 3 91 11 11 3 84 12 12 3 81

- Importing Native R Binary files

This is accomplished using the

`load`

function. Unlike text or CSV files, files in the binary format (.RData) can have multiple data.frames or other objects inside of them. After using`load`

, we will use the function`ls`

to list the objects that exist in our environment.# what is in the environment before loading ls() load(file="spelling.RData") ls() # list what is there after print(spelling.A)

character(0) [1] "spelling.A" "spelling.B" SubjID Method Y 1 1 1 124 2 2 1 129 3 3 1 115 4 4 1 112 5 5 2 101 6 6 2 88 7 7 2 107 8 8 2 92 9 9 3 76 10 10 3 91 11 11 3 84 12 12 3 81

- Other options

Note that it is also possible to download data from a MySQL database server into your environment using the

`RMySQL`

package (or other databases as well, such as sqlite using package`RSQLite`

, or a Windows ODBC connection using`RODBC`

). None of these packages is available by default, but must be installed from CRAN.

#### 2.4.2 Saving objects

Often you will wish to save objects that you have created to external
files, whether for further processing using different software or so
that you can import them back into `R`

again in a later session.

- Saving a text file

For this, we use the

`write.table`

function. Note that we are setting the`row.names`

option to`FALSE`

. Without this,`R`

will create a new column with the name of the rows in your data frame (which is often unnecessary, unless the row names have been set to something meaningful). Note that the column names will appear as the first line in the file (surrounded by quotations).write.table(spelling.A, file="spelling.txt", row.names=FALSE)

- Saving a CSV file

This is just like the

`write.table`

command.write.csv(spelling.A, file="spelling.csv", row.names=FALSE)

- Saving an R binary file

The advantages of the native

`R`

binary (.RData) format is that the files are compressed, and it is furthermore possible to store multiple objects in a single file. However, the disadvantage is that the data can only be read by`R`

, but not by other programs (e.g., Excel).To save in

`R`

binary format, use the`save`

function. Enter the names of the objects to save as the first arguments to the function, and then use the`file`

argument to specify the filename.`save(spelling.A, spelling.B, file="spelling.RData")`

### 2.5 Add-on packages

One of the great things about `R`

is that it is open-source
community-developed software. That means that anyone can create a
"package" of functions and data for any purpose, and make their
results freely available to other `R`

users. It also means that the
source code of the package is freely available so that you can see
what's going on under the hood. After packages have been vetted to
make sure they are bug-free and install properly, they are made
available on CRAN. Some packages that are still under development
(like my gmpm package for doing permutation tests) are available on R forge. You can install these as well using the typical way (the
`install.packages`

argument, though you'll have to enter the R-forge
address for the `repos`

argument).

CRAN Task Views shows packages sorted by theme. There are some amazing things out there, including packages for analyzing fMRI or eeg data, natural language processing, latent semantic analysis; not to mention a Sudoku puzzle solver, and even a twitter client!

To see how add-on packages work, let's install the `fortunes`

package,
a collection of wit and wisdom from the `R`

community. It's useless,
except for a laugh, but shows you how to install and load packages!
You can install the package from the command line using the
`install.packages()`

command.

install.packages("fortunes")

OK, we've installed it! Note that we never have to do this step again; the package now has been downloaded onto our system. If that command failed, it probably means that you don't have the proper permissions for installing software on your system or you are behind a firewall and your proxy is not set up properly (in either case, go bother your sysadmin).

The `fortunes`

package has one function, namely, `fortune()`

. Let's
try it and see what happens.

> fortunes() Error: could not find function "fortune"

Huh? What happened? Oh yeah, we need to *load* the package first
using the `library`

command. You only need to do this once per
session, whenever you want to use the package. This makes the
functions and datasets in the package available to your session.

library(fortunes) fortune()

If 'fools rush in where angels fear to tread', then Bayesians 'jump' in where frequentists fear to 'step'. -- Charles C. Berry (about Bayesian model selection as an alternative to stepwise regression) R-help (November 2009)

Brilliant!! (your results may differ :) )

## 3 Basic Objects and Operations

### 3.1 Naming objects

As already mentioned, working with `R`

involves creating and
manipulating named data objects. You cannot give your objects any old
names; obviously, to avoid ambiguity, names should not include
elements of the syntax or whitespace. Often people use dots in the
names (e.g., `x.dat`

, `x.lm`

, `x.rand`

) to indicate that the objects
somehow go together (are associated with the data object `x`

),
although this convention is meaningless to the interpreter itself.

In this section, we'll go through the most common `R`

object types
(there are more) and discuss how they are used and manipulated.

### 3.2 Vector

The most basic variables in `R`

are *vectors*. As in mathematics, a
vector is an ordered set of elements. Vectors in `R`

have a *length*
and a *mode*. The length is simply the number of elements, and the
mode is the type of data contained in the vector.

#### 3.2.1 Initializing

```
x <- 1
length(x)
mode(x)
```

[1] 1 [1] "numeric"

Here we see that the variable `x`

is a numeric vector with length of 1
and mode "numeric". There are a number of useful techniques for
creating sequences of numbers that save typing, including the colon
`:`

operator and the functions `seq`

and `rep`

.

1:3 3:1 c(1:3, 5:7, 9:11) rep(1:3, times=3) rep(1:3, each=3) seq(1, 27, by=3)

[1] 1 2 3 [1] 3 2 1 [1] 1 2 3 5 6 7 9 10 11 [1] 1 2 3 1 2 3 1 2 3 [1] 1 1 1 2 2 2 3 3 3 [1] 1 4 7 10 13 16 19 22 25

Note the above syntax `c()`

for defining a vector with multiple
elements, where the elements are separated by commas.

We can also define "logical" vectors, where the elements are the
values `TRUE`

and `FALSE`

. Note that we can also use `rep`

with any
mode of vector, including logical.

x <- c(TRUE, FALSE, TRUE) mode(x) rep(c(TRUE, FALSE), each=5)

[1] "logical" [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

We can name the elements of our vector by entering them explicitly
when we create the vector, or by using the `names`

function.

# put in names when creating vector weather <- c(rain=TRUE, snow=FALSE, sleet=FALSE) print(weather) names(weather) names(weather) <- c("sunny", "windy", "humid") print(weather)

rain snow sleet TRUE FALSE FALSE [1] "rain" "snow" "sleet" sunny windy humid TRUE FALSE FALSE

One can also have *character* vectors, where each element is a string.

actors <- c("Al Pacino", "Robert De Niro", "Alec Baldwin", "Jack Lemmon", "Ed Harris", "Alan Arkin") print(actors)

[1] "Al Pacino" "Robert De Niro" "Alec Baldwin" "Jack Lemmon" [5] "Ed Harris" "Alan Arkin"

#### 3.2.2 Accessing elements of a vector

There are three ways to access elements in a vector: using a
*numerical index*, a *name index*, or a *logical index*. Each of
these indices is, conveniently, also a vector, making it possible to
select multiple elements simultaneously. The index is enclosed within
brackets `[]`

.

actors[2] actors[c(3,5)] actors[c(TRUE, FALSE, TRUE, FALSE, TRUE)] weather["sunny"] weather[c(1,3)]

[1] "Robert De Niro" [1] "Alec Baldwin" "Ed Harris" [1] "Al Pacino" "Alec Baldwin" "Ed Harris" "Alan Arkin" sunny TRUE sunny humid TRUE FALSE

#### 3.2.3 Operations on vectors

One aspect of `R`

that is confusing at first but is ultimately
something that makes it very powerful is that it does something called
*implicit replication* of arguments in order to apply certain
operators, when the vectors the operator is supposed to apply to have
different lengths. This is best explained by example.

Suppose we have a vector of integers from 1 to 6:

```
x <- 1:6
x
```

[1] 1 2 3 4 5 6

And we want to add 2 to each of the elements of the vector. In
programming langauge such as `C`

, the way this would be done would be
using a for loop. In `R`

, it's much simpler:

print(x + 2)

[1] 3 4 5 6 7 8

What happened here? The interpreter encountered an instruction to add a vector of length 1 to a vector of length 6. This is not possible, so what it did (implicitly) was to replicate the shorter vector to match the size of the larger vector, like this:

x | after | RESULT | |
---|---|---|---|

replication | |||

1 | + | 2 | 3 |

2 | + | 2 | 4 |

3 | + | 2 | 5 |

4 | + | 2 | 6 |

5 | + | 2 | 7 |

6 | + | 2 | 8 |

So, what would happen if the value we added to `x`

was had length 2
instead of 1?

print(x + c(1,3))

[1] 2 5 4 7 6 9

What happened here was that the vector `<1,3>`

of length 2 was
replicated 3 times to match the length of `x`

(6 elements), like so:

after | ||
---|---|---|

x | replication | RESULT |

1 | 1 | 2 |

2 | 3 | 5 |

3 | 1 | 4 |

4 | 3 | 6 |

5 | 1 | 7 |

6 | 3 | 9 |

Obviously, this implicit replication only makes sense if the length of the smaller vector is a factor of the length of the larger vector. When it's not, R does the replication anyway, but throws up a warning.

print(x + c(1.1,2.2,3.3,4.4))

[1] 2.1 4.2 6.3 8.4 6.1 8.2 Warning message: In x + c(1.1, 2.2, 3.3, 4.4) : longer object length is not a multiple of shorter object length

The same goes for subtraction, multiplication, and division. We can
also take a vector to a power using the syntax `x^y`

, and the
interpreter will do this for each element:

print(x^2)

[1] 1 4 9 16 25 36

One thing that is sometimes confusing is that certain functions, such
as `log`

, return a vector of length `n`

when given a vector of length
`n`

, with each element transformed, whereas other functions, such as
`mean`

, perform aggregation functions over all elements, and return a
scalar. Usually it will be obvious whether a certain function
performs aggregation or operations on single elements.

log(x) mean(x)

[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 [1] 3.5

Sometimes, you might want to take the dot product of two vectors
(multiply pairs of corresponding elements in the two vectors, then add
the sums). This is accomplished with the `%*%`

operator. The object
that is returned is a matrix (with 1 row and 1 column in the case of
vector dot products).

c(1,2,3) %*% c(5,10,30)

[,1] [1,] 115

- Combining vectors

Combining vectors is easy; we just use the

*concatenate*(`c()`

) function.x <- c(1,2,3) y <- c(4,5,6) xy <- c(x, y) print(xy)

[1] 1 2 3 4 5 6

- Sampling elements

Sometimes we just want to randomly sample elements from the vector (e.g., for bootstrapping). For this, the

`sample`

function comes in handy.x <- 1:4 sample(x) # return a random permutation (sample without replacement)

[1] 4 2 1 3

sample(x, 3, replace=FALSE) # sample 3 without replacement sample(x, 10, replace=TRUE) # sample 10 with replacement sample(c(0,1), 30, replace=TRUE, prob=c(.3,.7)) # probs not uniform

[1] 2 4 3 [1] 3 4 4 1 2 4 3 2 1 3 [1] 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1

- Set operations

Sometimes we might want to know, what elements do vectors

`x`

and`y`

have in common? What elements does`x`

have that`y`

does not? Here are some examples of set operations.x <- 1:5 y <- 3:7 intersect(x, y)

[1] 3 4 5

union(x, y)

[1] 1 2 3 4 5 6 7

setdiff(x, y) # which elements does x have that y lacks?

[1] 1 2

The

`%in%`

operator returns`TRUE`

or`FALSE`

depending on whether the left hand argument is an element in the right-hand argument.x <- c("Jan", "Feb", "Mar") "Feb" %in% x "Apr" %in% x

[1] TRUE [1] FALSE

`x <- c(1, 1, 3, 3, 5, 5, 7, 7) unique(x)`

[1] 1 3 5 7

### 3.3 Matrix

A matrix is a vector with an "extra" dimension; it can be thought of a
"vector of vectors." They can be created using the `matrix`

command.
The command takes as its first argument a vector. One can then
specify the number of rows (and/or columns) in the matrix, and `R`

automatically will fill them in. The values can filled in by column
(default) or by row (by setting the argument `byrow`

to `TRUE`

). Note
that you only need to give `R`

the number of rows OR the number of
columns, but not both; the function is smart enough to figure out the
size of the missing dimension.

#### 3.3.1 Initialization

mx.bycol <- matrix(1:12, nrow=4) print(mx.bycol) mx.byrow <- matrix(1:12, nrow=4, byrow=TRUE) print(mx.byrow)

[,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12

#### 3.3.2 Access

You can access elements of a matrix using index vectors, similar to
how one does this with simple vectors. However, in this case, one
needs two indices, one for the row, and one for the column. If no
index is specified for a particular dimension, then `R`

returns all of
the values for that dimension.

mx.bycol[1,3] # row 1, col 3 mx.bycol[1,c(1,3)] # row 1, cols 1 and 3 mx.bycol[2:4,] # rows 2-4, all columns mx.bycol[,c(1,3)] # all rows, cols 1 & 3 mx.bycol[,] # return everything

[1] 9 [1] 1 9 [,1] [,2] [,3] [1,] 2 6 10 [2,] 3 7 11 [3,] 4 8 12 [,1] [,2] [1,] 1 9 [2,] 2 10 [3,] 3 11 [4,] 4 12 [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12

#### 3.3.3 Operations

Sometimes we want to perform the same operation on every element of a matrix; for example, we might want to multiply each element by two. Just as we saw for vectors, there is a shorthand way of doing this in R.

mx <- mx.bycol # using a for loop: this is INEFFICIENT! for (i in 1:nrow(mx)) { for (j in 1:ncol(mx)) { mx[i, j] <- mx[i, j] * 2 } } print(mx) mx <- mx.bycol mx <- mx*2 # the R way is super easy! print(mx)

[,1] [,2] [,3] [1,] 2 10 18 [2,] 4 12 20 [3,] 6 14 22 [4,] 8 16 24 [,1] [,2] [,3] [1,] 2 10 18 [2,] 4 12 20 [3,] 6 14 22 [4,] 8 16 24

Something else that we often would like to do is to perform the same
operation over all columns (or alternatively, rows) of a matrix. For
example, we might want to sum up each column of the matrix. `R`

has
the `apply`

function just for this purpose. The syntax of `apply`

is
`apply(X, MARGIN, FUN)`

where `X`

is the named matrix object, `FUN`

is
the name of the function to be applied to each column (or row), and
`MARGIN`

indicates whether the operation applies to each row (1) or
column (2).

# apply the function "sum" to each column apply(mx.bycol, 2, sum) apply(mx.bycol, 2, mean) # now, the mean # apply the function "sum" to each row apply(mx.bycol, 1, sum) apply(mx.bycol, 1, mean) # now, the mean

[1] 10 26 42 [1] 2.5 6.5 10.5 [1] 15 18 21 24 [1] 5 6 7 8

Note that what is returned in each case is a single vector containing
the result of the operation for the corresponding vector (row or
column). For instance, the vector (10, 26, 42) represents the sums of
`mx.bycol`

for the first column, second column, and third column,
respectively.

- binding together matrices

The

`cbind()`

and`rbind()`

functions allow you to concatenate two matrices together, with the former putting them side by side (such that the number of columns increases) and the latter putting them top to bottom (such that the number of rows increases).cbind(mx.bycol, mx.byrow) rbind(mx.bycol, mx.byrow)

[,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 5 9 1 2 3 [2,] 2 6 10 4 5 6 [3,] 3 7 11 7 8 9 [4,] 4 8 12 10 11 12 [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 [5,] 1 2 3 [6,] 4 5 6 [7,] 7 8 9 [8,] 10 11 12

- turning a matrix into a vector

The

`c()`

function applied to a matrix, will turn the matrix into a single dimensional vector, proceeding column by column, top to bottom.c(mx.byrow) c(mx.bycol)

[1] 1 4 7 10 2 5 8 11 3 6 9 12 [1] 1 2 3 4 5 6 7 8 9 10 11 12

### 3.4 List

A list is a collection of name value pairs. The values can be of any type; it is even possible to mix different kinds of objects (vectors, dataframes, etc) within a single list). In the below example, the name is the name of a musical artist, and the value is a (character) vector of album names.

#### 3.4.1 Initialization

albums <- list(Michael.Jackson=c("Off the Wall","Thriller","Bad"), Nirvana=c("Bleach","Nevermind","In Utero"))

#### 3.4.2 Access

One can access elements of a list by index or by name. Unlike
vectors, you use the double bracket `[[]]`

syntax.

```
albums[[1]]
albums[["Nirvana"]]
```

[1] "Off the Wall" "Thriller" "Bad" [1] "Bleach" "Nevermind" "In Utero"

The function `lapply()`

will iterate over the elements of the list,
applying a specified function to each element and returing the result
as a list. For instance, to randomly permute the character vectors
for each element, we would apply the `sample`

function to each
element.

lapply(albums, FUN=sample)

$Michael.Jackson [1] "Bad" "Off the Wall" "Thriller" $Nirvana [1] "Nevermind" "In Utero" "Bleach"

### 3.5 Data Frame

When you collect data, you will often have multiple measurements on
single sampling units. It is inconvenient to have all of these pieces
of data lying around your session environment in different variables.
The `data.frame`

structure in `R`

allows you to bind together the
different variables into a single object.

Let's have a look at the data from `spelling`

dataset (this assumes
you have already downloaded it, and it is in your working directory).

```
load("spelling.RData")
print(spelling.A)
```

SubjID Method Y 1 1 1 124 2 2 1 129 3 3 1 115 4 4 1 112 5 5 2 101 6 6 2 88 7 7 2 107 8 8 2 92 9 9 3 76 10 10 3 91 11 11 3 84 12 12 3 81

#### 3.5.1 Entering data by hand

It is possible to directly enter data into a data frame object by
typing in each of the vectors (or by creating them using the `rep`

or
`seq`

functions). Here is how I created the two `spelling`

data
frames.

spelling.A <- data.frame(SubjID=1:12, Method=rep(1:3, each=4), Y=c(124,129,115,112,101,88,107,92,76,91,84,81)) spelling.B <- data.frame(SubjID=1:12, Method=rep(1:3, each=4), Y=c(124,88,84,112,101,129,107,81,76,91,115,92)) print(spelling.A) print(spelling.B)

SubjID Method Y 1 1 1 124 2 2 1 129 3 3 1 115 4 4 1 112 5 5 2 101 6 6 2 88 7 7 2 107 8 8 2 92 9 9 3 76 10 10 3 91 11 11 3 84 12 12 3 81 SubjID Method Y 1 1 1 124 2 2 1 88 3 3 1 84 4 4 1 112 5 5 2 101 6 6 2 129 7 7 2 107 8 8 2 81 9 9 3 76 10 10 3 91 11 11 3 115 12 12 3 92

#### 3.5.2 Access

Note that the data frame `spelling.A`

has three columns, each
representing a different variable: the subject identifier ("SubjID"),
the spelling method that the subject learned ("Method"), and the
observed spelling performance ("Y"). To access any of the vectors
individually, we use the `$`

operator.

spelling.A$Y # give the column named "Y" from the dataframe spelling.A

[1] 124 129 115 112 101 88 107 92 76 91 84 81

We can access individual rows or columns using `[]`

syntax. For the
columns, we can also use character vectors to access them by name
rather than by column number. (We can do this for rows too, although
it's more common to do so by column; note that the default row names
are just the row numbers).

spelling.A[1:3,] # first three cases, all columns spelling.A[c(4,5), 3] # cases 4 + 5, third column only spelling.A[9:12, c("Method", "Y")] # cases 9-12, "Method" and "Y" only

SubjID Method Y 1 1 1 124 2 2 1 129 3 3 1 115 [1] 112 101 Method Y 9 3 76 10 3 91 11 3 84 12 3 81

#### 3.5.3 Subsetting

We might want to look at only those subjects who had Method 3, so:

subset(spelling.A, Method==3)

SubjID Method Y 9 9 3 76 10 10 3 91 11 11 3 84 12 12 3 81

The first argument to the `subset`

function is the name of the data
frame, while the second is a logical expression saying which rows to
keep. Note the double equals sign == for expressing equivalence. A
single equals sign won't work as expected!

We can combine conditions with the logical operators "or" `|`

and "and" `&`

. For example, to get subjects who were in Method 3 and
who had scores higher than 80:

subset(spelling.A, Method==3 & Y>80)

SubjID Method Y 10 10 3 91 11 11 3 84 12 12 3 81

Or, we might want to get all subjects where the method is NOT EQUAL to
3 using the `!=`

operator:

subset(spelling.A, Method!=3)

SubjID Method Y 1 1 1 124 2 2 1 129 3 3 1 115 4 4 1 112 5 5 2 101 6 6 2 88 7 7 2 107 8 8 2 92

#### 3.5.4 Merging Data Frames

Let's say you have two data frames, one called `subjects`

that has
subject information, and the other called `trials`

with trial
information.

subjects <- data.frame(SubjID=1:4, Gender=c("M","M","F","F")) trials <- data.frame(SubjID=rep(1:4, each=3), ItemID=1:3, Response=rnorm(12)) print(subjects) print(trials)

SubjID Gender 1 1 M 2 2 M 3 3 F 4 4 F SubjID ItemID Response 1 1 1 -0.1426569 2 1 2 0.6490175 3 1 3 1.7445806 4 2 1 0.1705198 5 2 2 -0.9361486 6 2 3 -0.3000100 7 3 1 0.1311008 8 3 2 -0.6824784 9 3 3 -1.2570611 10 4 1 1.2497215 11 4 2 -0.5346604 12 4 3 -0.7085383

We want to associate the gender information in `subjects`

with the
trial information in `trials`

so, for instance, we can analyze the
data by gender. What we want to do here is a database-style join.
`R`

has a function just for this, called `merge()`

.

```
merge(subjects, trials, by="SubjID")
```

SubjID Gender ItemID Response 1 1 M 1 -0.1426569 2 1 M 2 0.6490175 3 1 M 3 1.7445806 4 2 M 1 0.1705198 5 2 M 2 -0.9361486 6 2 M 3 -0.3000100 7 3 F 1 0.1311008 8 3 F 2 -0.6824784 9 3 F 3 -1.2570611 10 4 F 1 1.2497215 11 4 F 2 -0.5346604 12 4 F 3 -0.7085383

#### 3.5.5 Factorially Combining Values

`expand.grid()`

creates a data frame object that is the *Cartesian product* of the fields; that is, which includes all combinations of
the elements of each field. For example, consider a study in which
each of four subjects are given three items. There will be 4x3=12
observations in all in this study. We can create the data frame
simply like this:

```
trials <- expand.grid(SubjID=1:4, ItemID=1:3)
print(trials)
```

SubjID ItemID 1 1 1 2 2 1 3 3 1 4 4 1 5 1 2 6 2 2 7 3 2 8 4 2 9 1 3 10 2 3 11 3 3 12 4 3

#### 3.5.6 Data Summaries

The summary function is useful. Let's try it on `spelling.A`

.

summary(spelling.A)

SubjID Method Y Min. : 1.00 Min. :1 Min. : 76.0 1st Qu.: 3.75 1st Qu.:1 1st Qu.: 87.0 Median : 6.50 Median :2 Median : 96.5 Mean : 6.50 Mean :2 Mean :100.0 3rd Qu.: 9.25 3rd Qu.:3 3rd Qu.:112.8 Max. :12.00 Max. :3 Max. :129.0

What if we want to summarize by certain values, for instance,
calculate means for each method? There are two ways of doing this in
`R`

, using the functions `aggregate()`

and `tapply()`

. Here is how to
do it with aggregate.

aggregate(x=list(Y=spelling.A$Y), by=list(Method=spelling.A$Method), FUN=mean) with(spelling.A, aggregate(x=list(Y=Y), by=list(Method=Method), FUN=mean) )

Method Y 1 1 120 2 2 97 3 3 83 Method Y 1 1 120 2 2 97 3 3 83

Note the use of the `with()`

function in the second instance to
simplify the syntax. The argument `x`

is a list of values we want to
aggregate (DVs), and we had to specify the name "Y" otherwise we get
some default name from R that is opaque. The `by`

argument tells
`aggregate`

how to break the data into groups; we can have multiple
elements in the list, although in the current case we have only one
(`Method`

). We include the name "Method" for method, again, because
otherwise R just gives default names like "Var1". The argument `FUN`

is the function to be performed on the `x`

argument, which in this
case is `mean`

.

Note: to create contingency tables, see `?table`

.

## 4 Basic Plotting

### 4.1 Scatterplot

Let's start off by making a basic scatterplot, with data series `x`

,
and including a trendline. We will use the function `plot(x, y)`

to
create the plot. The function `abline()`

adds in a line with
intercept `a`

and slope `b`

.

```
x <- 3 + 2 * (1:100) + rnorm(100, sd=20)
plot(1:100, x)
abline(a=3, b=2)
```

Now let's include two series in the plot. Let's also include a legend to distinguish them.

x <- 3 + 2 * (1:100) + rnorm(100, sd=20) y <- 3 + 1.5 * (1:100) + rnorm(100, sd=20) plot(1:100, x) abline(a=3, b=2) points(1:100, y, pch=2) abline(a=3, b=1.5, lty=2) legend("topleft", c("x","y"), pch=1:2, lty=1:2)

Note that the `points`

function adds a series to a plot without
creating a new plot, whereas `plot`

will always create a new plot
(through a call to `plot.new()`

). There are many parameters of the
plot that can be tweaked. See `?plot`

and `?par`

for details.

### 4.2 Boxplot

If we specify a formula to plot, and the variable on the right-hand
side is a `factor`

, then we get boxplots. Factors re-code a variable
in terms of an exhaustive set of levels (and thus only makes sense for
categorical variables).

load(file="spelling.RData") spelling.A$Method spelling.A$Method <- factor(spelling.A$Method) # make into factor spelling.A$Method plot(Y ~ Method, data=spelling.A)

### 4.3 Basic line plot

This is accomplished in the same way as a scatterplot, except you set
the argument `type`

to 'l' (lines) or 'b' (both lines and points)
instead of the default 'p' (points only).

```
plot(1:100, y=3 + 2*(1:100) + rnorm(100, sd=20), type='b')
```

### 4.4 Histogram

To create a histogram, use the `hist()`

function.

x <- rnorm(1000) myhist <- hist(x) print(myhist)

You can alter how `x`

is broken up into categories using the `breaks`

argument.

### 4.5 Saving plots to a file

You will often want to output the graphics to a file for use in reports. It is easy to create PDF, PNG, and JPG graphics in R. What you need to do is to insert your plotting code between two commands, one to open the graphics output file, and the second to close (and save it). Here's how to save a file to PDF.

pdf(file="myplot.pdf") plot(rnorm(10), rnorm(10)) dev.off()

Here's how to create a PNG file.

png(file="myplot.png") plot(rnorm(10), rnorm(10)) dev.off()

### 4.6 Other graphics capabilities

To get a demo of the graphics capabilities try

demo(graphics)

You might also want to check out the fantastic graphics packages
`lattice`

and `ggplot2`

. To get a sense for what you can do with
these packages, here's a graphic I made using `ggplot2`

:

## 5 Statistical Functions

### 5.1 t-test and ANOVA

We can do a one-sample or two sample t-test using the function
`t.test()`

.

#### 5.1.1 One-sample t-test

x <- rnorm(100) x.t <- t.test(x) # test against zero print(x.t) x <- rnorm(100, mean=5) x.t <- t.test(x) print(x.t)

One Sample t-test data: x t = 0.6321, df = 99, p-value = 0.5288 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.1181721 0.2286716 sample estimates: mean of x 0.05524973 One Sample t-test data: x t = 56.9896, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 4.794030 5.139901 sample estimates: mean of x 4.966965

Note that if we want to test against something other than zero, we can
specify the null hypothesis using the `mu`

argument of `t.test`

.

#### 5.1.2 Two-sample t-test

- Independent samples

For this, we use the same

`t.test`

function, but with additional arguments.x <- rnorm(50, mean=12, sd=10) y <- rnorm(50, mean=15, sd=10) t.test(x, y) # default assumes unequal variances (Welch test) t.test(x, y, var.equal=TRUE)

Welch Two Sample t-test data: x and y t = -1.3577, df = 98, p-value = 0.1777 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.732557 1.262544 sample estimates: mean of x mean of y 12.96880 15.70381 Two Sample t-test data: x and y t = -1.3577, df = 98, p-value = 0.1777 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.732557 1.262544 sample estimates: mean of x mean of y 12.96880 15.70381

The

`var.equal`

argument is used to specify whether the variances of the two groups are assumed to be unequal (default) or equal.

- Paired-samples t-test

x <- data.frame(SubjID=rep(1:8, each=2), Condition=c("A","B"), sre=rep(rnorm(8, sd=10), each=2), err=rnorm(16, sd=4)) x$Y <- 100 + x$sre + x$err grp1 <- subset(x, Condition=="A") grp2 <- subset(x, Condition=="B") t.test(grp1$Y, grp2$Y) # WRONG! observations treated as independent t.test(grp1$Y, grp2$Y, paired=TRUE) # CORRECT

Welch Two Sample t-test data: grp1$Y and grp2$Y t = 0.267, df = 13.982, p-value = 0.7933 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.916550 8.883557 sample estimates: mean of x mean of y 97.46473 96.48123 Paired t-test data: grp1$Y and grp2$Y t = 0.8666, df = 7, p-value = 0.4149 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.700153 3.667159 sample estimates: mean of the differences 0.9835033

#### 5.1.3 One Factor ANOVA

The `aov`

function does what we need. We need to declare the variable
`Method`

as a factor to get it to work properly (see `?factor`

for
details).

The syntax for `aov`

is: `aov(formula, data)`

, where `formula`

is the
formula for the GLM, and data is the named dataframe where the
variables specified in the formula are found. In this case, our
formula is `Y ~ Method`

. Note that `Y`

is the name of our DV, and
`Method`

is the name of our IV. Because the equals sign is often used
for assignment in `R`

, `R`

requires us to use the tilde (~) symbol
where the equals sign should have been.

Note that we do not need to specify the error term or the intercept in
the model (they are simply assumed by the function). The `aov`

function returns a "fitted model object"; we get a summary table for
this object using the `summary`

function.

# declare Method as a factor spelling.A$Method <- factor(spelling.A$Method) spelling.A.aov <- aov(Y ~ Method, data=spelling.A) summary(spelling.A.aov)

Df Sum Sq Mean Sq F value Pr(>F) Method 2 2792 1396.00 23.886 0.0002515 *** Residuals 9 526 58.44 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#### 5.1.4 Factorial ANOVA

We can also perform a factorial ANOVA using `aov`

. For this example,
you will need to download the file anova2f.RData and load it into your
`R`

session. This will bring the data frame object `dat`

into your
environment, which contains the data we will be working with.

The first argument
to `aov()`

is `formula`

, which specifies the GLM. The baseline value
\(\mu\) and the error term are implicit, so all you have to is specify
the main effects and interactions. Interactions are specified using
the syntax `A:B`

. So for the current analysis, our formula to `aov()`

is `Y ~ A + B + A:B`

.

An easier way is to use the shortcut syntax `A*B`

which essentially
says, "give me all possible main effects and interactions for factors
A and B". This becomes important when we get to higher order designs,
where spelling all the terms out by hand gets tedious and prone to
error.

The `aov()`

function returns a special kind of object. To get a
summary of the object, we use the `summary()`

function.

load(file="anova2f.RData") # first way, spelling out the terms, which is equivalent. dat.aov <- aov(formula=Y ~ A + B + A:B, data=dat) summary(dat.aov) # second way, using the '*' shortcut syntax dat.aov <- aov(formula=Y ~ A*B, data=dat) summary(dat.aov)

Df Sum Sq Mean Sq F value Pr(>F) A 1 3502.1 3502.1 84.8990 1.559e-05 *** B 1 18.8 18.8 0.4545 0.5192 A:B 1 10.1 10.1 0.2444 0.6343 Residuals 8 330.0 41.2 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Df Sum Sq Mean Sq F value Pr(>F) A 1 3502.1 3502.1 84.8990 1.559e-05 *** B 1 18.8 18.8 0.4545 0.5192 A:B 1 10.1 10.1 0.2444 0.6343 Residuals 8 330.0 41.2 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1