Sea turtle stock analysis: getting started with the turtle package

Ben Bolker

Feb 21, 2002

1  Introduction

The turtle package is a set of routines written in the R language for doing stock analysis of sea turtle populations using data on mitochondrial haplotypes gathered from rookeries and from a mixed population. The package is intended to be self-contained, but some familiarity with R or S-PLUS may be helpful. (Some familiarity with your computer's operating system, which is probably Microsoft Windows, is also assumed.) The statistical methods implemented in the package are described in [1]and [4].

This package is in the public domain (GNU General Public License), is 2002 Ben Bolker and Toshinori Okuyama, and comes with NO WARRANTY. Please suggest improvements to me (Ben Bolker) at bolker@zoology.ufl.edu.

If you are feeling impatient and confident, turn to ``Quick Start'' (section 5).

2  Installation

To get started, you will have to download and install the R package, a general-purpose statistics and graphics package, from http://lib.stat.cmu.edu/R/CRAN/bin/windows/base/ (or see http://www.r-project.org/mirrors.html for a list of alternative ``mirror sites'' closer to you). You will download a file called SetupR.exe which will install R for you, when executed.

The following installation instructions assume you are using a ``modern'' Microsoft Windows system (tested on 2000 and XP); it is possible to use R, and the turtle package, on other operating systems - please contact the authors for more information. The setup file is about 17M, and R takes up about 40M of disk space. If you are running an antivirus package that is configured to check the signatures of executable files before they run, make sure you turn it off or register the new files installed by R before proceeding. The latest version of R as of this writing is 1.4.1, but the package should work with any version after 1.3.1.

Once you have downloaded and installed R, start the R program. The setup program should have asked whether you want to add a shortcut to the desktop or the Start menu: if you didn't, you will have to search for a file called Rgui.exe, which probably lives somewhere like Program Files\R\rw1041\bin depending on what version of R you are using and where you decided to install it. R will open up a window for you with a command prompt (>), at which you can type R commands. (Don't panic.)

You can exit R by selecting File/Exit from the menus, or by typing q() at the command prompt. In general, if you want help on a particular command (e.g. turtleest) you can type a question mark followed by the command name (e.g. ?turtleest).

You will next need to install the turtle library and two other auxiliary libraries, over the WWW, from within R (you will need to maintain a connection to the internet for this piece, although it is also possible to do this step ``off-line''). Within R, at the command prompt, type the following commands:

bbcontrib <- "http://www.zoology.ufl.edu/bolker/R/windows"
install.packages("turtle",contriburl=bbcontrib)
install.packages("bbmisc",contriburl=bbcontrib)
install.packages("boa",contriburl=bbcontrib)
install.packages("coda")
In each case, answer y to whether you want to delete the source files; you won't need them again. The first command specifies the location of the turtle, bbmisc, and boa libraries (the coda library comes from the default source for R packages). The install.packages commands download and install packages.

3  Loading the turtle package and reading in data

Start every session with the turtle library by typing library(turtle) at the command prompt; this loads the turtle and auxiliary libraries.

The package can read plain text data files that are separated by white space (spaces and/or tabs) or commas. If your data are in Microsoft Excel, you should export them as a comma-separated (CSV) file. If they are in Word, save them as plain text. The expected data format is that each row of data represents a haplotype, each column except the last represents samples from a particular rookery, and the last column is the samples from the mixed population. Each row and column should be named; your life will be simpler if the names do not have spaces or punctuation other than periods in them (a common convention in R is to replace spaces with periods, e.g. North.FL for "North FL"). Do not label the haplotype column; R detects the presence of column names by checking whether the first row has one fewer item than the rest of the rows in the file.

For example, a plain text file (with haplotype labels H1 and H2 and rookery labels R1-R3) could look like this:

  R1 R2 R3 mix
H1 1 2 3 4
H2 3 4 5 6
Or a comma-separated file could look like this:
R1,R2,R3,mix
H1,1,2,3,4
H2,3,4,5,6

To read in your data, you first need to make sure that R knows how to find them. The best thing to do is to use the File/Change working directory option under the file menu to move to a directory you will use for analysis, which should contain the data files you want to use and will contain R's working files. Once you have changed to the appropriate directory, you can read in your data files and assign the data to a variable (for example) mydata:

mydata <- read.table("myturtles.dat")
if you are using space-separated data, or
mydata <- read.csv("myturtles.csv")
if you have comma-separated values. To make sure that everything came out OK, type the name of the variable alone at the command prompt: e.g.
mydata
to print out the data.

Next, convert your data to a form that the turtle package can use with the as.turtle.data command:

mydata <- as.turtle.data(mydata)
You can also do the two previous steps as a single command:
mydata <- as.turtle.data(read.csv("mydata.csv"))

Once your data are converted to turtle.data form, you can produce a summary plot of the data:

plot(mydata)
The default plot is a barplot, with the proportions of each haplotype sampled in each rookery represented by a separate bar; the mixed population data are shown as the rightmost bar.

Before proceeding, you will need to ``condense'' your data set by (1) excluding any haplotype samples that are found only in the mixed population (which will break some estimation methods, and provide no useful information on turtle origins) and (2) lumping together all haplotypes that are found only in a single rookery and the mixed population (distinguishing among such haplotypes provides no extra information in our analyses, and slows down estimation). You can do this by typing

mydata <- hapfreq.condense(mydata)
(To examine the condensed form of the data, you can print them by typing mydata at the command prompt, or plot(mydata) to see the graphical summary.)

Some data are already entered in the package:

data(lahan98)
makes the haplotype frequency data from Lahanas et al. 1998 [3]available as variable lahan98.
l98c <- hapfreq.condense(as.turtle.data(lahan98))
converts it into a turtle.data object and condenses the haplotypes to make it ready for analysis.
data(bolten98)
gives you the data from Bolten et al. 1998 [2]available as bolten98; convert and condense as above.

4  Stock analysis

Various methods of stock analysis are available.

4.1  CML and UML

You can do standard conditional maximum likelihood (CML) analysis using cml(mydata). If you want to save the results, you can save them as a variable that you can then print, plot, etc.:
mydata.cml <- cml(mydata)
mydata.cml
plot(mydata.cml)
When you print CML results, R will tell you there is no estimate for the rookery frequencies, because CML assumes that the true rookery frequencies are equal to the sample rookery frequencies, rather than estimating the rookery frequencies independently.

The default plot for estimation results shows the estimated proportions of the mixed population contributed by each rookery as a stacked bar chart. Rookeries contributing less than a specified minimum contribution (default equal to 0.01) are not included in the bar.

Standard unconditional maximum likelihood analysis (UML) takes a little longer, but is equally straightforward:

mydata.uml <- uml(mydata)
mydata.uml
plot(mydata.uml)
UML estimates also include estimates of the true haplotype frequencies in each rookery, which are printed with the contribution estimates and can be included in the graphical summary as follows:
par(ask=TRUE)
plot(mydata.uml,plot.freqs=TRUE)
par(ask=FALSE)
(The par commands tell R to wait for user input, or not, between successive plots.)

4.2  Confidence intervals: CML and UML bootstrapping

mydata.umlboot <- genboot(mydata,"uml")
will generate standard (nonparametric) bootstrap confidence intervals for a UML fit to mydata, by resampling the data with replacement 1000 times (by default). This is fairly slow with a realistic size data set. (You can ignore warnings about singular matrix, returning equal contribs.) You can find out the results by typing
intervals(mydata.umlboot)

4.3  Markov Chain Monte Carlo estimation

mydata.mcmc <- tmcmc(mydata)
runs a standard (as defined in [4,1]) Markov Chain Monte Carlo analysis of your data, with default settings.
mydata.mcmc
intervals(mydata.mcmc)
plot(mydata.mcmc)
do the standard things: print the results, show confidence intervals, plot the results. (By default the information on haplotype frequencies in rookeries is not saved - it tends to be voluminous - and so this does not show up in the MCMC results.)

When you are running MCMC analyses, you have to check that the Markov chains have converged (i.e. that you've run everything long enough for a reliable estimate).

calc.randl.0(mydata)
runs Raftery and Lewis diagnostics on your data set: these criteria attempt to determine how long a single chain has to be in order for it to give ``sufficiently good'' estimates. This function actually runs an iterative procedure, repeating the chain until the R&L criterion is satisfied.

calc.gandr(mydata)
tests the Gelman-Rubin criterion, which starts multiple chains from widely spaced starting points and tests to ensure that the chains ``overlap'' - i.e., that between-chain variance is small relative to within-chain variance.

5  Quick start

6  To do

References

[1]
Benjamin Bolker, Toshinori Okuyama, Karen Bjorndal, and Alan Bolten. Stock estimation for sea turtle populations using genetic markers: accounting for sampling error of rare genotypes. Submitted to Ecological Applications: preprint available at http://www.zoology.ufl.edu/turtle/turtle001.pdf, 2001.

[2]
Alan B. Bolten, Karen A. Bjorndal, Helen R. Martins, Thomas Dellinger, Manuel J. Biscotio, Sandra E. Encalada, and Brian W. Bowen. Transatlantic developmental migrations of loggerhead sea turtles demonstrated by mtDNA sequence analysis. Ecological Applications, 8(1):1-7, 1998.

[3]
P. N. Lahanas, K. A. Bjorndal, A. B. Bolten, S. E. Encalada, M. M. Miyamoto, R. A. Valverde, and B. W. Bowen. Genetic composition of a green turtle (Chelonia mydas) feeding ground population: evidence for multiple origins. Marine Biology, 130:345-352, 1998.

[4]
J. Pella and M. Masuda. Bayesian methods for analysis of stock mixtures from genetic characters. Fisheries Bulletin, 99:151-167, 2001.


File translated from TEX by TTH, version 2.71.
On 21 Feb 2002, 10:23.