<?xml version="1.0"?>
<!DOCTYPE html    PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN"
           "http://www.w3.org/Math/DTD/mathml2/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="GENERATOR" content="TtM 3.70" />
 <style type="text/css">
 div.p { margin-top: 7pt; }
 span.roman {font-family: serif; font-style: normal; font-weight: normal;} 
</style>
 


<title> Exploratory data analysis and graphics: lab 2 </title>
</head>
<body>
 
<h1 align="center">Exploratory data analysis and graphics: lab 2 </h1>

<h3 align="center">&#169; 2005 Ben Bolker </h3>

<div class="p"><!----></div>
 This lab will cover many if not all of the
details you actually need to know about  R&nbsp;to
read in data and produce the figures shown in
Chapter 2, and more.
The exercises, which will be considerably more
difficult than those in Lab&nbsp;1, will typically
involve variations on the figures shown in the
text.  You will work through reading in
the different data sets and constructing the
figures shown, or variants of them.  It would
be even better to work through reading in
and making exploratory plots of your own data.

<div class="p"><!----></div>
 <h2><a name="tth_sEc1">
1</a>&nbsp;&nbsp;Reading data</h2>

<div class="p"><!----></div>
Find the file called <tt>seedpred.dat</tt>:
it's in the right format (plain text, long format),
so you can just read it in with

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data&nbsp;=&nbsp;read.table("seedpred.dat",&nbsp;header&nbsp;=&nbsp;TRUE)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
(remember not to copy the <tt>&gt;</tt> if you
are cutting and pasting from this document).

<div class="p"><!----></div>
Add the variable <tt>available</tt> 
to the data frame by
combining <tt>taken</tt> and <tt>remaining</tt>
(using the <tt>$</tt> symbol):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data$available&nbsp;=&nbsp;data$taken&nbsp;+&nbsp;data$remaining
&nbsp;
</pre> </font>

<div class="p"><!----></div>

<b>Pitfall #1: finding your file&nbsp;&nbsp;</b>

<div class="p"><!----></div>
If  R&nbsp;responds to your <tt>read.table()</tt> or
<tt>read.csv()</tt> command
with an error like

<pre>
Error&nbsp;in&nbsp;file(file,&nbsp;"r")&nbsp;:&nbsp;unable&nbsp;to&nbsp;open&nbsp;connection
In&nbsp;addition:&nbsp;Warning&nbsp;message:&nbsp;cannot&nbsp;open&nbsp;file&nbsp;'myfile.csv'

</pre>
it means it can't find your file, probably because it isn't looking in the right place.
By default,  R's <em>working directory</em> is the directory in which
the  R&nbsp;program starts up, which is (again by default) something like
<tt>C:/Program Files/R/rw2010/bin</tt>. ( R&nbsp;uses <tt>/</tt>
as the [operating-system-independent] 
separator between directories in a file path.)
The simplest way to change this for the duration of your  R&nbsp;session
is to go to <tt>File/Change dir ...</tt>, click on the <tt>Browse</tt>
button, and move to your Desktop (or wherever your file is located).
You can also use the <tt>setwd()</tt> command to <b>set</b> the 
<b>w</b>orking <b>d</b>irectory (<tt>getwd()</tt> tells you what
the current working directory is).
While you could just throw everything on your desktop,
it's good to get in the habit of setting up a 
separate working directory for different projects, so that
your data files, metadata files,  R&nbsp;script files, and so forth, are all in 
the same place.

<div class="p"><!----></div>
Depending on how you have gotten your data files onto your system
(e.g. by downloading them from the web), Windows
will sometimes hide or otherwise screw up the extension of your
file (e.g. adding <tt>.txt</tt> to a file called <tt>mydata.dat</tt>).
 R&nbsp;needs to know the full name of the file, including the extension.

<div class="p"><!----></div>

<b>Pitfall #2: checking number of fields&nbsp;&nbsp;</b>
The next potential problem is that  R&nbsp;needs every line of your data file to have the
same number of fields (variables).  You may get an error like:

<pre>
Error&nbsp;in&nbsp;read.table(file&nbsp;=&nbsp;file,&nbsp;header&nbsp;=&nbsp;header,&nbsp;sep&nbsp;=&nbsp;sep,&nbsp;quote&nbsp;=&nbsp;quote,&nbsp;&nbsp;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;more&nbsp;columns&nbsp;than&nbsp;column&nbsp;names

</pre>
or

<pre>
Error&nbsp;in&nbsp;scan(file&nbsp;=&nbsp;file,&nbsp;what&nbsp;=&nbsp;what,&nbsp;sep&nbsp;=&nbsp;sep,&nbsp;quote&nbsp;=&nbsp;quote,&nbsp;dec&nbsp;=&nbsp;dec,&nbsp;&nbsp;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;line&nbsp;1&nbsp;did&nbsp;not&nbsp;have&nbsp;5&nbsp;elements

</pre>

<div class="p"><!----></div>
If you need to check on the number of fields that
 R&nbsp;thinks you have on each line, use

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;count.fields("myfile.dat",&nbsp;sep&nbsp;=&nbsp;",")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
(you can omit the <tt>sep=","</tt> argument if
you have whitespace- rather than comma-delimited
data).
If you are checking a long data file you can try

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;cf&nbsp;=&nbsp;count.fields("myfile.dat",&nbsp;sep&nbsp;=&nbsp;",")
&#62;&nbsp;which(cf&nbsp;!=&nbsp;cf[1])
&nbsp;
</pre> </font>

<div class="p"><!----></div>
to get the line numbers with numbers of fields
different from the first line.

<div class="p"><!----></div>
By default  R&nbsp;will try to fill in what it sees
as missing fields with <tt>NA</tt> ("not available")
values; this can be useful but can also hide
errors.  You can try

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;mydata&nbsp;&lt;-&nbsp;read.csv("myfile.dat",&nbsp;fill&nbsp;=&nbsp;FALSE)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
to turn off this behavior; if you don't
have any missing fields at the end of lines
in your data this should work.

<div class="p"><!----></div>
     <h3><a name="tth_sEc1.1">
1.1</a>&nbsp;&nbsp;Checking data</h3>
Here's the quickest way to check that all your variables have
been classified correctly:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;sapply(data,&nbsp;class)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;&nbsp;Species&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tcum&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tint&nbsp;remaining&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;taken&nbsp;available&nbsp;
&nbsp;"factor"&nbsp;"integer"&nbsp;"integer"&nbsp;"integer"&nbsp;"integer"&nbsp;"integer"&nbsp;

&nbsp;
</pre> </font>

<div class="p"><!----></div>
(this applies the <tt>class()</tt> command, which identifies
the type of a variable, to each column in your data).

<div class="p"><!----></div>
Non-numeric missing-variable strings
(such as a star, <tt>*</tt>) will also make R
misclassify.  Use <tt>na.strings</tt>
in your <tt>read.table()</tt> command:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;mydata&nbsp;&lt;-&nbsp;read.table("mydata.dat",&nbsp;na.strings&nbsp;=&nbsp;"*")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
(you can specify more than one value with (e.g.)
<tt>na.strings=c("*","***","bad","-9999")</tt>).

<div class="p"><!----></div>
<b>Exercise 1</b>: Try out 
<tt>head()</tt>, <tt>summary()</tt> and <tt>str()</tt>
on <tt>data</tt>; make sure you understand the results.

<div class="p"><!----></div>
     <h3><a name="tth_sEc1.2">
1.2</a>&nbsp;&nbsp;Reshaping data</h3>
It's hard to give an example of reshaping the
seed predation data set because we have
different numbers of observations for each
species - thus, the data won't fit nicely
into a rectangular format with (say) all
observations from each species on the same
line.  
However, as in the chapter text I can
just make up a data frame and reshape it.

<div class="p"><!----></div>
Here are the commands to generate the
data frame I used as
an example in the text (I use <tt>LETTERS</tt>,
a built-in vector of the capitalized letters
of the alphabet, and <tt>runif()</tt>, which
picks a specified number of random numbers
from a uniform distribution between 0 and 1.
The command <tt>round(x,3)</tt>
rounds <tt>x</tt> to 3 digits after the decimal place.):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;loc&nbsp;=&nbsp;factor(rep(LETTERS[1:3],&nbsp;2))
&#62;&nbsp;day&nbsp;=&nbsp;factor(rep(1:2,&nbsp;each&nbsp;=&nbsp;3))
&#62;&nbsp;val&nbsp;=&nbsp;round(runif(6),&nbsp;3)
&#62;&nbsp;d&nbsp;=&nbsp;data.frame(loc,&nbsp;day,&nbsp;val)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
This data set is in long format.
To go to wide format:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;d2&nbsp;=&nbsp;reshape(d,&nbsp;direction&nbsp;=&nbsp;"wide",&nbsp;idvar&nbsp;=&nbsp;"loc",&nbsp;timevar&nbsp;=&nbsp;"day")
&#62;&nbsp;d2
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;&nbsp;loc&nbsp;val.1&nbsp;val.2
1&nbsp;&nbsp;&nbsp;A&nbsp;0.362&nbsp;0.598
2&nbsp;&nbsp;&nbsp;B&nbsp;0.522&nbsp;0.692
3&nbsp;&nbsp;&nbsp;C&nbsp;0.722&nbsp;0.697

&nbsp;
</pre> </font>

<div class="p"><!----></div>
<tt>idvar="loc"</tt> specifies 
that <tt>loc</tt> is the
identifier that should be used to assign multiple
values to the same row,
and <tt>timevar="day"</tt> specifies which variable
can be lumped together on the same row.

<div class="p"><!----></div>
To go back to long format:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;reshape(d2,&nbsp;direction&nbsp;=&nbsp;"long",&nbsp;varying&nbsp;=&nbsp;c("val.1",&nbsp;"val.2"),&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;timevar&nbsp;=&nbsp;"day",&nbsp;idvar&nbsp;=&nbsp;"loc")
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;&nbsp;&nbsp;&nbsp;loc&nbsp;day&nbsp;&nbsp;&nbsp;val
A.1&nbsp;&nbsp;&nbsp;A&nbsp;&nbsp;&nbsp;1&nbsp;0.362
B.1&nbsp;&nbsp;&nbsp;B&nbsp;&nbsp;&nbsp;1&nbsp;0.522
C.1&nbsp;&nbsp;&nbsp;C&nbsp;&nbsp;&nbsp;1&nbsp;0.722
A.2&nbsp;&nbsp;&nbsp;A&nbsp;&nbsp;&nbsp;2&nbsp;0.598
B.2&nbsp;&nbsp;&nbsp;B&nbsp;&nbsp;&nbsp;2&nbsp;0.692
C.2&nbsp;&nbsp;&nbsp;C&nbsp;&nbsp;&nbsp;2&nbsp;0.697

&nbsp;
</pre> </font>

<div class="p"><!----></div>
<tt>varying</tt> specifies which variables are changing
and need to be reshaped, and <tt>timevar</tt>
specifies the name of the variable to be (re)created
to distinguish different samples in the same location.

<div class="p"><!----></div>
<b>Exercise 2</b>: <tt>unstack()</tt> works with
a formula.  Try <tt>unstack(d,val~day)</tt> and
<tt>unstack(d,val~loc)</tt> and figure out what's going on.

<div class="p"><!----></div>
     <h3><a name="tth_sEc1.3">
1.3</a>&nbsp;&nbsp;Advanced data types</h3>

<div class="p"><!----></div>
While you can usually get by coding data in
not quite the right way - for example, coding dates
as numeric values or categorical variables as
strings -  R&nbsp;tries to "do the right
thing" with your data, and it is more likely to
do the right thing the more it knows about how your
data are structured.

<div class="p"><!----></div>

<b>Strings instead of factors&nbsp;&nbsp;</b>
Sometimes  R's default of assigning factors is not what you want: if
your strings are unique identifiers (e.g. if you have a code for
observations that combines the date and location of sampling, and each
location combination is only sampled once on a given date) then  R's
strategy of coding unique levels as integers and then associating a
label with integers will waste space and add confusion.
If all of your non-numeric variables should be treated
as character strings rather than factors, you can just
specify <tt>as.is=TRUE</tt>; if you want specific columns
to be left "as is" you can specify them by number or column
name. For example, these two commands have the same result:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data2&nbsp;=&nbsp;read.table("seedpred.dat",&nbsp;header&nbsp;=&nbsp;TRUE,&nbsp;as.is&nbsp;=&nbsp;"Species")
&#62;&nbsp;data2&nbsp;=&nbsp;read.table("seedpred.dat",&nbsp;header&nbsp;=&nbsp;TRUE,&nbsp;as.is&nbsp;=&nbsp;1)
&#62;&nbsp;sapply(data2,&nbsp;class)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;&nbsp;&nbsp;&nbsp;Species&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tcum&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tint&nbsp;&nbsp;&nbsp;remaining&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;taken&nbsp;
"character"&nbsp;&nbsp;&nbsp;"integer"&nbsp;&nbsp;&nbsp;"integer"&nbsp;&nbsp;&nbsp;"integer"&nbsp;&nbsp;&nbsp;"integer"&nbsp;

&nbsp;
</pre> </font>

<div class="p"><!----></div>
(use <tt>c()</tt> - e.g. <tt>c("name1","name2")</tt> or <tt>c(1,3)</tt> -
to specify more than one column).
You can also use the <tt>colClasses="character"</tt> argument to
<tt>read.table()</tt> to specify that a particular column should
be converted to type <tt>character</tt> -

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data2&nbsp;=&nbsp;read.table("seedpred.dat",&nbsp;header&nbsp;=&nbsp;TRUE,&nbsp;colClasses&nbsp;=&nbsp;c("character",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;rep("numeric",&nbsp;4)))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
again has the same results as the commands above.

<div class="p"><!----></div>
To convert factors back to strings <em>after</em> you have read them
into  R, use <tt>as.character()</tt>.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data2&nbsp;=&nbsp;read.table("seedpred.dat",&nbsp;header&nbsp;=&nbsp;TRUE)
&#62;&nbsp;sapply(data2,&nbsp;class)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;&nbsp;Species&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tcum&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tint&nbsp;remaining&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;taken&nbsp;
&nbsp;"factor"&nbsp;"integer"&nbsp;"integer"&nbsp;"integer"&nbsp;"integer"&nbsp;

&nbsp;
</pre> </font>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data2$Species&nbsp;=&nbsp;as.character(data2$Species)
&#62;&nbsp;sapply(data2,&nbsp;class)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;&nbsp;&nbsp;&nbsp;Species&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tcum&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tint&nbsp;&nbsp;&nbsp;remaining&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;taken&nbsp;
"character"&nbsp;&nbsp;&nbsp;"integer"&nbsp;&nbsp;&nbsp;"integer"&nbsp;&nbsp;&nbsp;"integer"&nbsp;&nbsp;&nbsp;"integer"&nbsp;

&nbsp;
</pre> </font>

<div class="p"><!----></div>

<b>Factors instead of numeric values&nbsp;&nbsp;</b>
In contrast, sometimes you have numeric labels for data
that are really categorical values - for example if your
sites or species have integer codes (often data sets
will have redundant information in them, e.g. both
a species name and a species code number).
It's best to specify appropriate data types, so use
<tt>colClasses</tt> to force  R&nbsp;to treat the data as
a factor.  For example, if we wanted to make <tt>tcum</tt>
a factor instead of a numeric variable:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data2&nbsp;=&nbsp;read.table("seedpred.dat",&nbsp;header&nbsp;=&nbsp;TRUE,&nbsp;colClasses&nbsp;=&nbsp;c(rep("factor",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2),&nbsp;rep("numeric",&nbsp;3)))
&#62;&nbsp;sapply(data2,&nbsp;class)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;&nbsp;Species&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tcum&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tint&nbsp;remaining&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;taken&nbsp;
&nbsp;"factor"&nbsp;&nbsp;"factor"&nbsp;"numeric"&nbsp;"numeric"&nbsp;"numeric"&nbsp;

&nbsp;
</pre> </font>

<div class="p"><!----></div>
<b>n.b.</b>: by default,  R&nbsp;sets the order of the 
factor levels alphabetically.  
You can find out the levels and their order
in a factor <tt>f</tt> with <tt>levels(f)</tt>.
If you want
your levels ordered in some other way (e.g. site
names in order along some transect), you need to
specify this explicitly.  Most confusingly,
 R&nbsp;will sort strings in alphabetic order too,
even if they represent numbers.
This is OK:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;f&nbsp;=&nbsp;factor(1:10)
&#62;&nbsp;levels(f)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;[1]&nbsp;"1"&nbsp;&nbsp;"2"&nbsp;&nbsp;"3"&nbsp;&nbsp;"4"&nbsp;&nbsp;"5"&nbsp;&nbsp;"6"&nbsp;&nbsp;"7"&nbsp;&nbsp;"8"&nbsp;&nbsp;"9"&nbsp;&nbsp;"10"

&nbsp;
</pre> </font>

<div class="p"><!----></div>
but this is not, since we explicitly tell  R&nbsp;to treat the numbers as characters (this can
happen by accident in some contexts):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;f&nbsp;=&nbsp;factor(as.character(1:10))
&#62;&nbsp;levels(f)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
&nbsp;[1]&nbsp;"1"&nbsp;&nbsp;"10"&nbsp;"2"&nbsp;&nbsp;"3"&nbsp;&nbsp;"4"&nbsp;&nbsp;"5"&nbsp;&nbsp;"6"&nbsp;&nbsp;"7"&nbsp;&nbsp;"8"&nbsp;&nbsp;"9"&nbsp;

&nbsp;
</pre> </font>

<div class="p"><!----></div>
In a list of numbers from 1 to 10, "10"
comes after "1" but before "2" ...

<div class="p"><!----></div>
You can fix the levels by using the 
<tt>levels</tt> argument in <tt>factor()</tt> 
to tell  R&nbsp;explicitly
what you want it to do, e.g.:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;f&nbsp;=&nbsp;factor(as.character(1:10),&nbsp;levels&nbsp;=&nbsp;1:10)
&#62;&nbsp;x&nbsp;=&nbsp;c("north",&nbsp;"middle",&nbsp;"south")
&#62;&nbsp;f&nbsp;=&nbsp;factor(x,&nbsp;levels&nbsp;=&nbsp;c("far_north",&nbsp;"north",&nbsp;"middle",&nbsp;"south"))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
so that the levels come out ordered geographically
rather than alphabetically.

<div class="p"><!----></div>
Sometimes your data contain a subset of integer
values in a range, but you want to make sure the
levels of the factor you construct include all
of the values in the range, not just the ones
in your data. Use <tt>levels</tt> again:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;f&nbsp;=&nbsp;factor(c(3,&nbsp;3,&nbsp;5,&nbsp;6,&nbsp;7,&nbsp;8,&nbsp;10),&nbsp;levels&nbsp;=&nbsp;3:10)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Finally, you may want to get rid of levels that
were included in a previous factor but are no
longer relevant:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;f&nbsp;=&nbsp;factor(c("a",&nbsp;"b",&nbsp;"c",&nbsp;"d"))
&#62;&nbsp;f2&nbsp;=&nbsp;f[1:2]
&#62;&nbsp;levels(f2)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
[1]&nbsp;"a"&nbsp;"b"&nbsp;"c"&nbsp;"d"

&nbsp;
</pre> </font>
  <font color="#FF0000">
<pre>
&#62;&nbsp;f2&nbsp;=&nbsp;factor(as.character(f2))
&#62;&nbsp;levels(f2)
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
[1]&nbsp;"a"&nbsp;"b"

&nbsp;
</pre> </font>

<div class="p"><!----></div>
For more complicated operations with
<tt>factor()</tt>, use the <tt>recode()</tt>
function in the <tt>car</tt> package.

<div class="p"><!----></div>
<b>Exercise 3</b>: 
Illustrate the effects of the levels
command by plotting the factor <tt>f=factor(c(3,3,5,6,7,8,10))</tt>
as created with and without intermediate levels.
For an extra challenge, draw them as two side-by-side
subplots.  (Use <tt>par(mfrow=c(1,1))</tt> to restore
a full plot window.)

<div class="p"><!----></div>

<b>Dates&nbsp;&nbsp;</b>
Dates and times can be tricky in  R, but you can (and should)
handle your dates as type <tt>Date</tt>
within  R&nbsp;rather than messing around
with Julian days (i.e., days since the
beginning of the year) or maintaining
separate variables for day/month/year.

<div class="p"><!----></div>
You can use <tt>colClasses="Date"</tt>
within <tt>read.table()</tt> to read in
dates directly from a file, but only if
your dates are in four-digit-year/month/day 
(e.g. 2005/08/16 or 2005-08-16) format;
otherwise  R&nbsp;will either butcher your
dates or complain

<pre>
Error&nbsp;in&nbsp;fromchar(x)&nbsp;:&nbsp;character&nbsp;string&nbsp;is&nbsp;not&nbsp;in&nbsp;a&nbsp;standard&nbsp;unambiguous&nbsp;format

</pre>

<div class="p"><!----></div>
If your dates are in another format in 
a single column, read them in as character
strings (<tt>colClasses="character"</tt> or
using <tt>as.is</tt>) and then use <tt>as.Date()</tt>,
which uses a very flexible <tt>format</tt> argument
to convert character formats to dates:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;as.Date(c("1jan1960",&nbsp;"2jan1960",&nbsp;"31mar1960",&nbsp;"30jul1960"),&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;format&nbsp;=&nbsp;"%d%b%Y")
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
[1]&nbsp;"1960-01-01"&nbsp;"1960-01-02"&nbsp;"1960-03-31"&nbsp;"1960-07-30"

&nbsp;
</pre> </font>
  <font color="#FF0000">
<pre>
&#62;&nbsp;as.Date(c("02/27/92",&nbsp;"02/27/92",&nbsp;"01/14/92",&nbsp;"02/28/92",&nbsp;"02/01/92"),&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;format&nbsp;=&nbsp;"%m/%d/%y")
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
[1]&nbsp;"1992-02-27"&nbsp;"1992-02-27"&nbsp;"1992-01-14"&nbsp;"1992-02-28"&nbsp;"1992-02-01"

&nbsp;
</pre> </font>

<div class="p"><!----></div>
The most useful format codes are <tt>%m</tt> for month number,
<tt>%d</tt> for day of month, <tt>%j%</tt> for Julian date (day of year),
<tt>%y%</tt> for two-digit year (dangerous for dates before 1970!)
and <tt>%Y%</tt> for four-digit year; see <tt>?strftime</tt> for
many more details.

<div class="p"><!----></div>
If you have your dates as separate (numeric) day, month, and year
columns, you actually have to squash them together into a 
character format (with <tt>paste()</tt>, using
<tt>sep="/"</tt> to specify that the values should be separated
by a slash) and then convert them to dates:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;year&nbsp;=&nbsp;c(2004,&nbsp;2004,&nbsp;2004,&nbsp;2005)
&#62;&nbsp;month&nbsp;=&nbsp;c(10,&nbsp;11,&nbsp;12,&nbsp;1)
&#62;&nbsp;day&nbsp;=&nbsp;c(20,&nbsp;18,&nbsp;28,&nbsp;17)
&#62;&nbsp;datestr&nbsp;=&nbsp;paste(year,&nbsp;month,&nbsp;day,&nbsp;sep&nbsp;=&nbsp;"/")
&#62;&nbsp;date&nbsp;=&nbsp;as.Date(datestr)
&#62;&nbsp;date
&nbsp;
</pre> </font>
  <font color="#0000FF">
<pre>
[1]&nbsp;"2004-10-20"&nbsp;"2004-11-18"&nbsp;"2004-12-28"&nbsp;"2005-01-17"

&nbsp;
</pre> </font>

<div class="p"><!----></div>
Although  R&nbsp;prints the
dates out so they look like a vector of character strings,
they are really dates: <tt>class(date)</tt> will
give you the answer <tt>"Date"</tt>.

<div class="p"><!----></div>
Other traps:

<ul>
<li>quotation marks in character variables: if you have
character strings in your data set with apostrophes or quotation
marks embedded in them, you have to get  R&nbsp;to ignore them.
I used a data set recently that contained lines like this:

<pre>
Western&nbsp;Canyon|valley|Santa&nbsp;Cruz|313120N|1103145WO'Donnell&nbsp;Canyon

</pre>
I used

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data&nbsp;=&nbsp;read.table("datafile",&nbsp;sep&nbsp;=&nbsp;"|",&nbsp;quote&nbsp;=&nbsp;"")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
to tell  R&nbsp;that <tt> - </tt> was the separator between fields and
that it should ignore all apostrophes/single quotations/double
quotations in the data set and just read them as part of
a string.

<div class="p"><!----></div>
</li>
</ul>

<div class="p"><!----></div>
     <h3><a name="tth_sEc1.4">
1.4</a>&nbsp;&nbsp;Accessing data and extra packages</h3>

<div class="p"><!----></div>

<b>Data&nbsp;&nbsp;</b>
To access individual variables within your data set use
<tt>mydata$varname</tt> or <tt>mydata[,n]</tt> or
<tt>mydata[,"varname"]</tt> where <tt>n</tt> is the column number and 
<tt>varname</tt> is the variable name you want.  You can also use <tt>
attach(mydata)</tt> to set things up so that you can refer to the
variable names alone (e.g. <tt>varname</tt> rather than
<tt>mydata$varname</tt>).  However, <b>beware</b>: if you then modify
a variable, you can end up with two copies of it: one (modified) is a
local variable called <tt>varname</tt>, the other (original) is a column
in the data frame called <tt>varname</tt>: it's probably better not to
<tt>attach</tt> a data set until after you've finished cleaning and
modifying it.  Furthermore, if you have already created a variable
called <tt>varname</tt>,  R&nbsp;will find it before it finds the version of
<tt>varname</tt> that is part of your data set.  Attaching multiple
copies of a data set is a good way to get confused: try to remember
to <tt>detach(mydata)</tt> when you're done.

<div class="p"><!----></div>
I'll start by <tt>attach</tt>ing the data set (so
we can refer to <tt>Species</tt> instead of 
<tt>data$Species</tt> and so on).

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;attach(data)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
To access data that are built in to  R&nbsp;or included
in an  R&nbsp;package (which you probably won't need to
do often), say

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data(dataset)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
(<tt>data()</tt> by itself will list all available data sets.)

<div class="p"><!----></div>

<b>Packages&nbsp;&nbsp;</b>
The <tt>sizeplot()</tt> function
I used for Figure&nbsp;2 in the chapter requires an add-on <em>package</em>
(unfortunately the command for loading a package
is <tt>library()</tt>!).  To use
an additional package it must
be (i) <em>installed</em> on your machine 
(with <tt>install.packages()</tt>) or
through the menu system and (ii) <em>loaded</em> in your
current R session (with <tt>library()</tt>).

<div class="p"><!----></div>
   <font color="#FF0000">
<pre>
&#62;&nbsp;install.packages("plotrix")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
   <font color="#FF0000">
<pre>
&#62;&nbsp;library(plotrix)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
You must both install and
load a package before you can use 
or get help on its functions, although <tt>help.search()</tt>
will list functions in packages that are installed but not yet
loaded.

<div class="p"><!----></div>
 <h2><a name="tth_sEc2">
2</a>&nbsp;&nbsp;Exploratory graphics</h2>

<div class="p"><!----></div>
     <h3><a name="tth_sEc2.1">
2.1</a>&nbsp;&nbsp;Bubble plot</h3>

<div class="p"><!----></div>
   <font color="#FF0000">
<pre>
&#62;&nbsp;sizeplot(available,&nbsp;taken,&nbsp;xlab&nbsp;=&nbsp;"Available",&nbsp;ylab&nbsp;=&nbsp;"Taken")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
will give you approximately the same basic graph shown in the chapter,
although I also played around with the x- and y-limits 
(using <tt>xlim</tt> and <tt>ylim</tt>) and the axes.  (The basic
procedure for showing custom axes in  R&nbsp;is to turn off the
default axes by specifying <tt>axes=FALSE</tt> and then to 
specify the axes one at a time with the <tt>axis()</tt> command.)

<div class="p"><!----></div>
I used 

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;t1&nbsp;=&nbsp;table(available,&nbsp;taken)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
to cross-tabulate the data,
and then used the <tt>text()</tt>
command to add the numbers to the plot.
There's a little bit
more trickery involved in
putting the numbers in the right place on the plot. <tt>row(x)</tt> gives a matrix with
the row numbers corresponding to the elements of <tt>x</tt>; <tt>col(x)</tt>
does the same for column numbers. Subtracting 1 (<tt>col(x)-1</tt>)
accounts for the fact that columns 1 through 6 of our table refer to 0 through 5
seeds actually taken.  When  R&nbsp;plots, it simply matches up each
of the x values, each of the y values, and each of the text
values (which in this case are the numbers in the table) and plots
them, even though the numbers are arranged in matrices rather
than vectors.
I also limit the plotting to positive values (using <tt>[t1&#62;0]</tt>),
although this is just cosmetic.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;r&nbsp;=&nbsp;row(t1)
&#62;&nbsp;c&nbsp;=&nbsp;col(t1)&nbsp;-&nbsp;1
&#62;&nbsp;text(r[t1&nbsp;&#62;&nbsp;0],&nbsp;c[t1&nbsp;&#62;&nbsp;0],&nbsp;t1[t1&nbsp;&#62;&nbsp;0])
&nbsp;
</pre> </font>

<div class="p"><!----></div>
is the final version of the commands.

<div class="p"><!----></div>
     <h3><a name="tth_sEc2.2">
2.2</a>&nbsp;&nbsp;Barplot</h3>
The command to produce the barplot (Figure 3) was:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;barplot(t(log10(t1&nbsp;+&nbsp;1)),&nbsp;beside&nbsp;=&nbsp;TRUE,&nbsp;legend&nbsp;=&nbsp;TRUE,&nbsp;xlab&nbsp;=&nbsp;"Available",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ylab&nbsp;=&nbsp;"log10(1+#&nbsp;observations)")
&#62;&nbsp;op&nbsp;=&nbsp;par(xpd&nbsp;=&nbsp;TRUE)
&#62;&nbsp;text(34.5,&nbsp;3.05,&nbsp;"Number&nbsp;taken")
&#62;&nbsp;par(op)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
As mentioned in the text, <tt>log10(t1+1)</tt> finds

<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow><mi>log</mi><mo stretchy="false">(</mo><mi>x</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></math>, a reasonable transformation to compress
the range of discrete data; <tt>t()</tt> transposes
the table so we can plot groups by number available.
The <tt>beside=TRUE</tt> argument plots grouped rather
than stacked bars; <tt>legend=TRUE</tt> plots a legend;
and <tt>xlab</tt> and <tt>ylab</tt> set labels.
The statement <tt>par(xpd=TRUE)</tt> allows text and
lines to be plotted outside the edge of the plot;
the <tt>op=par(...)</tt> and <tt>par(op)</tt> are a way
to set parameters and then restore the original settings
(I could have called <tt>op</tt> anything I wanted, but
in this case it stands for <b>o</b>ld <b>p</b>arameters).

<div class="p"><!----></div>
<b>Exercise 4*</b>:
In general, you can specify plotting characters
and colors in parallel with your data, so that
different points get plotted with different
plotting characters and colors.
For example:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;x&nbsp;=&nbsp;1:10
&#62;&nbsp;col_vec&nbsp;=&nbsp;rep(1:2,&nbsp;length&nbsp;=&nbsp;10)
&#62;&nbsp;pch_vec&nbsp;=&nbsp;rep(1:2,&nbsp;each&nbsp;=&nbsp;5)
&#62;&nbsp;plot(x,&nbsp;col&nbsp;=&nbsp;col_vec,&nbsp;pch&nbsp;=&nbsp;pch_vec)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
<img src="lab2-031.png" alt="lab2-031.png" />

<div class="p"><!----></div>
Take the old tabular data (<tt>t1</tt>), 
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow><mi>log</mi><mo stretchy="false">(</mo><mn>1</mn><mo>+</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></math>-transform them,
and use <tt>as.numeric()</tt> to drop
all the information in tabular form
and convert them to a numeric
vector.
Plot them (plotting the data numeric vector
will generate a scatterplot of values on the

<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow><mi>y</mi></mrow></math>-axis vs. observation number on the 
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow><mi>x</mi></mrow></math>-axis),
color-coded according to the
number available (rows) and point-type-coded
according the number taken (columns: note, there
is no color 0, so don't subtract 1).

<div class="p"><!----></div>
<tt>order(x)</tt> is a function that gives a vector
of integers that will put <tt>x</tt> in increasing
order.  For example, if I set <tt>x=c(3,1,2)</tt>
then <tt>order(z)</tt> is <tt>2 3 1</tt>: putting
the second element first, the third element
second, and the first element last will
put the vector in increasing order.
In contrast, <tt>rank(x)</tt> just gives
the ranks

<div class="p"><!----></div>
 <tt>y[order(x)]</tt> sorts <tt>y</tt> by
the elements of <tt>x</tt>.  

<div class="p"><!----></div>
Redo the plot with
the data sorted in increasing order; make sure
the colors and point types match the data properly.

<div class="p"><!----></div>
Does this way of plotting the data show anything
the bubbleplot didn't?  Can you think of other ways
of plotting these data?

<div class="p"><!----></div>
<hr />
You can use <tt>barchart()</tt> in the lattice package
to produce these graphics,
although it seems impossible to turn the graph so the
bars are vertical.  Try the following (<tt>stack=FALSE</tt>
is equivalent to <tt>beside=TRUE</tt> for <tt>barplot()</tt>):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;library(lattice)
&#62;&nbsp;barchart(log10(1&nbsp;+&nbsp;table(available,&nbsp;taken)),&nbsp;stack&nbsp;=&nbsp;FALSE,&nbsp;auto.key&nbsp;=&nbsp;TRUE)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
More impressively, the lattice package
can automatically plot a barplot of a three-way
cross-tabulation, in small multiples (I had to experiment
a bit to get the factors in the right order in the
<tt>table()</tt> command): try

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;barchart(log10(1&nbsp;+&nbsp;table(available,&nbsp;Species,&nbsp;taken)),&nbsp;stack&nbsp;=&nbsp;FALSE,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;auto.key&nbsp;=&nbsp;TRUE)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
<b>Exercise 5*</b>:
Restricting your analysis to only the observations with
5 seeds available, create a barplot showing
the distribution of number of seeds taken broken down by species.
<em>Hints:</em> you can create a new data set that includes only
the appropriate rows by using row indexing, then <tt>attach()</tt>
it.

<div class="p"><!----></div>
     <h3><a name="tth_sEc2.3">
2.3</a>&nbsp;&nbsp;Barplot with error bars</h3>

<div class="p"><!----></div>
Computing the fraction taken:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;frac_taken&nbsp;=&nbsp;taken/available
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Computing the mean fraction taken
for each number of seeds available,
using the <tt>tapply()</tt>
function:
<tt>tapply()</tt> ("<b>t</b>able <b>apply</b>",
pronounced "t apply"), is an extension of the <tt>table()</tt>
function; it splits a specified vector into groups according to
the factors provided, then <em>applies</em> a function (e.g.
<tt>mean()</tt> or <tt>sd()</tt>) to each group.
This idea of applying a function to a set of objects is a very general,
very powerful idea in data manipulation with  R; in due
course we'll learn about <tt>apply()</tt> (apply a function
to rows and columns of matrices), <tt>lapply()</tt> (apply
a function to lists), <tt>sapply()</tt> (apply a function
to lists and simplify), and <tt>mapply()</tt> (apply a function
to multiple lists).
For the present, though,

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;mean_frac_by_avail&nbsp;=&nbsp;tapply(frac_taken,&nbsp;available,&nbsp;mean)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
computes the mean of <tt>frac_taken</tt> for each group defined
by a different value of <tt>available</tt> ( R&nbsp;automatically
converts <tt>available</tt> into a <tt>factor</tt> temporarily for this
purpose).

<div class="p"><!----></div>
If you want to compute the mean by group for more than
one variable in a data set, use <tt>aggregate()</tt>.

<div class="p"><!----></div>
We can also calculate the standard errors,

<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow><mi>&sigma;</mi><mo stretchy="false">/</mo><msqrt><mrow><mi>n</mi></mrow></msqrt></mrow></math>: 

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;n_by_avail&nbsp;=&nbsp;table(available)
&#62;&nbsp;se_by_avail&nbsp;=&nbsp;tapply(frac_taken,&nbsp;available,&nbsp;sd)/sqrt(n_by_avail)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
I'll actually use a variant of <tt>barplot()</tt>,
<tt>barplot2()</tt> (from the <tt>gplots</tt> package,
which you may need to install,
along with the 
the <tt>gtools</tt> and <tt>gdata</tt>
packages) to plot these values
with standard errors. (I am mildly embarrassed that
 R&nbsp;does not supply error-bar plotting as a built-in function,
but you can use the <tt>barplot2()</tt>
in the <tt>gplots</tt> package or the <tt>plotCI()</tt> function
(the <tt>gplots</tt> and <tt>plotrix</tt> packages have slightly
different versions).

<div class="p"><!----></div>
   <font color="#FF0000">
<pre>
&#62;&nbsp;library(gplots)
&#62;&nbsp;lower_lim&nbsp;=&nbsp;mean_frac_by_avail&nbsp;-&nbsp;se_by_avail
&#62;&nbsp;upper_lim&nbsp;=&nbsp;mean_frac_by_avail&nbsp;+&nbsp;se_by_avail
&#62;&nbsp;b&nbsp;=&nbsp;barplot2(mean_frac_by_avail,&nbsp;plot.ci&nbsp;=&nbsp;TRUE,&nbsp;ci.l&nbsp;=&nbsp;lower_lim,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ci.u&nbsp;=&nbsp;upper_lim,&nbsp;xlab&nbsp;=&nbsp;"Number&nbsp;available",&nbsp;ylab&nbsp;=&nbsp;"Mean&nbsp;number&nbsp;taken")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
I specified that I wanted error bars plotted
(<tt>plot.ci=TRUE</tt>) and the lower (<tt>ci.l</tt>) and
upper (<tt>ci.u</tt>) limits.

<div class="p"><!----></div>
     <h3><a name="tth_sEc2.4">
2.4</a>&nbsp;&nbsp;Histograms by species</h3>

<div class="p"><!----></div>
All I had to do to get the lattice package to 
plot the histogram by species was:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;histogram(~frac_taken&nbsp;|&nbsp;Species,&nbsp;xlab&nbsp;=&nbsp;"Fraction&nbsp;taken")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
It's possible to do this with base graphics, too,
but you have to rearrange your data yourself:
essentially, you have to split the data up by
species, tell  R&nbsp;to break the plotting area
up into subplots, and then tell  R&nbsp;to draw a histogram
in each subplot.

<div class="p"><!----></div>

<ul>
<li>To reorganize the data appropriately and
draw the plot, I first use <tt>split()</tt>, which cuts a vector into a list
according to the levels of a factor - in this
case giving us a list of the fraction-taken data
separated by species:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;splitdat&nbsp;=&nbsp;split(frac_taken,&nbsp;Species)
&nbsp;
</pre> </font>

<div class="p"><!----></div>

<div class="p"><!----></div>
</li>

<li>Next I use the <tt>par()</tt>
command

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;op&nbsp;=&nbsp;par(mfrow&nbsp;=&nbsp;c(3,&nbsp;3),&nbsp;mar&nbsp;=&nbsp;c(2,&nbsp;2,&nbsp;1,&nbsp;1))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
to specify a 3 
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow><mo>&times;</mo></mrow></math> 3 array 
of mini-plots (<tt>mfrow=c(3,3)</tt>) and to reduce
the margin spacing to 2 lines on the bottom and left
sides and 1 line on the top and right
(<tt>mar=c(2,2,1,1)</tt>).

<div class="p"><!----></div>
</li>

<li>Finally, I combine
<tt>lapply()</tt>, which
applies a command to each of the
elements in a list,
with the <tt>hist()</tt>
(histogram) command.
You can specify extra arguments in <tt>lapply()</tt>
that will be passed along to the <tt>hist()</tt> function - in this
case they're designed to strip out unnecessary detail
and make the subplots bigger.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;h&nbsp;=&nbsp;lapply(splitdat,&nbsp;hist,&nbsp;xlab&nbsp;=&nbsp;"",&nbsp;ylab&nbsp;=&nbsp;"",&nbsp;main&nbsp;=&nbsp;"",&nbsp;col&nbsp;=&nbsp;"gray")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Assigning the answer to a variable stops  R&nbsp;from printing
the results, which I don't really want to see in this case.

<div class="p"><!----></div>
</li>

<li>
<tt>par(op)</tt> will restore the previous 
graphics parameters.

<div class="p"><!----></div>
</li>
</ul>

<div class="p"><!----></div>
It's a bit harder to get the species names plotted on the
graphs: it is technically possible to use <tt>mapply()</tt>
to do this, but then we've reinvented most of the wheels
used in the lattice version ...

<div class="p"><!----></div>
<b>Plots in this section:</b>
scatterplot (<tt>plot()</tt> or <tt>xyplot()</tt>)
bubble plot (<tt>sizeplot()</tt>),
barplot (<tt>barplot()</tt> or <tt>barchart()</tt> or <tt>barplot2()</tt>),
histogram (<tt>hist()</tt> or <tt>histogram()</tt>).

<div class="p"><!----></div>
<b>Data manipulation:</b>
<tt>reshape()</tt>, <tt>stack()</tt>/<tt>unstack()</tt>, <tt>table()</tt>, 
<tt>split()</tt>, <tt>lapply()</tt>, <tt>sapply()</tt>

<div class="p"><!----></div>
 <h2><a name="tth_sEc3">
3</a>&nbsp;&nbsp;Measles data</h2>

<div class="p"><!----></div>
I'm going to clear the workspace (<tt>rm(list=ls())</tt>
lists all the objects in the workspace with <tt>ls()</tt> and
then uses <tt>rm()</tt> to remove them: you can also
<tt>Clear workspace</tt> from the menu) and read in the
measles data, which are space-separated and have
a header:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;detach(data)
&#62;&nbsp;rm(list&nbsp;=&nbsp;ls())
&#62;&nbsp;data&nbsp;=&nbsp;read.table("ewcitmeas.dat",&nbsp;header&nbsp;=&nbsp;TRUE,&nbsp;na.strings&nbsp;=&nbsp;"*")
&#62;&nbsp;attach(data)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
<tt>year</tt>, <tt>mon</tt>, and <tt>day</tt> were read in as integers:
I'll create a <tt>date</tt> variable as described above.
For convenience, I'm also defining a variable with
the city names.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;date&nbsp;=&nbsp;as.Date(paste(year&nbsp;+&nbsp;1900,&nbsp;mon,&nbsp;day,&nbsp;sep&nbsp;=&nbsp;"/"))
&#62;&nbsp;city_names&nbsp;=&nbsp;colnames(data)[4:10]
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Later on it will be useful to have the data in long
format.  It's easiest to do use <tt>stack()</tt> for
this purpose (<tt>data.long = stack(data[,4:10])</tt>),
but that wouldn't preserve the date information.
As mentioned in the chapter, <tt>reshape()</tt>
is trickier but more flexible:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data&nbsp;=&nbsp;cbind(data,&nbsp;date)
&#62;&nbsp;data_long&nbsp;=&nbsp;reshape(data,&nbsp;direction&nbsp;=&nbsp;"long",&nbsp;varying&nbsp;=&nbsp;list(city_names),&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;v.name&nbsp;=&nbsp;"incidence",&nbsp;drop&nbsp;=&nbsp;c("day",&nbsp;"mon",&nbsp;"year"),&nbsp;times&nbsp;=&nbsp;factor(city_names),&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;timevar&nbsp;=&nbsp;"city")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
     <h3><a name="tth_sEc3.1">
3.1</a>&nbsp;&nbsp;Multiple-line plots</h3>
<tt>matplot()</tt> (<b>mat</b>rix <b>plot</b>),
which plots several different numeric variables on a common
vertical axis, is most useful when we have a 
wide format (otherwise it wouldn't make sense
to plot multiple columns on the same scale).
I'll plot columns 4 through 10, which
represent the incidence data, against the date, and I'll
tell  R&nbsp;to use lines (<tt>type="l"</tt>), and to plot all lines
in different colors with different line types
(the colors aren't very useful since I have set
them to different gray scales for printing purposes,
but the example
should at least give you the concept).  I also have to
tell it <em>not</em> to put on any axes, because  R&nbsp;won't
automatically plot a date axis.  Instead, I'll
use the <tt>axis()</tt> and <tt>axis.Date()</tt> commands to
add appropriate axes to the left (<tt>side=2</tt>) and
bottom (<tt>side=1</tt>) sides of the plot, and then
use <tt>box()</tt> to draw a frame around the plot.
I've used <tt>abline()</tt> to add vertical lines (<tt>v=</tt>) to
the plot every two years, and also
on 1 January 1968 (approximately when mass vaccination
against measles began in the UK).  You can also use
<tt>abline()</tt> to add
horizontal lines (<tt>h=</tt>); or lines with
intercepts (<tt>a=</tt>) and slopes (<tt>b=</tt>).
(I use <tt>seq.Date()</tt>, a special command to
create a sequence of dates, to define the beginning
of biennial periods.)
<tt>legend()</tt> puts a legend on the plot; I set
the line width (<tt>lwd</tt> to 2 so you could actually
see the different colors. 
Here are the commands:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;matplot(date,&nbsp;data[,&nbsp;4:10],&nbsp;type&nbsp;=&nbsp;"l",&nbsp;col&nbsp;=&nbsp;1:7,&nbsp;lty&nbsp;=&nbsp;1:7,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;axes&nbsp;=&nbsp;FALSE,&nbsp;ylab&nbsp;=&nbsp;"Weekly&nbsp;incidence",&nbsp;xlab&nbsp;=&nbsp;"Date")
&#62;&nbsp;axis(side&nbsp;=&nbsp;2)
&#62;&nbsp;axis.Date(side&nbsp;=&nbsp;1,&nbsp;x&nbsp;=&nbsp;date)
&#62;&nbsp;vacc.date&nbsp;=&nbsp;as.Date("1968/1/1")
&#62;&nbsp;biennial&nbsp;=&nbsp;seq.Date(as.Date("1948/9/1"),&nbsp;as.Date("1986/9/1"),&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;by&nbsp;=&nbsp;"2&nbsp;years")
&#62;&nbsp;abline(v&nbsp;=&nbsp;biennial,&nbsp;col&nbsp;=&nbsp;"gray",&nbsp;lty&nbsp;=&nbsp;2)
&#62;&nbsp;abline(v&nbsp;=&nbsp;vacc.date,&nbsp;lty&nbsp;=&nbsp;2,&nbsp;lwd&nbsp;=&nbsp;2)
&#62;&nbsp;legend(x&nbsp;=&nbsp;1970,&nbsp;y&nbsp;=&nbsp;5000,&nbsp;city_names,&nbsp;col&nbsp;=&nbsp;1:7,&nbsp;lty&nbsp;=&nbsp;1:7,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;bg&nbsp;=&nbsp;"white")
&#62;&nbsp;box()
&nbsp;
</pre> </font>

<div class="p"><!----></div>
I could use the long-format data set
and the lattice package to do 
this more easily, although without the refinements of a date
axis, using

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;xyplot(incidence&nbsp;~&nbsp;date,&nbsp;groups&nbsp;=&nbsp;city,&nbsp;data&nbsp;=&nbsp;data_long,&nbsp;type&nbsp;=&nbsp;"l",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;auto.key&nbsp;=&nbsp;TRUE)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
To plot each city in its own subplot,
use the formula <tt>incidence~date|city</tt> and omit the <tt>groups</tt> argument.

<div class="p"><!----></div>
You can also draw any of these plots with different kinds of symbols
(<tt>"l"</tt> for lines, <tt>"p"</tt> for points (default): see
<tt>?plot</tt> for other options).

<div class="p"><!----></div>
     <h3><a name="tth_sEc3.2">
3.2</a>&nbsp;&nbsp;Histogram and density plots</h3>

<div class="p"><!----></div>
I'll start by just collapsing all the incidence data into a
single, logged, non-<tt>NA</tt> vector (in this case
I have to use <tt>c(as.matrix(x))</tt> to collapse the data and remove
all of the data frame information):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;allvals&nbsp;=&nbsp;na.omit(c(as.matrix(data[,&nbsp;4:10])))
&#62;&nbsp;logvals&nbsp;=&nbsp;log10(1&nbsp;+&nbsp;allvals)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
The histogram (<tt>hist()</tt> command is fairly easy: the only tricks are to 
leave room for the other lines that will go on the plot by
setting the y limits with <tt>ylim</tt>, and to specify that
we want the data plotted as relative frequencies, not numbers
of counts (<tt>freq=FALSE</tt> or <tt>prob=TRUE</tt>).  This option
tells  R&nbsp;to divide by total number of counts and then by the bin width,
so that the area covered by all the bars adds up to 1; this scaling
makes the vertical scale of the histogram compatible with a density plot, or among different
histograms with different number of counts or bin widths
(?? include in chapter ??).

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;hist(logvals,&nbsp;col&nbsp;=&nbsp;"gray",&nbsp;main&nbsp;=&nbsp;"",&nbsp;xlab&nbsp;=&nbsp;"Log&nbsp;weekly&nbsp;incidence",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ylab&nbsp;=&nbsp;"Density",&nbsp;freq&nbsp;=&nbsp;FALSE,&nbsp;ylim&nbsp;=&nbsp;c(0,&nbsp;0.6))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Adding lines for the density is straightforward, since  R&nbsp;knows
what to do with a <tt>density</tt> object - in general, the <tt>lines</tt>
command just adds lines to a plot.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;lines(density(logvals),&nbsp;lwd&nbsp;=&nbsp;2)
&#62;&nbsp;lines(density(logvals,&nbsp;adjust&nbsp;=&nbsp;0.5),&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;lty&nbsp;=&nbsp;2)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Adding the estimated normal distribution
requires a couple of new functions: 

<ul>
<li> <tt>dnorm()</tt>
computes the probability density function of the
normal distribution with a specified mean and
standard deviation (much more on this in 
Chapter 4).
<div class="p"><!----></div>
</li>

<li><tt>curve()</tt> is a magic function for 
drawing a theoretical curve
(or, if <tt>add=TRUE</tt>, adding one to an existing
plot) .  The magic part is
that your curve <em>must</em> be expressed in terms
of <tt>x</tt>; <tt>curve(x^2</tt>2*x)+ will work,
but <tt>curve(y^2</tt>2*y)+ won't.  You can specify
other graphics parameters (line type (<tt>lty</tt>)
and width (<tt>lwd</tt>) in this case).
<div class="p"><!----></div>
</li>
</ul>

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;curve(dnorm(x,&nbsp;mean&nbsp;=&nbsp;mean(logvals),&nbsp;sd&nbsp;=&nbsp;sd(logvals)),&nbsp;lty&nbsp;=&nbsp;3,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;add&nbsp;=&nbsp;TRUE)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
By now the <tt>legend()</tt> command should be reasonably
self-explanatory:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;legend(x&nbsp;=&nbsp;2.1,&nbsp;y&nbsp;=&nbsp;0.62,&nbsp;legend&nbsp;=&nbsp;c("density,&nbsp;default",&nbsp;"density,&nbsp;adjust=0.5",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"normal"),&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;lty&nbsp;=&nbsp;c(1,&nbsp;2,&nbsp;3))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
     <h3><a name="tth_sEc3.3">
3.3</a>&nbsp;&nbsp;Scaling data</h3>

<div class="p"><!----></div>
Scaling the incidence in each city by the population size,
or by the mean or maximum incidence in that city, begins
to get us into some non-trivial data manipulation.
This process may actually be easier in the wide format.
Several useful commands:

<ul>
<li><tt>rowMeans()</tt>, <tt>rowSums()</tt>, <tt>colMeans()</tt>,
and <tt>colSums()</tt> will compute the means or sums of columns
efficiently. In this case we would do something like
<tt>colMeans(data[,4:10])</tt> to get the mean incidence for
each city.
<div class="p"><!----></div>
</li>

<li><tt>apply()</tt> is the more general command for running
some command on each of a set of rows or columns. When you
look at the help for <tt>apply()</tt> you'll see an argument
called <tt>MARGIN</tt>, which specifies whether you want
to operate on rows (1) or columns (2).  For example,
<tt>apply(data[,4:10],1,mean)</tt> is the equivalent
of <tt>rowMeans(data[,4:10])</tt>, but we can also easily
say (e.g.) <tt>apply(data[,4:10],1,max)</tt> to get the
maxima instead.  Later, when you've gotten practice
defining your own functions, you can apply any function - not
just  R's built-in functions.
<div class="p"><!----></div>
</li>

<li><tt>scale()</tt> is a function for subtracting and dividing
specified amounts out of the columns of a matrix.  It is fairly
flexible: <tt>scale(x,center=TRUE,scale=TRUE)</tt> will center by
subtracting the means and then 
scale by dividing by the standard errors of the columns.
Fairly obviously, setting either to <tt>FALSE</tt> will turn off
that part of the operation.  You can also specify a vector for
either <tt>center</tt> or <tt>scale</tt>, in which case <tt>scale()</tt>
will subtract or divide the columns by those vectors instead.
<b>Exercise 6*</b>: figure out how to use 
<tt>apply()</tt> and <tt>scale()</tt> to scale all columns so
they have a minimum of 0 and a maximum of 1 (<em>hint:</em>
subtract the minimum and divide by (max-min)).
<div class="p"><!----></div>
</li>

<li><tt>sweep()</tt> is more general than scale; it will
operate on either rows or columns (depending on the
<tt>MARGIN</tt> argument), and it will use any operator
(typically <tt>"-"</tt>, <tt>"/"</tt>, etc. - arithmetic
symbols must be in quotes) rather than just subtracting
or dividing.
For example, <tt>sweep(x,1,rowSums(x),"/")</tt> will divide
the rows (1) of <tt>x</tt> by their sums. <br />
<b>Exercise 7</b>: figure out how to use 
a call to <tt>sweep()</tt> to do the same thing
as <tt>scale(x,center=TRUE,scale=FALSE)</tt>.
<div class="p"><!----></div>
</li>
</ul>
So, if I want to divide each city's incidence by its mean
(allowing for adding 1) and take logs:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;logscaledat&nbsp;=&nbsp;as.data.frame(log10(scale(1&nbsp;+&nbsp;data[,&nbsp;4:10],&nbsp;center&nbsp;=&nbsp;FALSE,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;scale&nbsp;=&nbsp;colMeans(1&nbsp;+&nbsp;data[,&nbsp;4:10],&nbsp;na.rm&nbsp;=&nbsp;TRUE))))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
You can also scale the data while they are in long format,
but you have to think about it differently.
Use <tt>tapply()</tt> to compute the mean incidence in each city, ignoring <tt>NA</tt> values,
and adding 1 to all values:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;city_means&nbsp;&lt;-&nbsp;tapply(1&nbsp;+&nbsp;data_long$incidence,&nbsp;data_long$city,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;mean,&nbsp;na.rm&nbsp;=&nbsp;TRUE)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Now you can use vector indexing to scale each incidence value by
the appropriate mean value - <tt>city_means[data_long$city]</tt>
does the trick.  (Why?)

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;scdat&nbsp;&lt;-&nbsp;(1&nbsp;+&nbsp;data_long$incidence)/city_means[data_long$city]
&nbsp;
</pre> </font>

<div class="p"><!----></div>
<b>Exercise 8*</b>: figure out how to 
scale the long-format data to minima of zero and maxima of 1.

<div class="p"><!----></div>

<b>Plotting&nbsp;&nbsp;</b>

<div class="p"><!----></div>
Here are (approximately)
the commands I used to plot the scaled data.

<div class="p"><!----></div>
First, I ask  R&nbsp;to set up the graph by plotting
the first column, but without actually putting
any lines on the plot (<tt>type="n"</tt>):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;plot(density(na.omit(logscaledat[,&nbsp;1])),&nbsp;type&nbsp;=&nbsp;"n",&nbsp;main&nbsp;=&nbsp;"",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;xlab&nbsp;=&nbsp;"Log&nbsp;scaled&nbsp;incidence")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Now I do something tricky.  I define a temporary
function with two arguments - the data and 
a number specifying the column and line types.
This function doesn't do anything at all until
I call it with a specific data vector and number:
<tt>x</tt> and <tt>i</tt> are just place-holders.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;tmpfun&nbsp;=&nbsp;function(x,&nbsp;i)&nbsp;{
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lines(density(na.omit(x)),&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;col&nbsp;=&nbsp;i,&nbsp;lty&nbsp;=&nbsp;i)
+&nbsp;}
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Now I use the <tt>mapply()</tt> command 
(<b>m</b>ultiple <b>apply</b>) to 
run the <tt>tmpfun()</tt> function for all
the columns in <tt>logscaledat</tt>, with
different colors and line types.
This takes advantage of the fact that
I have used <tt>as.data.frame()</tt> above
to make <tt>logscaledat</tt> back into
a data frame, so that its columns can
be treated as elements of a list:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;m&nbsp;=&nbsp;mapply(tmpfun,&nbsp;logscaledat,&nbsp;1:7)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Finally, I'll add a legend.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;legend(-2.6,&nbsp;0.65,&nbsp;city_names,&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;col&nbsp;=&nbsp;1:7,&nbsp;lty&nbsp;=&nbsp;1:7)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Once again, for this relatively simple case,
I can get the lattice package to do all of this
for me magically:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;densityplot(~log10(scdat),&nbsp;groups&nbsp;=&nbsp;data_long$city,&nbsp;plot.points&nbsp;=&nbsp;FALSE,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;auto.key&nbsp;=&nbsp;TRUE,&nbsp;lty&nbsp;=&nbsp;1:7)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
(if <tt>plot.points=TRUE</tt>, as is the default, the plot will include all of the
actual data values plotted as points along the zero line - this is often useful
but in this case just turns into a blob).

<div class="p"><!----></div>
However, it's really useful to know some of the ins and outs for times when
lattice won't do want you want - in those cases it's often easier to do
it yourself with the base package than to figure out how to get lattice
to do it.

<div class="p"><!----></div>
     <h3><a name="tth_sEc3.4">
3.4</a>&nbsp;&nbsp;Box-and-whisker and violin plots</h3>

<div class="p"><!----></div>
By this time, box-and-whisker and violin plots will (I hope) seem easy:

<div class="p"><!----></div>
Since the labels get a little crowded ( R&nbsp;is not really sophisticated
about dealing with axis labels - crowded labels just disappear - although
you can try the <tt>stagger.labs()</tt> command from the <tt>plotrix</tt>
package), I'll use the <tt>substr()</tt> (<b>substr</b>ing) command to abbreviate
each city's name to its first three letters.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;city_abbr&nbsp;=&nbsp;substr(city_names,&nbsp;1,&nbsp;3)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
The <tt>boxplot()</tt> command uses a formula - the variable before the <tt>~</tt>
is the data and the variable after it is the factor to use to split the data up.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;boxplot(log10(1&nbsp;+&nbsp;incidence)&nbsp;~&nbsp;city,&nbsp;data&nbsp;=&nbsp;data_long,&nbsp;ylab&nbsp;=&nbsp;"Log(incidence+1)",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;names&nbsp;=&nbsp;city_abbr)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Of course, I can do this with the lattice package as well.  If I want
violin plots instead of boxplots, I specify <tt>panel=panel.violin</tt>.
The <tt>scales=list(abbreviate=TRUE)</tt> tells the lattice package to
make up its own abbreviations (the <tt>scale()</tt> command is a general-purpose
list of options for the subplot formats).

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;bwplot(log10(1&nbsp;+&nbsp;incidence)&nbsp;~&nbsp;city,&nbsp;data&nbsp;=&nbsp;data_long,&nbsp;panel&nbsp;=&nbsp;panel.violin,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;horizontal&nbsp;=&nbsp;FALSE,&nbsp;scales&nbsp;=&nbsp;list(abbreviate&nbsp;=&nbsp;TRUE))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
<b>Plots in this section:</b>
multiple-groups plot (<tt>matplot()</tt> or <tt>xyplot(...,groups)</tt>,
box-and-whisker plot (<tt>boxplot()</tt> or <tt>bwplot()</tt>),
density plot (<tt>plot(density())</tt> or <tt>lines(density())</tt> or <tt>densityplot()</tt>),
violin plot (<tt>panel.violin()</tt>)

<div class="p"><!----></div>
<b>Data manipulation:</b>
<tt>row/colMeans()</tt>,
<tt>row/colSums()</tt>, <tt>sweep()</tt>, <tt>scale()</tt>, <tt>apply()</tt>, <tt>mapply()</tt>

<div class="p"><!----></div>
 <h2><a name="tth_sEc4">
4</a>&nbsp;&nbsp;Continuous data</h2>

<div class="p"><!----></div>
First let's make sure the earthquake data are accessible:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;data(quakes)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Luckily, most of the plots I drew in this section are
fairly automatic.
To draw a scatterplot matrix, just use <tt>pairs()</tt>
(base) or <tt>splom()</tt> (lattice):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;pairs(quakes,&nbsp;pch&nbsp;=&nbsp;".")
&#62;&nbsp;splom(quakes,&nbsp;pch&nbsp;=&nbsp;".")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
(<tt>pch="."</tt> marks the data with a single-pixel
point, which is handy if you are fortunate enough
to have a really big data set).

<div class="p"><!----></div>
Similarly, the conditioning plot is only available
through lattice:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;coplot(lat&nbsp;~&nbsp;long&nbsp;|&nbsp;depth,&nbsp;data&nbsp;=&nbsp;quakes)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
although <tt>coplot()</tt> has <em>many</em> other
options for changing the number of overlapping
categories (or <em>shingles</em> the data are
divided into; conditioning on more than one
variable (use <tt>var1*var2</tt> after the vertical
bar); colors, line types, etc. etc..

<div class="p"><!----></div>
To draw the last figure (various lines plotted
against data, I first took a subset of the
data with longitude greater than 175:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;tmpdat&nbsp;=&nbsp;quakes[quakes$long&nbsp;&#62;&nbsp;175,&nbsp;]
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Then I generated a basic plot of depth vs. longitude:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;plot(tmpdat$long,&nbsp;tmpdat$depth,&nbsp;xlab&nbsp;=&nbsp;"Longitude",&nbsp;ylab&nbsp;=&nbsp;"Depth",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;col&nbsp;=&nbsp;"darkgray",&nbsp;pch&nbsp;=&nbsp;".")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
 R&nbsp;knows what to do (<tt>plot()</tt> or <tt>lines</tt>) with
a <tt>lowess()</tt> fit. In this case I used the default
smoothing parameter (<tt>f=2/3</tt>), but I could have used
a smaller value to get a wigglier line.

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;lines(lowess(tmpdat$long,&nbsp;tmpdat$depth),&nbsp;lwd&nbsp;=&nbsp;2)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
 R&nbsp;also knows what to do with <tt>smooth.spline</tt>
objects: in this case I plot two lines, one with
less smoothing (<tt>df=4</tt>):

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;lines(smooth.spline(tmpdat$long,&nbsp;tmpdat$depth),&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;lty&nbsp;=&nbsp;2)
&#62;&nbsp;lines(smooth.spline(tmpdat$long,&nbsp;tmpdat$depth,&nbsp;df&nbsp;=&nbsp;4),&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lty&nbsp;=&nbsp;3)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Adding a line based on a linear regression fit is 
easy - we did that in Lab&nbsp;1:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;abline(lm(depth&nbsp;~&nbsp;long,&nbsp;data&nbsp;=&nbsp;tmpdat),&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;col&nbsp;=&nbsp;"gray")
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Finally, I do something slightly more complicated - plot
the results of a quadratic regression. The regression itself
is easy, except that I have to specify longitude-squared
as  <tt>I(long^2)</tt> so that  R&nbsp;knows I mean to raise
longitude to the second power rather than exploring
a statistical interaction:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;quad.lm&nbsp;=&nbsp;lm(depth&nbsp;~&nbsp;long&nbsp;+&nbsp;I(long^2),&nbsp;data&nbsp;=&nbsp;tmpdat)
&nbsp;
</pre> </font>

<div class="p"><!----></div>
To calculate predicted depth values across the range of 
longitudes, I have to set up a longitude vector and
then use <tt>predict()</tt> to generate predictions
at these values:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;lvec&nbsp;=&nbsp;seq(176,&nbsp;188,&nbsp;length&nbsp;=&nbsp;100)
&#62;&nbsp;quadvals&nbsp;=&nbsp;predict(quad.lm,&nbsp;newdata&nbsp;=&nbsp;data.frame(long&nbsp;=&nbsp;lvec))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
Now I can just use <tt>lines()</tt> to add these
values to the graph, and add a legend:

<div class="p"><!----></div>
  <font color="#FF0000">
<pre>
&#62;&nbsp;lines(lvec,&nbsp;quadvals,&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;lty&nbsp;=&nbsp;2,&nbsp;col&nbsp;=&nbsp;"gray")
&#62;&nbsp;legend(183.2,&nbsp;690,&nbsp;c("lowess",&nbsp;"spline&nbsp;(default)",&nbsp;"spline&nbsp;(df=4)",&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"regression",&nbsp;"quad.&nbsp;regression"),&nbsp;lwd&nbsp;=&nbsp;2,&nbsp;lty&nbsp;=&nbsp;c(1,&nbsp;2,&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3,&nbsp;1,&nbsp;2),&nbsp;col&nbsp;=&nbsp;c(rep("black",&nbsp;3),&nbsp;rep("gray",&nbsp;2)))
&nbsp;
</pre> </font>

<div class="p"><!----></div>
<b>Plots in this section:</b>
scatterplot matrix (<tt>pairs()</tt>, <tt>splom()</tt>),
conditioning plot (<tt>coplot()</tt>),
spline (<tt>smooth.spline()</tt> and
locally weighted (<tt>lowess()</tt>) smoothing

<div class="p"><!----></div>
<b>Exercise 9*</b>: generate
three new plots based on one of the data sets in this lab,
or on your own data.

<div class="p"><!----></div>

<br /><br /><hr /><small>File translated from
T<sub><font size="-1">E</font></sub>X
by <a href="http://hutchinson.belmont.ma.us/tth/">
T<sub><font size="-1">T</font></sub>M</a>,
version 3.70.<br />On 12 Sep 2005, 12:35.</small>
</body></html>
