``Just get a bigger hammer'': alternate optimization methods
Comments on last problem set
Averaging/Jensen's inequality problem set
I hope this will be the last set of comments I make on Jensen's
inequality: as I have said to a few people, although J.E. is an
important idea, the concern and confusion I accidentally engendered
over it has blown it all somewhat out of proportion.
Two references that talk about Jensen's inequality are:
- J. J. Ruel and Matthew P. Ayres,
Jensen's
inequality predicts effects of environmental variation [N.B.: link
only available from UF sites-sorry], TREE
14 (9):361-366 (Sept. 1999): a really good general
introduction to J.E. and the effects of ignored variation across a
variety of different ecological scales (plant ecology, general animal
physiology, plant-herbivore interactions). [Thanks to James Vonesh
for pointing this out.]
- Schmitt, R.J., S.J. Holbrook, and C.W. Osenberg. 1999. Quantifying
the effects of multiple processes on local abundance: a cohort
approach for open populations
Ecology Letters 2:294-303. Not specifically about
J.E., but the last section of the paper extrapolates some of their
results to a larger scale by averaging across a distribution of
variability in fish settlement rates.
- P. D. Smallwood, An introduction to risk sensitivity: The use of
Jensen's inequality to clarify evolutionary arguments of adaptation and constraint,
American Zoologist, 36:392-401 (1996). (Jensen's
inequality in a more classical behavioral context, where ``risk
sensitivity''-the willingness of an organism to trade off a higher
mean reward for a lower variance in reward-can be explained in terms
of a saturating fitness curve.)
The only other general point I wanted to make was about the homework,
where in going from s = 1 to s = 3 many people didn't
remember to extend the range (for which I initially suggested m±3 s, which worked out to 17-23 for m = 20, s = 1). This
gives the wrong answer, because you're no longer integrating over
enough of the distribution.
When you're doing this kind of numeric integration, it's always worth
checking that you're getting enough of the distribution (i.e., your
range is large enough and your Dz is small enough) by doing
some calculation where you know what the right answer is. For
example,
dz <- 0.01
N <- seq(17,23,by=dz)
pvec <- dnorm(N,mean=20,sd=3)
sum(pvec*dz)
gives an answer of 0.68.
What we were hoping for is that
åz1z2 P(z) Dz » òz1z2 P(z) dz » ò-¥¥ P(z) dz, which should be equal to
1 in this case because P is a probability distribution function.
The fact that 0.68 ¹ 1 indicates that either Dz is too
big or (more likely in this case) that {z1,z2} doesn't span a
large enough range.
This is just another example of sanity checking-always try to
think of ways to cross-check your answers and look at them in
different ways to make sure they make sense.
Dynamic models
A large category of models, but one that we haven't dealt with much in
this class yet, is dynamic models, models where the expected
values of the data are not some simple function of a static
independent variable (light, nitrogen, predator density, etc.), but
can only be derived from a dynamic model such as a differential
equation or stochastic simulation.
Take the example (which we will cover in more detail a bit later in
the course) of wildebeest dynamics in the Serengeti. Hilborn &
Mangel present a model in which birth (calving) and death rates depend
on population density and rainfall. In the best of all possible
worlds, we would just go out and measure calving and death rates for
in years with different population densities and rainfall. In the
very best of all worlds we would find some way to vary these
independent variables experimentally; in practice, there's no way we
can do this-these observational and dynamic situations are the ones
where we most often need models to help overcome our experimental and
observational limitations (they cannot substitute for good
observational and experimental design).
Since we cannot do the experiments, we use dynamic models instead: we
define Nt+1 in terms of birth rates and death rates given Nt
and the rainfall in year t. For each set of parameters we want to
evaluate, we run the whole model forward in time and see how the
predictions compare with the observed data. This comparison (e.g. the
sum-of-squares of the difference between the model prediction for a
particular set of parameters) becomes the likelihood or
goodness-of-fit that we try to minimize/maximize with the tools we are
collecting.
There are two possible ways that we can sometimes avoid using a
complete dynamic model (which is computationally harder, and allows
many steps for error/usually produces larger errors in parameters,
than more straightforward models).
- First, for really simple models we
can solve our dynamic equations (Nt+1 = f(Nt) or dN/dt = f(N)) to
get a complete time-dependent solution. For example if we are using
the logistic equation dN/dt = rN(1-N/K) to describe population growth
rates, we can solve the differential equation directly:
|
N(t) = |
K
|
1- |
æ ç
è
|
1 - |
K N(0)
|
ö ÷
ø
|
e-rt |
|
|
| (1) |
and plug this into our likelihood function L(r,K), rather
than running a loop each time to calculate N(t+1) from N(t).
However, it's fairly unusual that we can solve the equations
analytically for ecologically interesting models: linear growth
(Nt+1 = Nt + C), exponential growth (Nt+1 = C Nt), and
logistic growth are the exceptions rather than the general case.
- Another suggestion (as pointed out by Lou Santiago in class) is
that sometimes we can fit the function for Nt+1 = f(Nt) directly.
This is essentially what we saw in the ``Monte Carlo'' exercise from
chapter 3: the plot of Nt+1 vs. Nt is a straight line, and we
could fit a straight line through the points by linear regression
(although it is not clear that the assumptions of linear regression
hold in this case, particularly if there is process error as well as
measurement error). Where this tends to break down is (as in the
wildebeest model) where we don't have all the consecutive data: if,
for example, we have data for year 5 and year 7 but not year 6, the
relationship between N5 and N7 is sufficiently complicated that
we can't fit it directly and we have to go back to the dynamic model.
We'll see more dynamic models as we go along.
Overcoming fitting problems
Changing topics back to the technical details of fitting functions
(whether they are dynamic models or simpler static functions of data):
the quasi-Newton methods described in the last lecture are powerful
and efficient when they work, but they don't always work.
Discontinuities etc.: Nelder-Mead (``simplex'',
``amoeba'')
The first problem that derivative-based methods (such as
quasi-Newton) have are that they can be messed up when the
goodness-of-fit function has sharp changes in it.
(For a digression/example, see this example of
how a threshold function causes sharp changes in the sum-of-squares
function: that, or take this on faith and wait until we get to the
example from chapter 6, which will have exactly this problem.)
When the quasi-Newton method gets to a place on the graph with flat
places, it gets faked into thinking it's found the answer (since the
derivative is zero); when there's a vertical discontinuity in the
graph, it gets faked into extrapolating wrong.
There are fancy methods that avoid the use of derivatives (you can
look up Powell's method in Numerical Recipes), but a
simple brute-force option is the simplex method invented by
Nelder and Mead (called the ``amoeba'' in NR).
Rather than starting with a single point guess at what the parameters
are, this method picks a number of points which form a
simplex-the simplest shape possible in an n-dimensional
space.
In two dimensions, this is three points (each of which represents a
pair of parameter values) forming a triangle; in three dimensions,
it's 4 points (each of which is a triplet of parameter values) forming
a pyramid or tetrahedron; in higher dimensions, it's n+1 points (we call this
shape a ``hyper-tetrahedron'' or just call it a simplex).
We then evaluate the likelihood/sum-of-squares at each point, which is
the ``height'' of the surface at that point, and move the worst point
according to a simple set of rules:
- start by going in what seems to the best direction by
reflecting the high (worst) point in the simplex
through the face opposite it;
- if the goodness-of-fit at the new point is better than the best
(lowest) other point in the simplex, double the length of the jump in
that direction;
- if this jump was bad-the height at the new point is worse than
the second-worst point in the simplex-then try a point that's only
half as far out as the initial try;
- if this second try, closer to the original, is also bad, then we
contract the simplex around the current best (lowest) point.
Here's a graphical illustration (taken from NR) of those rules applied
to a trapezoid:

The amoeba works pretty well in a wide variety of situations, although
it's not foolproof (nothing is) and it's not very efficient.
Here's what the simplex does with the surface we calculated for the
exponential model in the last section.

We give the amoeba (r = 4, d = 4) as starting coordinates and it
displaces these coordinates slightly to get its starting simplex
(black triangle). Thereafter, it takes steps alternating between
simple reflection (blue triangles) and doubled reflection (green
triangles), until it finds that it has gone too far and ``turns the
corner'' to start decreasing r. Occasionally, a simple reflection
is too far-usually when the amoeba is trying to turn a corner-and
the triangle is light blue. In any case, the amoeba does eventually
get to the right place, although it takes 48 cycles to get there.
Multiple minima: multiple-start Newton, Metropolis
However clever the Nelder-Mead simplex is, it can't deal with
multiple minima: this is a particularly tough case.
If there are only a few minima and their ``basins of
attraction''-the set of starting points from which you can roll
downhill into them-are all fairly large, the best thing to do is
probably just to use a quasi-Newton algorithm, but to start from many
different (random or stratified) starting points.
If this doesn't work-if there are lots of minima of different
heights, so many that you'll never manage to fall into the right
one-then another possibility is to use one of the many
stochastic optimization algorithms, where you actually pick
random numbers to add some ``noise'' and avoid local minima.
The classic stochastic optimization algorithm is the Metropolis
algorithm (or simulated annealing, which says (essentially):
- pick a starting point (set of parameters) and calculate the
value of the function for those parameters;
- until your answer is good enough or you're sick of calculating:
- A. pick a new point (set of parameters) at random (near your old point);
- calculate the function value there
- if it's better than the old function value, accept it and go
back to A
- if it's worse than the old value, calculate the difference
DL; pick a random number between 0 and 1 and accept the new
value if the random number is less than e-DL/k, where k
is a constant. Otherwise, go back to the old value. The smaller k is and the worse DL
is, the less likely you are to accept the new value
- As you go along, gradually lower the value of k (which is
sometimes called the temperature) to make it harder and harder to
accept bad moves.
A different variant of the Metropolis algorithm (Szymura-Barton, MSB) lets
the size of the change in parameters vary rather than the probability
of accepting a bad step, and changes the jump size automatically
rather than according to a fixed schedule: every
successful jump increases the jump size, while every unsuccessful jump
decreases the jump size. This makes the algorithm good at exploring
lots of local minima (every time it gets into a valley, it starts
trying to get out) but really bad at refining estimates (it has a hard
time actually getting all the way to the bottom of a valley).
Here's a snapshot of where the MSB algorithm tries to go on our
familiar landscape:

Black dots are successful jumps, green are unsuccessful jumps.
Let's look at some statistics on how the Metropolis algorithm
performed over time:

It's actually pretty sloppy. The first two panels show the values of
r and d over time, which although they're centered on reasonable
values, fluctuate a whole lot. The third and fourth panels who the
best-so-far values of r and d-most of the action takes place in
the first 500 steps or so, after that there's only tiny incremental
changes. The jump sizes for r and d (panels 5 and 6) go up to a
rough asymptote in the first 500 steps, then stay there. The sum of
squares (panel 7) varies wildly (note that this is on a log scale),
and as suggested by panels 3 and 4 the best-so-far S.S. (panel 8)
drops rapidly in the first 500 steps and then drops very slowly.
Finally, the fraction of jumps accepted (panel 9) starts high while lots of
improvement is taking place, and then drops down to about 0.5.
The Metropolis algorithm is not particularly good in this context, but
can be a lifesaver when you have more complicated goodness-of-fit
surfaces and you're willing to use brute force.
Constraints on parameters
The last technical detail that we'll cover (for now) is the problem of
having to constrain parameter values within a particular range. There
are many reasons this comes up, but the most common is when you only
have parameters that make sense when they are positive or (say)
between 0 and 1 (such as probabilities).
There are three basic approaches to this problem:
- Find a clever (e.g. quasi-Newton) minimizer that incorporates the bounds into its
solution of the problem. These do exist, but (alas) not in the
current version of R [although apparently the latest version does
have this, so if you need it for your project you can see about
downloading the latest version]. The odds of finding a simplex or a
Metropolis algorithm with bounds built in is fairly slim, and if your
constraints are more complicated than simple ``parameter x must be
between a and b''-e.g., if there's a relationship between
different parameters that has to stay constant-then you may be out
of luck.
- Add a penalty to the likelihood or sum of squares that increases
the farther outside of the allowed region the parameters go. This
will tend to push minimizers go back into the allowed region.
However, there are a variety of potential technical problems with
this-the largest is that, if your likelihood calculation doesn't
make any sense outside the allowed region (e.g. for negative
probabilities) you have to make some decisions.
- Transform your parameters so that there are no longer any
constraints. For example, if parameter x has to be positive, then
rewrite your function and minimize with respect to y = logx instead. When y goes to
-¥, then x goes to zero; when y goes to ¥, x goes
to ¥. Similarly, if you have a parameter x that has to be
between 0 and 1, the arctangent of x will go from -¥ to
¥.
A cautionary note
Just for the heck of it, here's a goodness-of-fit surface that
recently came up in a problem I'm working on. Unfortunately, the
slice shown is a section of the variation of the G-O-F with variation
in a single parameter: the entire model
has 242 (!) parameters, so
brute force is really out of the question.

You can see that there are sharp edges, multiple local minima - all
the nasty stuff - plus the parameters are all constrained
to be between 0 and 1.
Here's a contour of a two-dimensional slice:

The point shows where my algorithm got stuck; it's optimal in both
directions, but not along the diagonal. I still haven't got an algorithm that can
reliably find the
best solution.
File translated from
TEX
by
TTH,
version 2.60.
On 7 Mar 2000, 10:42.