``Just get a bigger hammer'': alternate optimization methods

Comments on last problem set

Averaging/Jensen's inequality problem set

I hope this will be the last set of comments I make on Jensen's inequality: as I have said to a few people, although J.E. is an important idea, the concern and confusion I accidentally engendered over it has blown it all somewhat out of proportion. Two references that talk about Jensen's inequality are:

The only other general point I wanted to make was about the homework, where in going from s = 1 to s = 3 many people didn't remember to extend the range (for which I initially suggested m±3 s, which worked out to 17-23 for m = 20, s = 1). This gives the wrong answer, because you're no longer integrating over enough of the distribution. When you're doing this kind of numeric integration, it's always worth checking that you're getting enough of the distribution (i.e., your range is large enough and your Dz is small enough) by doing some calculation where you know what the right answer is. For example,

dz <- 0.01
N <- seq(17,23,by=dz)
pvec <- dnorm(N,mean=20,sd=3)
sum(pvec*dz)
gives an answer of 0.68. What we were hoping for is that åz1z2 P(z) Dz » òz1z2 P(z)  dz » ò-¥¥ P(z)  dz, which should be equal to 1 in this case because P is a probability distribution function. The fact that 0.68 ¹ 1 indicates that either Dz is too big or (more likely in this case) that {z1,z2} doesn't span a large enough range. This is just another example of sanity checking-always try to think of ways to cross-check your answers and look at them in different ways to make sure they make sense.

Dynamic models

A large category of models, but one that we haven't dealt with much in this class yet, is dynamic models, models where the expected values of the data are not some simple function of a static independent variable (light, nitrogen, predator density, etc.), but can only be derived from a dynamic model such as a differential equation or stochastic simulation.

Take the example (which we will cover in more detail a bit later in the course) of wildebeest dynamics in the Serengeti. Hilborn & Mangel present a model in which birth (calving) and death rates depend on population density and rainfall. In the best of all possible worlds, we would just go out and measure calving and death rates for in years with different population densities and rainfall. In the very best of all worlds we would find some way to vary these independent variables experimentally; in practice, there's no way we can do this-these observational and dynamic situations are the ones where we most often need models to help overcome our experimental and observational limitations (they cannot substitute for good observational and experimental design).

Since we cannot do the experiments, we use dynamic models instead: we define Nt+1 in terms of birth rates and death rates given Nt and the rainfall in year t. For each set of parameters we want to evaluate, we run the whole model forward in time and see how the predictions compare with the observed data. This comparison (e.g. the sum-of-squares of the difference between the model prediction for a particular set of parameters) becomes the likelihood or goodness-of-fit that we try to minimize/maximize with the tools we are collecting.

There are two possible ways that we can sometimes avoid using a complete dynamic model (which is computationally harder, and allows many steps for error/usually produces larger errors in parameters, than more straightforward models).

We'll see more dynamic models as we go along.

Overcoming fitting problems

Changing topics back to the technical details of fitting functions (whether they are dynamic models or simpler static functions of data): the quasi-Newton methods described in the last lecture are powerful and efficient when they work, but they don't always work.

Discontinuities etc.: Nelder-Mead (``simplex'', ``amoeba'')

The first problem that derivative-based methods (such as quasi-Newton) have are that they can be messed up when the goodness-of-fit function has sharp changes in it. (For a digression/example, see this example of how a threshold function causes sharp changes in the sum-of-squares function: that, or take this on faith and wait until we get to the example from chapter 6, which will have exactly this problem.) When the quasi-Newton method gets to a place on the graph with flat places, it gets faked into thinking it's found the answer (since the derivative is zero); when there's a vertical discontinuity in the graph, it gets faked into extrapolating wrong.

There are fancy methods that avoid the use of derivatives (you can look up Powell's method in Numerical Recipes), but a simple brute-force option is the simplex method invented by Nelder and Mead (called the ``amoeba'' in NR). Rather than starting with a single point guess at what the parameters are, this method picks a number of points which form a simplex-the simplest shape possible in an n-dimensional space. In two dimensions, this is three points (each of which represents a pair of parameter values) forming a triangle; in three dimensions, it's 4 points (each of which is a triplet of parameter values) forming a pyramid or tetrahedron; in higher dimensions, it's n+1 points (we call this shape a ``hyper-tetrahedron'' or just call it a simplex). We then evaluate the likelihood/sum-of-squares at each point, which is the ``height'' of the surface at that point, and move the worst point according to a simple set of rules:

Here's a graphical illustration (taken from NR) of those rules applied to a trapezoid:

N-M.gif

The amoeba works pretty well in a wide variety of situations, although it's not foolproof (nothing is) and it's not very efficient.

Here's what the simplex does with the surface we calculated for the exponential model in the last section.

simpsurf.gif

We give the amoeba (r = 4, d = 4) as starting coordinates and it displaces these coordinates slightly to get its starting simplex (black triangle). Thereafter, it takes steps alternating between simple reflection (blue triangles) and doubled reflection (green triangles), until it finds that it has gone too far and ``turns the corner'' to start decreasing r. Occasionally, a simple reflection is too far-usually when the amoeba is trying to turn a corner-and the triangle is light blue. In any case, the amoeba does eventually get to the right place, although it takes 48 cycles to get there.

Multiple minima: multiple-start Newton, Metropolis

However clever the Nelder-Mead simplex is, it can't deal with multiple minima: this is a particularly tough case. If there are only a few minima and their ``basins of attraction''-the set of starting points from which you can roll downhill into them-are all fairly large, the best thing to do is probably just to use a quasi-Newton algorithm, but to start from many different (random or stratified) starting points.

If this doesn't work-if there are lots of minima of different heights, so many that you'll never manage to fall into the right one-then another possibility is to use one of the many stochastic optimization algorithms, where you actually pick random numbers to add some ``noise'' and avoid local minima. The classic stochastic optimization algorithm is the Metropolis algorithm (or simulated annealing, which says (essentially):

A different variant of the Metropolis algorithm (Szymura-Barton, MSB) lets the size of the change in parameters vary rather than the probability of accepting a bad step, and changes the jump size automatically rather than according to a fixed schedule: every successful jump increases the jump size, while every unsuccessful jump decreases the jump size. This makes the algorithm good at exploring lots of local minima (every time it gets into a valley, it starts trying to get out) but really bad at refining estimates (it has a hard time actually getting all the way to the bottom of a valley).

Here's a snapshot of where the MSB algorithm tries to go on our familiar landscape:

metsurf.gif

Black dots are successful jumps, green are unsuccessful jumps.

Let's look at some statistics on how the Metropolis algorithm performed over time:

metstats.gif

It's actually pretty sloppy. The first two panels show the values of r and d over time, which although they're centered on reasonable values, fluctuate a whole lot. The third and fourth panels who the best-so-far values of r and d-most of the action takes place in the first 500 steps or so, after that there's only tiny incremental changes. The jump sizes for r and d (panels 5 and 6) go up to a rough asymptote in the first 500 steps, then stay there. The sum of squares (panel 7) varies wildly (note that this is on a log scale), and as suggested by panels 3 and 4 the best-so-far S.S. (panel 8) drops rapidly in the first 500 steps and then drops very slowly. Finally, the fraction of jumps accepted (panel 9) starts high while lots of improvement is taking place, and then drops down to about 0.5.

The Metropolis algorithm is not particularly good in this context, but can be a lifesaver when you have more complicated goodness-of-fit surfaces and you're willing to use brute force.

Constraints on parameters

The last technical detail that we'll cover (for now) is the problem of having to constrain parameter values within a particular range. There are many reasons this comes up, but the most common is when you only have parameters that make sense when they are positive or (say) between 0 and 1 (such as probabilities).

There are three basic approaches to this problem:

A cautionary note

Just for the heck of it, here's a goodness-of-fit surface that recently came up in a problem I'm working on. Unfortunately, the slice shown is a section of the variation of the G-O-F with variation in a single parameter: the entire model has 242 (!) parameters, so brute force is really out of the question.

ugly2.gif

You can see that there are sharp edges, multiple local minima - all the nasty stuff - plus the parameters are all constrained to be between 0 and 1.

Here's a contour of a two-dimensional slice:

prof2d-1.gif

The point shows where my algorithm got stuck; it's optimal in both directions, but not along the diagonal. I still haven't got an algorithm that can reliably find the best solution.


File translated from TEX by TTH, version 2.60.
On 7 Mar 2000, 10:42.