Curve fitting How-to

by W. Garrett Mitchener

This worksheet goes over traditional linear and non-linear least squares curve fitting and different ways to do it in Mathematica. It also goes over maximum likelihood curve fitting. Along the way, it shows different functions for finding maxima and minima of expressions.

Least squares and linear regression

Let's say you have some data and you want to fit a curve to it so that you can say .

To illustrate, let's create some noisy data:

In[1]:=

Out[1]=

In[2]:=

Out[2]=

In Mathematica, the Fit function takes a list of points, a list of expressions, and a list of independent variables, and determines which linear combination of the given expressions produces the best fit to the data. If you want to fit a line to the data, your list of expressions should consist of 1 and x, since a line is a linear combination of a constant and a multiple of x:

In[3]:=

Out[3]=

In[4]:=

Out[4]=

In[5]:=

Out[5]=

And you can see that we've more or less recovered the that we used to create the data in the first place.

You can also do more interesting things:

In[6]:=

Out[6]=

In[7]:=

Out[7]=

For instance, this fits a second degree polynomial to the data:

In[8]:=

Out[8]=

In[9]:=

Out[9]=

Not a very good fit, is it? It doesn't have nearly enough bumps. Let's try something higher degree.

In[10]:=

Out[10]=

In[11]:=

Out[11]=

That looks a lot better. I knew to guess 5 for the degree because the data goes through four extrema.

In[12]:=

Out[12]=

But don't get carried away. If you give it too many degrees of freedom, it will start to fit the noise, as in this example:

In[13]:=

Out[13]=

In[14]:=

Out[14]=

In[15]:=

Out[15]=

You can also do more exotic things:

In[16]:=

Out[16]=

In[17]:=

Out[17]=

but that doesn't look right somehow. In fact, I get vastly different plots every time I run this worksheet, which indicates that the random noise added to the data is having a huge impact on the cosine fit, which isn't the case for the fifth degree polynomial. That indicates that these cosines are not a good way to fit this data. (Think about it: Why?) The first step to getting a good fit is to know what functions to include.

Details on how Fit works.

The way Fit works is called least squares, because it minimizes this:

In Mathematica notation

In[18]:=

Let's return to our linear data.

In[19]:=

Out[19]=

Here's how to define a general line function:

In[20]:=

Out[20]=

To fit this line to the data, we need to determine and such that the error is minimized.

Mathematica has several functions for finding the minimum of an expression. They all work a little differently. The Minimize function works algebraically:

In[21]:=

Out[21]=

The result is a list where min is the minimum value it found, and args is a rule table. Here's how to use a rule table. First, just so we're clear on what's going on, we'll unpack the list returned by Minimize.

In[22]:=

Out[22]=

That defined min to be the minimum value, and to be the rule table giving the values of and :

In[23]:=

Out[23]=

This was the function we were trying to fit:

In[24]:=

Out[24]=

And we can connect the general function to the specific values of and by using the /.

operator. (You can also use the ReplaceAll function. The /. is short-hand for ReplaceAll.)

In[25]:=

Out[25]=

In[26]:=

Out[26]=

If you want to define a function representing the fit:

In[27]:=

Out[27]=

In[28]:=

Out[28]=

The Minimize function works algebraically, which means it sometimes doesn't do quite what you'd like. It generally works well on polynomial problems, but if you give it a nasty enough trancendental problem, you're out of luck.

In[29]:=

Out[29]=

So, sometimes you get better results working just numerically. For that, try NMinimize .

In[30]:=

Out[30]=

Here's our linear least squares fit again:

In[31]:=

Out[31]=

Another numerical function is FindMinimum, which uses a different numerical algorithm. You have to specify a starting point for each unknown variable.

In[32]:=

Out[32]=

The Fit function does exactly this same minimization conceptually, but it only works if the fit function looks like

and the are the only unknowns. This is the traditional method of curve fitting (predating modern computers that can do more powerful techniques almost as fast) because if has this form, you can take a short cut from linear algebra and do the computation very quickly. Otherwise, the minimization can be computationally intensive and may get stuck at a local minimum instead of finding the global minimum.

Non-linear least squares

The linearization method

To do non-linear curve fitting with least squares, there are a couple of alternatives. One is to linearize the data first, then proceed using Fit.

As before, let's make up some noisy data to play with. It's basically with noise added to the coefficient and the exponent.

In[33]:=

Out[33]=

In[34]:=

Out[34]=

We'd like to fit this to a power function and find and .

In[35]:=

Out[35]=

But we can't use Fit, because the unknowns aren't in the right place. So, we linearize the data first. Assuming the power function is correct for our data, we can take the logarithm, and get something in the right form for Fit:

So even though our data is not in the right form for Fit, it turns out that is in the right form. Here's an incantation to linearize the data. The trick is that /. can apply rules that involve patterns (see the Mathematica book 2.5 ), so this next command looks at and replaces anything that looks like with . As in function definitions, the _ on the x and y indicates that these are pattern variables. Without the _, Mathematica will think you mean to replace only stuff with the symbols x and y. And you don't use the _ on the right hand side of the rule or in a function definition, only on the left.

In[36]:=

Out[36]=

In[37]:=

Out[37]=

Now we can use Fit:

In[38]:=

Out[38]=

In[39]:=

Out[39]=

In[40]:=

Out[40]=

And you can see that this is pretty close to

In[41]:=

Out[41]=

In[42]:=

Out[42]=

In[43]:=

Out[43]=

The direct least squares method

Since Mathematica can directly minimize various expressions, we can also skip the linearization step and minimize the error directly:

In[44]:=

Out[44]=

In[45]:=

Out[45]=

In[46]:=

Out[46]=

In[47]:=

Out[47]=

And you should be able to see that the function found by the linearization method isn't quite the same as the one found by the direct method:

In[48]:=

Out[48]=

Both functions are actually the optimum fit to the data, but under different notions of distance. And notice that neither one is exact compared to what we started with. For example, here's what I got on one run of this worksheet:

Out[55]=

Since the data set is constructed with random noise, the results will be a little different each time you run it. But neither of these gives back exactly. (Think about it: Should they?) And which curve is "better"?

This contention between the results illustrates the fundamental conceptual problems in curve fitting: (1) Guess the appropriate form of the function. (2) Determine an appropriate notion of "distance" between the curve and the data to minimize.

Maximum likelihood

An alternative notion of distance that is appropriate for modeling changing probabilities is likelihood. This notion assumes that the data is of the form where is an independent variable, such as distance or time, are independent random numbers, and are the parameters that we want to find. Then, the likelihood of the data is the probability of getting exactly the 's. The curve fit procedure is to determine values of the parameters such that the likelihood of the data is maximal.

As an example, let's suppose we have a biological experiment that succeeds or fails, for example, getting bacteria to accept a fragment of DNA. Let's suppose that the probability of success depends on temperature . Furthermore, let's suppose that we have some knowledge of the biochemistry involved that tells us that the probability of success is actually an exponential function: where and are unknown. We are given results from experiments run at different temperatures, and they are simply given as success or failure, and our job is to find and .

First, let's invent some data, using and .

In[49]:=

Out[49]=

Random returns a random number between 0 and 1, so the test Random[] < p returns True with probability and False with probability , so we can use it to simulate our experiment. We'll also assume that this experiment is fairly expensive and time consuming, so we can't run zillions of experiments.

In[50]:=

In[51]:=

Out[51]=

Here's a way to plot that data. Do you see how this command works?

In[52]:=

Out[52]=

The data we have isn't points which is what we'd need to do traditional curve fitting. In other words, we want to fit a curve but we don't have points from the curve plus noise like we did in earlier examples; we have something else entirely. Intuitively, it seems impossible for least squares to give anything useful for this type of data, so instead, we do maximum likelihood. The experiments are independent, so the probability of getting all this data is the product of the probabilities of getting each point. But the probability of getting each point depends on and .

In[53]:=

So for example:

In[54]:=

Out[54]=

In[55]:=

Out[55]=

These are tiny numbers, and that causes trouble with the numerical maximization process, so instead of maximizing the likelihood, we maximize the log likelihood (Think about it: Why can we do this?)

We could do this, but then Mathematica will compute that tiny likelihood, then take the log.

In[56]:=

A better way to compute the same thing is to expand the log of the product into a sum of logs. I'll do this computation using different notation:

In[57]:=

Just to check that we did this right, these should be the same number:

In[58]:=

Out[58]=

In[59]:=

Out[59]=

Here's the first try at running the fit:

In[60]:=

Out[60]=

Which means it couldn't solve the problem algebraically. So, let's try the numerical methods:

In[61]:=

Out[61]=

You probably got an error message, either that an overflow occurred, or that it got a complex number somewhere along the way. That means that NMaximize is trying values of and that yield logs of negative numbers, or perhaps log of 0. Let's use the constraint feature of NMaximize to give it some additional hints: We ask it to maximize the log likelihood, but subject to some reasonable constraints. We know the exponential should decrease as increases, so we add in .

In[62]:=

Out[62]=

Sometimes that works, sometimes it doesn't, depending on what random numbers appeared in our simulated experiment. Let's try FindMaximum since it takes a starting point.

In[63]:=

Out[63]=

That probably didn't work because we get something like somewhere if we start at

In[64]:=

Out[64]=

This is pretty good, at least the time I ran it. Sometimes you have to poke around with different starting points to get reasonable results. There are also zillions of options, and you can spend lots of time playing with them, tweaking the numerical method, but unless you're desperate or know what all the tweaks mean, doing that is often a waste of time.

The result isn't perfect, but it has a definite interpretation: These values for and are the ones such that the probability of getting our observations is maximal. And they're reasonably close to the exact answer, which is great given how little information our data acutally contains.

Created by Mathematica (January 20, 2005)