Updating Least Squares

In previous installments, we covered the Linear Least Squares method for determining parameters of a linear model, starting with a simple input/output model, and working towards a dynamic state transition model. For the dynamic model, the model predicts output values based on a given set of current input values and the current internal state variables. The predictions are calculated as a sum of the input values times constant coefficient values.

Those coefficients are the model parameters. Getting good estimates of multiple parameter values from noisy data can require collecting thousands (tens of thousands, hundreds of thousands) of input/output data combinations. That means an independent data matrix M with thousands of rows. Sometimes it is no big deal: bulky or not, you just grind through the data once and you are done. But if you have to do this crunching a lot, it means reserving quite a lot of system resources. First, you need plenty of memory just to retain all the data. Then, at the finish, there is a long delay for all of the CPU cycles needed to perform the calculations. Is there some way to utilize the data "on the fly" as collected, rather than accumulating everything and then concentrating all the processing at the end?

Introducing rank-one updates

Suppose that n items of data were accumulated to build the input data matrix M_n of independent ("input") values. The linear model uses the "multipliers" in the parameter vector p_n to "predict" the corresponding output values Y_n. If the model were perfect, and if the data were perfect, the following conditions would all be satisfied exactly.

M_{n} p_{n} = Y_{n}

However, in the real world these conditions can't all be satisfied perfectly. As we saw in preceding installments, we can solve the Normal Equations to obtain an approximate solution that is "best" according to the least sum of squared errors criterion.

{M_{n}}^{T} M_{n} \cdot p_{n} = {M_{n}}^{T} \cdot Y_{n}

In most non-pathological cases, this "necessary condition" yields a unique solution that is the best-fit solution we seek.

Now suppose that this is only a partial solution; the solution we actually want includes all of the previous constraint data, plus one more input/output relationship. We can modify the original problem by including the data terms for the new constraint as the new top row of the data matrix. All of the rows previously present are still there, covered by the M_n matrix terms. Similar for the corresponding output values, a new output term y_n+1 is added to the top of the output data vector, with the previous output terms Y_n still present below.

M_{N + 1} \equiv [\begin{matrix} m_{n + 1} \\ M_{n} \end{matrix}] Y_{n + 1} \equiv [\begin{matrix} y_{n + 1} \\ Y_{n} \end{matrix}] p_{n + 1} \equiv p_{n} + Δ p

The new Normal Equations formulation to solve for this adjusted best fit problem is:

{M_{n + 1}}^{T} M_{n + 1} \cdot p_{n + 1} = {M_{n + 1}}^{T} \cdot Y_{n + 1}

In this solution, we can expect the new information to result in a new solution p_n+1 differing from the previous best estimate by the small amount Δp.

If we substitute the notations from above, we can see the roles that previous data and new data play in producing the updated solution.

({m_{n + 1}}^{T} m_{n + 1} + {M_{n}}^{T} M_{n}) \cdot (p_{n} + Δ p) = ({m_{n + 1}}^{T} \cdot y_{n + 1} + {M_{n}}^{T} \cdot Y_{n})

This provides a scheme to construct the Normal Equations incrementally.

Let k be the number of parameters to be evaluated. Start by constructing the k x k matrix M₀^TM₀ and a k-length vector M₀^TY₀ with all matrix and vector terms set to 0.
Begin adding constraint rows. For the first input/output constraint condition to be added to the system, construct from the row of independent data values the peculiar matrix product m₁^Tm₁ that forms a square matrix of size k x k but rank 1. This matrix is summed into the previously constructed matrix storage, M₀, to form the updated matrix M₁. On the other side of the equation system under construction, the new data vector is transposed into a column vector form m₁^T, scaled by the new output value y₁, and this is summed with the previous right-side accumulation Y₀ to form the updated vector Y₁.
With the next row of independent values and corresponding output value, the M₁ matrix is updated to form M₂, and the Y₁ vector is updated to form Y₂.
Repeat the same way until the entire input output data set is processed.
The value of the modified p parameter vector? No need to worry about this until the time comes to compute it -- which can be after any update.

Once all of the updates are done, the Normal Equations are all set up and ready to solve.

Some observations

The point of this scheme is that it is no longer necessary to accumulate a huge collection of input-output data. All you need to carry forward from one step to the next is one k x k matrix and one vector of size k.
This processing is neither more nor less efficient than the calculations you would perform to set up the Normal Equations solution in the usual manner. All this does is spread the work more evenly over time.
The rank-one updates (and similar lower-rank updates) are not iterative approximations. You will get exactly the same result from building up normal equations in this sequential manner or from one large monolithic matrix construction.
You probably should not discard all of the original input/output data after updating the Normal Equations. You will need some of that data to verify your fitting results.

Some extensions

Instead of a single new constraint with single output value added at each update, the updating process is equally valid when groups of updates are accumulated and incorporated together as one batch mode update.
The method is easily adapted for the case of more than one output variables. Go ahead and try this!

For next time

When constructing the Normal Equations one constraint row at a time, the single most time-consuming operation is the matrix inverse you must calculate to evaluate parameters. This is not a big deal if done rarely. But if you need updated parameter values frequently (in the worst case, as soon as possible after each new update), some modifications to be discussed later can streamline the calculations.

Updating and Sequential Least Squares

Introducing rank-one updates

Some observations

Some extensions

For next time