I am a software mathematician. The biggest leap in my career was when I learned to say: "I do not understand anything!" Now I am not ashamed to tell the luminary of science that he is giving me a lecture, that I do not understand what it was telling me about. And this is very difficult. Yes, it is difficult and embarrassing to admit your ignorance. Who likes to admit that he does not know the basics of something-there. By virtue of my profession, I have to attend a large number of presentations and lectures, where, I confess, in the overwhelming majority of cases I want to sleep, because I do not understand anything. But I don't understand because the huge problem of the current situation in science lies in mathematics. It assumes that all listeners are familiar with absolutely all areas of mathematics (which is absurd). It’s a shame to admit that you don’t know what a derivative is (that it is a little later).

But I learned to say that I don't know what multiplication is. Yes, I don't know what a subalgebra over a Lie algebra is. Yes, I do not know why quadratic equations are needed in life. By the way, if you are sure that you know, then we have something to talk about! Mathematics is a series of tricks. Mathematicians try to confuse and intimidate the public; where there is no confusion, there is no reputation, there is no authority. Yes, it is prestigious to speak in as abstract language as possible, which is complete nonsense in itself.

Do you know what a derivative is? Most likely you will tell me about the difference ratio limit. In the first year of Mathematics and Mechanics at St. Petersburg State University, Viktor Petrovich Khavin identified the derivative as the coefficient of the first term of the Taylor series of the function at a point (it was a separate gymnastics to determine the Taylor series without derivatives). I laughed at this definition for a long time, until I finally understood what it was about. The derivative is nothing more than just a measure of how much the function we are differentiating resembles the function y = x, y = x ^ 2, y = x ^ 3.

I now have the honor to lecture to students who fear mathematics. If you are afraid of mathematics, we are on the same path. As soon as you try to read some text, and it seems to you that it is overly complicated, then know that it is badly written. I argue that there is not a single area of mathematics that cannot be talked about "on the fingers" without losing accuracy.

The task for the near future: I instructed my students to understand what a linear-quadratic regulator is. Do not hesitate, spend three minutes of your life, follow the link. If you do not understand anything, then we are on the way with you. I (a professional mathematician-programmer) didn't understand anything either. And I assure you that you can figure it out on the fingers. At the moment I do not know what it is, but I assure you that we will be able to figure it out.

So, the first lecture that I am going to read to my students after they come running to me in horror with the words that a linear-quadratic regulator is a terrible byaka that will never be mastered in my life, this is least squares methods... Can you solve linear equations? If you are reading this text, then most likely not.

So, given two points (x0, y0), (x1, y1), for example, (1,1) and (3,2), the problem is to find the equation of a straight line passing through these two points:

illustration

This line must have an equation like the following:

Here alpha and beta are unknown to us, but we know two points of this straight line:

You can write this equation in matrix form:

A lyrical digression should be made here: what is a matrix? A matrix is nothing more than a two-dimensional array. This is a way of storing data; you shouldn't attach any more importance to it. It is up to us how exactly to interpret a certain matrix. Periodically I will interpret it as a linear display, periodically as a quadratic form, and sometimes just as a set of vectors. This will all be clarified in context.

Let's replace specific matrices with their symbolic representations:

Then (alpha, beta) can be found easily:

More specifically for our previous data:

Which leads to the following equation of the straight line passing through the points (1,1) and (3,2):

Okay, everything is clear here. Let's find the equation of the straight line passing through three points: (x0, y0), (x1, y1) and (x2, y2):

Oh-oh-oh, but we have three equations for two unknowns! A standard mathematician will say that there is no solution. What will the programmer say? For a start, he will rewrite the previous system of equations in the following form:

In our case, the vectors i, j, b are three-dimensional, therefore, (in the general case) there is no solution to this system. Any vector (alpha \ * i + beta \ * j) lies in the plane spanned by the vectors (i, j). If b does not belong to this plane, then the solution does not exist (equality in the equation cannot be achieved). What to do? Let's find a compromise. Let's denote by e (alpha, beta) exactly how far we have not reached equality:

And we will try to minimize this error:

Why square?

We are looking not just for the minimum of the norm, but for the minimum of the square of the norm. Why? The minimum point itself coincides, and the square gives a smooth function (a quadratic function of the arguments (alpha, beta)), while simply the length gives a cone-like function that is not differentiable at the minimum point. Brr. The square is more convenient.

Obviously, the error is minimized when the vector e is orthogonal to the plane spanned by the vectors i and j.

Illustration

In other words: we are looking for a line such that the sum of the squared lengths of the distances from all points to this line is minimal:

UPDATE: here I have a cant, the distance to the straight line should be measured vertically, not an orthogonal projection. This commentator is right.

Illustration

Quite differently (carefully, poorly formalized, but it should be clear on the fingers): we take all possible straight lines between all pairs of points and look for the average straight line between all:

Illustration

Another explanation on the fingers: we attach a spring between all data points (here we have three) and the straight line that we are looking for, and the straight line of the equilibrium state is exactly what we are looking for.

Minimum of a quadratic form

So, having a given vector b and the plane spanned by the column vectors of the matrix A(in this case (x0, x1, x2) and (1,1,1)), we are looking for a vector e with a minimum of a square of length. Obviously, the minimum is attainable only for the vector e, orthogonal to the plane spanned by the column vectors of the matrix A:

In other words, we are looking for a vector x = (alpha, beta) such that:

Let me remind you that this vector x = (alpha, beta) is the minimum of the quadratic function || e (alpha, beta) || ^ 2:

Here it will be useful to remember that the matrix can be interpreted as a quadratic form, for example, the unit matrix ((1,0), (0,1)) can be interpreted as a function x ^ 2 + y ^ 2:

quadratic form

All this gymnastics is known as linear regression.

Laplace's equation with the Dirichlet boundary condition

Now the simplest real task: there is a certain triangulated surface, you need to smooth it. For example, let's load my face model:

The initial commit is available. To minimize external dependencies, I took the code of my software renderer, already on Habré. To solve a linear system, I use OpenNL, this is an excellent solver, which, however, is very difficult to install: you need to copy two files (.h + .c) to the folder with your project. All anti-aliasing is done with the following code:

For (int d = 0; d<3; d++) { nlNewContext(); nlSolverParameteri(NL_NB_VARIABLES, verts.size()); nlSolverParameteri(NL_LEAST_SQUARES, NL_TRUE); nlBegin(NL_SYSTEM); nlBegin(NL_MATRIX); for (int i=0; i<(int)verts.size(); i++) { nlBegin(NL_ROW); nlCoefficient(i, 1); nlRightHandSide(verts[i][d]); nlEnd(NL_ROW); } for (unsigned int i=0; i& face = faces [i]; for (int j = 0; j<3; j++) { nlBegin(NL_ROW); nlCoefficient(face[ j ], 1); nlCoefficient(face[(j+1)%3], -1); nlEnd(NL_ROW); } } nlEnd(NL_MATRIX); nlEnd(NL_SYSTEM); nlSolve(); for (int i=0; i<(int)verts.size(); i++) { verts[i][d] = nlGetVariable(i); } }

The X, Y and Z coordinates are separable, I smooth them separately. That is, I solve three systems of linear equations, each with the number of variables equal to the number of vertices in my model. The first n rows of matrix A have only one unit per row, and the first n rows of vector b have original model coordinates. That is, I spring-tie between the new vertex position and the old vertex position - the new ones should not stray too far from the old ones.

All subsequent rows of the matrix A (faces.size () * 3 = the number of edges of all triangles in the grid) have one occurrence 1 and one occurrence -1, and the vector b has zero components opposite. This means I hang a spring on each edge of our triangular mesh: all edges try to get the same vertex as a starting and ending point.

Once again: all vertices are variables, and they cannot move far from their original position, but at the same time they try to become similar to each other.

Here's the result:

Everything would be fine, the model is really smoothed, but it has moved away from its original edge. Let's change the code a bit:

For (int i = 0; i<(int)verts.size(); i++) { float scale = border[i] ? 1000: 1; nlBegin(NL_ROW); nlCoefficient(i, scale); nlRightHandSide(scale*verts[i][d]); nlEnd(NL_ROW); }

In our matrix A, for the vertices that are on the edge, I add not a row from the v_i = verts [i] [d] bit, but 1000 * v_i = 1000 * verts [i] [d]. What does it change? And it changes our square-law error. Now, a single deviation from the vertex at the edge will cost not one unit, as before, but 1000 * 1000 units. That is, we hung a stronger spring on the extreme vertices, the solution prefers to stretch the others more. Here's the result:

Let's double the springs between the vertices:
nlCoefficient (face [j], 2); nlCoefficient (face [(j + 1)% 3], -2);

It is logical that the surface has become smoother:

And now it is even a hundred times stronger:

What is it? Imagine dipping a wire ring in soapy water. As a result, the formed soapy film will try to have the smallest curvature, as far as possible, touching the border - our wire ring. This is exactly what we got by fixing the border and asking for a smooth surface on the inside. Congratulations, we just solved the Laplace equation with Dirichlet boundary conditions. Sounds cool? But in fact, only one system of linear equations to solve.

Poisson's equation

Let's remember another cool name.

Suppose I have a picture like this:

Everyone is good, only I don't like the chair.

I will cut the picture in half:

And I will highlight the chair with my hands:

Then I will pull everything that is white in the mask to the left side of the picture, and at the same time throughout the picture I will say that the difference between two neighboring pixels should be equal to the difference between two neighboring pixels of the right picture:

For (int i = 0; i

Here's the result:

Real life example

I deliberately didn’t do the polished results. I just wanted to show you exactly how you can apply the least squares methods, this is a tutorial code. Let me now give an example from life:

I have a number of photos of fabric samples like this:

My task is to make seamless textures from photos of this quality. First, I (automatically) look for a repeating pattern:

If I cut out this quadrilateral directly, then due to distortion, the edges will not converge, here is an example of a pattern repeated four times:

Hidden text

Here is a snippet where the seam is clearly visible:

Therefore, I will not cut along a straight line, here is the cut line:

Hidden text

And here is a pattern repeated four times:

Hidden text

And a fragment of it, to make it clearer:

Even better, the cut did not go in a straight line, bypassing all sorts of curls, but still the seam is visible due to the uneven lighting in the original photo. This is where the least squares method for Poisson's equation comes in. Here is the final result after leveling the lighting:

The texture came out perfectly seamless, and it was all automatic from a very mediocre photo. Do not be afraid of math, look for simple explanations, and you will have engineering happiness.

Which finds the widest application in various fields of science and practice. It can be physics, chemistry, biology, economics, sociology, psychology, and so on, and so on. By the will of fate, I often have to deal with the economy, and therefore today I will issue you a ticket to an amazing country called Econometrics=) ... How do you not want it ?! It's very good there - you just need to make up your mind! ... But what you probably definitely want is to learn how to solve problems least squares method... And especially diligent readers will learn how to solve them not only faultlessly, but also VERY FAST ;-) But first general problem statement+ related example:

Let in some subject area the indicators are investigated that have a quantitative expression. At the same time, there is every reason to believe that the indicator depends on the indicator. This assumption can be both a scientific hypothesis and based on elementary common sense. Leaving science aside, however, and exploring more mouth-watering areas - namely grocery stores. Let us denote by:

- retail space of a grocery store, sq.m.,
- annual turnover of the grocery store, mln. Rub.

It is absolutely clear that the larger the area of the store, the more its turnover will be in most cases.

Suppose that after observing / experimenting / calculating / dancing with a tambourine, we have numerical data at our disposal:

With grocery stores, I think everything is clear: - this is the area of the 1st store, - its annual turnover, - the area of the 2nd store, - its annual turnover, etc. By the way, it is not at all necessary to have access to classified materials - a fairly accurate estimate of the turnover can be obtained by means of mathematical statistics... However, let's not be distracted, the course of commercial espionage - it is already paid =)

Tabular data can also be written in the form of dots and depicted in the usual for us Cartesian system .

Let's answer an important question: how many points do you need for a qualitative study?

The bigger, the better. The minimum allowable set consists of 5-6 points. In addition, with a small amount of data, the sample cannot include “anomalous” results. So, for example, a small elite store can help out by orders of magnitude more "its colleagues", thereby distorting the general pattern that needs to be found!

To put it quite simply - we need to choose a function, schedule which passes as close as possible to the points ... This function is called approximating (approximation - approximation) or theoretical function ... Generally speaking, there immediately appears an obvious "challenger" - a high degree polynomial whose graph passes through ALL points. But this option is difficult, and often simply incorrect. (since the chart will be “twisting” all the time and reflecting poorly the main trend).

Thus, the sought function should be simple enough and at the same time reflect the dependence adequately. As you might guess, one of the methods for finding such functions is called least squares method... First, let's take a look at its essence in general terms. Let some function approximate the experimental data:

How to evaluate the accuracy of this approximation? Let us calculate the differences (deviations) between the experimental and functional values (studying the drawing)... The first thought that comes to mind is to estimate how large the sum is, but the problem is that the differences can be negative. (for example, ) and deviations as a result of such summation will cancel each other out. Therefore, as an estimate of the accuracy of the approximation, it begs to accept the sum modules deviations:

or collapsed: (suddenly, who does not know: - this is the sum icon, and - an auxiliary variable - "counter", which takes values from 1 to).

Approximating the experimental points with different functions, we will get different values, and it is obvious where this sum is less - that function is more accurate.

Such a method exists and it is called least modulus method... However, in practice, it has become much more widespread. least square method, in which possible negative values are eliminated not by the modulus, but by squaring the deviations:

, after which efforts are directed to the selection of such a function so that the sum of the squares of the deviations was as small as possible. Actually, hence the name of the method.

And now we return to another important point: as noted above, the selected function should be quite simple - but there are also a lot of such functions: linear , hyperbolic, exponential, logarithmic, quadratic etc. And, of course, here I would immediately like to "reduce the field of activity." Which class of functions to choose for research? A primitive but effective trick:

- The easiest way to draw points on the drawing and analyze their location. If they tend to be in a straight line, then you should look for equation of a straight line with optimal values and. In other words, the task is to find SUCH coefficients - so that the sum of the squares of the deviations is the smallest.

If the points are located, for example, along hyperbole, then it is a priori clear that a linear function will give a bad approximation. In this case, we are looking for the most "favorable" coefficients for the hyperbola equation - those that give the minimum sum of squares .

Now, note that in both cases we are talking about functions of two variables whose arguments are parameters of wanted dependencies:

And in essence, we need to solve a standard problem - to find minimum function of two variables.

Let's remember our example: suppose that the "store" points tend to be located in a straight line and there is every reason to believe that linear relationship turnover from the retail space. Let's find SUCH coefficients "a" and "bs" so that the sum of the squares of the deviations was the smallest. Everything is as usual - first 1st order partial derivatives... According to linearity rule you can differentiate directly under the amount icon:

If you want to use this information for an essay or course book, I will be very grateful for the link in the list of sources, you will find such detailed calculations in few places:

Let's compose a standard system:

We reduce each equation by "two" and, in addition, "break up" the sums:

Note : Analyze on your own why “a” and “bie” can be taken out for the sum icon. By the way, formally this can be done with the sum

Let's rewrite the system in an "applied" form:

after which the algorithm for solving our problem begins to be drawn:

Do we know the coordinates of the points? We know. Amounts can we find? Easily. We compose the simplest system of two linear equations in two unknowns("A" and "bh"). We solve the system, for example, Cramer's method, as a result of which we obtain a stationary point. Checking sufficient condition for extremum, one can make sure that at this point the function achieves exactly minimum... Verification is associated with additional calculations and therefore we will leave it behind the scenes. (if necessary, the missing frame can be viewed)... We draw the final conclusion:

Function the best way (at least compared to any other linear function) brings experimental points closer ... Roughly speaking, its graph goes as close as possible to these points. In tradition econometrics the resulting approximating function is also called paired linear regression equation .

The problem under consideration is of great practical importance. In the situation with our example, the equation allows you to predict what turnover ("Game") will be at the store with one or another value of the retail space (this or that value "x")... Yes, the forecast obtained will be only a forecast, but in many cases it will be quite accurate.

I will analyze just one problem with "real" numbers, since there are no difficulties in it - all calculations are at the level of the 7-8 grade school curriculum. In 95 percent of cases, you will be asked to find just a linear function, but at the very end of the article I will show that it is no more difficult to find the equations of the optimal hyperbola, exponent and some other functions.

In fact, it remains to hand out the promised buns - so that you learn how to solve such examples not only accurately, but also quickly. We carefully study the standard:

Task

As a result of studying the relationship between the two indicators, the following pairs of numbers were obtained:

Using the least squares method, find the linear function that best approximates the empirical (experienced) data. Make a drawing on which, in a Cartesian rectangular coordinate system, plot experimental points and a graph of the approximating function ... Find the sum of the squares of the deviations between empirical and theoretical values. Figure out if the function would be better (from the point of view of the method of least squares) zoom in on experimental points.

Note that the “x” meanings are natural, and this has a characteristic meaningful meaning, which I will talk about a little later; but they, of course, can be fractional. In addition, depending on the content of a particular problem, both "x" and "game" values can be fully or partially negative. Well, we have a “faceless” task, and we start it solution:

We find the coefficients of the optimal function as a solution to the system:

For the sake of a more compact notation, the "counter" variable can be omitted, since it is already clear that the summation is carried out from 1 to.

It is more convenient to calculate the required amounts in a tabular form:

Calculations can be carried out on a microcalculator, but it is much better to use Excel - both faster and without errors; watch a short video:

Thus, we obtain the following the system:

Here you can multiply the second equation by 3 and subtract the 2nd from the 1st equation term-by-term... But this is luck - in practice, systems are often not a gift, and in such cases it saves Cramer's method:
, which means that the system has a unique solution.

Let's check. I understand that I don’t want to, but why skip errors where they can be completely avoided? We substitute the found solution into the left side of each equation of the system:

The right-hand sides of the corresponding equations are obtained, which means that the system is solved correctly.

Thus, the required approximating function: - from of all linear functions it is she who approximates the experimental data in the best way.

Unlike straight dependence of the turnover of the store on its area, the dependence found is reverse (the principle "the more - the less"), and this fact is immediately revealed by the negative slope... Function informs us that with an increase in a certain indicator by 1 unit, the value of the dependent indicator decreases average by 0.65 units. As the saying goes, the higher the price of buckwheat, the less it is sold.

To plot the graph of the approximating function, we find two of its values:

and execute the drawing:

The constructed line is called trend line (namely, a linear trend line, i.e., in the general case, a trend is not necessarily a straight line)... Everyone is familiar with the expression "be in trend", and I think that this term does not need additional comments.

Let's calculate the sum of the squares of the deviations between empirical and theoretical values. Geometrically, it is the sum of the squares of the lengths of the "crimson" segments (two of which are so small that you can't even see them).

Let's summarize the calculations in a table:

They can again be done manually, just in case I will give an example for the 1st point:

but it is much more efficient to act in a well-known way:

Let's repeat: what is the meaning of the obtained result? From of all linear functions function the indicator is the smallest, that is, in its family it is the best approximation. And here, by the way, the final question of the problem is not accidental: what if the proposed exponential function will it be better to approximate the experimental points?

Let's find the corresponding sum of squares of deviations - in order to distinguish, I will designate them with the letter "epsilon". The technique is exactly the same:

And again, just for every fireman, calculations for the 1st point:

In Excel, we use the standard function EXP (see the Excel Help for the syntax).

Output:, which means that the exponential function approximates the experimental points worse than the straight line .

But here it should be noted that "worse" is does not mean yet, what is wrong. Now I have plotted this exponential function - and it also goes close to the points - so much so that without analytical research it is difficult to say which function is more accurate.

This completes the solution, and I return to the question of the natural values of the argument. In various studies, as a rule, economic or sociological, natural "xes" number months, years or other equal time intervals. Consider, for example, a problem like this.

Least square method is used to estimate the parameters of the regression equation.

One of the methods for studying stochastic relationships between features is regression analysis.
Regression analysis is the derivation of the regression equation, with the help of which the average value of a random variable (feature-result) is found, if the value of another (or other) variables (feature-factors) is known. It includes the following steps:

choice of the form of communication (type of analytical regression equation);
estimation of the parameters of the equation;
assessment of the quality of the analytical regression equation.

Most often, a linear form is used to describe the statistical relationship of features. Attention to the linear relationship is explained by a clear economic interpretation of its parameters, limited variation of variables and the fact that in most cases nonlinear forms of communication for performing calculations are converted (by logarithm or change of variables) into a linear form.
In the case of a linear pairwise relationship, the regression equation will take the form: y i = a + b x i + u i. The parameters of this equation a and b are estimated from the data of statistical observation x and y. The result of such an assessment is the equation:, where, are the estimates of the parameters a and b, is the value of the effective attribute (variable) obtained by the regression equation (calculated value).

The most often used to estimate parameters least squares method (OLS).
The least squares method gives the best (consistent, efficient and unbiased) estimates of the parameters of the regression equation. But only if certain prerequisites are met regarding the random term (u) and the independent variable (x) (see OLS prerequisites).

The problem of estimating the parameters of a linear paired equation by the least squares method consists in the following: to obtain such parameter estimates, at which the sum of the squares of the deviations of the actual values of the effective indicator - y i from the calculated values - is minimal.
Formally OLS criterion can be written like this: .

Least squares classification

Least square method.
Maximum likelihood method (for the normal classical linear regression model, the normality of the regression residuals is postulated).
The generalized least squares OLS method is used in the case of autocorrelation of errors and in the case of heteroscedasticity.
Weighted least squares method (a special case of OLS with heteroscedastic residuals).

Let's illustrate the essence the classical least squares method graphically... To do this, we will build a dot plot according to the observation data (x i, y i, i = 1; n) in a rectangular coordinate system (such a dot plot is called the correlation field). Let's try to find a straight line that is closest to the points of the correlation field. According to the method of least squares, the line is chosen so that the sum of the squares of the vertical distances between the points of the correlation field and this line would be minimal.

Mathematical record of this problem: .
We know the values of y i and x i = 1 ... n, these are observational data. In the S function, they are constants. The variables in this function are the required parameter estimates -,. To find the minimum of a function of 2 variables, it is necessary to calculate the partial derivatives of this function with respect to each of the parameters and equate them to zero, i.e. .
As a result, we get a system of 2 normal linear equations:
Solving this system, we find the required parameter estimates:

The correctness of the calculation of the parameters of the regression equation can be checked by comparing the sums (there may be some discrepancy due to rounding of calculations).
To calculate the parameter estimates, you can build table 1.
The sign of the regression coefficient b indicates the direction of the relationship (if b> 0, the relationship is direct, if b<0, то связь обратная). Величина b показывает на сколько единиц изменится в среднем признак-результат -y при изменении признака-фактора - х на 1 единицу своего измерения.
Formally, the value of parameter a is the average value of y at x equal to zero. If the attribute factor does not and cannot have a zero value, then the above interpretation of the parameter a does not make sense.

Assessment of the tightness of the relationship between the signs is carried out using the coefficient of linear pair correlation - r x, y. It can be calculated using the formula: ... In addition, the linear pairwise correlation coefficient can be determined through the regression coefficient b: .
The range of admissible values of the linear pair correlation coefficient is from –1 to +1. The sign of the correlation coefficient indicates the direction of the link. If r x, y> 0, then the connection is direct; if r x, y<0, то связь обратная.
If this coefficient is close to one in absolute value, then the relationship between the features can be interpreted as a rather close linear one. If its modulus is equal to one ê r x, y ê = 1, then the connection between the features is functional linear. If features x and y are linearly independent, then r x, y is close to 0.
To calculate r x, y, you can also use table 1.

To assess the quality of the obtained regression equation, the theoretical coefficient of determination is calculated - R 2 yx:

,
where d 2 is the variance y explained by the regression equation;
e 2 - residual (not explained by the regression equation) variance y;
s 2 y is the total (total) variance of y.
The coefficient of determination characterizes the proportion of the variation (variance) of the effective trait y, explained by the regression (and, consequently, the factor x), in the total variation (variance) y. The coefficient of determination R 2 yx takes values from 0 to 1. Accordingly, the value 1-R 2 yx characterizes the proportion of variance y caused by the influence of other factors not taken into account in the model and specification errors.
With paired linear regression R 2 yx = r 2 yx.

Example.

Experimental data on the values of variables NS and at are given in the table.

As a result of their alignment, the function is obtained

Using least square method, approximate this data with a linear dependence y = ax + b(find parameters a and b). Find out which of the two lines is better (in the sense of the least squares method) equalizes the experimental data. Make a drawing.

The essence of the method of least squares (mns).

The task is to find the coefficients of the linear dependence for which the function of two variables a and b takes the smallest value. That is, given a and b the sum of the squares of the deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, the solution of the example is reduced to finding the extremum of a function of two variables.

Derivation of formulas for finding coefficients.

A system of two equations with two unknowns is composed and solved. Find the partial derivatives of the function by variables a and b, we equate these derivatives to zero.

We solve the resulting system of equations by any method (for example substitution method or Cramer's method) and obtain formulas for finding the coefficients using the least squares method (OLS).

With data a and b function takes the smallest value. The proof of this fact is given below in the text at the end of the page.

That's the whole least squares method. Formula for finding the parameter a contains the sums ,,, and the parameter n- the amount of experimental data. We recommend calculating the values of these amounts separately. Coefficient b is after calculation a.

It's time to remember the original example.

Solution.

In our example n = 5... We fill in the table for the convenience of calculating the amounts that are included in the formulas of the desired coefficients.

The values in the fourth row of the table are obtained by multiplying the values of the 2nd row by the values of the 3rd row for each number i.

The values in the fifth row of the table are obtained by squaring the values of the 2nd row for each number i.

The values in the last column of the table are the row sums of the values.

We use the formulas of the least squares method to find the coefficients a and b... We substitute in them the corresponding values from the last column of the table:

Hence, y = 0.165x + 2.184- the required approximating straight line.

It remains to find out which of the lines y = 0.165x + 2.184 or better approximates the original data, that is, make an estimate using the least squares method.

Estimation of the error of the least squares method.

To do this, you need to calculate the sum of the squares of the deviations of the initial data from these lines and , the lower value corresponds to the line that better approximates the original data in the sense of the least squares method.

Since, then straight y = 0.165x + 2.184 approximates the original data better.

Graphical illustration of the method of least squares (mns).

Everything is perfectly visible on the graphs. The red line is the straight line found y = 0.165x + 2.184, the blue line is , pink dots are raw data.

In practice, when modeling various processes - in particular, economic, physical, technical, social - one or another method of calculating the approximate values of functions from their known values at some fixed points is widely used.

Such problems of approximation of functions often arise:

when constructing approximate formulas for calculating the values of the characteristic values of the process under study according to tabular data obtained as a result of the experiment;

for numerical integration, differentiation, solving differential equations, etc .;

when it is necessary to calculate the values of functions at intermediate points of the considered interval;

when determining the values of the characteristic quantities of the process outside the considered interval, in particular when predicting.

If, to model a certain process given by the table, construct a function that approximately describes this process based on the least squares method, it will be called an approximating function (regression), and the problem of constructing approximating functions itself is an approximation problem.

This article discusses the capabilities of the MS Excel package for solving such problems, in addition, methods and techniques for constructing (creating) regressions for table-defined functions (which is the basis of regression analysis) are given.

Excel has two options for plotting regressions.

Adding the selected regressions (trend lines) to the diagram based on the data table for the studied process characteristic (available only if there is a constructed diagram);

Use the built-in statistical functions of the Excel worksheet to obtain regressions (trend lines) directly from the raw data table.

Adding trend lines to a chart

For a table of data describing a certain process and represented by a diagram, Excel has an effective regression analysis tool that allows you to:

build on the basis of the least squares method and add five types of regressions to the diagram, which model the process under study with varying degrees of accuracy;

add the equation of the constructed regression to the diagram;

determine the degree to which the selected regression matches the data displayed on the chart.

Based on the data of the Excel chart, it allows you to obtain linear, polynomial, logarithmic, power, exponential types of regressions, which are given by the equation:

y = y (x)

where x is an independent variable, which often takes on the values of a sequence of natural numbers (1; 2; 3; ...) and produces, for example, counting the time of the process under study (characteristics).

1 ... Linear regression is good for modeling characteristics that increase or decrease at a constant rate. This is the simplest model of the process under study to construct. It is built according to the equation:

y = mx + b

where m is the tangent of the angle of inclination of the linear regression to the abscissa axis; b - coordinate of the point of intersection of linear regression with the ordinate axis.

2 ... The polynomial trendline is useful for describing characteristics that have several distinct extremes (highs and lows). The choice of the degree of the polynomial is determined by the number of extrema of the studied characteristic. Thus, a polynomial of the second degree can describe well a process that has only one maximum or minimum; polynomial of the third degree - no more than two extrema; polynomial of the fourth degree - no more than three extrema, etc.

In this case, the trend line is plotted in accordance with the equation:

y = c0 + c1x + c2x2 + c3x3 + c4x4 + c5x5 + c6x6

where the coefficients c0, c1, c2, ... c6 are constants, the values of which are determined during construction.

3 ... The logarithmic trend line is successfully used to simulate characteristics, the values of which change rapidly at first and then gradually stabilize.

y = c ln (x) + b

4 ... A power-law trend line gives good results if the values of the studied dependence are characterized by a constant change in the growth rate. An example of such a relationship is a graph of uniformly accelerated movement of a car. If the data contains zero or negative values, you cannot use a power trendline.

It is built in accordance with the equation:

y = c xb

where the coefficients b, c are constants.

5 ... An exponential trendline should be used when the rate of change in data is continuously increasing. For data containing zero or negative values, this kind of approximation is also not applicable.

It is built in accordance with the equation:

y = c ebx

where the coefficients b, c are constants.

When selecting a trend line, Excel automatically calculates the value of R2, which characterizes the accuracy of the approximation: the closer the value of R2 is to one, the more reliably the trend line approximates the process under study. If necessary, the R2 value can always be displayed on the chart.

Determined by the formula:

To add a trend line to a data series:

activate a chart based on a series of data, that is, click within the chart area. The Chart item will appear in the main menu;

after clicking on this item, a menu will appear on the screen, in which you should select the Add trend line command.

The same actions are easily accomplished by hovering the mouse pointer over the graph corresponding to one of the data series and clicking the right mouse button; in the context menu that appears, select the Add trend line command. The Trendline dialog box with the Type tab expanded (Fig. 1) will appear on the screen.

After that it is necessary:

Select the required trendline type on the Type tab (by default, the Linear type is selected). For the Polynomial type, in the Degree field, specify the degree of the selected polynomial.

1 ... The Plotted on Series box lists all the data series of the chart in question. To add a trend line to a specific data series, select its name in the Plotted on Series field.

If necessary, by going to the Parameters tab (Fig. 2), you can set the following parameters for the trend line:

change the name of the trend line in the Name of the approximating (smoothed) curve field.

set the number of periods (forward or backward) for the forecast in the Forecast field;

display the equation of the trend line in the diagram area, for which you should enable the Show equation on the diagram check box;

display the value of the approximation reliability R2 in the diagram area, for which you should enable the checkbox to place the approximation reliability value (R ^ 2) on the diagram;

set the point of intersection of the trend line with the Y axis, for which you should enable the intersection of the curve with the Y axis at a point checkbox;

click on the OK button to close the dialog box.

In order to start editing an already built trend line, there are three ways:

use the Selected trend line command from the Format menu after selecting the trend line;

select the Format trendline command from the context menu, which is invoked by right-clicking on the trend line;

by double clicking on the trend line.

The Trendline Format dialog box (Fig. 3) will appear on the screen, containing three tabs: View, Type, Parameters, and the contents of the latter two completely coincide with similar tabs in the Trendline dialog box (Fig. 1-2). On the View tab, you can set the line type, its color and thickness.

To delete an already built trend line, select the deleted trend line and press the Delete key.

The advantages of the considered regression analysis tool are:

the relative ease of plotting a trend line on charts without creating a data table for it;

a fairly wide list of types of proposed trend lines, and this list includes the most commonly used types of regression;

the ability to predict the behavior of the process under study for an arbitrary (within common sense) number of steps forward, as well as backward;

the ability to obtain the equation of the trend line in an analytical form;

the possibility, if necessary, of obtaining an estimate of the reliability of the approximation carried out.

The disadvantages include the following points:

the construction of a trend line is carried out only if there is a diagram built on a number of data;

the process of forming data series for the studied characteristic based on the trend line equations obtained for it is somewhat cluttered: the sought regression equations are updated with each change in the values of the original data series, but only within the diagram area, while the data series formed on the basis of the old line equation trend remains unchanged;

In PivotChart reports, when you change the view of a chart or a linked PivotTable report, existing trendlines are not retained, which means that you must ensure that the layout of the report meets your requirements before you draw trendlines or otherwise format the PivotChart report.

Trend lines can be used to supplement data series presented on charts such as graph, bar, flat unnormalized area charts, bar, scatter, bubble, and stock charts.

You cannot add trendlines to data series in 3-D, Normalized, Radar, Pie, and Donut charts.

Using built-in Excel functions

Excel also provides a regression analysis tool for plotting trend lines outside the chart area. A number of worksheet statistical functions can be used for this purpose, but all of them allow only linear or exponential regressions to be built.

Excel provides several functions for constructing linear regression, in particular:

TREND;

INCLINE and INTERCEPT.

And also several functions for building an exponential trendline, in particular:

LGRFPRIBL.

It should be noted that the methods of constructing regressions using the TREND and GROWTH functions practically coincide. The same can be said for a pair of LINEST and LGRFPRIBL functions. For these four functions, Excel features such as array formulas are used to create a table of values, which makes the regression process somewhat cluttered. Note also that the construction of linear regression, in our opinion, is easiest to carry out using the SLOPE and INTERCEPT functions, where the first of them determines the slope of linear regression, and the second is the segment cut off by the regression on the ordinate axis.

The benefits of the built-in regression analysis tool include:

a fairly simple process of the same type of formation of data series of the studied characteristic for all built-in statistical functions that set trend lines;

standard technique for constructing trend lines based on generated data series;

the ability to predict the behavior of the process under study for the required number of steps forward or backward.

The disadvantage is that Excel does not have built-in functions for creating other (besides linear and exponential) trendline types. This circumstance often does not allow choosing a sufficiently accurate model of the process under study, as well as obtaining forecasts that are close to reality. Also, when using the TREND and GROWTH functions, the trendline equations are not known.

It should be noted that the authors did not set the goal of the article to present the course of regression analysis with varying degrees of completeness. Its main task is to show the capabilities of the Excel package in solving approximation problems using specific examples; demonstrate what effective tools Excel has for building regressions and forecasting; illustrate how relatively easily such problems can be solved even by a user who does not have deep knowledge of regression analysis.

Examples of solving specific problems

Let's consider the solution of specific tasks using the listed tools of the Excel package.

Problem 1

With a table of data on the profit of a trucking company for 1995-2002. you need to do the following.

Build a diagram.

Add linear and polynomial (quadratic and cubic) trend lines to the chart.

Using the trend line equations, obtain tabular data on enterprise profits for each trend line for 1995-2004.

Make a forecast for the profit of the enterprise for 2003 and 2004.

The solution of the problem

In the range of cells A4: C11 of the Excel worksheet, enter the worksheet shown in Fig. 4.

Having selected the range of cells B4: C11, we build a diagram.

We activate the constructed chart and, according to the method described above, after selecting the type of trend line in the Trendline dialog box (see Fig. 1), we alternately add linear, quadratic and cubic trend lines to the chart. In the same dialog box, open the Parameters tab (see Fig. 2), in the Name of the approximating (smoothed) curve field, enter the name of the added trend, and in the Forecast for: periods field, set the value 2, since it is planned to make a profit forecast for two years ahead. To display the regression equation and the approximation reliability value R2 in the diagram area, turn on the checkboxes to show the equation on the screen and place the approximation reliability value (R ^ 2) on the diagram. For a better visual perception, change the type, color and thickness of the constructed trend lines, for which we use the View tab of the Trendline Format dialog box (see Fig. 3). The resulting diagram with added trend lines is shown in Fig. 5.

To obtain tabular data on the profit of the enterprise for each trend line for 1995-2004. Let us use the trend line equations shown in Fig. 5. To do this, in the cells of the range D3: F3, enter text information about the type of the selected trend line: Linear trend, Quadratic trend, Cubic trend. Next, enter the linear regression formula in cell D4 and, using the fill marker, copy this formula with relative references to the range of cells D5: D13. It should be noted that each cell with a linear regression formula from the range of cells D4: D13 takes the corresponding cell from the range A4: A13 as an argument. Similarly, for quadratic regression, the cell range E4: E13 is filled, and for cubic regression, the cell range F4: F13. Thus, the forecast for the profit of the enterprise for 2003 and 2004 was made. using three trends. The resulting table of values is shown in Fig. 6.

Task 2

Build a diagram.

Add logarithmic, exponential and exponential trend lines to the chart.

Derive the equations of the obtained trend lines, as well as the values of the approximation reliability R2 for each of them.

Using the trend line equations, obtain tabular data on enterprise profits for each trend line for 1995-2002.

Make a forecast of the company's profit for 2003 and 2004 using these trend lines.

The solution of the problem

Following the methodology given in solving problem 1, we obtain a diagram with added logarithmic, power and exponential trend lines (Fig. 7). Further, using the obtained equations of the trend lines, we fill in the table of values for the profit of the enterprise, including the predicted values for 2003 and 2004. (fig. 8).

In fig. 5 and fig. it can be seen that the model with a logarithmic trend corresponds to the smallest value of the approximation reliability

R2 = 0.8659

The largest values of R2 correspond to models with a polynomial trend: quadratic (R2 = 0.9263) and cubic (R2 = 0.933).

Problem 3

With the table of data on the profit of a trucking company for 1995-2002, given in task 1, you must perform the following steps.

Get data series for linear and exponential trendlines using TREND and GROWTH functions.

Using the TREND and GROWTH functions, make a forecast of the company's profit for 2003 and 2004.

Build a diagram for the initial data and the resulting data series.

The solution of the problem

Let's use the worksheet of task 1 (see Fig. 4). Let's start with the TREND function:

select the range of cells D4: D11, which should be filled with the values of the TREND function, corresponding to the known data on the profit of the enterprise;

call the Function command from the Insert menu. In the Function Wizard dialog box that appears, select the TREND function from the Statistical category, and then click on the OK button. The same operation can be performed by pressing the (Insert function) button on the standard toolbar.

In the Function Arguments dialog box that appears, enter in the Known_values_y field the range of cells C4: C11; in the Known_x's field - the range of cells B4: B11;

to make the entered formula become an array formula, use the + + key combination.

The formula we entered in the formula bar will look like: = (TREND (C4: C11; B4: B11)).

As a result, the range of cells D4: D11 is filled with the corresponding values of the TREND function (Fig. 9).

To make a forecast of the company's profit for 2003 and 2004. necessary:

select the range of cells D12: D13, where the values predicted by the TREND function will be entered.

call the TREND function and in the Function Arguments dialog box that appears, enter in the Known_values_y field - the range of cells C4: C11; in the Known_x's field - the range of cells B4: B11; and the New_x_values field contains the range of cells B12: B13.

turn this formula into an array formula using the keyboard shortcut Ctrl + Shift + Enter.

The entered formula will look like: = (TREND (C4: C11; B4: B11; B12: B13)), and the range of cells D12: D13 will be filled with the predicted values of the TREND function (see Fig. 9).

Similarly, a data series is filled using the GROWTH function, which is used in the analysis of nonlinear dependencies and works in exactly the same way as its linear analogue TREND.

Figure 10 shows a table in the formulas display mode.

For the initial data and the obtained data series, the diagram shown in Fig. eleven.

Problem 4

With the table of data on the receipt of applications for services by the dispatch service of a motor transport company for the period from the 1st to the 11th day of the current month, the following actions must be performed.

Get data series for linear regression: using the SLOPE and INTERCEPT functions; using the LINEST function.

Get a data series for exponential regression using the LGRFPRIBL function.

Using the above functions, make a forecast about the receipt of applications in the dispatch service for the period from the 12th to the 14th day of the current month.

Build a diagram for the original and received data series.

The solution of the problem

Note that, unlike the TREND and GROWTH functions, none of the above functions (SLOPE, INTERCEPT, LINEST, LGRFPRIB) is a regression. These functions play only an auxiliary role, defining the necessary parameters of the regression.

For linear and exponential regressions, built using the SLOPE, INTERCEPT, LINEST, LGRFPRIB functions, the appearance of their equations is always known, in contrast to the linear and exponential regressions corresponding to the TREND and GROWTH functions.

1 ... Let's construct a linear regression with the equation:

y = mx + b

with the SLOPE and INTERCEPT functions, where the slope m is determined by the SLOPE function and the intercept b by the INTERCEPT function.

To do this, we carry out the following actions:

we enter the original table into the range of cells A4: B14;

the value of parameter m will be determined in cell C19. Select from the Statistical category Slope; enter the range of cells B4: B14 in the known_y's field and the range of cells A4: A14 in the known_x's field. You will enter the formula in cell C19: = SLOPE (B4: B14; A4: A14);

using a similar methodology, the value of parameter b in cell D19 is determined. And its content will look like: = INTERCEPT (B4: B14; A4: A14). Thus, the values of the parameters m and b, necessary for constructing the linear regression, will be stored in cells C19, D19, respectively;

then we enter the linear regression formula in cell C4 in the form: = $ C * A4 + $ D. In this formula, cells C19 and D19 are written with absolute references (the cell address should not change when copying is possible). The absolute reference sign $ can be typed either from the keyboard or by using the F4 key, after placing the cursor on the cell address. Using the fill handle, copy this formula to the range of cells C4: C17. We get the required data series (Fig. 12). Due to the fact that the number of orders is an integer, set the number format with 0 decimal places on the Number tab of the Format cells window.

2 ... Now let's build the linear regression given by the equation:

y = mx + b

using the LINEST function.

For this:

enter the LINEST function into the range of cells C20: D20 as an array formula: = (LINEST (B4: B14; A4: A14)). As a result, we get in cell C20 the value of parameter m, and in cell D20 - the value of parameter b;

enter the formula in cell D4: = $ C * A4 + $ D;

copy this formula using the fill handle to the range of cells D4: D17 and get the required data series.

3 ... We build an exponential regression that has the equation:

using the LGRFPRIBL function, it is performed in the same way:

into the range of cells C21: D21 we enter the LGRFPRIBL function as an array formula: = (LGRFPRIBL (B4: B14; A4: A14)). In this case, in cell C21 the value of the parameter m will be determined, and in cell D21 - the value of the parameter b;

the formula is entered into cell E4: = $ D * $ C ^ A4;

using the fill marker, this formula is copied to the range of cells E4: E17, where the data series for the exponential regression will be located (see Fig. 12).

In fig. 13 is a table where you can see the functions we use with the required ranges of cells, as well as formulas.

The magnitude R 2 called coefficient of determination.

The task of constructing a regression dependence is to find the vector of coefficients m of the model (1) at which the coefficient R takes its maximum value.

To assess the significance of R, Fisher's F-test is used, calculated by the formula

where n- sample size (number of experiments);

k is the number of coefficients of the model.

If F exceeds some critical value for the data n and k and the accepted confidence level, then the value of R is considered significant. Tables of critical values of F are given in handbooks on mathematical statistics.

Thus, the significance of R is determined not only by its value, but also by the ratio between the number of experiments and the number of coefficients (parameters) of the model. Indeed, the correlation ratio for n = 2 for a simple linear model is 1 (through 2 points on the plane, you can always draw a single straight line). However, if the experimental data are random values, such R value should be trusted with great care. Usually, to obtain a significant R and reliable regression, one strives to ensure that the number of experiments significantly exceeds the number of model coefficients (n> k).

To build a linear regression model, you must:

1) prepare a list of n rows and m columns containing experimental data (a column containing the output value Y must be either first or last in the list); for example, we will take the data of the previous task, adding a column with the name "Period No.", number the period numbers from 1 to 12. (these will be the values NS)

2) go to the menu Data / Data Analysis / Regression

If the "Data Analysis" item in the "Tools" menu is absent, then you should go to the "Add-Ins" item of the same menu and select the "Analysis Package" checkbox.

3) in the "Regression" dialog box set:

· Input interval Y;

· Input interval X;

· Output interval - the upper left cell of the interval in which the results of calculations will be placed (it is recommended to place them on a new worksheet);

4) click "Ok" and analyze the results.

Approximation of experimental data is a method based on replacing the experimentally obtained data with an analytical function that most closely passes or coincides at the nodal points with the initial values (data obtained during the experiment or experiment). There are currently two ways to define an analytic function:

By constructing an n-degree interpolation polynomial that passes directly through all points a given data array. In this case, the approximating function is represented in the form of an interpolation polynomial in the form of Lagrange or an interpolation polynomial in the form of Newton.

By constructing an approximating n-degree polynomial that passes in close proximity to points from a given data array. Thus, the approximating function smoothes out all random noises (or errors) that may arise during the experiment: the measured values during the experiment depend on random factors that fluctuate according to their own random laws (measurement or instrument errors, inaccuracy or experimental errors). In this case, the approximating function is determined using the least squares method.

Least square method(in the English-language literature Ordinary Least Squares, OLS) is a mathematical method based on the definition of an approximating function, which is built in the closest proximity to points from a given array of experimental data. The closeness of the initial and approximating function F (x) is determined by a numerical measure, namely: the sum of the squares of deviations of the experimental data from the approximating curve F (x) should be the smallest.

Least squares fit curve

The least squares method is used:

To solve overdetermined systems of equations when the number of equations exceeds the number of unknowns;

To search for a solution in the case of ordinary (not overdetermined) nonlinear systems of equations;

To approximate point values by some approximating function.

The approximating function by the method of least squares is determined from the condition of the minimum sum of squares of deviations of the calculated approximating function from a given array of experimental data. This criterion for the least squares method is written as the following expression:

The values of the calculated approximating function at the nodal points,

A given array of experimental data at the nodal points.

The quadratic criterion has a number of "good" properties, such as differentiability, providing a unique solution to the approximation problem with polynomial approximating functions.

Depending on the conditions of the problem, the approximating function is a polynomial of degree m

The degree of the approximating function does not depend on the number of nodal points, but its dimension should always be less than the dimension (number of points) of a given array of experimental data.

∙ If the degree of the approximating function is m = 1, then we approximate the tabular function with a straight line (linear regression).

∙ If the degree of the approximating function is m = 2, then we approximate the tabular function with a quadratic parabola (quadratic approximation).

∙ If the degree of the approximating function is m = 3, then we approximate the tabular function with a cubic parabola (cubic approximation).

In the general case, when it is required to construct an approximating polynomial of degree m for given tabular values, the condition for the minimum of the sum of squares of deviations for all nodal points is rewritten as follows:

- unknown coefficients of the approximating polynomial of degree m;

The number of specified table values.

A necessary condition for the existence of a minimum of a function is the equality to zero of its partial derivatives with respect to unknown variables ... As a result, we get the following system of equations:

We transform the resulting linear system of equations: open the brackets and transfer the free terms to the right side of the expression. As a result, the resulting system of linear algebraic expressions will be written in the following form:

This system of linear algebraic expressions can be rewritten in matrix form:

As a result, a system of linear equations of dimension m + 1 was obtained, which consists of m + 1 unknowns. This system can be solved using any method for solving linear algebraic equations (for example, the Gauss method). As a result of the solution, unknown parameters of the approximating function will be found that provide the minimum sum of squares of deviations of the approximating function from the initial data, i.e. best possible quadratic approximation. It should be remembered that when even one value of the initial data changes, all the coefficients will change their values, since they are completely determined by the initial data.

Linear approximation of initial data

(linear regression)

As an example, consider the method for determining the approximating function, which is specified as a linear relationship. In accordance with the least squares method, the condition for the minimum sum of squares of deviations is written in the following form:

The coordinates of the grid points of the table;

Unknown coefficients of the approximating function, which is given as a linear relationship.

A necessary condition for the existence of a minimum of a function is the equality to zero of its partial derivatives with respect to unknown variables. As a result, we obtain the following system of equations:

We transform the resulting linear system of equations.

We solve the resulting system of linear equations. The coefficients of the approximating function in analytical form are determined as follows (Cramer's method):

These coefficients provide the construction of a linear approximating function in accordance with the criterion for minimizing the sum of squares of the approximating function from the given table values (experimental data).

Algorithm for the implementation of the least squares method

1. Initial data:

An array of experimental data is given with the number of measurements N

The degree of the approximating polynomial is given (m)

2. Calculation algorithm:

2.1. Coefficients are determined for constructing a system of equations with the dimension

Coefficients of the system of equations (left side of the equation)

is the index of the column number of the square matrix of the system of equations

Free terms of a system of linear equations (right side of the equation)

is the index of the row number of the square matrix of the system of equations

2.2. Formation of a system of linear equations in dimension.

2.3. Solving a system of linear equations in order to determine the unknown coefficients of the approximating polynomial of degree m.

2.4 Determination of the sum of squares of deviations of the approximating polynomial from the original values for all nodal points

The found value of the sum of squares of deviations is the minimum possible.

Approximation using other functions

It should be noted that when approximating the initial data in accordance with the least squares method, a logarithmic function, an exponential function, and a power function are sometimes used as an approximating function.

Logarithmic approximation

Consider the case when the approximating function is given by a logarithmic function of the form: