Relationship that exists between random variables different nature, for example, between the X value and the Y value, is not necessarily the result of a direct dependence of one value on the other (the so-called functional relationship). In some cases, both quantities depend on a whole set of different factors common to both quantities, as a result of which the bound friend with other patterns. When a relationship between random variables is discovered with the help of statistics, we cannot claim that we have discovered the cause of the ongoing change in parameters, rather, we only saw two interconnected consequences.

For example, children who watch more American action movies on TV read less. Children who read more learn better. It is not so easy to decide which are the causes and which are the effects, but this is not the task of statistics. Statistics can only put forward a hypothesis about the presence of a connection and back it up with numbers. If there is indeed a connection, the two random variables are said to be correlated. If an increase in one random variable is associated with an increase in the second random variable, the correlation is called direct. For example, the number of pages read per year and the average score (performance). If, on the contrary, an increase in one value is associated with a decrease in another, one speaks of an inverse correlation. For example, the number of action movies and the number of pages read.

The mutual relationship of two random variables is called correlation, correlation analysis allows you to determine the presence of such a relationship, to assess how close and significant this relationship is. All of this is quantified.

How to determine if there is a correlation between the values? In most cases, this can be seen on a regular chart. For example, for each child from our sample, we can determine the value X i (number of pages) and Y i ( GPA annual estimate) and record this data in a table. Build the X and Y axes, and then plot the entire series of points on the graph so that each of them has a specific pair of coordinates (X i , Y i) from our table. Since in this case we find it difficult to determine what can be considered a cause and what a consequence, it does not matter which axis is vertical and which is horizontal.


If the graph looks like a), then this indicates the presence of a direct correlation, if it looks like b) - the correlation is inverse. Lack of correlation
Using the correlation coefficient, you can calculate how close the relationship exists between the values.

Suppose there is a correlation between price and demand for a product. The number of purchased units of goods, depending on the price from different sellers, is shown in the table:

It can be seen that we are dealing with an inverse correlation. To quantify the tightness of the connection, the correlation coefficient is used:

We calculate the coefficient r in Excel, using the f x function, then statistical functions, the CORREL function. At the prompt of the program, we enter two different arrays (X and Y) into the two corresponding fields with the mouse. In our case, the correlation coefficient turned out to be r = - 0.988. It should be noted that the closer the correlation coefficient is to 0, the weaker the relationship between the values. The closest relationship with direct correlation corresponds to a coefficient r close to +1. In our case, the correlation is inverse, but also very close, and the coefficient is close to -1.

What can be said about random variables whose coefficient has an intermediate value? For example, if we got r=0.65. In this case, statistics allow us to say that two random variables are partially related to each other. Let's say 65% ​​of the impact on the number of purchases had price, and 35% - other circumstances.

And one more important circumstance should be mentioned. Since we are talking about random variables, there is always the possibility that the connection we noticed is a random circumstance. Moreover, the probability of finding a connection where there is none is especially high when there are few points in the sample, and when evaluating, you did not build a graph, but simply calculated the value of the correlation coefficient on a computer. So, if we leave only two different points in any arbitrary sample, the correlation coefficient will be equal to either +1 or -1. From the school geometry course, we know that you can always draw a straight line through two points. For rate statistical validity the fact of the connection you discovered, it is useful to use the so-called correlation correction:

While the goal of correlation analysis is to establish whether given random variables are related, the goal is regression analysis- describe this relationship with an analytical dependence, i.e. using an equation. We will consider the simplest case, when the connection between points on the graph can be represented by a straight line. The equation of this straight line is Y=aX+b, where a=Yav.-bXav.,

Knowing , we can find the value of the function by the value of the argument at those points where the value of X is known, but Y is not. These estimates are very useful, but they must be used with caution, especially if the relationship between the quantities is not too close.

We also note that from a comparison of the formulas for b and r, it can be seen that the coefficient does not give the value of the slope of the straight line, but only shows the very fact of the existence of a connection.

The company employs 10 people. Table 2 shows data on their work experience and

monthly salary.

Calculate from this data

  • - the value of the sample covariance estimate;
  • - the value of the sample Pearson correlation coefficient;
  • - evaluate the direction and strength of the connection according to the obtained values;
  • -Determine the extent to which the assertion that this company uses Japanese model management, which consists in the assumption that the more time an employee spends in a given company, the higher his salary should be.

Based on the correlation field, a hypothesis can be put forward (for population) that the relationship between all possible values ​​of X and Y is linear.

To calculate the regression parameters, we will build a calculation table.

Sample means.

Sample variances:

The estimated regression equation will look like

y = bx + a + e,

where ei are the observed values ​​(estimates) of the errors ei, a and b, respectively, the estimates of the parameters b and in the regression model that should be found.

To estimate the parameters b and c - use LSM (least squares).

System of normal equations.

a?x + b?x2 = ?y*x

For our data, the system of equations has the form

  • 10a + 307b = 33300
  • 307 a + 10857 b = 1127700

We multiply the equation (1) of the system by (-30.7), we get a system that we solve by the method of algebraic addition.

  • -307a -9424.9 b = -1022310
  • 307 a + 10857 b = 1127700

We get:

1432.1b = 105390

Where b = 73.5912

Now we find the coefficient "a" from equation (1):

  • 10a + 307b = 33300
  • 10a + 307 * 73.5912 = 33300
  • 10a = 10707.49

We get empirical regression coefficients: b = 73.5912, a = 1070.7492

Regression equation (empirical regression equation):

y = 73.5912 x + 1070.7492

covariance.

In our example, the relationship between feature Y and factor X is high and direct.

Therefore, we can safely say that the more time an employee works in a given company, the higher his salary.

4. Testing statistical hypotheses. When solving this problem, the first step is to formulate a testable hypothesis and an alternative one.

Checking the equality of general shares.

A study was conducted on student performance at two faculties. The results for the variants are shown in Table 3. Can it be argued that both faculties have the same percentage of excellent students?

simple arithmetic mean

We test the hypothesis about the equality of the general shares:

Let's find the experimental value of Student's criterion:

Number of degrees of freedom

f \u003d nx + ny - 2 \u003d 2 + 2 - 2 \u003d 2

Determine the value of tkp according to the Student's distribution table

According to Student's table we find:

Ttabl(f;b/2) = Ttabl(2;0.025) = 4.303

According to the table of critical points of the Student's distribution at a significance level b = 0.05 and a given number of degrees of freedom, we find tcr = 4.303

Because tobs > tcr, then the null hypothesis is rejected, the general shares of the two samples are not equal.

Checking the uniformity of the general distribution.

The university management wants to find out how the popularity of the Faculty of Humanities has changed over time. The number of applicants who applied for this faculty was analyzed in relation to the total number of applicants in the corresponding year. (Data are given in Table 4). If we consider the number of applicants as a representative sample of total school graduates of the year, can it be argued that the interest of schoolchildren in the specialties of this faculty does not change over time?

Option 4

Solution: Table for calculating indicators.

Interval midpoint, xi

Cumulative frequency, S

Frequency, fi/n

To evaluate the distribution series, we find the following indicators:

weighted average

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R = 2008 - 1988 = 20 Dispersion - characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation (mean sampling error).

Each value of the series differs from the average value of 2002.66 by an average of 6.32

Testing the hypothesis about the uniform distribution of the general population.

In order to test the hypothesis about the uniform distribution of X, i.e. according to the law: f(x) = 1/(b-a) in the interval (a,b) it is necessary:

Estimate the parameters a and b - the ends of the interval in which the possible values ​​of X were observed, according to the formulas (the * denotes the estimates of the parameters):

Find the probability density of the estimated distribution f(x) = 1/(b* - a*)

Find theoretical frequencies:

n1 = nP1 = n = n*1/(b* - a*)*(x1 - a*)

n2 = n3 = ... = ns-1 = n*1/(b* - a*)*(xi - xi-1)

ns = n*1/(b* - a*)*(b* - xs-1)

Compare empirical and theoretical frequencies using the Pearson test, assuming the number of degrees of freedom k = s-3, where s is the number of initial sampling intervals; if, however, a combination of small frequencies, and therefore the intervals themselves, was made, then s is the number of intervals remaining after the combination. Let's find the estimates of the parameters a* and b* of the uniform distribution by the formulas:

Let's find the density of the supposed uniform distribution:

f(x) = 1/(b* - a*) = 1/(2013.62 - 1991.71) = 0.0456

Let's find the theoretical frequencies:

n1 = n*f(x)(x1 - a*) = 0.77 * 0.0456(1992-1991.71) = 0.0102

n5 = n*f(x)(b* - x4) = 0.77 * 0.0456(2013.62-2008) = 0.2

ns = n*f(x)(xi - xi-1)

Since the Pearson statistic measures the difference between the empirical and theoretical distributions, the larger its observed value Kobs, the stronger the argument against the main hypothesis.

Therefore, the critical region for this statistic is always right-handed: ) may differ significantly from the corresponding characteristics of the original (undistorted) scheme (, n). normal scheme (, m) always reduces the absolute value of the regression coefficient Ql in relation (B. 15), and also weakens the degree of closeness of the relationship between um (ie, reduces the absolute value of the correlation coefficient r).

Influence of measurement errors on the value of the correlation coefficient. Let us want to estimate the degree of closeness of the correlation between the components of a two-dimensional normal random variable (, TJ), but we can observe them only with some random measurement errors, respectively, es and e (see the D2 dependence diagram in the introduction). Therefore, the experimental data are (xit i/i), i = 1, 2,. .., n, are practically sample values ​​of the distorted two-dimensional random variable (, r)), where =

Method R.a. consists in deriving a regression equation (including an estimate of its parameters), with the help of which the average value of a random variable is found, if the value of another (or others in the case of multiple or multivariate regression) is known. (In contrast, correlation analysis is used to find and express the strength of the relationship between random variables71.)

In the study of the correlation of signs that are not connected by a consistent change in time, each sign changes under the influence of many causes, taken as random. In the series of dynamics, a change is added to them during the time of each series. This change leads to the so-called autocorrelation - the influence of changes in the levels of previous series on subsequent ones. Therefore, the correlation between the levels of time series correctly shows the tightness of the relationship between the phenomena reflected in the time series, only if there is no autocorrelation in each of them. In addition, autocorrelation leads to a distortion of the mean square errors of the regression coefficients, which makes it difficult to build confidence intervals for the regression coefficients, as well as to check their significance.

The theoretical and sample correlation coefficients defined by relations (1.8) and (1.8), respectively, can be formally calculated for any two-dimensional observational system; they are measures of the degree of tightness of the linear statistical relationship between the analyzed features. However, only in the case of a joint normal distribution of the random variables under study and u, the correlation coefficient r has a clear meaning as a characteristic of the degree of closeness of the connection between them. In particular, in this case, the ratio r - 1 confirms a purely functional linear relationship between the quantities under study, and the equation r = 0 indicates their complete mutual independence. In addition, the correlation coefficient, together with the means and variances of random variables and TJ, constitutes those five parameters that provide comprehensive information about

Regression Analysis

Processing the results of the experiment by the method

When studying the processes of functioning of complex systems, one has to deal with a number of simultaneously acting random variables. To understand the mechanism of phenomena, cause-and-effect relationships between the elements of the system, etc., we are trying to establish the relationship of these quantities based on the observations received.

In mathematical analysis, the dependence, for example, between two quantities is expressed by the concept of a function

where each value of one variable corresponds to only one value of the other. This dependence is called functional.

The situation with the concept of dependence of random variables is much more complicated. As a rule, between random variables (random factors) that determine the process of functioning of complex systems, there is usually such a relationship in which, with a change in one variable, the distribution of another changes. Such a connection is called stochastic, or probabilistic. In this case, the magnitude of the change in the random factor Y, corresponding to the change in the value X, can be broken down into two components. The first is related to addiction. Y from X, and the second with the influence of "own" random components Y And X. If the first component is missing, then the random variables Y And X are independent. If the second component is missing, then Y And X depend functionally. In the presence of both components, the ratio between them determines the strength or tightness of the relationship between random variables Y And X.

There are various indicators that characterize certain aspects of the stochastic relationship. So, linear dependence between random variables X And Y determines the correlation coefficient.

where are the mathematical expectations of random variables X and Y.

– standard deviations of random variables X And Y.


The linear probabilistic dependence of random variables lies in the fact that as one random variable increases, the other tends to increase (or decrease) according to a linear law. If random variables X And Y are connected by a strict linear functional dependence, for example,

y=b 0 +b 1 x 1,

then the correlation coefficient will be equal to ; where the sign corresponds to the sign of the coefficient b 1.If the values X And Y are connected by an arbitrary stochastic dependence, then the correlation coefficient will vary within

It should be emphasized that for independent random variables the correlation coefficient is equal to zero. However, the correlation coefficient as an indicator of the dependence between random variables has serious drawbacks. First, from the equality r= 0 does not imply independence of random variables X And Y(with the exception of random variables subject to the normal distribution law, for which r= 0 means at the same time the absence of any dependence). Secondly, the extreme values ​​are also not very useful, since they do not correspond to any functional dependence, but only to a strictly linear one.



Full description dependencies Y from X, and, moreover, expressed in exact functional relationships, can be obtained by knowing the conditional distribution function .

It should be noted that in this case one of the observed variables is considered nonrandom. Fixing simultaneously the values ​​of two random variables X And Y, when comparing their values, we can attribute all errors only to the value Y. Thus, the observation error will be the sum of its own random error of the quantity Y and from the matching error arising from the fact that with the value Y not quite the same value is matched X which actually took place.

However, finding the conditional distribution function, as a rule, turns out to be very difficult. challenging task. The easiest way to investigate the relationship between X And Y with a normal distribution Y, since it is completely determined by the mathematical expectation and variance. In this case, to describe the dependence Y from X you do not need to build a conditional distribution function, but just indicate how, when changing the parameter X the mathematical expectation and variance of the value change Y.

Thus, we come to the need to find only two functions:

Conditional variance dependence D from parameter X is called skhodastichesky dependencies. It characterizes the change in the accuracy of the observation technique with a change in the parameter and is used quite rarely.

Dependence of the conditional mathematical expectation M from X is called regression, it gives the true dependence of the quantities X And At, devoid of all random layers. Therefore, the ideal goal of any study of dependent variables is to find a regression equation, and the variance is used only to assess the accuracy of the result.

Direct interpretation of the term correlation - stochastic, probable, possible connection between two (pair) or several (multiple) random variables.

It was said above that if for two SWs ( X And Y) we have the equality P(XY) =P(X) P(Y), then the quantities X And Y considered independent. Well, what if it's not!?

After all, the question is always important - and how strong does one SW depend on the other? And the point is not inherent in people's desire to analyze something necessarily in a numerical dimension. It is already clear that systems analysis means continuous calculations, that the use of a computer forces us to work with numbers, not concepts.

To numerically evaluate a possible relationship between two random variables: Y(with average M ySy) And - X(with average Mx and standard deviation S x) it is customary to use the so-called correlation coefficient

Rxy = . {2 - 11}

This coefficient can take values ​​from -1 to +1 - depending on the tightness of the relationship between these random variables.

If the correlation coefficient is zero, then X And Y called uncorrelated . There is usually no reason to consider them independent - it turns out that there are, as a rule, non-linear relationships of quantities under which Rxy = 0, although the quantities depend on each other. The reverse is always true - if the values independent , That Rxy = 0 . But if the module Rxy= 1, that is, there is every reason to assume the presence linear Communication between Y And X. That is why they often talk about linear correlation when using this method of estimating the connection between CBs.

We note another way to assess the correlation between two random variables - if we sum up the products of the deviations of each of them from its average value, then the resulting value is

C xy \u003d S (X - M x)· (Y-My)

or covariance quantities X And Y distinguishes two indicators from the correlation coefficient : Firstly, averaging(divided by the number of observations or pairs X, Y) and, secondly, rationing by dividing by the corresponding standard deviations.

Such an assessment of the relationships between random variables in a complex system is one of the initial stages. system analysis, therefore, already here, in all its sharpness, the question of trust in the conclusion about the presence or absence of links between the two SWs arises.

IN modern methods systems analysis usually do so. By found value R calculate the auxiliary value:

W = 0.5 Ln[(1+R)/(1-R)]{2 - 12}

and the question of confidence in the correlation coefficient is reduced to confidence intervals for the random variable W, which are determined by standard tables or formulas.

In some cases of system analysis, it is necessary to solve the issue of relationships between several (more than 2) random variables or the issue of multiple correlation.

Let X, Y And Z- random variables, according to observations over which we have established their average Mx, M y,mz and standard deviations S x, S y , S z .

Then one can find paired correlation coefficients Rxy, R xz , R yz according to the above formula. But this is clearly not enough - after all, at each of the three stages we simply forgot about the presence of a third random variable! Therefore, in cases of multiple correlation analysis, it is sometimes necessary to look for the so-called. private correlation coefficients - e.g. wobble score Z for communication between X And Y produced using the coefficient

Rxy.z = {2 - 13}

And, finally, we can pose the question - what is the relationship between this SV and the totality of the rest? The answer to such questions is given by the coefficients multiple correlations R x.yz , R y.zx , R z.xy , the formulas for calculating which are built according to the same principles - taking into account the connection of one of the quantities with all the others in the aggregate.

The complexity of the calculations of all the described indicators of correlations can be ignored special attention- programs for their calculation are quite simple and are available in ready-made form in many PPPs of modern computers.

It is enough to understand the main thing - if in the formal description of an element of a complex system, a set of such elements in the form of a subsystem or, finally, the system as a whole, we consider connections between its individual parts, then the degree of closeness of this connection in the form of the influence of one SW on another can and should be assessed at the level of correlation.

In conclusion, we note one more thing - in all cases of system analysis at the correlation level, both random variables with a pair correlation or all with a multiple one are considered "equal" - i.e. we are talking about the mutual influence of SW on each other.

This is not always the case - very often the question of connections Y And X is placed in a different plane - one of the quantities is dependent (function) on the other (argument).