Spearman's rank correlation method. Spearman's rank correlation coefficient rs

The buildings 25.09.2019
The buildings

Publication date: 09/03/2017 13:01

The term "correlation" is actively used in humanities, medicine; frequently featured in the media. Correlations play a key role in psychology. In particular, the calculation of correlations is milestone implementation of an empirical study in WRC writing in psychology.

Correlation stuff on the web is too scientific. It is difficult for a non-specialist to understand the formulas. At the same time, understanding the meaning of correlations is necessary for a marketer, sociologist, physician, psychologist - everyone who conducts research on people.

In this article we plain language explain the essence of the correlation, types of correlations, methods of calculation, features of the use of correlation in psychological research, as well as when writing dissertations in psychology.

Content

What is correlation

Correlation is communication. But not any. What is its peculiarity? Let's look at an example.

Imagine that you are driving a car. You press the gas pedal - the car goes faster. You slow down the gas - the car slows down. Even a person who is not familiar with the device of a car will say: “There is a direct relationship between the gas pedal and the speed of the car: the harder the pedal is pressed, the higher the speed.”

This dependence is functional - the speed is a direct function of the gas pedal. The specialist will explain that the pedal controls the supply of fuel to the cylinders, where the combustion of the mixture occurs, which leads to an increase in power to the shaft, etc. This connection is rigid, deterministic, not allowing exceptions (provided that the machine is working).

Now imagine that you are the director of a company whose employees sell goods. You decide to increase sales by raising employees' salaries. You raise your salary by 10%, and the company's average sales go up. After a while, you increase by another 10%, and again growth. Then another 5%, and again there is an effect. The conclusion suggests itself - there is a direct relationship between the sales of the company and the salary of employees - the higher the salaries, the higher the sales of the organization. Is this the same connection as between the gas pedal and the speed of the car? What is the key difference?

That's right, the relationship between salary and sales is not rigid. This means that for some of the employees, sales could even decline, despite the increase in salary. Somebody's got to stay the same. But on average, sales have grown in the company, and we say that there is a relationship between sales and employee salaries, and it is correlated.

The functional connection (gas pedal - speed) is based on physical law. The basis of the correlation (sales - salary) is a simple consistency of changes in two indicators. There is no law (in the physical sense of the word) behind correlation. There is only a probabilistic (stochastic) regularity.

Numerical expression of correlation dependence

So, the correlation reflects the dependence between phenomena. If these phenomena can be measured, then it receives a numerical expression.

For example, the role of reading in people's lives is being studied. The researchers took a group of 40 people and measured two indicators for each subject: 1) how much time he reads per week; 2) to what extent he considers himself successful (on a scale from 1 to 10). The researchers plotted the data in two columns and used a statistical program to calculate the correlation between reading and well-being. Suppose they got the following result -0.76. But what does this number mean? How to interpret it? Let's figure it out.

The resulting number is called the correlation coefficient. For his correct interpretation it is important to consider the following:

  1. The sign "+" or "-" reflects the direction of dependence.
  2. The value of the coefficient reflects the strength of the dependence.

Direct and reverse

The plus sign in front of the coefficient indicates that the relationship between phenomena or indicators is direct. That is, the greater one indicator, the greater the other. Higher salary means higher sales. Such a correlation is called direct, or positive.

If the coefficient has a minus sign, then the correlation is inverse, or negative. In this case, the higher one indicator, the lower the other. In the example of reading and well-being, we got -0.76, which means that, than more people read, the lower their level of well-being.

Strong and weak

Correlation in numerical terms is a number in the range from -1 to +1. Denoted by the letter "r". The higher the number (ignoring the sign), the stronger the correlation.

The lower the numerical value of the coefficient, the less the relationship between phenomena and indicators.

The maximum possible dependency strength is 1 or -1. How to understand and present it?

Consider an example. They took 10 students and measured their level of intelligence (IQ) and academic performance for the semester. Arranged this data in two columns.

test subject

IQ

Progress (points)

Look carefully at the data in the table. From 1 to 10 of the test subject, the IQ level increases. But the level of achievement is also rising. Of any two students, the one with the higher IQ will perform better. And there will be no exceptions to this rule.

Before us is an example of a complete, 100% coordinated change in two indicators in a group. And this is an example of the maximum possible positive relationship. That is, the correlation between intelligence and performance is 1.

Let's consider another example. The same 10 students were assessed with the help of a survey to what extent they feel successful in communicating with the opposite sex (on a scale from 1 to 10).

test subject

IQ

Success in communicating with the opposite sex (points)

We look closely at the data in the table. From 1 to 10 of the test subject, the IQ level increases. At the same time, the level of success in communication with the opposite sex consistently decreases in the last column. Of any two students, the one with the lower IQ will be more successful in communicating with the opposite sex. And there will be no exceptions to this rule.

This is an example of complete consistency in the change of two indicators in the group - the maximum possible negative relationship. The correlation between IQ and the success of communication with the opposite sex is -1.

And how to understand the meaning of a correlation equal to zero (0)? This means that there is no relationship between the indicators. Once again, let's return to our students and consider another indicator measured by them - the length of the jump from a place.

test subject

IQ

Standing jump length (m)

There is no consistency between person-to-person variation in IQ and long jump. This indicates a lack of correlation. The correlation coefficient of IQ and jump length for students is 0.

We've looked at extreme cases. In real measurements, the coefficients are rarely equal to exactly 1 or 0. In this case, the following scale is adopted:

  • if the coefficient is greater than 0.70 - the relationship between the indicators is strong;
  • from 0.30 to 0.70 - the connection is moderate,
  • less than 0.30 - the connection is weak.

If we evaluate on this scale the correlation we obtained above between reading and well-being, it turns out that this dependence is strong and negative -0.76. That is, there is a strong negative relationship between erudition and well-being. Which once again confirms the biblical wisdom about the relationship between wisdom and sorrow.

The given gradation gives very rough estimates and is rarely used in research in this form.

Gradations of coefficients according to significance levels are more often used. In this case, the actual coefficient obtained may be significant or not significant. This can be determined by comparing its value with the critical value of the correlation coefficient taken from a special table. Moreover, these critical values ​​depend on the size of the sample (the larger the volume, the lower the critical value).

Correlation analysis in psychology

The correlation method is one of the main ones in psychological research. And this is not accidental, because psychology strives to be an exact science. Does it work?

What is the peculiarity of laws in the exact sciences. For example, the law of gravity in physics operates without exception: the greater the mass of a body, the stronger it attracts other bodies. This physical law reflects the relationship between body mass and gravity.

In psychology, the situation is different. For example, psychologists publish data on the relationship of warm relationships in childhood with parents and the level of creativity in adulthood. Does this mean that any of the subjects with a very warm relationship with their parents in childhood will have very high Creative skills? The answer is unequivocal - no. There is no law like the physical one. No mechanism of influence childhood experience on adult creativity. These are our fantasies! There is data consistency (relationships - creativity), but there is no law behind them. But there is only correlation. Psychologists often refer to the identified relationships as psychological patterns, emphasizing their probabilistic nature - not rigidity.

The student study example from the previous section illustrates well the use of correlations in psychology:

  1. Analysis of the relationship between psychological indicators. In our example, IQ and the success of communication with the opposite sex are psychological parameters. Identification of the correlation between them expands ideas about the mental organization of a person, about the relationship between various aspects of his personality - in this case between the intellect and the sphere of communication.
  2. Analysis of the relationship of IQ with academic performance and jumping is an example of the relationship of a psychological parameter with non-psychological ones. The results obtained reveal the features of the influence of intelligence on educational and sports activities.

Here's what a summary of the results of a fictional study on students could look like:

  1. A significant positive relationship between the intelligence of students and their academic performance was revealed.
  2. There is a negative significant relationship between IQ and successful communication with the opposite sex.
  3. There was no connection between the IQ of students and the ability to jump from a place.

Thus, the level of intelligence of students acts as a positive factor in their academic performance, while at the same time negatively affecting relationships with the opposite sex and not having a significant impact on sports success, in particular, the ability to jump from a place.

As you can see, the intellect helps students to learn, but prevents them from building relationships with the opposite sex. This does not affect their athletic performance.

The ambiguous influence of intelligence on the personality and activity of students reflects the complexity of this phenomenon in the structure of personality traits and the importance of continuing research in this direction. In particular, it seems important to analyze the relationship between intelligence and psychological features and activities of students, taking into account their gender.

Pearson and Spearman coefficients

Let's consider two calculation methods.

The Pearson coefficient is a special method for calculating the relationship of indicators between the severity numerical values in one group. Very simplified, it boils down to this:

  1. The values ​​of two parameters in the group of subjects are taken (for example, aggression and perfectionism).
  2. The average values ​​of each parameter in the group are found.
  3. The differences between the parameters of each subject and the average value are found.
  4. These differences are substituted into a special form for calculating the Pearson coefficient.

Spearman's rank correlation coefficient is calculated in a similar way:

  1. The values ​​of two indicators in the group of subjects are taken.
  2. The ranks of each factor in the group are found, that is, the place in the list in ascending order.
  3. The rank differences are found, squared and summed.
  4. Next, the rank differences are substituted into a special form to calculate the Spearman coefficient.

In Pearson's case, the calculation was based on the average value. Therefore, random data outliers (significant difference from the mean), for example, due to processing error or unreliable answers, can significantly distort the result.

In Spearman's case, the absolute values ​​of the data do not matter, since only their values ​​are taken into account. mutual arrangement in relation to each other (ranks). That is, data outliers or other inaccuracies will not seriously affect the final result.

If the test results are correct, then the differences between the Pearson and Spearman coefficients are insignificant, while the Pearson coefficient shows more exact value data relationships.

How to Calculate the Correlation Coefficient

The Pearson and Spearman coefficients can be calculated manually. This may be necessary for an in-depth study of statistical methods.

However, in most cases, when solving applied problems, including in psychology, it is possible to carry out calculations using special programs.

Calculation using Microsoft Excel spreadsheets

Let's go back to the students example and look at the data on their level of intelligence and the length of the jump from a place. Let's enter this data (two columns) into an Excel spreadsheet.

After moving the cursor to an empty cell, press the "Insert Function" option and select "CORREL" from the "Statistical" section.

The format of this function assumes the selection of two data arrays: CORREL(array 1; array"). We highlight the column with IQ and the length of the jumps, respectively.

In Excel tables, the formula for calculating only the Pearson coefficient is implemented.

Calculation with the program STATISTICA

We enter data on intelligence and the length of the jump in the field of initial data. Next, select the option "Nonparametric criteria", "Spearman". Select the parameters for the calculation and get the following result.


As you can see, the calculation gave a result of 0.024, which differs from the Pearson result - 0.038, obtained above using Excel. However, the differences are minor.

Using correlation analysis in psychology theses (example)

Most of the topics of final qualification works in psychology (diplomas, term papers, master's) involve a correlation study (the rest are related to identifying differences in psychological indicators in different groups).

The very term "correlation" in the titles of topics rarely sounds - it is hidden behind the following wording:

  • "The relationship between subjective feelings of loneliness and self-actualization in women of mature age";
  • “Peculiarities of the influence of the resilience of managers on the success of their interaction with clients in conflict situations”;
  • "Personal factors of stress resistance of employees of the Ministry of Emergency Situations."

Thus, the words "relationship", "influence" and "factors" are sure signs that the method of data analysis in empirical research there should be a correlation analysis.

Consider briefly the stages of its implementation when writing thesis in psychology on the topic: "The relationship of personal anxiety and aggressiveness in adolescents."

1. For the calculation, raw data are required, which are usually the test results of the subjects. They are entered into a pivot table and placed in the application. This table is structured as follows:

  • each line contains data for one subject;
  • each column contains scores on one scale for all subjects.

subject number

Personal anxiety

Aggressiveness

2. It is necessary to decide which of the two types of coefficients - Pearson or Spearman - will be used. Recall that Pearson gives a more accurate result, but it is sensitive to outliers in the data. Spearman coefficients can be used with any data (except for the nominative scale), which is why they are most often used in psychology diplomas.

3. We enter the table of raw data into the statistical program.

4. Calculate the value.



5. The next step is to determine if the relationship is significant. The statistical program highlighted the results in red, which means that the correlations are statistically significant at a significance level of 0.05 (indicated above).

However, it is useful to know how to determine the significance manually. To do this, you need Spearman's critical values ​​table.

Table of critical values ​​of the Spearman coefficients

Level of statistical significance

Number of test subjects

p=0.05

p=0.01

p=0.001

0,88

0,96

0,99

0,81

0,92

0,97

0,75

0,88

0,95

0,71

0,83

0,93

0,67

0,63

0,77

0,87

0,74

0,85

0,58

0,71

0,82

0,55

0,68

0,53

0,66

0,78

0,51

0,64

0,76

We are interested in the significance level of 0.05 and the size of our sample of 10 people. At the intersection of these data, we find the value of the critical Spearman: Rcr=0.63.

The rule is this: if the Spearman empirical value obtained is greater than or equal to the critical value, then it is statistically significant. In our case: Remp (0.66) > Rcr (0.63), therefore, the relationship between aggressiveness and anxiety in the adolescent group is statistically significant.

5. In the text of the thesis, you need to insert data in a word format table, and not a table from a statistical program. Below the table, we describe the result obtained and interpret it.

Table 1

Spearman's coefficients of aggressiveness and anxiety in a group of adolescents

Aggressiveness

Personal anxiety

0,665*

* - statistically significant (p0,05)

Analysis of the data presented in Table 1 shows that there is a statistically significant positive relationship between the aggressiveness and anxiety of adolescents. This means that the higher the personal anxiety of adolescents, the higher the level of their aggressiveness. This result suggests that aggression for adolescents is one of the ways to relieve anxiety. Experiencing self-doubt, anxiety due to threats to self-esteem, especially sensitive in adolescence, a teenager often uses aggressive behavior, reducing anxiety in such an unproductive way.

6. Is it possible to talk about influence when interpreting relationships? Can we say that anxiety affects aggressiveness? Strictly speaking, no. We have shown above that the correlation between phenomena is of a probabilistic nature and reflects only the consistency of changes in characteristics in a group. At the same time, we cannot say that this consistency is caused by the fact that one of the phenomena is the cause of the other, affects it. That is, the presence of a correlation between psychological parameters does not give grounds to talk about the existence of a causal relationship between them. However, practice shows that the term "influence" is often used when analyzing the results of correlation analysis.

37. Spearman's rank correlation coefficient.

S. 56 (64) 063.JPG

http://psystat.at.ua/publ/1-1-0-33

Spearman's rank correlation coefficient is used when:
- variables have ranking scale measurements;
- data distribution is too different from normal or not known at all
- samples are small (N< 30).

The interpretation of Spearman's rank correlation coefficient does not differ from Pearson's coefficient, but its meaning is somewhat different. To understand the difference between these methods and logically substantiate the areas of their application, let's compare their formulas.

Pearson correlation coefficient:

Spearman's correlation coefficient:

As you can see, the formulas differ significantly. Compare Formulas

The Pearson correlation formula uses the arithmetic mean and standard deviation of the correlated series, while the Spearman formula does not. Thus, to obtain an adequate result according to the Pearson formula, it is necessary that the correlated series be close to the normal distribution (the mean and standard deviation are parameters normal distribution ). For the Spearman formula, this is not relevant.

An element of Pearson's formula is the standardization of each series in z-score.

As you can see, the conversion of variables to the Z-scale is present in the Pearson correlation coefficient formula. Accordingly, for the Pearson coefficient, the scale of the data is absolutely irrelevant: for example, we can correlate two variables, one of which has a min. = 0 and max. = 1, and the second min. = 100 and max. = 1000. No matter how different the range of values ​​is, they will all be converted to standard z-values ​​that are the same in scale.

There is no such normalization in the Spearman coefficient, so

A MANDATORY CONDITION FOR USING THE SPEERMAN COEFFICIENT IS THE EQUALITY OF THE RANGE OF TWO VARIABLES.

Before using the Spearman coefficient for data series with different ranges, it is necessary to rank. Ranking causes the values ​​of these series to acquire the same minimum = 1 (minimum rank) and a maximum equal to the number of values ​​(maximum, last rank = N, i.e. the maximum number cases in the sample).

In what cases it is possible to do without ranking

These are cases where the data is originally ranking scale. For example, test value orientations Rokeach.

Also, these are cases when the number of value options is small and there are fixed minimum and maximum in the sample. For example, in the semantic differential, minimum = 1, maximum = 7.

An example of calculating the Spearman rank correlation coefficient

The Rokeach value orientations test was carried out on two samples X and Y. The task was to find out how close the value hierarchies of these samples are (literally, how similar they are).

The resulting value r=0.747 is checked against critical value table. According to the table, at N=18, the obtained value is reliable at the level of p<=0,005

Rank correlation coefficients according to Spearman and Kendal

For variables belonging to the ordinal scale or for variables that do not follow a normal distribution, as well as for variables belonging to the interval scale, Spearman's rank correlation is calculated instead of the Pearson coefficient. To do this, individual values ​​of variables are assigned ranking places, which are subsequently processed using the appropriate formulas. To reveal rank correlation, uncheck the default Pearson correlation check box in the Bivariate Correlations... dialog box. Instead, activate the Spearman correlation calculation. This calculation will give the following results. The rank correlation coefficients are very close to the corresponding values ​​of the Pearson coefficients (the original variables have a normal distribution).

titkova-matmetody.pdf p. 45

Spearman's rank correlation method allows you to determine the tightness (strength) and direction

correlation between two signs or two profiles (hierarchies) signs.

To calculate the rank correlation, it is necessary to have two series of values,

which can be ranked. These ranges of values ​​can be:

1) two signs measured in the same group test subjects;

2) two individual feature hierarchies, identified in two subjects for the same

a set of features;

3) two group hierarchies of features,

4) individual and group feature hierarchy.

First, the indicators are ranked separately for each of the features.

As a rule, a lower value of a feature is assigned a lower rank.

In the first case (two features), individual values ​​are ranked according to the first

trait obtained by different subjects, and then individual values ​​for the second

sign.

If two signs are positively related, then the subjects with low ranks in

one of them will have low ranks in the other, and the subjects with high ranks in

one of the traits will also have high ranks on the other trait. For counting rs

it is necessary to determine the differences (d) between the ranks obtained by these subjects on both

signs. Then these indicators d are transformed in a certain way and subtracted from 1. Than

the smaller the difference between the ranks, the larger rs will be, the closer it will be to +1.

If there is no correlation, then all ranks will be mixed and there will be no

no match. The formula is designed so that in this case rs will be close to 0.

In case of negative correlation low ranks of subjects on one basis

will correspond to high ranks on another attribute, and vice versa. The more mismatch

between the ranks of subjects in two variables, the closer rs is to -1.

In the second case (two individual profiles), individual

values ​​obtained by each of the 2 subjects according to a certain (the same for them

both) a set of features. The first rank will receive the trait with the lowest value; second rank -

a sign with a higher value, etc. Obviously, all features must be measured in

the same units, otherwise ranking is impossible. For example, it's impossible

rank the indicators according to the Cattell Personality Questionnaire (16PF), if they are expressed in

"raw" scores, since the ranges of values ​​are different for different factors: from 0 to 13, from 0 to

20 and from 0 to 26. We cannot say which of the factors will take first place in terms of

severity, until we bring all the values ​​​​to a single scale (most often this is the scale of the walls).

If the individual hierarchies of two subjects are positively related, then the signs

having low ranks in one of them will have low ranks in the other, and vice versa.

For example, if for one subject the factor E (dominance) has the lowest rank, then for

another subject, it should have a low rank if one subject has factor C

(emotional stability) has the highest rank, then the other subject must also have

this factor has a high rank, and so on.

In the third case (two group profiles), the average group values ​​are ranked,

received in 2 groups of subjects according to a certain, identical for two groups, set

signs. In what follows, the line of reasoning is the same as in the previous two cases.

In the case of the 4th (individual and group profiles), they are ranked separately

individual values ​​of the subject and average group values ​​for the same set

signs that are obtained, as a rule, with the exclusion of this individual subject - he

does not participate in the average group profile, with which his individual will be compared

profile. Rank correlation will allow you to check how consistent the individual and

group profiles.

In all four cases, the significance of the obtained correlation coefficient is determined by

by number of ranked values N. In the first case, this number will coincide with

sample size n. In the second case, the number of observations will be the number of features,

constituting a hierarchy. In the third and fourth cases, N is also the number of matched

signs, not the number of subjects in groups. Detailed explanations are given in the examples. If a

the absolute value of rs reaches a critical value or exceeds it, the correlation

reliable.

Hypotheses.

There are two possible hypotheses. The first refers to case 1, the second to the other three

The first version of hypotheses

H0: The correlation between variables A and B is not different from zero.

H2: The correlation between variables A and B is significantly different from zero.

The second version of the hypotheses

H0: Correlation between hierarchies A and B is not different from zero.

H2: The correlation between hierarchies A and B is significantly different from zero.

Limitations of the rank correlation coefficient

1. At least 5 observations must be submitted for each variable. Upper

the sampling limit is determined by the available tables of critical values .

2. Spearman's rank correlation coefficient rs with a large number of identical

ranks for one or both matched variables gives coarse values. Ideally

both correlated series must be two sequences of non-matching

values. If this condition is not met, an adjustment must be made for

the same ranks.

Spearman's rank correlation coefficient is calculated by the formula:

If in both compared ranking series there are groups of the same ranks,

before calculating the rank correlation coefficient, it is necessary to correct for the same

ranks Ta and Tv:

Ta \u003d Σ (a3 - a) / 12,

TV \u003d Σ (v3 - c) / 12,

where a - the volume of each group of identical ranks in the rank series A, in volume of each

groups of equal ranks in the rank series B.

To calculate the empirical value of rs, use the formula:

38. Dotted biserial correlation coefficient.

For correlation in general, see question no. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf

Let variable X be measured on a strong scale, and variable Y on a dichotomous one. The point biserial correlation coefficient rpb is calculated by the formula:

Here x 1 is the average value for X objects with the value "one" for Y;

x 0 - the average value for X objects with a value of "zero" for Y;

s x - standard deviation of all values ​​for X;

n 1 - the number of objects "one" in Y, n 0 - the number of objects "zero" in Y;

n = n 1 + n 0 is the sample size.

The point biserial correlation coefficient can also be calculated using other equivalent expressions:

Here x is the overall mean value for the variable X.

Point Biserial Correlation Coefficient rpb varies from –1 to +1. Its value is equal to zero in the event that variables with a unit for Y have an average Y, equal to the mean of variables with zero over Y.

Examination significance hypotheses point biserial correlation coefficient is to check null hypothesish 0 about the equality of the general correlation coefficient to zero: ρ = 0, which is carried out using the Student's criterion. Empirical value

compared with critical values t a (df) for the number of degrees of freedom df = n– 2

If the condition | t| ≤ ta(df), the null hypothesis ρ = 0 is not rejected. The point biserial correlation coefficient significantly differs from zero if the empirical value | t| falls into the critical region, that is, if the condition | t| > ta(n– 2). Reliability of relationship calculated using point biserial correlation coefficient rpb, can also be determined using the criterion χ 2 for the number of degrees of freedom df= 2.

Dot-biserial correlation

The subsequent modification of the correlation coefficient of the product of moments was reflected in the dotted-biserial r. This stat. shows the relationship between two variables, one of which is supposedly continuous and normally distributed, while the other is discrete in the exact sense of the word. The dot-biserial correlation coefficient is denoted by r pbis Because in r pbis the dichotomy reflects the true nature of the discrete variable, and not being artificial, as in the case r bis, its sign is arbitrarily determined. Therefore, for all practices goals r pbis considered in the range from 0.00 to +1.00.

There is also such a case when two variables are considered to be continuous and normally distributed, but both are artificially dichotomized, as in the case of biserial correlation. To assess the relationship between such variables, the tetrachoric correlation coefficient is used r tet, which was also bred by Pearson. Main (exact) formulas and procedures for calculating r tet are quite complex. Therefore, with pract. this method uses the approximations r tet obtained on the basis of shortened procedures and tables.

/online/dictionary/dictionary.php?term=511

DOTTED BISERIAL COEFFICIENT OF CORRELATION is the correlation coefficient between two variables, one of which is measured on a dichotomous scale and the other on an interval scale. It is used in classical and modern testology as an indicator of the quality of a test task - reliability-consistency with the overall test score.

To correlate variables measured in dichotomous and interval scale use dot-biserial correlation coefficient.
The point-biserial correlation coefficient is a method of correlation analysis of the ratio of variables, one of which is measured in the scale of names and takes only 2 values ​​(for example, men / women, the answer is correct / the answer is incorrect, there is a sign / there is no sign), and the second in the scale ratios or interval scale. The formula for calculating the coefficient of point-biserial correlation:

Where:
m1 and m0 are the average values ​​of X with a value of 1 or 0 in Y.
σx is the standard deviation of all values ​​for X
n1 ,n0 – number of X values ​​from 1 or 0 to Y.
n is the total number of pairs of values

Most often this species The correlation coefficient is used to calculate the relationship of test items with the summary scale. This is one type of validation check.

39. Rank-biserial correlation coefficient.

For correlation in general, see question no. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf p. 28

The rank-biserial correlation coefficient used when one of the variables ( X) is presented in an ordinal scale, and the other ( Y) - in dichotomous, calculated by the formula

.

Here, is the average rank of objects having unity in Y; is the average rank of objects with zero in Y, n is the sample size.

Examination significance hypotheses rank-biserial correlation coefficient is carried out similarly to the point biserial correlation coefficient using Student's t-test with replacement in the formulas rpb on the rrb.

When one variable is measured on a dichotomous scale (variable x), and the other in the rank scale (variable Y), using the rank-biserial correlation coefficient. We remember that the variable x, measured in a dichotomous scale, takes only two values ​​(codes) 0 and 1. Let us emphasize in particular that despite the fact that this coefficient varies in the range from –1 to +1, its sign does not matter for interpreting the results. This is another exception to the general rule.

The calculation of this coefficient is made according to the formula:

where ` X 1 average rank over those elements of the variable Y, which corresponds to the code (feature) 1 in the variable X;

`X 0 – average rank for those elements of the variable Y, which corresponds to the code (feature) 0 in the variable X\

N- the total number of elements in the variable x.

To apply the rank-biserial correlation coefficient, the following conditions must be met:

1. The variables being compared must be measured on different scales: one X- in a dichotomous scale; another Y– in the ranking scale.

2. The number of varying features in the compared variables X and Y should be the same.

3. To assess the level of reliability of the rank-biserial correlation coefficient, one should use the formula (11.9) and the table of critical values ​​for the Student's test when k = n - 2.

http://psystat.at.ua/publ/drugie_vidy_koehfficienta_korreljacii/1-1-0-38

Cases where one of the variables is present in dichotomous scale, and the other in rank (ordinal), require the use rank-biserial correlation coefficient:

rpb=2 / n * (m1 - m0)

where:
n is the number of measurement objects
m1 and m0 - the average rank of objects with 1 or 0 in the second variable.
This coefficient is also used when checking the validity of tests.

40. Linear correlation coefficient.

About correlation in general (and about linear correlation in particular), see question No. 36 With. 56 (64) 063.JPG

Mr. PEARSON'S CORRELATION COEFFICIENT

r-Pearson (Pearson r) is used to study the relationship between two metricother variables measured on the same sample. There are many situations in which it is appropriate to use it. Does intelligence affect performance in senior university years? Is the size of an employee's salary related to his goodwill towards colleagues? Does the mood of a student affect the success of solving a complex arithmetic problem? To answer such questions, the researcher must measure two indicators of interest to each member of the sample. The data to study the relationship is then tabulated, as in the example below.

EXAMPLE 6.1

The table shows an example of the initial measurement data for two indicators of intelligence (verbal and non-verbal) in 20 students of the 8th grade.

The relationship between these variables can be depicted using a scatter diagram (see Figure 6.3). The diagram shows that there is some relationship between the measured indicators: the greater the value of verbal intelligence, the (mainly) the greater the value of non-verbal intelligence.

Before giving the formula for the correlation coefficient, let's try to trace the logic of its occurrence, using the data of Example 6.1. The position of each /-point (subject with the number /) on the scatter diagram relative to the other points (Fig. 6.3) can be given by the magnitudes and signs of the deviations of the corresponding values ​​of the variables from their average values: (xj - MJ and (mind at ). If the signs of these deviations coincide, then this indicates in favor of a positive relationship (large values ​​for X correspond to large values at or smaller values ​​for X correspond to smaller values y).

For subject No. 1, the deviation from the average X and by at positive, and for subject No. 3, both deviations are negative. Consequently, the data of both indicate a positive relationship between the studied traits. On the contrary, if the signs of deviations from the average X and by at differ, this will indicate a negative relationship between the signs. Thus, for subject No. 4, the deviation from the average X is negative, according to y - positive, and for subject No. 9 - vice versa.

Thus, if the product of deviations (x, - M X ) X (mind at ) positive, then the data of the /-subject indicate a direct (positive) relationship, and if negative, then an inverse (negative) relationship. Accordingly, if Xwy are mostly directly proportional, then most of the products of the deviations will be positive, and if they are related inversely, then most of the products will be negative. Therefore, the sum of all products of deviations for a given sample can serve as a general indicator for the strength and direction of the relationship:

With a directly proportional relationship between the variables, this value is large and positive - for most of the subjects, the deviations coincide in sign (large values ​​of one variable correspond to large values ​​of the other variable and vice versa). If X and at have feedback, then for most subjects, large values ​​of one variable will correspond to smaller values ​​of another variable, i.e., the signs of the products will be negative, and the sum of the products as a whole will also be large in absolute value, but negative in sign. If there is no systematic relationship between the variables, then the positive terms (products of deviations) will be balanced by negative terms, and the sum of all products of deviations will be close to zero.

So that the sum of the products does not depend on the sample size, it is enough to average it. But we are interested in the measure of the relationship not as a general parameter, but as a calculated estimate of it - statistics. Therefore, as for the dispersion formula, in this case we will do the same, we divide the sum of the products of deviations not by N, and on TV - 1. It turns out a measure of communication, widely used in physics and technical sciences, which is called covariance (Covahance):


AT psychology, unlike physics, most variables are measured on arbitrary scales, since psychologists are not interested in the absolute value of the attribute, but in the relative position of the subjects in the group. In addition, covariance is very sensitive to the scale (dispersion) in which the features are measured. To make the measure of communication independent of the units of measurement of either attribute, it is enough to divide the covariance into the corresponding standard deviations. Thus, it was obtained for-K. Pearson's correlation coefficient mule:

or, after substituting the expressions for o x and


If the values ​​of both variables were converted to r-values ​​using the formula


then the r-Pearson correlation coefficient formula looks simpler (071.JPG):

/dict/sociology/article/soc/soc-0525.htm

CORRELATION LINEAR- statistical non-causal linear relationship between two quantitative variables X and at. Measured using the "factor K.L." Pearson, which is the result of dividing the covariance by the standard deviations of both variables:

,

where s xy- covariance between variables X and at;

s x , s y- standard deviations for variables X and at;

x i , y i- variable values X and at for object number i;

x, y- arithmetic averages for variables X and at.

Pearson's ratio r can take values ​​from the interval [-1; +1]. Meaning r = 0 means no linear relationship between variables X and at(but does not rule out a non-linear statistical relationship). Positive coefficient values ​​( r> 0) indicate a direct linear relationship; the closer its value is to +1, the stronger the statistical direct relationship. Negative coefficient values ​​( r < 0) свидетельствуют об обратной линейной связи; чем ближе его значение к -1, тем сильнее обратная связь. Значения r= ±1 mean the presence of a full linear connection, direct or reverse. In the case of a complete connection, all points with coordinates ( x i , y i) lie on a straight line y = a + bx.

"Coefficient K.L." Pearson is also used to measure the tightness of the relationship in the linear pair regression model.

41. Correlation matrix and correlation graph.

For correlation in general, see question no. 36 With. 56 (64) 063.JPG

correlation matrix. Often, correlation analysis includes the study of the relationship not of two, but of many variables measured on a quantitative scale on a single sample. In this case, correlations are calculated for each pair of this set of variables. Calculations are usually carried out on a computer, and the result is a correlation matrix.

Correlation matrix(correlation matrix) is the result of calculating correlations of the same type for each pair from the set R variables measured in a quantitative scale on one sample.

EXAMPLE

Assume that we are studying relationships between 5 variables (vl, v2,..., v5; P= 5), measured on a sample of N=30 human. Below is a table of initial data and a correlation matrix.

And
related data:

Correlation matrix:

It is easy to see that the correlation matrix is ​​square, symmetrical with respect to the main diagonal (takkakg, y = /) y), with units on the main diagonal (since G and = Gu = 1).

The correlation matrix is square: the number of rows and columns is equal to the number of variables. She is symmetrical relative to the main diagonal, since the correlation X With at equals correlation at With X. Units are located on its main diagonal, since the correlation of a feature with itself is equal to one. Consequently, not all elements of the correlation matrix are subject to analysis, but those that are above or below the main diagonal.

Number of correlation coefficients, P features to be analyzed in the study of relationships is determined by the formula: P(P- 1)/2. In the example above, the number of such correlation coefficients is 5(5 - 1)/2 = 10.

The main task of analyzing the correlation matrix is revealing the structure of interrelations of a set of features. This allows visual analysis correlation pleiades- graphic image structures statisticallysignificant connections if there are not very many such connections (up to 10-15). Another way is to use multivariate methods: multiple regression, factorial or cluster analysis (see section "Multivariate methods..."). Using factorial or cluster analysis, it is possible to identify groupings of variables that are more closely related to each other than to other variables. A combination of these methods is also very effective, for example, if there are many signs and they are not homogeneous.

Comparison of correlations - an additional task of analyzing the correlation matrix, which has two options. If it is necessary to compare correlations in one of the rows of the correlation matrix (for one of the variables), the comparison method for dependent samples is applied (pp. 148-149). When comparing correlations of the same name calculated for different samples, the comparison method for independent samples is used (pp. 147-148).

Comparison Methods correlations in diagonals correlation matrix (for assessing the stationarity of a random process) and comparing several correlation matrices obtained for different samples (for their homogeneity) are time-consuming and beyond the scope of this book. You can get acquainted with these methods from the book by GV Sukhodolsky 1 .

The problem of statistical significance of correlations. The problem is that the statistical hypothesis testing procedure involves one-multiple test carried out on one sample. If the same method is applied many times, even if in relation to different variables, then the probability of obtaining a result purely by chance increases. In general, if we repeat the same hypothesis testing method to times in relation to different variables or samples, then with the established value of a, we are guaranteed to receive confirmation of the hypothesis in ahk the number of cases.

Let's assume that the correlation matrix for 15 variables is analyzed, that is, 15(15-1)/2 = 105 correlation coefficients are calculated. To test the hypotheses, the level a = 0.05 is set. By testing the hypothesis 105 times, we will get its confirmation five times (!) regardless of whether the connection actually exists. Knowing this and having received, say, 15 "statistically significant" correlation coefficients, can we tell which of them are obtained by chance, and which ones reflect a real relationship?

Strictly speaking, in order to make a statistical decision, it is necessary to reduce the level a by as many times as the number of hypotheses being tested. But this is hardly advisable, since the probability of ignoring a really existing connection (make a type II error) increases in an unpredictable way.

The correlation matrix alone is not a sufficient basisfor statistical conclusions regarding the individual coefficients included in itcorrelations!

There is only one really convincing way to solve this problem: divide the sample randomly into two parts and take into account only those correlations that are statistically significant in both parts of the sample. An alternative may be the use of multivariate methods (factorial, cluster or multiple regression analysis) - for the selection and subsequent interpretation of groups of statistically significantly related variables.

The problem of missing values. If there are missing values ​​in the data, then two options for calculating the correlation matrix are possible: a) line-by-line deletion of values (excludecaseslistwise); b) pairwise deletion of values (excludecasespairwise). At line-by-line deletion observations with gaps, the entire line is deleted for the object (subject) that has at least one missing value for one of the variables. This method leads to a "correct" correlation matrix in the sense that all coefficients are calculated from the same set of objects. However, if the missing values ​​are randomly distributed in the variables, then this method can lead to the fact that in the considered data set there will not be a single object left (each line will contain at least one missing value). To avoid this situation, use another method called pairwise removal. This method takes into account only gaps in each selected pair of variable columns and ignores gaps in other variables. Correlation for a pair of variables is calculated for those objects where there are no gaps. In many situations, especially when the number of gaps is relatively small, say 10%, and the gaps are fairly randomly distributed, this method does not lead to serious mistakes. However, sometimes this is not the case. For example, in the systematic bias (shift) of the estimate, the systematic location of the gaps can be "hidden", which is the reason for the difference in the correlation coefficients built on different subsets (for example, for different subgroups of objects). Another problem associated with the correlation matrix calculated with in pairs gap removal occurs when using this matrix in other types of analysis (for example, in multiple regression or factor analysis). They assume that a "correct" correlation matrix is ​​used with a certain level of consistency and "correspondence" of various coefficients. The use of a matrix with "bad" (biased) estimates leads to the fact that the program is either unable to analyze such a matrix, or the results will be erroneous. Therefore, if a pairwise method of eliminating missing data is used, it is necessary to check whether there are or are not systematic patterns in the distribution of gaps.

If the pairwise elimination of missing data does not lead to any systematic shift in the means and variances (standard deviations), then these statistics will be similar to those calculated with the line-by-line method of removing gaps. If there is a significant difference, then there is reason to assume that there is a shift in the estimates. For example, if the mean (or standard deviation) of the values ​​of the variable BUT, which was used in calculating its correlation with the variable AT, much less than the mean (or standard deviation) of the same values ​​of the variable BUT, which were used in calculating its correlation with the variable C, then there is every reason to expect that these two correlations (A-Bus) based on different subsets of data. There will be a shift in the correlations caused by the non-random location of the gaps in the values ​​of the variables.

Analysis of correlation pleiades. After solving the problem of the statistical significance of the elements of the correlation matrix, statistically significant correlations can be represented graphically in the form of a correlation pleiad or pleiades. Correlation galaxy - it is a figure consisting of vertices and lines connecting them. The vertices correspond to the features and are usually denoted by numbers - the numbers of the variables. The lines correspond to statistically significant relationships and graphically express the sign, and sometimes the /j-significance level of the relationship.

The correlation galaxy can reflect all statistically significant relationships of the correlation matrix (sometimes called correlation graph ) or only their meaningfully selected part (for example, corresponding to one factor according to the results of factor analysis).

EXAMPLE OF CONSTRUCTING A CORRELATION PLEIADI


Preparation for the state (final) certification of graduates: the formation of the USE base ( common list USE participants of all categories with indication of subjects) - taking into account reserve days in case of coincidence of subjects;

  • Work plan (27)

    Solution

    2. The activities of the educational institution to improve the content and assess the quality in the subjects of natural and mathematical education MOU secondary school No. 4, Litvinovskaya, Chapaevskaya,

  • Spearman's rank correlation coefficient is a non-parametric method that is used to statistical study connections between phenomena. In this case, the actual degree of parallelism between the two quantitative series of the studied features is determined and the tightness of the established relationship is estimated using a quantitatively expressed coefficient.

    1. History of the development of the rank correlation coefficient

    This criterion was developed and proposed for correlation analysis in 1904 Charles Edward Spearman, English psychologist, professor at London and Chesterfield Universities.

    2. What is the Spearman ratio used for?

    Spearman's rank correlation coefficient is used to identify and evaluate the closeness of the relationship between two series of compared quantitative indicators. In the event that the ranks of indicators, sorted by degree of increase or decrease, in most cases coincide ( greater value one indicator corresponds to a larger value of another indicator - for example, when comparing the height of the patient and his body weight), it is concluded that there straight correlation. If the ranks of indicators have the opposite direction (a higher value of one indicator corresponds to a lower value of another - for example, when comparing age and heart rate), then they talk about reverse links between indicators.

      The Spearman correlation coefficient has the following properties:
    1. The correlation coefficient can take values ​​from minus one to one, and at rs=1 there is a strictly direct relationship, and at rs= -1 - strictly inverse relationship.
    2. If the correlation coefficient is negative, then there is an inverse relationship; if it is positive, then there is a direct relationship.
    3. If the correlation coefficient is equal to zero, then the relationship between the quantities is practically absent.
    4. The closer the modulus of the correlation coefficient is to unity, the stronger is the relationship between the measured values.

    3. In what cases can the Spearman coefficient be used?

    Due to the fact that the coefficient is a method nonparametric analysis, no check for normal distribution is required.

    Comparable indicators can be measured as in continuous scale(for example, the number of erythrocytes in 1 µl of blood), and in ordinal(for example, peer review scores from 1 to 5).

    The effectiveness and quality of Spearman's estimation is reduced if the difference between the different values ​​of any of the measured quantities is large enough. It is not recommended to use the Spearman coefficient if there is an uneven distribution of the values ​​of the measured value.

    4. How to calculate Spearman's ratio?

    The calculation of the Spearman rank correlation coefficient includes the following steps:

    5. How to interpret the value of the Spearman coefficient?

    When using the rank correlation coefficient, the closeness of the connection between the signs is conditionally estimated, considering the values ​​of the coefficient equal to 0.3 or less - indicators of weak closeness of the connection; values ​​greater than 0.4 but less than 0.7 are indicators of moderate closeness of connection, and values ​​of 0.7 and more are indicators of high closeness of communication.

    The statistical significance of the obtained coefficient is assessed using Student's t-test. If the calculated value of the t-criterion is less than the tabular value for a given number of degrees of freedom, statistical significance there is no observed relationship. If more, then the correlation is considered statistically significant.

    The calculator below calculates the Spearman rank correlation coefficient between two random variables. The theoretical part, so as not to be distracted from the calculator, is traditionally placed under it.

    add import_export mode_edit delete

    Changes in random variables

    arrow_upwardarrow_downward Xarrow_upwardarrow_downward Y
    Page Size: 5 10 20 50 100 chevron_left chevron_right

    Changes in random variables

    Import data Import error

    You can use one of these characters to separate fields: Tab, ";" or "," Example: -50.5;-50.5

    Import Back Cancel

    The method for calculating the Spearman rank correlation coefficient is actually described very simply. This is the same Pearson correlation coefficient, only calculated not for the measurement results themselves random variables, and for them rank values.

    That is,

    It remains only to figure out what ranking values ​​are and why all this is needed.

    If the elements of the variational series are arranged in ascending or descending order, then rank element will be its number in this ordered series.

    For example, let's say we have a variation series (17,26,5,14,21). Sort its elements in descending order (26,21,17,14,5). 26 has rank 1, 21 has rank 2, and so on. The variation series of rank values ​​will look like this (3,1,5,4,2).

    That is, when calculating the Spearman coefficient, the initial variation series are converted into variation series of rank values, after which the Pearson formula is applied to them.

    There is one subtlety - the rank of repeated values ​​is taken as the average of the ranks. That is, for the series (17, 15, 14, 15), the series of rank values ​​will look like (1, 2.5, 4, 2.5), since the first element equal to 15 has a rank of 2, and the second - a rank of 3, and .

    If there are no repeating values, that is, all values ​​of the ranking series are numbers from the range from 1 to n, Pearson's formula can be simplified to

    Well, by the way, this formula is most often given as a formula for calculating the Spearman coefficient.

    What is the essence of the transition from the values ​​themselves to their rank values?
    And the point is that by examining the correlation of rank values, one can establish how well the dependence of two variables is described by a monotonic function.

    The sign of the coefficient indicates the direction of the relationship between the variables. If the sign is positive, then the Y values ​​tend to increase as the X values ​​increase; if the sign is negative, then the Y values ​​tend to decrease as the X values ​​increase. If the coefficient is 0, then there is no trend. If the coefficient is equal to 1 or -1, then the relationship between X and Y has the form of a monotonic function - that is, with an increase in X, Y also increases, or vice versa, with an increase in X, Y decreases.

    That is, unlike the Pearson correlation coefficient, which can reveal only a linear dependence of one variable on another, the Spearman correlation coefficient can reveal a monotonic dependence, where a direct linear relationship is not revealed.

    Let me explain with an example. Let's assume that we examine the function y=10/x.
    We have the following X and Y measurement results
    {{1,10}, {5,2}, {10,1}, {20,0.5}, {100,0.1}}
    For these data, the Pearson correlation coefficient is -0.4686, that is, the relationship is weak or absent. But the Spearman correlation coefficient is strictly equal to -1, which, as it were, hints to the researcher that Y has a strict negative monotonic dependence on X.

    The discipline "higher mathematics" causes rejection among some, since truly not everyone is given to understand it. But those who are lucky enough to study this subject and solve problems using various equations and coefficients can boast of almost complete knowledge of it. AT psychological science there is not only a humanitarian orientation, but also certain formulas and methods for mathematical verification of the hypothesis put forward in the course of research. For this, various coefficients are applied.

    Spearman's correlation coefficient

    This is a common measurement for determining the closeness of the relationship between any two features. The coefficient is also called the non-parametric method. It shows connection statistics. That is, we know, for example, that in a child, aggression and irritability are interconnected, and the correlation coefficient of Spearman's ranks shows a statistical mathematical connection these two signs.

    How is the ranking coefficient calculated?

    Naturally, all mathematical definitions or quantities have their own formulas by which they are calculated. It also has the Spearman correlation coefficient. Its formula is the following:

    At first glance, the formula is not entirely clear, but if you look, everything is very easy to calculate:

    • n is the number of features or indicators that are ranked.
    • d is the difference between certain two ranks corresponding to the specific two variables of each subject.
    • ∑d 2 is the sum of all squared differences of the feature ranks, the squares of which are calculated separately for each rank.

    Scope of the mathematical measure of connection

    To apply the rank coefficient, it is necessary that the quantitative data of the trait be ranked, that is, they were assigned a certain number depending on the place where the trait is located and on its value. It is proved that two rows of signs, expressed in numerical form, are somewhat parallel to each other. Spearman's rank correlation coefficient determines the degree of this parallelism, the tightness of the relationship of features.

    For a mathematical operation to calculate and determine the relationship of features using the specified coefficient, you need to perform some actions:

    1. Each value of any subject or phenomenon is assigned a number in order - a rank. It can correspond to the value of the phenomenon in ascending and descending order.
    2. Next, the ranks of the values ​​of the signs of two quantitative series are compared in order to determine the difference between them.
    3. In a separate column of the table, for each difference obtained, its square is written, and the results are summarized below.
    4. After these steps, a formula is applied by which the Spearman correlation coefficient is calculated.

    Properties of the correlation coefficient

    The main properties of the Spearman coefficient include the following:

    • Measuring values ​​between -1 and 1.
    • The sign of the coefficient of interpretation has no.
    • The closeness of the connection is determined by the principle: the higher the value, the closer the connection.

    How to check the received value?

    To check the relationship between signs, you must perform certain actions:

    1. The null hypothesis (H0), which is also the main one, is put forward, then another one is formulated, alternative to the first one (H 1). The first hypothesis would be that the Spearman correlation coefficient is 0, which means that there will be no connection. The second, on the contrary, says that the coefficient is not equal to 0, then there is a connection.
    2. The next step is to find the observed value of the criterion. It is found by the basic formula of the Spearman coefficient.
    3. Next, the critical values ​​of the given criterion are found. This can be done only with the help of a special table, which displays various values ​​for the given indicators: the significance level (l) and the number that determines (n).
    4. Now we need to compare the two received values: the established observable, as well as the critical one. To do this, you need to build a critical region. It is necessary to draw a straight line, mark on it the points of the critical value of the coefficient with the "-" sign and with the "+" sign. To the left and to the right of the critical values, the critical regions are plotted in semicircles from the points. In the middle, combining two values, it is marked with a semicircle of the OPG.
    5. After that, a conclusion is made about the tightness of the relationship between the two features.

    Where is the best place to use this value?

    The very first science where this coefficient was actively used was psychology. After all, this is a science that is not based on numbers, however, to prove any important hypotheses regarding the development of relationships, character traits of people, students' knowledge, statistical confirmation of the conclusions is required. It is also used in the economy, in particular, in foreign exchange transactions. Here, features without statistics are evaluated. Spearman's rank correlation coefficient is very convenient in this area of ​​application in that the assessment is made independently of the distribution of variables, since they are replaced by a rank number. The Spearman coefficient is actively used in banking. Sociology, political science, demography and other sciences also use it in their research. Results are obtained quickly and as accurately as possible.

    Conveniently and quickly used Spearman's correlation coefficient in Excel. There are special functions here that help you quickly get the necessary values.

    What other correlation coefficients exist?

    In addition to what we learned about the Spearman correlation coefficient, there are also various correlation coefficients that allow you to measure, evaluate qualitative features, the relationship between quantitative features, the closeness of the relationship between them, presented in a rank scale. These are such coefficients as bis-serial, rank-bis-serial, content, associations, and so on. The Spearman coefficient shows the tightness of the connection very accurately, unlike all other methods of its mathematical determination.

    We recommend reading

    Top