Reliability and statistical significance. Level of statistical significance

reservoirs 25.09.2019
reservoirs

Hypothesis testing is carried out using statistical analysis. Statistical significance is found using the P-value, which corresponds to the probability of a given event under the assumption that some statement (null hypothesis) is true. If the P-value is less than a given level of statistical significance (usually 0.05), the experimenter can safely conclude that the null hypothesis is false and move on to consider the alternative hypothesis. Using Student's t-test, you can calculate the P-value and determine the significance for two data sets.

Steps

Part 1

Setting up an experiment

    Define your hypothesis. The first step in evaluating statistical significance is to choose the question you want answered and formulate a hypothesis. A hypothesis is a statement about experimental data, their distribution and properties. For any experiment, there is both a null and an alternative hypothesis. Generally speaking, you will have to compare two sets of data to determine if they are similar or different.

    • The null hypothesis (H 0) usually states that there is no difference between the two datasets. For example: those students who read the material before class do not get higher marks.
    • The alternative hypothesis (H a) is the opposite of the null hypothesis and is a statement that needs to be confirmed with experimental data. For example: those students who read the material before class get higher marks.
  1. Set the significance level to determine how much the distribution of the data must differ from the usual one in order to be considered a significant result. Significance level (also called α (\displaystyle \alpha )-level) is the threshold you define for statistical significance. If the P-value is less than or equal to the significance level, the data is considered statistically significant.

    • As a rule, the level of significance (value α (\displaystyle \alpha )) is taken equal to 0.05, in which case the probability of detecting a random difference between different data sets is only 5%.
    • The higher the level of significance (and, accordingly, less p-value), the more reliable the results.
    • If you want more reliable results, lower the P-value to 0.01. Typically, lower P-values ​​are used in production when it is necessary to detect defects in products. In this case, high fidelity is required to ensure that all parts work as expected.
    • For most hypotheses experiments, a significance level of 0.05 is sufficient.
  2. Decide which criteria you will use: one-sided or two-sided. One of the assumptions in Student's t-test is that the data are normally distributed. The normal distribution is a bell-shaped curve with the maximum number results in the middle of the curve. Student's t-test is mathematical method data validation, which allows you to determine whether the data falls outside the normal distribution(more, less, or in the “tails” of the curve).

    • If you're not sure if the data is above or below the control group, use a two-tailed test. This will allow you to determine the significance in both directions.
    • If you know in which direction the data might fall outside of the normal distribution, use a one-tailed test. In the example above, we expect students' grades to go up, so a one-tailed test can be used.
  3. Determine the sample size using statistical power. The statistical power of a study is the probability that a given sample size will produce the expected result. A common power threshold (or β) is 80%. Power analysis without any prior data can be tricky because some information is required about the expected means in each data set and their standard deviations. Use the online statistical power calculator to determine the optimal sample size for your data.

    • Typically, researchers conduct a small pilot study that provides data for power analysis and determines the sample size needed for a larger and more complete study.
    • If you do not have the opportunity to conduct a pilot study, try to estimate possible average values ​​based on the literature data and the results of other people. This may help you determine the optimal sample size.

    Part 2

    Compute Standard Deviation
    1. Write down the formula for the standard deviation. The standard deviation indicates how large the spread of the data is. It allows you to conclude how close the data obtained on a particular sample. At first glance, the formula seems rather complicated, but the explanations below will help you understand it. The formula is as follows: s = √∑((x i – µ) 2 /(N – 1)).

      • s - standard deviation;
      • the ∑ sign indicates that all the data obtained in the sample should be added;
      • x i corresponds to the i-th value, that is, a separate result obtained;
      • µ is the average value for this group;
      • N- total number data in the sample.
    2. Find the average in each group. To calculate the standard deviation, you must first find the mean for each study group. The mean value is denoted Greek letterµ (mu). To find the average, simply add up all the resulting values ​​and divide them by the amount of data (sample size).

      • For example, to find the average grade in a group of students who study material before class, consider a small data set. For simplicity, we use a set of five points: 90, 91, 85, 83 and 94.
      • Let's add all the values ​​together: 90 + 91 + 85 + 83 + 94 = 443.
      • Divide the sum by the number of values, N = 5: 443/5 = 88.6.
      • Thus, the average value for this group is 88.6.
    3. Subtract each value obtained from the average. The next step is to calculate the difference (x i - µ). To do this, subtract from the found medium size each received value. In our example, we need to find five differences:

      • (90 - 88.6), (91 - 88.6), (85 - 88.6), (83 - 88.6) and (94 - 88.6).
      • As a result, we get the following values: 1.4, 2.4, -3.6, -5.6 and 5.4.
    4. Square each value obtained and add them together. Each of the quantities just found should be squared. This step will remove all negative values. If after this step you have negative numbers, then you forgot to square them.

      • For our example, we get 1.96, 5.76, 12.96, 31.36 and 29.16.
      • We add the obtained values: 1.96 + 5.76 + 12.96 + 31.36 + 29.16 = 81.2.
    5. Divide by the sample size minus 1. In the formula, the sum is divided by N - 1 due to the fact that we do not take into account the general population, but take a sample of all students for evaluation.

      • Subtract: N - 1 = 5 - 1 = 4
      • Divide: 81.2/4 = 20.3
    6. Extract Square root. After dividing the sum by the sample size minus one, take the square root of the found value. This is the last step in calculating the standard deviation. There are statistical programs that, after entering the initial data, perform all the necessary calculations.

      • In our example, the standard deviation of the marks of those students who read the material before class is s = √20.3 = 4.51.

      Part 3

      Determine Significance
      1. Calculate the variance between the two groups of data. Up to this step, we have considered the example for only one group of data. If you want to compare two groups, obviously you should take the data for both groups. Calculate the standard deviation for the second group of data and then find the variance between the two experimental groups. The dispersion is calculated using the following formula: s d = √((s 1 /N 1) + (s 2 /N 2)).

In the tables of the results of statistical calculations in term papers, diploma and master's theses in psychology, there is always an indicator "p".

For example, in accordance with research objectives Differences in the level of meaningfulness of life in boys and girls of adolescence were calculated.

Mean

Mann-Whitney U test

Level of statistical significance (p)

Boys (20 people)

Girls

(5 people)

Goals

28,9

35,2

17,5

0,027*

Process

30,1

32,0

38,5

0,435

Result

25,2

29,0

29,5

0,164

Locus of control - "I"

20,3

23,6

0,067

Locus of Control - "Life"

30,4

33,8

27,5

0,126

Meaningfulness of life

98,9

111,2

0,103

* - differences are statistically significant (p0,05)

The right column indicates the value of "p" and it is by its value that one can determine whether the differences in the meaningfulness of life in the future in boys and girls are significant or not significant. The rule is simple:

  • If the level of statistical significance "p" is less than or equal to 0.05, then we conclude that the differences are significant. In the above table, the differences between boys and girls are significant in relation to the indicator "Goals" - meaningfulness of life in the future. In girls, this indicator is statistically significantly higher than in boys.
  • If the level of statistical significance "p" is greater than 0.05, then it is concluded that the differences are not significant. In the above table, the differences between boys and girls are not significant for all other indicators, except for the first one.

Where does the level of statistical significance "p" come from

The level of statistical significance is calculated statistical program together with the calculation of the statistical criterion. In these programs, you can also set a critical limit for the level of statistical significance and the corresponding indicators will be highlighted by the program.

For example, in the STATISTICA program, when calculating correlations, you can set the p limit, for example, 0.05, and all statistically significant relationships will be highlighted in red.

If the calculation of the statistical criterion is carried out manually, then the significance level "p" is determined by comparing the value of the obtained criterion with the critical value.

What does the level of statistical significance "p" show

All statistical calculations are approximate. The level of this approximation determines the "r". The significance level is written as decimals, for example, 0.023 or 0.965. If we multiply this number by 100, we get the p indicator as a percentage: 2.3% and 96.5%. These percentages reflect the likelihood that our assumption of a relationship, for example, between aggressiveness and anxiety, is wrong.

I.e, correlation coefficient 0.58 between aggressiveness and anxiety is obtained at a statistical significance level of 0.05 or a 5% error probability. What exactly does this mean?

The correlation we found means that the following pattern is observed in our sample: the higher the aggressiveness, the higher the anxiety. That is, if we take two teenagers, and one will have higher anxiety than the other, then, knowing about the positive correlation, we can say that this teenager will also have higher aggressiveness. But since everything is approximate in statistics, then, stating this, we admit that we can make a mistake, and the probability of an error is 5%. That is, having made 20 such comparisons in this group of adolescents, we can make a mistake with the forecast about the level of aggressiveness once, knowing anxiety.

Which level of statistical significance is better: 0.01 or 0.05

The level of statistical significance reflects the probability of error. Therefore, the result at p=0.01 is more accurate than at p=0.05.

AT psychological research two acceptable levels of statistical significance of the results are accepted:

p=0.01 - high reliability of the result comparative analysis or analysis of relationships;

p=0.05 - sufficient accuracy.

I hope this article will help you write a psychology paper on your own. If you need help, please contact (all types of work in psychology; statistical calculations).

The level of significance in statistics is an important indicator that reflects the degree of confidence in the accuracy and truth of the received (predicted) data. The concept is widely used in various fields: from holding sociological research, to statistical testing of scientific hypotheses.

Definition

The level of statistical significance (or statistically significant result) shows what is the probability of random occurrence of the studied indicators. General statistical significance phenomena is expressed by the coefficient p-value (p-level). In any experiment or observation, there is a possibility that the data obtained arose due to sampling errors. This is especially true for sociology.

That is, a value is statistically significant, whose probability of random occurrence is extremely small or tends to extremes. The extreme in this context is the degree of deviation of statistics from the null hypothesis (a hypothesis that is tested for consistency with the obtained sample data). In scientific practice, the significance level is chosen before data collection and, as a rule, its coefficient is 0.05 (5%). For systems where it is critical exact values, this indicator can be 0.01 (1%) or less.

Background

The concept of significance level was introduced by the British statistician and geneticist Ronald Fisher in 1925 when he was developing a technique for testing statistical hypotheses. When analyzing any process, there is a certain probability of certain phenomena. Difficulties arise when working with small (or not obvious) percentages of probabilities that fall under the concept of "measurement error".

When working with statistics that were not specific enough to be tested, scientists were faced with the problem of the null hypothesis, which “prevents” operating with small values. Fisher proposed for such systems to determine the probability of events at 5% (0.05) as a convenient sample cutoff that allows one to reject the null hypothesis in the calculations.

Introduction of a fixed coefficient

In 1933 Jerzy scientists Neumann and Egon Pearson in their papers recommended setting a certain significance level in advance (before data collection). Examples of the use of these rules are clearly visible during the elections. Suppose there are two candidates, one of which is very popular and the other is not well known. It is obvious that the first candidate will win the election, and the chances of the second tend to zero. Strive - but not equal: there is always the possibility of force majeure, sensational information, unexpected decisions that could change the predicted election results.

Neumann and Pearson agreed that Fisher's proposed significance level of 0.05 (denoted by the symbol α) is the most convenient. However, Fischer himself in 1956 opposed fixing this value. He believed that the level of α should be set in accordance with specific circumstances. For example, in particle physics it is 0.01.

p-value

The term p-value was first used by Brownlee in 1960. P-level (p-value) is an indicator that is inversely related to the truth of the results. The highest p-value corresponds to the lowest level of confidence in the sampled relationship between variables.

This value reflects the probability of errors associated with the interpretation of the results. Assume p-value = 0.05 (1/20). It shows a five percent chance that the relationship between variables found in the sample is just a random feature of the sample. That is, if this dependence is absent, then with multiple similar experiments, on average, in every twentieth study, one can expect the same or greater dependence between variables. Often the p-level is considered as the "margin" of the error level.

By the way, the p-value may not reflect the real relationship between the variables, but only shows a certain average value within the assumptions. In particular, the final analysis of the data will also depend on the chosen values ​​of this coefficient. With p-level = 0.05 there will be some results, and with a coefficient equal to 0.01, others.

Testing statistical hypotheses

The level of statistical significance is especially important when testing hypotheses. For example, when calculating a two-sided test, the rejection area is divided equally at both ends of the sampling distribution (relative to the zero coordinate) and the truth of the obtained data is calculated.

Suppose, when monitoring a certain process (phenomenon), it turned out that new statistical information indicates small changes relative to previous values. At the same time, the discrepancies in the results are small, not obvious, but important for the study. The specialist faces a dilemma: do the changes really occur or are they sampling errors (measurement inaccuracy)?

In this case, the null hypothesis is applied or rejected (everything is written off as an error, or the change in the system is recognized as a fait accompli). The process of solving the problem is based on the ratio of the overall statistical significance (p-value) and the level of significance (α). If p-level< α, значит, нулевую гипотезу отвергают. Чем меньше р-value, тем более значимой является тестовая статистика.

Used values

The level of significance depends on the analyzed material. In practice, the following fixed values ​​are used:

  • α = 0.1 (or 10%);
  • α = 0.05 (or 5%);
  • α = 0.01 (or 1%);
  • α = 0.001 (or 0.1%).

The more accurate the calculations are required, the smaller the coefficient α is used. Naturally, statistical forecasts in physics, chemistry, pharmaceuticals, and genetics require greater accuracy than in political science and sociology.

Significance thresholds in specific areas

In high-precision fields such as particle physics and production activity, statistical significance is often expressed as the ratio of the standard deviation (denoted by the coefficient sigma - σ) relative to a normal probability distribution (Gaussian distribution). σ is statistic, which determines the dispersion of values ​​of a certain quantity relative to mathematical expectations. Used to plot the probability of events.

Depending on the field of knowledge, the coefficient σ varies greatly. For example, when predicting the existence of the Higgs boson, the parameter σ is equal to five (σ=5), which corresponds to the p-value=1/3.5 million. areas.

Efficiency

It must be taken into account that the coefficients α and p-value are not exact characteristics. Whatever the level of significance in the statistics of the phenomenon under study, it is not an unconditional basis for accepting the hypothesis. For example, than less valueα, the greater the chance that the hypothesis being established is significant. However, there is a risk of error, which reduces the statistical power (significance) of the study.

Researchers who focus exclusively on statistically significant results may draw erroneous conclusions. At the same time, it is difficult to double-check their work, since they apply assumptions (which, in fact, are the values ​​of α and p-value). Therefore, it is always recommended, along with the calculation of statistical significance, to determine another indicator - the magnitude of the statistical effect. Effect size is a quantitative measure of the strength of an effect.

In any scientific and practical situation of an experiment (survey), researchers can not study all people (general population, population), but only a certain sample. For example, even if we study a relatively small group of people, such as those suffering from certain disease, then again it is highly unlikely that we have the appropriate resources or the need to test every patient. Instead, a sample of the population is usually tested because it is more convenient and takes less time. In that case, how do we know that the results obtained from the sample represent the whole group? Or, to use professional terminology, can we be sure that our study correctly describes the entire population, the sample from which we used?

To answer this question, it is necessary to determine the statistical significance of the test results. Statistical Significance (Significant level, abbreviated Sig.), or /7-significance level (p level) - is the probability that a given result correctly represents the population from which the sample was studied. Note that this is only probability- it is impossible to say with absolute certainty that this study correctly describes the entire population. At best, one can only conclude from the level of significance that this is highly probable. Thus, the following question inevitably arises: what should be the level of significance in order to consider this result as a correct characterization of the population?

For example, at what value of probability are you willing to say that such odds are enough to take a risk? If the chances are 10 out of 100 or 50 out of 100? But what if this probability is higher? What about odds like 90 out of 100, 95 out of 100, or 98 out of 100? For a situation associated with risk, this choice is quite problematic, because it depends on the personal characteristics of a person.

In psychology, it is traditionally believed that a 95 or more chance out of 100 means that the probability of the correctness of the results is high enough to be generalized to the entire population. This figure was established in the process of scientific and practical activity - there is no law according to which it should be chosen as a guideline (and indeed, in other sciences, sometimes other values ​​​​of the significance level are chosen).

In psychology, this probability is handled in a somewhat unusual way. Instead of the probability that the sample represents a population, the probability that the sample is does not represent population. In other words, it is the probability that the discovered relationship or differences are random and not a property of the population. Thus, instead of saying that the results of a study are 95 out of 100 correct, psychologists say there is a 5 out of 100 chance that the results are wrong (similarly, 40 out of 100 chances in favor of the results being correct means 60 out of 100 chances in favor of their wrongness). The probability value is sometimes expressed as a percentage, but more often it is written as decimal fraction. For example, 10 chances out of 100 are represented as a decimal fraction of 0.1; 5 out of 100 is written as 0.05; 1 in 100 - 0.01. With this form of recording, the limit value is 0.05. For a result to be considered correct, its significance level must be below this number (remember that this is the probability that the result not right describes the population. To do away with terminology, we add that the "probability of wrong result" (which is more correctly called significance level) usually denoted by the Latin letter R. The description of the results of the experiment usually includes a summary conclusion, such as "the results were significant at the level of significance (R(p) less than 0.05 (ie less than 5%).

Thus, the significance level ( R) indicates the probability that the results not represent the population. According to tradition in psychology, the results are considered to reliably reflect big picture if value R less than 0.05 (i.e. 5%). However, this is only a probabilistic statement, and not at all an unconditional guarantee. In some cases, this conclusion may be incorrect. In fact, we can calculate how often this can happen if we look at the magnitude of the significance level. At a significance level of 0.05, in 5 out of 100 cases, the results are probably incorrect. 11a at first glance it seems that this is not too often, but if you think about it, then 5 chances out of 100 is the same as 1 out of 20. In other words, in one out of every 20 cases the result will turn out to be wrong. Such odds do not seem particularly favorable, and researchers should beware of committing errors of the first kind. This is the name of the error that occurs when researchers think they have found real results, but in fact there are none. The opposite errors, consisting in the fact that researchers believe that they have not found a result, but in fact there is one, are called errors of the second kind.

These errors arise because the possibility of incorrect statistical analysis cannot be ruled out. The probability of error depends on the level of statistical significance of the results. We have already noted that in order for the result to be considered correct, the significance level must be below 0.05. Of course, some results are lower, and it's not uncommon to find results as low as 0.001 (a value of 0.001 indicates a 1 in 1000 chance of being wrong). The smaller the p value, the stronger our confidence in the correctness of the results.

In table. 7.2 shows the traditional interpretation of significance levels about the possibility of statistical inference and justification of the decision on the presence of a connection (differences).

Table 7.2

Traditional Interpretation of Significance Levels Used in Psychology

Based on the experience of practical research, it is recommended that, in order to avoid errors of the first and second types, when making responsible conclusions, decisions should be made about the presence of differences (connections), focusing on the level R n sign.

Statistical test(Statistical Test - it is a tool for determining the level of statistical significance. This is a decision rule that ensures that a true hypothesis is accepted and a false one is rejected with high probability.

Statistical criteria also indicate the method of calculating a certain number and this number itself. All criteria are used with one main goal: define significance level the data they analyze (i.e., the likelihood that the data reflects the true effect that correctly represents the population from which the sample was drawn).

Some criteria can only be used for normally distributed data (and if the feature is measured on an interval scale) - these criteria are usually called parametric. With the help of other criteria, you can analyze data with almost any distribution law - they are called nonparametric.

Parametric criteria - criteria that include distribution parameters in the calculation formula, i.e. means and variances (Student's t-test, Fisher's F-test, etc.).

Non-parametric criteria - criteria that do not include distribution parameters in the formula for calculating distributions and are based on operating frequencies or ranks (criterion Q Rosenbaum, criterion U Manna - Whitney

For example, when we say that the significance of differences was determined by Student's t-test, we mean that the Student's t-test method was used to calculate the empirical value, which is then compared with the tabular (critical) value.

According to the ratio of the empirical (we calculated) and critical values ​​of the criterion (table), we can judge whether our hypothesis is confirmed or refuted. In most cases, in order for us to recognize the differences as significant, it is necessary that the empirical value of the criterion exceed the critical one, although there are criteria (for example, the Mann-Whitney test or the sign test) in which we must adhere to the opposite rule.

In some cases, the calculation formula of the criterion includes the number of observations in the study sample, denoted as P. Using a special table, we determine what level of statistical significance of differences corresponds to a given empirical value. In most cases, the same empirical value of the criterion may turn out to be significant or insignificant, depending on the number of observations in the study sample ( P ) or from the so-called number of degrees of freedom , which is denoted as v (g>) or both df (sometimes d).

Knowing P or the number of degrees of freedom, we can use special tables (the main ones are given in Appendix 5) to determine the critical values ​​of the criterion and compare the obtained empirical value with them. It is usually written like this: n = 22 critical values ​​of the criterion are tSt = 2.07" or "at v (d) = 2, the critical values ​​of the Student's criterion are = 4.30 "and the so-called.

Usually, however, preference is given to parametric criteria, and we adhere to this position. They are considered to be more reliable and can provide more information and deeper analysis. As for the complexity of mathematical calculations, when using computer programs, this complexity disappears (but some others appear, however, quite surmountable).

  • In this textbook, we do not deal in detail with the problem of statistical
  • hypotheses (zero - R0 and alternative - Hj) and statistical decisions, since psychology students study this separately in the discipline "Mathematical Methods in Psychology". In addition, it should be noted that when preparing a research report (course or thesis, publications) statistical hypotheses and statistical solutions, as a rule, are not given. Usually, when describing the results, a criterion is indicated, the necessary descriptive statistics are given (means, sigma, correlation coefficients, etc.), empirical values ​​of the criteria, degrees of freedom, and necessarily the p-significance level. Then a meaningful conclusion is formulated in relation to the hypothesis being tested, indicating (usually in the form of inequality) the level of significance achieved or not achieved.

Statistical significance or p-significance level - the main test result

statistical hypothesis. talking technical language, is the probability of getting the given

the result of a selective study, provided that in fact for the general

set, the null hypothesis is true - that is, there is no relationship. In other words, this

the probability that the detected relationship is random and not a property

aggregates. It is statistical significance, the p-significance level is

a quantitative assessment of the reliability of the connection: the lower this probability, the more reliable the connection.

Suppose, when comparing two sample means, the value of the level was obtained

statistical significance p=0.05. This means that testing the statistical hypothesis about

equality of means in the general population showed that if it is true, then the probability

the random occurrence of the detected differences is no more than 5%. In other words, if

two samples were repeatedly drawn from the same general population, then in 1 of

20 cases would show the same or greater difference between the means of these samples.

That is, there is a 5% chance that the differences found are random.

character, and are not a property of the aggregate.

In a relationship scientific hypothesis the level of statistical significance is the quantitative

indicator of the degree of distrust in the conclusion about the presence of a connection, calculated from the results

selective, empirical testing of this hypothesis. The smaller the p-value, the higher

statistical significance of the result of the study, confirming the scientific hypothesis.

It is useful to know what influences the level of significance. Significance level, other things being equal

above (lower p-value) if:

The magnitude of the connection (difference) is greater;

The variability of the trait(s) is less;

The sample size(s) is larger.

Unilateral Here are two-tailed significance tests

If the purpose of the study is to reveal the difference between the parameters of the two general

sets that correspond to its various natural conditions (living conditions,

the age of the subjects, etc.), it is often unknown which of these parameters will be greater, and

which one is smaller.

For example, if you are interested in the variability of results in the control and

experimental groups, then, as a rule, there is no confidence in the sign of the difference between the variances or

the standard deviations of the results against which variability is estimated. In this case

the null hypothesis is that the variances are equal to each other, and the goal of the study is

prove the opposite, i.e. there is a difference between the variances. At the same time, it is allowed that

the difference can be of any sign. Such hypotheses are called two-sided.

But sometimes the task is to prove an increase or decrease in a parameter;

for example, the average result in the experimental group is higher than the control group. Wherein

it is no longer allowed that the difference can be of a different sign. Such hypotheses are called

Unilateral.

Significance tests used to test two-sided hypotheses are called

Bilateral, and for unilateral - unilateral.

The question arises as to which of the criteria should be chosen in a particular case. Answer

This question is outside the scope of formal statistical methods and is completely

Depends on the purpose of the study. In no case should one or another criterion be chosen after

Conducting an experiment based on the analysis of experimental data, since this can

lead to wrong conclusions. If, prior to the experiment, it is assumed that the difference

Compared parameters can be both positive and negative, it follows

We recommend reading

Top