In the scientific method, we collect data to support or refute hypotheses, not to prove or disprove them. We frame scientific research in this way because there might be factors that we are unaware of that affect the results. In human-subjects research, one reason we cannot prove or disprove hypotheses is that we use samples to represent populations rather than whole populations.

A **population** is the group of people that you are interested in studying. For example, a population might be the students in a particular major, those at a particular university, in a particular country, or all university students. In most cases, we aren’t able, nor is it necessary, to include everyone in a population in a research study. Instead, we rely on **samples**, or representative subsets, from those populations. For example, if you were interested in a project about improving physics education (i.e., the population of physics students), you might use students in an intro physics class as your sample. Sampling introduces potential error in the research because samples can differ (e.g., students in one physics class aren’t the same as those in another), whereas a population is all-inclusive.

We also do not attempt to prove hypotheses because human-subjects research includes error. Error is anything that might influence the results of a study in a way that is inconsistent or that isn’t being measured. For example, personal issues may or may not affect a student’s performance in a course. Unless the study is about coping with personal issues, this isn’t likely to be measured but could impact the results of the research and, thus, be a source of error. People are incredibly complex, and human-subjects research inherently has a lot of error because people’s performance in a study can be influenced by almost anything, including time of day, day of week, personality, events that happened yesterday, events that happened a month ago, what we think about the researcher, and what we think the research is about.

All measurements include some degree of error as well. For example, if you ask participants for their age, you can have two participants who are “20” with almost a year difference in age unless you measure with more specificity, which probably would not be worth your or their time. Other measurements used in educational research are no different.

To manage this error, human-subjects research employs null hypothesis significance testing to determine whether a phenomenon is likely due to error or not. Basically, that means we use statistical analyses to compare the size of the error to the size of the effect. For example, when comparing an intervention group to a control group, there will be a normal amount of variation within groups (i.e., within the intervention group) based on differences among people, measurements error, and other kinds of error. This within-group variability is compared to between-group variability, or the differences between the groups. When between-group variability is larger than within-group variability, we say there is an effect of the intervention. The same comparison can be made within people from a pre-test to a post-test to compare within-subject to between-subject variability.

In null hypothesis significance testing, we use this comparison to determine if a phenomenon is less than 5% likely to be due to error, reported as *p *< .05. Said another way, *p *< .05 means that we have 95% confidence that the result is due to differences in the independent variable, not just chance. Thus, the more error in the study, the larger the effect must be to outweigh this error. Results that meet the p < .05 standard are evidence, but not proof, that an effect exists.

Differences that exceed the *p *< .05 standard are considered **statistically significant**. When describing results, it is correct to use the full phrase “statistically significant” instead of only “significant” because significant is used commonly enough in the English language that people might infer the colloquial meaning. The opposite of statistical significance is “nonsignificant,” and not “insignificant.”

It is important that you don’t confuse statistical significance with significance. Results can be statistically significant but not meaningful for several reasons. For example, if you’re analyzing a MOOC with a very large sample size, then a 2% difference between groups could be statistically significant, but a 2% difference likely isn’t very meaningful in the big picture. Similarly, results can be statistically nonsignificant but meaningful. For example, if you expected online and in-class groups of students to perform differently, and they did not, then nonsignificant results are meaningful.

More sophisticated statistical analyses, called effect sizes, are becoming standard in social science research. They can tell you how strong or weak a phenomenon is, but they fall outside of the null hypothesis significance testing paradigm, which simply says whether a result is likely or unlikely to be due to error. Later posts will discuss using and reporting statistical analyses in both paradigms.

Term | Definition | Relationship to Error |

Sample | A subset of people from the population that you are trying to reach | Samples include error because they do not necessarily represent all people in the population of interest equally. |

Normal Distribution | The expected distribution of scores based on inherent differences among people | Because people vary on innumerable characteristics, some differences between participants are expected. Further, measurements include error. |

Statistical Significance | The conclusion drawn from null hypothesis significance testing that the difference between groups is greater than what could be expected due to chance | The difference between groups must be larger than the error within groups to have statistical significance. |

Effect size | The strength of an intervention or variable | A ratio of the variability caused by the effect to the variability caused by error. |

To view more posts about research design, see a list of topics on the Research Design: Series Introduction.

Pingback: Research Design: Series Introduction | Lauren Margulieux

Pingback: Research Design: Descriptive Statistics | Lauren Margulieux

Pingback: Research Design: Inferential Statistics for Causal Questions | Lauren Margulieux

Pingback: Research Design: Interpreting and Calculating Effect Sizes | Lauren Margulieux