Research Design: Inferential Statistics for Causal Questions

The last post discussed inferential statistics for relational questions. This post will discuss inferential statistics for causal questions. Most commonly when asking a causal research question, an experimental design is appropriate. In this case, the purpose of inferential statistics is to determine whether the difference between groups is greater than what would be expected due to chance. For example, if you tested a new teaching method and found that students who received this new teaching method scored 5 points higher on a test than students who didn’t, it would matter if the standard deviation were 2 points or 20 points. If the standard deviation were 2 points, then the mean of the intervention group is 2.5 standard deviations higher than the other group, and the average student in the intervention group scored better than 98% of the other group. However, if the standard deviation were 20 points, then the difference in means likely isn’t greater than what would be expected due to chance.

T-Test

If you are comparing two groups (e.g., a group that gets conventional instruction and a group that gets a new type of instruction), then you’ll want to use a t-test to analyze the differences between groups.

There are a few different types of t-tests based on how you collected data.

For comparing groups that are between-subjects (participants in groups are mutually exclusive), use an independent-samples t-test. For example, if you taught one group with lectures and another group with active learning, then you could compare their test performance using an independent samples t-test.
For comparing groups that are within-subjects (participants are the same in both groups), use a paired-samples t-test. For example, if you taught one topic with lecture and another topic with active learning, and the same students learned both topics, you could compare their performance on questions about each topic using a paired-samples t-test.
For comparing a group to the mean from literature or another study, use a one-sample t-test. For example, if you were comparing scores on a concept inventory from your class to the mean score from the literature, then you could use a one-sample t-test.

T-tests will give you a t value, which is basically the ratio of the difference between groups to the error within groups. The larger t is, the bigger the difference between groups is compared to the error within each group. The t value can be positive or negative depending on whether you have a positive or negative relationship between groups. You can determine if that value is statistically significant based on the p value (i.e., p < .05 is statistically significant), or you can determine how large the difference between groups is using the effect size statistic Cohen’s d.

Below is a table for the heuristic cut-offs (Cohen, 1988) to describe the size of effects based on Cohen’s d, but the meaningfulness of the strength of the relationship depends largely on the variables that you are correlating.

Value of d	Size of effect
d > .8	Large effect
d = .8 to .5	Medium effect
d = .5 to .2	Small effect
d < .2	No effect

Cohen’s d effect size values for t tests.

ANOVA

If you are comparing more than two groups, then you’ll use analysis of variance, ANOVA. Comparing more than two groups includes comparing three or more groups of one independent variable and/or groups from multiple independent variables. ANOVA will tell you whether there is a

main effect of an independent variable, a difference between groups within an independent variable
interaction between independent variables, a difference in an independent variable’s effect based on the value of another independent variable

For example, if you had two interventions that you were testing (i.e., 2 x 2 between-subjects design), it could be the case that getting one or the other intervention would not improve test scores (dependent variable). In this scenario, there would be no main effect of either independent variable, meaning that by themselves neither variable improved scores. If students who received both interventions performed better on the test, then that would be an example of an interaction. The effect of each intervention relies on the other intervention being given.

The type of ANOVA that you use will depend on whether you have a between- or within-subjects design and the relationships among dependent variables. A standard ANOVA assumes between-subjects design and affords one dependent variable to be analyzed at a time. If you have a within-subjects or mixed design or you expect dependent variables to be related, then you’ll need to use one of the other types of ANOVA mentioned below.

ANOVA will give you an F value, which is basically the ratio of the differences among groups to the error within groups. The larger F is, the bigger the difference among groups is compared to the error within groups. The F value can only be positive. You can determine if that value is statistically significant based on the p value (i.e., p < .05 is statistically significant), or you can determine how large the effect is using the effect size statistic f.

Below is a table for the heuristic cut-offs (Cohen, 1988) to describe the size of effects based on Cohen’s f, but the meaningfulness of the strength of the relationship depends largely on the variables that you are correlating.

Value of f	Size of effect
f > .4	Large effect
f = .4 to .25	Medium effect
f = .25 to .1	Small effect
f < .1	No effect

Cohen’s f effect size for F tests.

Post-hoc Analyses

An ANOVA will tell you whether there is a difference among your groups, but because you are comparing more than two groups and it is only one number, it will not tell you between which groups the difference occurs. For example, if you were comparing 3 groups and found a statistically significant F value, it could be the case that groups 1 & 2 are equal but group 3 is different or it could be the case that groups 2 & 3 are equal and 1 is different. To determine the specific pattern of results within an ANOVA, conduct post-hoc tests.

For distinguishing between more than two levels within one independent variable, the most popular post-hoc test is Tukey’s Honestly Significant Different (HSD) test because it is conservative and, therefore, doesn’t invite skepticism. This test conducts pairwise comparisons (compares the means of two levels of the independent variable) to determine if one mean is statistically larger or smaller than the other. It will give you q values, which are like t values and represent the differences among groups.

For distinguishing between groups across different independent variables, the most popular post-hoc test is a simple main effects analysis. For example, if you had two independent variables with two levels each (2×2 design), and you wanted to compare groups that got the same level of one independent variable and different levels of the other independent variable, then you’d use simple main effects.

Managing Error from Multiple Tests

It’s generally a good idea in the null hypothesis significance testing framework to use post-hoc tests for only the groups that you expect to be different (either due to your hypotheses or the means of your data). The more post-hoc analyses that you use, the more error you introduce. Because every test has up to 5% error (p < .05), each test that you run adds a 5% chance that your results are due to error. To keep error in check, researchers use the Bonferroni correction (sometimes called the Bonferroni adjustment). This correction divides the p value by the number of tests that you conduct to ensure that the total error in your analyses is no more than 5%. For example, if you conducted 4 t-tests as post-hoc tests for an ANOVA, then you’d divide your p value by 4, and your results would only be considered statistically significant if p < .0125.

ANCOVA (rather than using Gain/Difference Scores)

ANCOVA stands for analysis of covariance. If you expect that performance on one dependent measure or a demographic measure will be predictive of performance on the dependent measure being analyzed, then you have a covariate for your dependent variable. Covariates act like an independent variable because they have predictive value. In this case, you’d use ANCOVA to ensure that the independent variable predicts performance on the dependent variable above and beyond any covariates. For example, if you gave participants a pre-test and expected performance on that pre-test to be predictive of performance on the post-test, then you’d want to use ANCOVA to ensure that your independent variable can predict the post-test separately from the pre-test.

Especially for comparing pre-test to post-test scores, many researchers will use gain scores, but it is better to use ANCOVA. Gain scores have fallen out of favor recently because they are less reliable. Gain scores are the educational equivalent of difference scores. Difference scores take two points of data (e.g., a pre- and post-test) that each has an error component and condenses them into one data point. Because the error components are no longer part of the analysis, difference scores ignore the error components of the original scores, making it less reliable.

MANOVA (rather than one Summed/Average Score)

MANOVA stands for multivariate analysis of variance. If you expect two or more dependent measures to measure the same underlying construct, then you’d use MANOVA. For example, if you gave students 3 exams throughout the semester and wanted to use them together as a measure of student learning, then you would use MANOVA to analyze your results. You could also add or average all the test scores to make one final score, but similar to the difference scores from the previous paragraph, that would ignore the error components of the original scores and make your analyses less reliable. If you want to test the similarity of performance on two dependent variables, you can use a correlation. If you want to test the similarity of performance on more than two dependent variables, you can use principles components analysis or factor analysis.

Repeated Measures

Repeated measures ANOVA is for within-subjects research designs that use the same measurements multiple times, usually at different time points. For example, if you gave learners the same test at the middle of the semester, the end of the semester, and 3 months after the semester, you’d use repeated measures to analyze the results. One of the benefits of a within-subjects design like this is that it reduces error by reducing the number of participants and all of the random factors that they bring with them. Repeated measures ANOVA allows the analysis to capitalize on this reduced error, but it assumes that participants completed all of the measures. If you have more than 10% of participants who did not complete every measure, then you should not use repeated measures ANOVA unless you’re willing to exclude those participants from the analysis.

Both MANOVA and repeated measures ANOVA analyze multiple measurements from the same participants. The major difference between the two is that in MANOVA, the measurements are different but all measure the same construct, and in repeated measures ANOVA, the measurements are the same and administered at different times.

Linear Regression

Regression is an analysis technique for determining how much of the performance on a dependent variable can be predicted by the independent variables, demographic characteristics, and/or performance on other dependent variables. If you’re thinking that it sounds a lot like ANOVA, then you are right. ANOVA, however, requires that the levels of your independent variable are discrete, and it is typically used for analyses with few predictors. Regressions can also use continuous independent variables and typically attempt to account for as much of the variance in the performance of the dependent variable as possible. Thus, it is typically used for analyses with several predictors.

Regression also focuses on how well each predictor explains the dependent variable rather than if the differences are statistically significant, so it’s used outside of null hypothesis significance testing much more frequently. A regression coefficient, or β, refers to how much of the variance in the dependent variable is predicted by a variable (independent, demographic, or covariate). In addition, there are many types other of regression, such as logistic regression, to understand non-linear relationships among variables.

This table summarizes the descriptive and inferential statistics used to answer different types of research questions. Next, we’ll discuss analyses to better understand your data, such as effect sizes, demographic analysis, and interrater reliability.

Statistic	Type of Question	When to Use
Mean	Descriptive	Find the average score of a group
Standard Deviation	Descriptive	Find the average error of a group
Correlation Coefficient	Relational	Determine the strength of the relationship between 2 variables
t-Test	Causal	Determine whether difference between groups on dependent variable is caused by independent variable with 2 levels
ANOVA	Causal	Determine whether difference among groups on dependent variable(s) is caused by independent variable(s) with 2 or more levels
Regression	Causal	Determine how much of the variance in a dependent variable is attributable to other variables (e.g., demographics, independent variables)

Different statistics and their uses

To view more posts about research design, see a list of topics on the Research Design: Series Introduction.

7 thoughts on “Research Design: Inferential Statistics for Causal Questions”

Pingback: Research Design: Series Introduction | Lauren Margulieux
Carl says:

November 21, 2022 at 6:55 PM

Also, check out,Coe, R. (2002, September). It’s the effect size, stupid. In British Educational Research Association Annual Conference (Vol. 12, p. 14). https://cebma.org/wp-content/uploads/Coe-2002.pdf

Carl says:

November 21, 2022 at 7:01 PM

Also, check out, Rogaten, J., & Rienties, B. (2021). A Critical Review of Learning Gains Methods and Approaches. Learning Gain in Higher Education.

Click to access Book_Tight_Hughes_15_12_2019_oro.pdf

“Obviously, there are several limitations associated with both pre-post tests and self reported measures. In relation to the pre-post testing, the first limitation that should be taken into account is whether the tests used are the same tests at the pre- and the post-test stages. If the same test or similar questions was administered twice, by default students will always perform better at the post-test than at the pre-test just as a result of the mere exposure to the testing environment. This is not just the case for the knowledge tests, as these findings are also quite common with the other skills assessments that require practice and attention. As such, when interpreting the findings of the studies that used identical assessments for the pretest and the post-test we should consider how much of the improvement can be attributed to the actual learning and how much of the improvement is just due to the exposure. To avoid the negative effects of completing the same test twice, one can choose two different tests, but the issue of comparability of the test difficulty should be addressed as well as attempts of removing the order effect. Furthermore, the direct comparison of the results of pre-test and post-test may produce less reliable learning gains, but the computational aspect of the prepost test research will be further discussed later on in the chapter.” (p. 6 – 7)

Pingback: Research Design: Interpreting and Calculating Effect Sizes | Lauren Margulieux
Pingback: Research Design: Inferential Statistics for Relational Questions | Lauren Margulieux
Pingback: Research Design: What Statistical Significance Means | Lauren Margulieux
Pingback: Additional Analyses: Interrater Reliability and Demographics | Lauren Margulieux