This post discusses additional analyses that can help you to establish conclusion validity. Conclusion validity means that your conclusions are reasonable given the evidence that you have, especially that you do not overstate the generalizability of your data (i.e., overstate what your sample says about the population). Two common tools to establish conclusion validity are interrater reliability, which ensures your data are as objective as possible, and demographic analyses, which examine differences within your data based on learner characteristics.
Interrater Reliability
If you have qualitative data that needs to be scored and if that scoring scheme is at all subjective, you’ll want multiple raters to score at least some of the data to ensure that the raters are unbiased. For some people, qualitative data brings to mind long, verbal responses from participants, but they can also be a program written by a student or anything else that is open-ended. If you use multiple graders for a test, you are already engaging in some form of interrater reliability. When you have more than one rater, you need to determine how similar each rater’s scores are. Interrater reliability determines how similarly raters scored data.
There are many types of interrater reliability, depending on the type of data that you have. I’ll discuss the most common types, but if they don’t match your data, there is probably a better test for you.
For binary data, e.g., data that is marked as correct/incorrect, use Cohen’s Kappa. Cohen’s Kappa compares the observed agreement with the chance of agreement, calculated from the base rate of each binary option from each rater. For example, if rater one marked 80% of items as correct while rater two marked 20% of items as incorrect, then the chance of agreement would be low. A higher Kappa indicates higher agreement, with 0.41 to 0.60 as moderate agreement, 0.61 to 0.80 as substantial agreement, and 0.81 to 1.00 indicating nearly absolute agreement.
For data that is rated or ranked (i.e., more sensitive than correct/incorrect, typically interval or ratio data), use intra-class correlation coefficients. There are two types of intra-class correlation coefficients. The first, intraclass correlation coefficient of consistency, ICC(C), is used when you are determining if raters put items in the same order. For example, if you were trying to rank your students from the highest to lowest performers, then you could determine how similar raters’ rankings were using ICC(C). The other, intraclass correlation coefficient of absolute agreement, ICC(A), is used when you are determining whether raters gave each object the same score. For example, if you were giving each student a numerical grade, then you could determine how similar raters’ rankings were using ICC(A). As you might suspect, having a high ICC(A) is more rigorous than having a high ICC(C), but an acceptable score on initial rankings for either is .80 or higher.
Once you have the initial scores from each rater, disagreements among raters are typically resolved through discussion until 100% agreement is reached. If you have an ICC lower than .80 on initial scores, then you’ll need to retrain raters and re-score the data. If you have large datasets, then you can take a sample of 20% of the participants and ask multiple raters to score only those participants’ data. If the reliability for this sample of the data is .80 or higher, then it is acceptable to have raters score the rest of the data independently.
Demographic Analyses
One of the reasons to collect data about demographic or learner characteristics is to determine whether they affect the relationship between independent and dependent variables. In this way, demographic variables are like independent variables or covariates. It is good practice to run correlations between demographic data and dependent variable data to see if there is a relationship between the two. If you do find that one of your demographics is correlated with a dependent variable, you’ll want to examine whether there are meaningful differences among groups based on that demographic characteristic. If there are differences, it’ll be difficult to argue that the differences in the dependent variable are due to the independent variable instead of the demographic difference. You can determine whether there are differences among groups by using descriptive statistics or by treating the demographic variable as a dependent variable and using inferential statistics, depending on what is appropriate for your data. Though this isn’t how inferential statistics are meant to be used, this analysis will tell you if there are differences among groups.
To view more posts about research design, see a list of topics on the Research Design: Series Introduction.
Pingback: Research Design: Series Introduction | Lauren Margulieux