Rubrics, the detailed categorization of student work, have become a common teaching and grading tool. Here is one designed to assess college-level writing. The idea is to provide specificity and a sense of progress. The student and the teacher have detailed descriptions of the various levels of accomplishment so that expectations and grading assessments are better understood and ways to improve easily seen. Rubrics are a highly effective tool for teacher-student communication.

The question to be addressed here is, “What happens when we amalgamate rubric data?” The idea is that we want to use rubrics to get a sense of the overall student performance of a class or a cohort. In this post, questions of accuracy and sampling will be ignored. Student work will be assumed correctly categorized. Issues of inter-rater reliability and the like will be assumed solved and simple random samples will be assumed to have been taken.

Consider this data we gathered for an institutional assessment of student writing. We took a random sample of thirty-nine instances of senior writing. A particular attribute resulted in this frequency table and this bar graph.

What do we make of it? We have four categories that measure increasing competency. We ignore the possibility that the very names of the categories influenced the assessments. The scale – beginning, developing, accomplished, and exemplary – is an ordinal scale, in this case, a person rated accomplished is presumed to have higher skills than one rated developing or rated beginning, for example. The statistical descriptor for this type of information is ordinal categorical data. I note at this point that my reading of Analysis of Ordinal Categorical Data, 2nd by Alan Agresti informs these views and it is not at all impossible that I misinterpreted what I read.

At this stage a natural step would be to replace the descriptive categories – beginning, developing, accomplished, and exemplary – with numbers, say 1, 2, 3 and 4. Oops, we have just taken our first step onto a slick downward sloping glacier. Lose our balance and we will end up in the rocks far below.

The numbers 1,2,3,4 do preserve the ordering. 3 is “better” than 2 which is “better” than 1, etc. But they deceive. Is a person whose level is 4 twice as skilled as a person rated a 2? In terms of the original categories is a person rated “accomplished” twice as skilled as a person rated “beginning”? Or is the difference in competency between the 2 and 3 levels the same as the difference in competency between the 3 and 4 levels? In other words, is the difference in the ability of a person rated “developing” and a person rated “accomplished” the same difference in ability of a person rated “accomplished” and a person rated “exemplary? By translating statements from numerical ratings back to the descriptive ratings, we get nonsense. What happened? We started to treat the numerical categories as if the numbers had the normal meaning, for example 4-3 = 3-2 = 1. Same difference numerically, near nonsense in the original language of the categories. In statistics talk, we have ordinal data, not interval data. The categories are not equally split and a natural zero does not exist.

Let’s take another step onto the glacier. The average score is 2.73. This treats the data like regular numbers with consistent intervals and ratios which they do not have. For instance maybe “exemplary” skills are five times “better” than “beginning” skills whatever that means. Thinking like this is particularly hard to resist here in academia. We do it all the time. We take ordinal data, A, A-, B+, etc, assign numbers for the grades, 4, 3.7, 3.3, etc, weight by the number of units taken and calculate GPA to two decimal places. We use this faux number to make financial aid, scholarship, athletic eligibility, and hiring decisions.

Can anything be done? Dr. Agresti gives a few ways to assign numbers to the ordered categories that use the underlying proportions or assumptions about the data. This approach allows one to more easily speak of odds ratios and yields possibilities for sophisticated analysis. Instead let’s go back to the data. Would the calculation of a median help us characterize the data? With so few categories, the median is not much help. The median in the above data is “accomplished. So half of the students are “accomplished” or lower and half are “accomplished” or higher. Not a lot of new information there. The mode is easy to see and it might be helpful to say the greater proportion of the students are rated “accomplished.”

Going back to the numerical assignment of categories, maybe we just didn’t choose the ‘right’ scale. For instance our sense of the difference of skills when we actually measured them might have made us think that exemplary students are really, really good and the rating scale should be 1,2,3,5. The result would be this bar graph with a big hole in it.

This might make sense if we have some reason to think we have a really good sub-population, but our experience tells us that the population of students usually have a continuum of skills. So the hole is an artifact. Maybe half of the 5-rated students are really 4’s and half the 3-rated students should have been rated 4’s. If so we get this bar graph.

We are on a slippery slope, manipulating the data post hoc. We will need a lengthy justification for any particular choice among the large variety of possibilities.

Yet there must be something there. Another way to think about the rubric data is as a rough approximation of an underlying latent (to use Dr. Agresti’s term) variable. It is possible if we could measure a trait with microscopic precision that we could get an x-axis with meaningful units. As with most measurements of complex human traits we would assume the distribution would be normal. Our bar graphs would turn into histograms and would estimate the situation according to this rough sketch.

A common objection to the use of an x-axis that extends to infinity is that the skill measured is limited and bounded below. Yet it is not too hard to think of a very, very, very low skilled student or for that matter a student with astonishing high skills – not infinite but far enough out there to establish a good approximation for a normal curve.

This idea can work quite well for symmetric mound-shaped histograms as this example shows.

Here the category percentages match the latent variable distribution.

What happens with less symmetrical data? We get a less accurate match as in this example.

The diagram challenges us to explain the lack of symmetry. Was it the rubric? Was it our interpretation of the rubric? Maybe the population really displays that asymmetry?

Enough of the slippery slope. Forcing a numerical scale on our ordinal categorical data and using these numbers to generate numbers like mean and standard deviation should be questioned. It could lead us to an erroneous conclusion and at minimum should make us humbly uncertain. If the data is bell-shaped, less so. Part II will explore how to draw statistical valid conclusions from aggregate sample rubric data.