Most companies who monitor calls and do Quality Assessment (QA) have some form of calibration. This usually follows some pattern of listening to a call with a group of people and analyzing it together to find out where you agree and disagree on an evaluation. It can be a painful process.
You can often learn just as much, if not more, by simply comparing a large sample of evaluations scored by different analysts.
For example, I was recently pouring over a comparison of quarterly data for one of our clients. I have a member of my team run the raw data on a regular basis and provide me with a comparison report. In this particular project, we have four analysts scoring a couplehundred calls per month. There are roughly 65 different behavioral elements we analyze that are rolled up into 14 corresponding attributes. There is a score for each attributeand a corresponding Overall Service score on a scale from 0 to 100.
First, I compared the average Overall Service score for each of the four analysts. The four analysts were within a half-point of the overallaverage for the group. This told me that I didn't have anyone who was particularly lenient or harsh in their analysis. Our overall service numbers were very similar. If an analyst had an overall service score that was much higher or much lower, it would have motivated me to dig into the underlying data to find out why. So far, so good.
Next, I compared the average scores for each of the 14 Attributes. Because some Attributes rarely apply, there are much higher deviations. Keeping this in mind, I focused on the attributes that apply most often and have the greatest impact on the Overall Service score. Once again, our scores were very similar.
Finally, I looked at the average number of times each analyst marked a behavior "yes," "no," or "not applicable" for each of the behavioral elements. In this instance, I found one of my new analysts who had marked a particular behavior as applicable on 100% of the calls analyzed while the other analysts had it applicable on less than half. Because it's an element within an attribute that the client normally scores very well - it didn't show up in the corresponding score. Nevertheless, it could eventually make a difference, and we were clearly not calibrated in scoring this particular behavior. By looking at the data, I was able to address the issue with the analyst and coach them on how to more accurately measure that particular behavior. From this point forward, we should be more closely calibrated.
Sometimes, you've got to let the data show you the way.