ad@dubay.bz
(907) 223 1088
|
Ordinal Variable |
Benchmark Criteria |
# Items |
SHAC Variable Not Coded |
Form Dom |
Form 2ndary |
Formless |
Total Agreement |
Kendall Tau-b |
Kappa |
Gwet’s AC2 |
|
Aggregate |
-- |
155 |
99.5% |
78% |
59% |
79.5% |
79% |
0.82+++ |
0.71++ |
0.86## |
|
FDSHAC |
(1.00) |
(4.79) |
(18.27) |
(11.81) |
(4.43) |
(0.04) |
(0.05) |
(0.04) |
||
|
SHAC>0 |
115 |
[0.73-0.91] |
[0.66-0.76] |
[0.82-0.90] |
||||||
|
72% |
0.65++ |
0.56+ |
0.85## |
|||||||
|
(6.55) |
(0.08) |
(0.09) |
(0.05) |
|||||||
|
[0.57-0.73] |
[0.47-0.65] |
[0.80-0.90] |
||||||||
|
FDY |
-- |
155 |
94% |
71% |
44% |
75% |
89% |
0.73++ |
0.64++ |
0.92### |
|
(2.58) |
(4.50) |
(17.91) |
(9.23) |
(3.28) |
(0.10) |
(0.11) |
(0.03) |
|||
|
[0.63-0.83] |
[0.53-0.75] |
[0.89-0.95] |
||||||||
|
SHAC>0 |
115 |
91% |
82% |
0.69++ |
0.62++ |
0.87## |
||||
|
(3.81) |
(5.40) |
(0.10) |
(0.11) |
(0.06) |
||||||
|
[0.59-0.79] |
0.51-0.73] |
[0.81-0.93] |
||||||||
|
Any Y>0 |
33 |
-- |
59% |
0.39 |
0.42+ |
0.64# |
||||
|
(11.09) |
(0.16) |
(0.12) |
(0.15) |
|||||||
|
[0.23-0.55] |
[0.30-0.54] |
[0.49-0.79] |
||||||||
|
FDT |
-- |
155 |
98% |
85% |
54% |
84% |
94% |
0.88+++ |
0.81+++ |
0.97### |
|
(2.28) |
(9.39) |
(10.40) |
(19.05) |
(3.59) |
(0.09) |
(0.09) |
(0.03) |
|||
|
[0.79-0.87] |
[0.72-0.90] |
[0.94-0.99] |
||||||||
|
SHAC>0 |
115 |
97% |
92% |
0.85+++ |
0.79+++ |
0.95### |
||||
|
(3.49) |
(4.39) |
(0.08) |
(0.10) |
(0.04) |
||||||
|
[0.77-0.93] |
[0.69-0.89] |
[0.91-0.99] |
||||||||
|
Any T>0 |
27 |
73% |
0.56+ |
0.62++ |
0.80## |
|||||
|
-- |
(6.06) |
(0.21) |
(0.11) |
(0.10) |
||||||
|
[0.35-0.77] |
[0.51-0.73] |
[0.70-0.90] |
||||||||
|
FDV |
-- |
155 |
99% |
46% |
39% |
47% |
89% |
0.75+++ |
0.60++ |
0.95### |
|
(0.96) |
(19.1 |
(34.68) |
(13.40) |
(3.77) |
(0.11) |
(0.14) |
(0.04) |
|||
|
9) |
[0.64-0.86] |
[0.46-0.74] |
[0.91-0.99] |
|||||||
|
SHAC>0 |
115 |
99% |
85% |
0.73+++ |
0.60++ |
0.92### |
||||
|
(1.22) |
(4.53) |
(0.12) |
(0.14) |
(0.03) |
||||||
|
[0.61-0.85] |
[0.46-0.74] |
[0.89-0.95] |
||||||||
|
Any V>0 |
29 |
-- |
47% |
0.32 |
0.28 |
0.43 |
||||
|
(15.20) |
(0.17) |
(0.19) |
(0.23) |
|||||||
|
[0.15-0.49] |
[0.09-0.47] |
[0.21-0.66] |
||||||||
|
FDC’ |
-- |
155 |
97% |
81% |
65% |
82% |
92% |
0.88+++ |
0.79+++ |
0.97### |
|
(2.16) |
(5.50) |
(22.97) |
(21.75) |
(1.83) |
(0.06) |
(0.10) |
(0.02) |
|||
|
[0.82-0.94] |
[0.69-0.79] |
[0.95-0.99] |
||||||||
|
SHAC>0 |
115 |
96% |
90% |
0.86+++ |
0.78+++ |
0.95### |
||||
|
(2.69) |
(4.15) |
(0.06) |
(0.09) |
(0.03) |
||||||
|
[0.80-0.92] |
[0.69-0.87] |
[0.92-0.98] |
||||||||
|
Any C’>0 |
29 |
-- |
74% |
0.70++ |
0.63++ |
0.83## |
||||
|
(7.69) |
(0.12) |
(0.11) |
(0.07) |
|||||||
|
[0.58-0.82] |
[0.52-0.74] |
[0.76-0.90] |
Note. FDSHAC=Form Dominance in Shading and Achromatic Color. SHAC=Shading and Achromatic Color. FDY=Form Dominance in Diffuse Shading. FDT=Form Dominance in Texture. FDV=Form Dominance in Vista. FDC’=Form Dominance in Achromatic Color. Interpretive ranges for Kappa and Kendall Taub-b: +fair (0.40-0.59); ++ good (0.60-0.74); +++ excellent (0.75 and above) (Cichetti, 1994; Shrout & Fleiss, 1979). Interpretive ranges for Gwet’s AC2: #moderate (.50-.74); ## good (.75-.89); ### excellent (.90 and above) (Koo & Li, 2016).
The first row of Table 1 presents average scoring accuracy coefficients for the four raters on Aggregate FDSHAC. On separate lines, this row also provides coefficients for all 155 items and for the 115 for which the benchmark scored a SHAC determinant. The means and standard deviations reveal similar trends as the interrater reliability analyses (Bram et al., 2023): Analyses of all 155 items across all four points of the scale, reliability looks more than solid, including overall percent agreement of 79%, a good Kappa statistic over 0.70, and Kendall Tau-b and Gwet’s AC2 above 0.80 (excellent and good respectively). When examining only the 115 items for which benchmark coded a SHAC determinant, we notice the expected drop off in the overall mean coefficients (72% agreement, good Kendall Tau-b, and fair Kappa), with the exception of Gwet’s AC2, which hardly changes at 0.85.
Providing breakdowns in average percent agreement with expert/benchmark at each level of the ordinal scale, Table 1 also illuminates the relative weakness of reliability in coding Form Secondary to SHAC. Though agreement approaches 80% for the Form Dominance and Formless (Formless SHAC) categories, it is only 59% for Form Secondary. This is analogous to the finding for interrater reliability (Bram et al., 2023). We also noticed that Form Secondary to SHAC is the level possessing the greatest variability among the four raters’ (SD=18.27), which in part captures the range between Rater 3’s 40% agreement and Rater 2’s 80% agreement for this most challenging category.6
The second row of Table 1 displays average scoring accuracy coefficients across our raters in coding FDY, and again we see a pattern similar to the interrater reliabilities reported in Bram et al. (2023). In the 155-item analyses, Gwet’s AC2 and Kendall Tau-b were excellent, and Kappa was good. In the 115-item analyses, Gwet’s AC2 and Kendall Tau-b were good, and Kappa was fair. For the restricted 33-item set, overall Gwet’s AC2 was moderate, Kappa fair, and Kendall Tau-b poor.
Interestingly, homing in on percent agreements within each of the levels of FDY, we observe that our raters were most accurate at the Formless Y level. Akin to the Bram et al. (2023) findings with interrater reliabilities, the YF category was the category most difficult for our group of experienced coders and the one with the greatest variability.
Shown in the third row of Table 1, our four raters were impressively consistent with the expert/benchmark in ordinal coding of FDT. Gwet’s AC2’s, Kappas, and Kendall Tau-b’s in the 155- and 115-item sets were all excellent. In the 27-item set, Gwet’s AC2 and Kappas were good, and Kendall Tau-b was fair. The excellent Kendall Tau-b’s in the first two sets of 0.88 and 0.85 help illuminate that when agreement with the expert/benchmark was not exact, it was quite close. Not depicted in the table is how remarkably close both Raters 1 and 2 were to the expert/benchmark in coding FDT (Gwet’s AC2=0.99 and 0.98; Kappas=0.88-0.89 and 0.87-0.88, respectively) within the 155- and 115-item sets.
Examining percent agreement within each of the four categories of the scale, again it was the Form Secondary category —in this case TF—that was the relatively weak link: Mean percent agreement with the expert/benchmark for TF was only 54%, whereas agreements were 85%, and 84%, respectively for FT and Formless T.
6These breakdowns of each rater’s consistency with benchmark are not included in Table 1 but are available from the first author.
They are available not only for FDSHAC but for FDY, FDT, FDV, and FDC’.
Rorschach Scoring Accuracy: 11
The fourth row of Table 1 reveals a complicated picture of the four raters’ consistency with the expert/benchmark’s ordinal coding of FDV. Mean Gwet’s AC2’s for the 155- and 115-item analyses were excellent, and Kappas were in the good range. However, examining two other aspects of Table 1 reveal that these overall accuracy statistics are misleading as, on the whole, raters actually had difficulty with distinguishing among FV, VF, and Formless V: (a) mean percent agreement at each level of the 4-point ordinal scale and (b) mean AC2 and Kappa for the 29-item set. The average percent agreement with benchmark in each of the three categories were FV=46%, VF=39%, and Formless V=47%, suggesting that the AC2 and Kappas for the 155- and 115-item sets were inflated based on the considerable ease of scoring the absence of Vista (99%) on the ordinal scale. These percent agreements were weak across the board in comparison to average scoring accuracy for other specific FDSHAC determinants (FY=71%, YF=59%, Formless Y= 75%; FT=85%, TF=54%, Formless T=84%; FC’=81%, CF’=65%, Formless
C’=82%). The weak average Gwet’s AC2=0.43 and Kendall Tau-b=0.32 for the 29-item analysis (where expert/benchmark coded Vista so that a coding of V=0 [Vista absent] was unlikely) also captures this phenomenon where raters had great difficulty not only with exact agreement with the benchmark but also with coming close in their ratings on the ordinal scale.
Another finding that stood out in third row of Table 1 involved the considerable variability in coding VF (SD=34.68, the largest among all of the Form Dominance levels across Aggregate FDSHAC, FDY, FDT, FDV, and FDC ordinal variables) among raters in their ability to match with expert/benchmark coding. Rater 2 was able to apply Viglione’s (2010) method to attain high levels accuracy, notably even in the most challenging VF category (89% agreement). In the 29-item analyses, which eliminate the common and easy ―Vista Not Coded‖: level (coded 0), Gwet’s AC2 for Rater 2’s scoring accuracy on FDV was excellent at 0.76. On the other hand, the 29-item analyses for Raters 3 and 4 each yielded a poor Gwet’s AC2 of 0.22. The 29-item analyses revealed Rater 4’s difficulty coming close to the benchmark’s distinctions among FV, VF, and V (Kendall Tau-b=0.14), but Raters 1 and 3 also struggled in the same way, albeit to a lesser extent (Kendall Tau-b’s of 0.33 and 0.27, respectively).
The fourth row of Table 1 presents the average accuracy among the four raters with reference to the expert / benchmark coding on the ordinal FDC’ scale. Not only were the mean Gwet’s AC2’s in the 155- and 115-item sets excellent but so too were the respective Kappas and Kendall Taub-b’s. And in the 29-item analyses, Gwet’s AC2, Kappa, and Kendall Tau-b were all in the good range. The average percent agreement between raters and expert/benchmark within each of the coding categories was consistently strong (97% for No C’, 81% for FC’, 65% for C’F, and 82% for Formless C’). Even as C’F expectedly has the relatively weakest accuracy among the levels of FDC’ it still represents the strongest accuracy in Form Secondary (65%) among all of the Specific FDSHAC variables (44% for FDY, 54% for FDT, 39% for FDV).
Using Viglione’s (2010) guidelines, our four coders’ pattern of scoring accuracy relative to expert/benchmark ratings on ordinal FDSHAC variables was largely consistent with their pattern of interrater reliability as reported in Bram et al. (2023). Similar to the findings with interrater reliability, among the Specific FDSHAC variables, our coders exhibited greatest accuracy in rating the various levels of FDT and FDC’. Accuracy rating the ordinal FDY and the Aggregate FDHAC variables were comparatively modest, but still more than acceptable. Similar to the findings for interrater reliability, the most challenging ordinal FDSHAC variable for our raters to code accurately was FDV. Also consistent with our interrater reliability findings, among the levels of the Aggregate and Specific FDSHAC variables, Form Secondary (compared to Form Dominant and Formless) was the most difficult to code accurately and the level with the greatest variability among coders.
Perusing Table 1 displaying coding accuracy and the analogous Table 1 in Bram et al. (2023) showing interrater reliability, we notice that percent agreement, Kendall Tau-b, Kappa, and Gwet’s AC2 tend to run a few points higher for accuracy compared to interrater reliability. This trend is expected theoretically because coefficients calculated between non-expert judges are lowered by error in each judge’s rating, whereas with one judge being an expert in each comparison (e.g., Rater 1 with expert/benchmark, Rater 2 with expert/benchmark), the coefficient of agreement is affected by half as much error (Greg Meyer, personal communication, July 2023).
Even with somewhat higher coefficients for accuracy versus interrater reliability, the conclusions, limitations, and implications drawn in Bram et al. (2023) about the status of using Viglione (2010) to code FDSHAC remain apropos. Given the potential clinical yield of coding FDSHAC to assess level of ego involvement in regulating emotions (Bram & Peebles, 2014; Peebles-Kleiger, 2002; Schafer, 1954), there remains additional work to be done to improve the training of raters to apply Viglione’s FDSHAC coding guidelines more effectively, particularly in accurately distinguishing Form Secondary from both (a) Form Dominant and (b) Formless. As our group of experienced raters had difficulty with this, it is likely that relatively novice Rorschachers would struggle at least as much, if not more. As described in Bram et al. (2023), there are key terms and concepts in the coding decision-tree that were likely unclear in the training materials that we provided to our raters and thus require improved explication. As refinements are made in the training materials and process, future research can overcome a primary limitation of this investigation by studying the accuracy of less experienced coders using Viglione’s (2010) guidelines. Along these lines, it is worth highlighting that for a less experienced group, focusing on coding accuracy relative to expert/benchmark ratings would be more meaningful than focusing on interrater reliability as it would reveal to what extent the coders are accurately internalizing and applying the coding distinctions. As Guarnaccia et al. (2001) remind us, it is conceivable that such raters could have strong agreement with each other, but mostly still miss established benchmarks.
Although fortunately our findings did not realize Guarnaccia et al.’s (2001) cautionary tale that interrater reliability might belie scoring accuracy, their point is still well taken even if largely ignored in Rorschach research and general psychological assessment literature. Broadly, a case can be made that coding accuracy is even more important than interrater reliability ―as all raters should aspire to the same standard‖ (Greg Meyer, personal communication, July 2023). At the same time, it bears acknowledgment that in many studies, it is not always possible to identify or access an expert to provide ratings. This does not mean, though, that this procedure should not be considered in study design. In some instances, it might be feasible for investigators to (a) enlist an expert to code a subset of responses/items for comparison with the study’s raters and (b) report accuracy coefficients alongside those for interrater reliability.
We also raise for consideration the possibility that authors making the psychometric case for their instrument or coding approach publish findings specific to scoring accuracy. For example, R-PAS has coding (as well as administration) exams created to establish proficiency for Rorschachers contributing to normative data collection. It is conceivable that findings related to coding accuracy of the many variables could be disseminated through publication, calling attention to coding distinctions that are more versus less challenging and offering ideas to enhance accuracy in coding them.
Finally, based on the convergence between Guarnaccia et al.’s (2001) findings and the first author’s experience as an assessment supervisor and consultant, in clinical practice Rorschach coding accuracy, especially at the response level, may not be as strong as we in the field would like to believe. To mitigate this possibility, one idea is that authors of the major Rorschach systems (Exner, Andronikof, & Fontan, 2022; Meyer et al., 2011) consider designing automated online checks of scoring accuracy. Potentially free or subscription-based (which might be necessary to cover costs of creating the service), this could involve periodically posting new practice protocols for users to code and then providing statistical and qualitative feedback on accuracy relative to expert ratings and then directing users to sections in the manual and other training materials to redress domains of coding challenges. Recent, rapid advances in artificial intelligence (AI) hold promise for the development of such a coding service.
We thank the American Psychoanalytic Association’s Fund for Psychoanalytic Research for funding this project. We are also grateful to Dr. Greg Meyer for sharing his ideas about scoring accuracy. Disclosure: Donald J. Viglione is a member of a company that sells the R-PAS manual and associated products. He is also the author and self-publisher of Rorschach Coding Solutions. Correspondence concerning this article should be addressed to Dr. Anthony D. Bram, 329 Massachusetts Ave., #2, Lexington, MA 02420. Email: Anthony_Bram@hms.harvard.edu
Bram, A. D., & Peebles, M. J. (2014). Psychological testing that matters: Creating a road map effective treatment. Washington, DC: APA Books.
Bram, A.D., & Yalof, J. (2018). Two contemporary Rorschach systems: Views of two experienced Rorschachers on the CS and R-PAS, Journal of Projective Psychologv and Mental Health, 25, 35-43.
Bram, A. D., Viglione, D. J., Lee-Parritz, O., Gottschalk, K. A., Yalof, J. A., Kleiger, J. H., Dyette, K., & Khadivi, A. (2023). Rorschach assessment of ego involvement in affect regulation: Interrater reliability of form dominance in shading and achromatic color responses. Psychoanalytic Psychology. Advance online publication.
Burke, L.J. (2011). Interrater reliability and scoring accuracy of the Rorschach Comprehensive System: Comparing students and experts at the variable, response, protocol, and interpretation level. Unpublished dissertation. Department of Professional Psychology, Chestnut Hill College, Philadelphia, PA.
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284-290.
Exner, J. E. (1986). The Rorschach: A Comprehensive System Vol. 1: Basic foundations and principles of interpretation (2nd ed.). New York: Wiley.
Exner, J. E. (2003). The Rorschach: A Comprehensive System Vol. 1: Basic foundations and principles of interpretation (4th ed.). New York: Wiley.
Exner, J.E., Andronikof, A., & Fontan, P. (2022). The Rorschach: A Comprehensive System--Revised administration and coding manual. Fort Mills, SC: Rorschach Workshops.
Exner, J. E, Colligan, S. C„ Hillman, L. R., Metts, A, S. Ritzier, B.A, Rogers, K. T., Sciara, A.D., & Viglione, D, J. (2001). A Rorschach workbook for the comprehensive system (5th ed.). Asheville, NC: Rorschach Workshops.
Guarnaccia, V., Dill, C. A., Sabatino, S., & Southwick, S. (2001). Scoring accuracy using the Comprehensive System for the Rorschach. Journal of Personality Assessment, 77(3), 464–474.
Gwet, K. (2014). Handbook of inter-rater reliability (4th ed.). Gaithersburg, MD: Advanced Analytics Press.
Kleiger, J. H. (1997). Rorschach shading responses: From a printer's error to an integrated psychoanalytic paradigm. Journal of Personality Assessment, 69, 342-364.
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting Intraclass Correlation Coefficients for reliability research.
Journal of Chiropractic Medicine, 15(2), 155–163.
Lewey, J. H., Kivisalu, T. M., & Giromini, L. (2019). Coding with R-PAS: Does prior training with the Exner Comprehensive System impact interrater reliability compared to those examiners with only R-PAS-based training? Journal of Personality Assessment, 101(4), 393–401.
Lundh, A., Kowalski, J., Sundberg, C. J., Gumpert, C., & Landén, M. (2010). Children’s Global Assessment Scale (CGAS) in a naturalistic clinical setting: Inter-rater reliability and comparison with expert ratings. Psychiatry Research, 177(1), 206–210.
Meyer, G. J., Hilsenroth, M. J., Baxter, D., Exner, J. E., Jr., Fowler, J. C., Piers, C. C., & Resnick, J. (2002). An examination of interrater reliability for scoring the Rorschach comprehensive system in eight data sets. Journal of Personality Assessment, 78(2), 219-274.
Meyer, G. J., Viglione, D J., Mihura, J. L., Erard, R. F., & Edberg, P. (2011). Rorschach Performance Assessment System: Administration, coding, interpretation, and technical manual. Toledo, OH: Rorschach Performance Assessment System, L.L.C.
Peebles-Kleiger, M. L (2002). Elaboration of some sequence analysis strategies: Examples and guidelines for level of confidence.
Journal of Personality Assessment, 79(1), 19-38.
Piotrowski, C. (2017). Rorschach research through the lens of bibliometric analysis: Mapping investigatory domain. Journal of Projective Psychology and Mental Health, 24, 34-35.
Schafer, R. (1954). Psychoanalytic interpretation of Rorschach testing. New York: Grune & Stratton.
Shaffer, D., Gould, M. S., Brasic, J., Ambrosini, P., Fisher, P., Bird, H., & Aluwahlia, S. (1983). A children's global assessment scale (CGAS). Archives of General Psychiatry, 40(11), 1228–1231.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. Viglione, D.J. (2010). Rorschach coding solutions: A reference guide for the Comprehensive System (2nd ed.). San Diego, CA: Author.
Viglione, D.J., Meyer, G.J., Resende, A.C., & Pignolo, C. (2017). A survey of challenges experienced by new learners coding the Rorschach. Journal of Personality Assessment, 99(3), 315-323.
Weiner, I. B (1998). Principles of Rorschach interpretation. Mahwah, NJ: Lawrence Erlbaum.
We gratefully acknowledge the support of our sponsors.
© 2026 Somatic Inkblots. All Rights Reserved.