SIS Journal of Projective Psychology & Mental Health

Scoring Accuracy and Rorschach Coding of Form Dominance in Shading and Achromatic Color Responses

Anthony D. Bram, Donald J. Viglione, Oren Lee-Parritz, Kiley A. Gottschalk, Jed A. Yalof, Ksera Dyette, Ali Khadivi, James H. Kleiger & Giselle C. Goldfischer

In coding observational and performance data, researchers have favored a focus on interrater reliability (agreement of coders with each other) over scoring accuracy (agreement of coders with expert ratings). Often, an implicit assumption is that establishing interrater reliability implies acceptable scoring accuracy, but this is not necessarily the case. This study reports scoring accuracy of four experienced Rorschachers in coding 155 responses for Form Dominance in Shading and Achromatic Color (FDSHAC) determinants using Viglione’s (2010) guidelines augmenting those from the Comprehensive System (CS; Exner et al., 2001). These determinants were of interest since these (a) are low-base rate and difficult for new learners, and previous findings have been mixed regarding their interrater reliability, (b) have been eliminated by the Rorschach Performance Assessment System (R-PAS; Meyer et al., 2011), yet (c) are still considered valuable for inferences related to ego involvement in affect regulation. In this study we compared findings of scoring accuracy to those previously reported on interrater reliability in the same data set (Bram et al., 2023) and found similar patterns. Coders exhibited strongest accuracy in rating the levels of Form Dominance of Texture and Form Dominance of Achromatic Color. Accuracies for Form Dominance of Diffuse Shading and Form Dominance of Aggregate FDHAC were comparatively modest but more than acceptable. The most challenging variable to code accurately was Form Dominance of Vista; and among the levels of the Aggregate and individual FDSHAC’s, Form Secondary was the most difficult to code and had greatest variability. We discuss the need for further refinement of training for coding FDSHAC, and we make a case for increasing attention to scoring accuracy in research and clinical contexts.

Introduction:

In coding observational, narrative, and other performance-based data such as the Rorschach, there are two ways of evaluating the consistency of such coding: interrater agreement and scoring accuracy (Meyer et al., 2002). Interrater agreement, more commonly referred to as interrater reliability, captures consistency in coding across two or more raters. Scoring accuracy refers to the degree of agreement between raters and some established benchmark, gold standard, or expert rating. Both interrater agreement and scoring accuracy provide complementary vantage points, and the hope is that in research studies it is possible to demonstrate both. This is not, however, always the case. For example, consider a study involving various health care professionals’ use of the Global Child Assessment Scale (CGAS; Shaffer et al., 1983) to rate five vignettes describing behaviors and attributes of children and adolescents (Lundh et al., 2010). The investigators found that although the professionals—themselves not formally trained on the CGAS—demonstrated an acceptable level of interrater reliability (i.e., agreement among each other), they consistently rated the children’s and adolescents’ functioning higher than experts who were trained and experienced with the measure. In other words, the healthcare professionals’ scoring accuracy was more suspect: Even though interrater reliability was solid, this group of raters still missed the mark in a way that in clinical practice holds potential for deleterious impact. Based on over-estimation of levels of functioning—thus overlooking vulnerabilities—it is possible to imagine some young patients being deemed not in need of treatment or being prematurely discharged.

Despite premises that both interrater reliability and scoring accuracy have value and that establishing the former does not necessarily imply the latter, it is striking that within the psychology assessment literature in general, there has been disproportionate emphasis on interrater reliability. Bibliographic searches of the database PsycINFO identified articles, book chapters, and dissertations/theses between the years 2000 and 20231 and revealed 349 titles containing the words ―interrater reliability‖ or

―interrater agreement‖ and only 29 with the words ―scoring accuracy‖ or ―rating accuracy.‖ That is to say, there were over 12 times as many articles on interrater reliability. Using the same search parameters but specifying that ―Rorschach‖ also be included in the title revealed 10 hits for interrater reliability or interrater agreement and only 2 for scoring accuracy or rating accuracy (one a dissertation). Modifying the searches, keeping Rorschach as a query for titles but broadening search terms to interrater reliability/interrater agreement versus scoring accuracy/rating accuracy as found anywhere in the text, we found 66 hits for the former and just 4 (3 of which were theses/dissertations) of the latter. Clearly, although issues of reliability and coding stability have been attended to in the Rorschach literature (Piotrowski, 2017), scoring accuracy has received limited psychological research in general and in Rorschach studies in particular. Guarnaccia et al. (2001) noted: ―Readers of published research in which interscorer reliability is reported can only presume that independent scorers had agreed on the correct score, but this may not actually have been the case‖ (p. 465).

In what we have found to be likely the only published Rorschach research article on the topic over the past two-plus decades, Guarnaccia et al. (2001) examined scoring accuracy among groups of (a) doctoral students who had recently been trained (25 hour of training) on the Comprehensive System (CS; Exner, 1986) and (b) practicing psychologists who used the CS in their clinical assessment. These investigators made use of 20 scored Rorschach responses—balanced between those from patients and non patients—that had been published by the originator of the CS (Exner, 1986) or a well-known CS expert and author (Weiner, 1998). Guarnaccia et al. found that although both groups were close to or above acceptable levels of scoring accuracy (defined as > 80% agreement, as was the convention at the time) at the response level2 relative to the expert coding for the non-patient response, both groups were each well below such acceptable levels of accuracy for the more complex and challenging-to-code patient responses. With a sense of alarm Guarnaccia et al. concluded:

If the error rates found here are representative of what occurs in the field, then we are faced with a major professional problem that requires immediate attention. Perhaps it would be desirable to require some kind of certification process to ensure that clinicians who use the CS are able to score responses to a minimum level of accuracy.‖ (p. 473).

In a dissertation completed a decade after Guarnaccia et al.’s (2001), Burke (2011) used a similar design involving students and experts coding previously published scored Rorschach CS patient protocols and reported more promising findings. Burke found, on the whole, that both groups of coders showed acceptable levels of scoring accuracy (defined as Kappa > .40).3 Still of concern, though, was that both the expert and student coders ―displayed fair and poor levels of scoring accuracy on the variables identified in previous studies as low base-rate variables‖ (p. 82), and that this had implications for the clinical hypotheses generated from such variables. Among such low base-rate variables showing concerning levels

Searches were conducted at the end of July 2023, so the full year of 2023 was not represented.
Extrapolating from Meyer et al. (2002)’s discussion of interrater reliability, there are three vantage points to evaluate coding accuracy of Rorschach variables, each addressing different questions: (a) scores at the response level (e.g., whether a particular code is present versus absent in each response), (b) segments of scores at the response level (agreement about coding decisions within a particular scoring category such as Location, Determinants, or Form Quality), or (c) scores at the protocol level (aggregated response-level scores). In the present article, our focus is on segments at the response level, specifically determinants involving Form Dominance of Shading and Achromatic Color (FDSHAC).
Burke also found that, overall, both groups displayed acceptable levels of scoring accuracy at the protocol level and in terms of

clinical interpretations.

of scoring accuracy were those involving coding of Form Dominance in Shading and Achromatic Color (FDSHAC) determinants: C’F, C’, TF, T, FV, VF, V, FY, YF, and Y. This is largely consistent with previous mixed findings of interrater reliability of these low base-rate variables (reviewed in Bram et al. [2023]) and the difficulty that novice learners report mastering these codes (Viglione et al., 2017).

In large part, because of such difficulties involving low base rates and mastery by new learners, the authors of the latest Rorschach approach, the Rorschach Performance Assessment System (R-PAS; Meyer et al., 2011), eliminated coding of FDSHAC. In other words, in R-PAS, coding of all Shading (Diffuse Shading [Y], Shading-Texture [T], Shading-Vista [V] and Achromatic Color (C’) determinants is made irrespective of the role of Form Dominance. So, for example, any determinant involving Diffuse Shading is coded simply as Y; in contrast to the CS and earlier Rorschach systems, there are no distinctions made among FY, YF, and Y. Although there is an understandable psychometric and pragmatic case for drop coding of FDSHAC, for psychoanalytic assessors for whom the role of Form Dominance informs inferences about ego involvement (or the role of cognition or frontal lobal activity) in affect-laden responses (Kleiger, 1997), something of value has been lost with this decision (Bram & Yalof, 2018; Bram et al., 2023). Coding FDSHAC (as well as Form Dominance for Color, which is retained in R-PAS) plays a central role in the application of psychoanalytic configurational and sequence analyses that illuminate aspects of a patient’s style and effectiveness of emotional regulation (Bram & Peebles, 2014; Schafer, 1954). The idea is that not all responses coded Y (or T, V, or C’) are the same in what they tell us about a person’s ability to contain or become dysregulated by anxiety or dysphoria. Consider the coding of three responses to location D7 or W of Card IV:

Patient A: ―A person…like a person, may be a ghost, any way sort of ominous. (CP) Shaped here like a person or ghost (ominous?) I guess, just the haziness of it. (haziness?) The lighter and darker contrasts.‖

Patient B: ―Some kind of apparition. (CP) It’s hazy, the light and darkness, but it’s kinda in

the shape of a person, so I guess like a ghost.

Patient C: ―I don’t know. Maybe, something creepy, kind like an apparition. (CP) Not sure exactly what kind of apparition, but I get kind of a creepy sense. (apparition?) Like I said, creepy. (creepy?) The haziness, where it gets lighter and darker.

Applying R-PAS, each response would be coded Y, but this obscures distinctions among the three patients in how they are able to manage the anxiety stirred by the shading of Card IV. When Form Dominance is taken into account, Patient A’s response garners an FY, reflecting that their evoked anxiety is fairly well contained by their cognition (the manifestation of ominousness / anxiety [Y] is structured from the outset by the shape/idea of a person or ghost). Patient B’s response, meriting a YF, involves anxiety (Y) that is relatively less contained (the idea/shape of the person/ghost is less definitive [―kinda,‖ ―I guess‖] and, more importantly, offered a bit later in the clarification after the emphasis on shading). And Patient C’s response would be coded pure Y as the anxiety/―creepy sense‖ is not cognitively organized into a discernible structure (an unelaborated apparition does not have an unequivocal form). Although we would not draw any overarching conclusions based on a single response in the absence of repetition and convergence from other Rorschach and non-Rorschach data (Bram & Peebles, 2014), the coding of each of these responses would contribute to contrasting inferences about how well the respective patient manages anxiety.

In an effort to demonstrate that FDSHAC can be coded reliably and accurately and thus merits consideration from contemporary assessors, even those who use R-PAS, we conducted a study that implemented Viglione’s (2010) detailed coding guidelines that supplement instructions from Exner’s et al. (2001) workbook. We recently published our findings from the interrater reliability component of the study (Bram et al., 2023). In this present article, we offer and discuss our findings with regard to scoring accuracy.

Method:

Details of our methodology are explained more fully in Bram et al. (2023), but some essential elements are summarized here.

Raters:

Our four raters were experienced assessment psychologists with an average of over 30 years of

Rorschach experience, CS and R-PAS combined. All had experience teaching and/or supervising the Rorschach.

Training:

Raters received training packets that, among other materials, included: pages of the R-PAS manual describing how to code the presence of Shading and Achromatic Color (SHAC) determinants; Chapter 3 from Viglione’s (2010) Rorschach Coding Solutions detailing scoring of FDSHAC determinants; and a coding FDSHAC flowchart based on those guidelines from Viglione (2010; flowchart included in online Supplement S1 of Bram et al., 2023). As preparation for the study, raters coded—and received feedback on—20 Rorschach responses taken from Exner et al. (2001), 18 of which involved a FDSHAC determinant.

Rorschach Responses and Protocols:

There were 155 Rorschach responses to be coded, nearly three-quarters (115) of which involved an FDSHAC determinant. Such responses included a mix of levels of Form Dominance (Form Dominant, Form Secondary, and Formless) for Shading (Y, T, V) and Achromatic Color (C’) determinants. All but five of the 155 responses were culled from CS- or R-PAS-administered child and adult patient protocols from the (mostly outpatient) practice of the first author. The responses were presented to raters in the form of six constructed ―protocols‖ each consisting of between 25-27 responses that included both response phase/free association and clarification phase/inquiry for Cards I through X.4 Raters coded only determinants.

Expert/Benchmark Coding:

Prior to coding by the four raters, determinants for each of the 155 responses were scored by consensus between the first two authors, the second of whom is an expert who authored Rorschach Coding Solutions that included the elaborated FDSHAC coding guidelines. The first author had previously studied, applied these guidelines clinically, and consulted with the second author in preparation for this project. The expert/benchmark coding utilized the same Rorschach Coding Solutions guidelines.

Data Analyses:

Average scoring accuracy was computed for the four raters with reference to our expert/benchmark ratings. Specifically, we aimed to analyze scoring accuracy coefficients on an ordinal Aggregate FDSHAC variable (capturing the relative involvement of Form in any response coded for Y, T, V, and/or C’) as well as the four Specific FDSHAC ordinal variables, that is, separately for Form Dominance of Y (FDY), Form Dominance of T (FDT), Form Dominance of V (FDV), and Form

To be clear, each ―protocol‖ was constructed from the responses of many patients. Hence, these were not intact patient protocols. As responses were unidentifiable, this study qualified for an IRB exemption.

Dominance of C’ (FDC’).5 We analyzed scoring accuracy of each of these variables three different ways, based on: (a) all 155 responses, (b) only the 115 coded for a SHAC determinant by the benchmark, and (c) for each Specific SHAC variable, only the responses coded by the expert/benchmark for that variable (33 responses for FDY, 27 for FDT, and 29 each for FDV and FDC’). The reason for the latter two versions of analyses relates to the relative ease of coding 0 when a SHAC determinant was absent, contributing to inflated accuracy coefficients especially when all 155 responses are analyzed. As we acknowledged previously:

We understood the last two versions of analyses not to be truly fair tests of reliability because (a) the 0-point of the 4-point scales was essentially eliminated, creating an artificial 1-to-3 scale involving only the most difficult discriminations (between Form Dominance and Form Secondary and between Form Secondary and Formless) and (b) they are based on a relatively small number of items. Nevertheless, we included these because we believed them informative about difficult coding discriminations. (Bram et al., 2023, p. 4)

Average scoring accuracy will be reported using the following statistics: percent agreement,

Fliess’ Kappa, Kendall’s Tau-b, and Gwet’s AC2. In Bram et al. (2023), we offered the following rationale:

Kappas adjust for chance agreement not reflected by percent agreement. But Kappas only credit exact agreement among raters, which is not ideal for ordinal variables where it is meaningful how close or apart inexactness is…. For another perspective, we also report Kendall Tau-b, an associational statistic crediting close-but-not-exact agreement but does not adjust for chance agreement. Finally…we report Gwet’s (2014) AC2 statistic developed as an alternative to Kappa in that it corrects for chance but lowers the maximum theoretical value for chance agreement, takes into account closeness of inexact agreements, and weights unanimity of agreement according to number of raters (see Lewey et al., 2019). Our premise is that Gwet’s AC2 is the [accuracy] statistic with the best fit for our ordinal FDSHAC variables coded by trained assessors. But we include the other statistics as they offer other vantage points for understanding patterns of [scoring accuracy].

Kappa and Kendall’s Tau-b were interpreted according to conventional parameters for research on Rorschach interrater reliability (e.g., Meyer et al., 2002) and scoring accuracy research (Burke, 2011):

< 40 poor; .40-.59 fair; .60-.74 good; > .74 excellent (Cichetti, 1994; Shrout & Fliess, 1979). Since Gwet’s AC2 coefficients run higher than Kappas, a more stringent interpretive criteria was applied: < .50, poor; .50-.74, moderate; .75-.89, good; > .89 , excellent (Koo & Li, 2016).

Results:

We present findings for our four raters’ average scoring accuracy relative to the expert/benchmark

coding on the ordinal variables FDSHAC, FDY, FDT, FDV, and FDC’. In doing so, we embed brief comments about the extent to which findings related to scoring accuracy are similar to those of interrater reliability as reported in Bram et al. (2023).

Average Scoring Accuracy:

Table 1 displays mean, standard deviations, and 95% confidence intervals for scoring accuracy across the four raters on Aggregate FDSHAC and then on FDY, FDT, FDV, and FDC’.

The ordinal FDSHAC variable was coded: 0=No SHAC determinant; 1=Form Dominant over any SHAC determinant; 2=Form Secondary to any SHAC determinant; 3=SHAC determinant with Form absent (Formless). The ordinal FDY variable was coded: 0=no Y; 1=FY, 2=YF, 3=pure Y. The ordinal FDT, FDV, FDC’ variables were coded similarly.

Rorschach Scoring Accuracy: 9

Table 1

Scoring Accuracy for FDSHAC, FDY, FDT, FDV, FDC’, and FDC Across Four Raters: Means, Standard Deviations, and 95% Confidence Intervals.

Ordinal Variable	Benchmark Criteria	# Items	SHAC Variable Not Coded	Form Dom	Form 2ndary	Formless	Total Agreement	Kendall Tau-b	Kappa	Gwet’s AC2
Aggregate	--	155	99.5%	78%	59%	79.5%	79%	0.82+++	0.71++	0.86##
FDSHAC			(1.00)	(4.79)	(18.27)	(11.81)	(4.43)	(0.04)	(0.05)	(0.04)
	SHAC>0	115						[0.73-0.91]	[0.66-0.76]	[0.82-0.90]
							72%	0.65++	0.56+	0.85##
							(6.55)	(0.08)	(0.09)	(0.05)
								[0.57-0.73]	[0.47-0.65]	[0.80-0.90]
FDY	--	155	94%	71%	44%	75%	89%	0.73++	0.64++	0.92###
			(2.58)	(4.50)	(17.91)	(9.23)	(3.28)	(0.10)	(0.11)	(0.03)
								[0.63-0.83]	[0.53-0.75]	[0.89-0.95]
	SHAC>0	115	91%				82%	0.69++	0.62++	0.87##
			(3.81)				(5.40)	(0.10)	(0.11)	(0.06)
								[0.59-0.79]	0.51-0.73]	[0.81-0.93]
	Any Y>0	33	--				59%	0.39	0.42+	0.64#
							(11.09)	(0.16)	(0.12)	(0.15)
								[0.23-0.55]	[0.30-0.54]	[0.49-0.79]
FDT	--	155	98%	85%	54%	84%	94%	0.88+++	0.81+++	0.97###
			(2.28)	(9.39)	(10.40)	(19.05)	(3.59)	(0.09)	(0.09)	(0.03)
								[0.79-0.87]	[0.72-0.90]	[0.94-0.99]
	SHAC>0	115	97%				92%	0.85+++	0.79+++	0.95###
			(3.49)				(4.39)	(0.08)	(0.10)	(0.04)
								[0.77-0.93]	[0.69-0.89]	[0.91-0.99]
	Any T>0	27					73%	0.56+	0.62++	0.80##
			--				(6.06)	(0.21)	(0.11)	(0.10)
								[0.35-0.77]	[0.51-0.73]	[0.70-0.90]
FDV	--	155	99%	46%	39%	47%	89%	0.75+++	0.60++	0.95###
			(0.96)	(19.1	(34.68)	(13.40)	(3.77)	(0.11)	(0.14)	(0.04)
				9)				[0.64-0.86]	[0.46-0.74]	[0.91-0.99]
	SHAC>0	115	99%				85%	0.73+++	0.60++	0.92###
			(1.22)				(4.53)	(0.12)	(0.14)	(0.03)
								[0.61-0.85]	[0.46-0.74]	[0.89-0.95]
	Any V>0	29	--				47%	0.32	0.28	0.43
							(15.20)	(0.17)	(0.19)	(0.23)
								[0.15-0.49]	[0.09-0.47]	[0.21-0.66]
FDC’	--	155	97%	81%	65%	82%	92%	0.88+++	0.79+++	0.97###
			(2.16)	(5.50)	(22.97)	(21.75)	(1.83)	(0.06)	(0.10)	(0.02)
								[0.82-0.94]	[0.69-0.79]	[0.95-0.99]
	SHAC>0	115	96%				90%	0.86+++	0.78+++	0.95###
			(2.69)				(4.15)	(0.06)	(0.09)	(0.03)
								[0.80-0.92]	[0.69-0.87]	[0.92-0.98]
	Any C’>0	29	--				74%	0.70++	0.63++	0.83##
							(7.69)	(0.12)	(0.11)	(0.07)
								[0.58-0.82]	[0.52-0.74]	[0.76-0.90]

Note. FDSHAC=Form Dominance in Shading and Achromatic Color. SHAC=Shading and Achromatic Color. FDY=Form Dominance in Diffuse Shading. FDT=Form Dominance in Texture. FDV=Form Dominance in Vista. FDC’=Form Dominance in Achromatic Color. Interpretive ranges for Kappa and Kendall Taub-b: +fair (0.40-0.59); ++ good (0.60-0.74); +++ excellent (0.75 and above) (Cichetti, 1994; Shrout & Fleiss, 1979). Interpretive ranges for Gwet’s AC2: #moderate (.50-.74); ## good (.75-.89); ### excellent (.90 and above) (Koo & Li, 2016).

Scoring Accuracy for Aggregate FDSHAC:

The first row of Table 1 presents average scoring accuracy coefficients for the four raters on Aggregate FDSHAC. On separate lines, this row also provides coefficients for all 155 items and for the 115 for which the benchmark scored a SHAC determinant. The means and standard deviations reveal similar trends as the interrater reliability analyses (Bram et al., 2023): Analyses of all 155 items across all four points of the scale, reliability looks more than solid, including overall percent agreement of 79%, a good Kappa statistic over 0.70, and Kendall Tau-b and Gwet’s AC2 above 0.80 (excellent and good respectively). When examining only the 115 items for which benchmark coded a SHAC determinant, we notice the expected drop off in the overall mean coefficients (72% agreement, good Kendall Tau-b, and fair Kappa), with the exception of Gwet’s AC2, which hardly changes at 0.85.

Providing breakdowns in average percent agreement with expert/benchmark at each level of the ordinal scale, Table 1 also illuminates the relative weakness of reliability in coding Form Secondary to SHAC. Though agreement approaches 80% for the Form Dominance and Formless (Formless SHAC) categories, it is only 59% for Form Secondary. This is analogous to the finding for interrater reliability (Bram et al., 2023). We also noticed that Form Secondary to SHAC is the level possessing the greatest variability among the four raters’ (SD=18.27), which in part captures the range between Rater 3’s 40% agreement and Rater 2’s 80% agreement for this most challenging category.6

Scoring Accuracy for FDY:

The second row of Table 1 displays average scoring accuracy coefficients across our raters in coding FDY, and again we see a pattern similar to the interrater reliabilities reported in Bram et al. (2023). In the 155-item analyses, Gwet’s AC2 and Kendall Tau-b were excellent, and Kappa was good. In the 115-item analyses, Gwet’s AC2 and Kendall Tau-b were good, and Kappa was fair. For the restricted 33-item set, overall Gwet’s AC2 was moderate, Kappa fair, and Kendall Tau-b poor.

Interestingly, homing in on percent agreements within each of the levels of FDY, we observe that our raters were most accurate at the Formless Y level. Akin to the Bram et al. (2023) findings with interrater reliabilities, the YF category was the category most difficult for our group of experienced coders and the one with the greatest variability.

Scoring Accuracy for FDT:

Shown in the third row of Table 1, our four raters were impressively consistent with the expert/benchmark in ordinal coding of FDT. Gwet’s AC2’s, Kappas, and Kendall Tau-b’s in the 155- and 115-item sets were all excellent. In the 27-item set, Gwet’s AC2 and Kappas were good, and Kendall Tau-b was fair. The excellent Kendall Tau-b’s in the first two sets of 0.88 and 0.85 help illuminate that when agreement with the expert/benchmark was not exact, it was quite close. Not depicted in the table is how remarkably close both Raters 1 and 2 were to the expert/benchmark in coding FDT (Gwet’s AC2=0.99 and 0.98; Kappas=0.88-0.89 and 0.87-0.88, respectively) within the 155- and 115-item sets.

Examining percent agreement within each of the four categories of the scale, again it was the Form Secondary category —in this case TF—that was the relatively weak link: Mean percent agreement with the expert/benchmark for TF was only 54%, whereas agreements were 85%, and 84%, respectively for FT and Formless T.

6These breakdowns of each rater’s consistency with benchmark are not included in Table 1 but are available from the first author.

They are available not only for FDSHAC but for FDY, FDT, FDV, and FDC’.

Rorschach Scoring Accuracy: 11

Scoring Accuracy for FDV

The fourth row of Table 1 reveals a complicated picture of the four raters’ consistency with the expert/benchmark’s ordinal coding of FDV. Mean Gwet’s AC2’s for the 155- and 115-item analyses were excellent, and Kappas were in the good range. However, examining two other aspects of Table 1 reveal that these overall accuracy statistics are misleading as, on the whole, raters actually had difficulty with distinguishing among FV, VF, and Formless V: (a) mean percent agreement at each level of the 4-point ordinal scale and (b) mean AC2 and Kappa for the 29-item set. The average percent agreement with benchmark in each of the three categories were FV=46%, VF=39%, and Formless V=47%, suggesting that the AC2 and Kappas for the 155- and 115-item sets were inflated based on the considerable ease of scoring the absence of Vista (99%) on the ordinal scale. These percent agreements were weak across the board in comparison to average scoring accuracy for other specific FDSHAC determinants (FY=71%, YF=59%, Formless Y= 75%; FT=85%, TF=54%, Formless T=84%; FC’=81%, CF’=65%, Formless

C’=82%). The weak average Gwet’s AC2=0.43 and Kendall Tau-b=0.32 for the 29-item analysis (where expert/benchmark coded Vista so that a coding of V=0 [Vista absent] was unlikely) also captures this phenomenon where raters had great difficulty not only with exact agreement with the benchmark but also with coming close in their ratings on the ordinal scale.

Another finding that stood out in third row of Table 1 involved the considerable variability in coding VF (SD=34.68, the largest among all of the Form Dominance levels across Aggregate FDSHAC, FDY, FDT, FDV, and FDC ordinal variables) among raters in their ability to match with expert/benchmark coding. Rater 2 was able to apply Viglione’s (2010) method to attain high levels accuracy, notably even in the most challenging VF category (89% agreement). In the 29-item analyses, which eliminate the common and easy ―Vista Not Coded‖: level (coded 0), Gwet’s AC2 for Rater 2’s scoring accuracy on FDV was excellent at 0.76. On the other hand, the 29-item analyses for Raters 3 and 4 each yielded a poor Gwet’s AC2 of 0.22. The 29-item analyses revealed Rater 4’s difficulty coming close to the benchmark’s distinctions among FV, VF, and V (Kendall Tau-b=0.14), but Raters 1 and 3 also struggled in the same way, albeit to a lesser extent (Kendall Tau-b’s of 0.33 and 0.27, respectively).

Scoring Accuracy for FDC’:

The fourth row of Table 1 presents the average accuracy among the four raters with reference to the expert / benchmark coding on the ordinal FDC’ scale. Not only were the mean Gwet’s AC2’s in the 155- and 115-item sets excellent but so too were the respective Kappas and Kendall Taub-b’s. And in the 29-item analyses, Gwet’s AC2, Kappa, and Kendall Tau-b were all in the good range. The average percent agreement between raters and expert/benchmark within each of the coding categories was consistently strong (97% for No C’, 81% for FC’, 65% for C’F, and 82% for Formless C’). Even as C’F expectedly has the relatively weakest accuracy among the levels of FDC’ it still represents the strongest accuracy in Form Secondary (65%) among all of the Specific FDSHAC variables (44% for FDY, 54% for FDT, 39% for FDV).

Discussion:

Using Viglione’s (2010) guidelines, our four coders’ pattern of scoring accuracy relative to expert/benchmark ratings on ordinal FDSHAC variables was largely consistent with their pattern of interrater reliability as reported in Bram et al. (2023). Similar to the findings with interrater reliability, among the Specific FDSHAC variables, our coders exhibited greatest accuracy in rating the various levels of FDT and FDC’. Accuracy rating the ordinal FDY and the Aggregate FDHAC variables were comparatively modest, but still more than acceptable. Similar to the findings for interrater reliability, the most challenging ordinal FDSHAC variable for our raters to code accurately was FDV. Also consistent with our interrater reliability findings, among the levels of the Aggregate and Specific FDSHAC variables, Form Secondary (compared to Form Dominant and Formless) was the most difficult to code accurately and the level with the greatest variability among coders.

Perusing Table 1 displaying coding accuracy and the analogous Table 1 in Bram et al. (2023) showing interrater reliability, we notice that percent agreement, Kendall Tau-b, Kappa, and Gwet’s AC2 tend to run a few points higher for accuracy compared to interrater reliability. This trend is expected theoretically because coefficients calculated between non-expert judges are lowered by error in each judge’s rating, whereas with one judge being an expert in each comparison (e.g., Rater 1 with expert/benchmark, Rater 2 with expert/benchmark), the coefficient of agreement is affected by half as much error (Greg Meyer, personal communication, July 2023).

Even with somewhat higher coefficients for accuracy versus interrater reliability, the conclusions, limitations, and implications drawn in Bram et al. (2023) about the status of using Viglione (2010) to code FDSHAC remain apropos. Given the potential clinical yield of coding FDSHAC to assess level of ego involvement in regulating emotions (Bram & Peebles, 2014; Peebles-Kleiger, 2002; Schafer, 1954), there remains additional work to be done to improve the training of raters to apply Viglione’s FDSHAC coding guidelines more effectively, particularly in accurately distinguishing Form Secondary from both (a) Form Dominant and (b) Formless. As our group of experienced raters had difficulty with this, it is likely that relatively novice Rorschachers would struggle at least as much, if not more. As described in Bram et al. (2023), there are key terms and concepts in the coding decision-tree that were likely unclear in the training materials that we provided to our raters and thus require improved explication. As refinements are made in the training materials and process, future research can overcome a primary limitation of this investigation by studying the accuracy of less experienced coders using Viglione’s (2010) guidelines. Along these lines, it is worth highlighting that for a less experienced group, focusing on coding accuracy relative to expert/benchmark ratings would be more meaningful than focusing on interrater reliability as it would reveal to what extent the coders are accurately internalizing and applying the coding distinctions. As Guarnaccia et al. (2001) remind us, it is conceivable that such raters could have strong agreement with each other, but mostly still miss established benchmarks.

Although fortunately our findings did not realize Guarnaccia et al.’s (2001) cautionary tale that interrater reliability might belie scoring accuracy, their point is still well taken even if largely ignored in Rorschach research and general psychological assessment literature. Broadly, a case can be made that coding accuracy is even more important than interrater reliability ―as all raters should aspire to the same standard‖ (Greg Meyer, personal communication, July 2023). At the same time, it bears acknowledgment that in many studies, it is not always possible to identify or access an expert to provide ratings. This does not mean, though, that this procedure should not be considered in study design. In some instances, it might be feasible for investigators to (a) enlist an expert to code a subset of responses/items for comparison with the study’s raters and (b) report accuracy coefficients alongside those for interrater reliability.

We also raise for consideration the possibility that authors making the psychometric case for their instrument or coding approach publish findings specific to scoring accuracy. For example, R-PAS has coding (as well as administration) exams created to establish proficiency for Rorschachers contributing to normative data collection. It is conceivable that findings related to coding accuracy of the many variables could be disseminated through publication, calling attention to coding distinctions that are more versus less challenging and offering ideas to enhance accuracy in coding them.

Finally, based on the convergence between Guarnaccia et al.’s (2001) findings and the first author’s experience as an assessment supervisor and consultant, in clinical practice Rorschach coding accuracy, especially at the response level, may not be as strong as we in the field would like to believe. To mitigate this possibility, one idea is that authors of the major Rorschach systems (Exner, Andronikof, & Fontan, 2022; Meyer et al., 2011) consider designing automated online checks of scoring accuracy. Potentially free or subscription-based (which might be necessary to cover costs of creating the service), this could involve periodically posting new practice protocols for users to code and then providing statistical and qualitative feedback on accuracy relative to expert ratings and then directing users to sections in the manual and other training materials to redress domains of coding challenges. Recent, rapid advances in artificial intelligence (AI) hold promise for the development of such a coding service.

Author Note:

We thank the American Psychoanalytic Association’s Fund for Psychoanalytic Research for funding this project. We are also grateful to Dr. Greg Meyer for sharing his ideas about scoring accuracy. Disclosure: Donald J. Viglione is a member of a company that sells the R-PAS manual and associated products. He is also the author and self-publisher of Rorschach Coding Solutions. Correspondence concerning this article should be addressed to Dr. Anthony D. Bram, 329 Massachusetts Ave., #2, Lexington, MA 02420. Email: Anthony_Bram@hms.harvard.edu

References:

Bram, A. D., & Peebles, M. J. (2014). Psychological testing that matters: Creating a road map effective treatment. Washington, DC: APA Books.

Bram, A.D., & Yalof, J. (2018). Two contemporary Rorschach systems: Views of two experienced Rorschachers on the CS and R-PAS, Journal of Projective Psychologv and Mental Health, 25, 35-43.

Bram, A. D., Viglione, D. J., Lee-Parritz, O., Gottschalk, K. A., Yalof, J. A., Kleiger, J. H., Dyette, K., & Khadivi, A. (2023). Rorschach assessment of ego involvement in affect regulation: Interrater reliability of form dominance in shading and achromatic color responses. Psychoanalytic Psychology. Advance online publication.

Burke, L.J. (2011). Interrater reliability and scoring accuracy of the Rorschach Comprehensive System: Comparing students and experts at the variable, response, protocol, and interpretation level. Unpublished dissertation. Department of Professional Psychology, Chestnut Hill College, Philadelphia, PA.

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284-290.

Exner, J. E. (1986). The Rorschach: A Comprehensive System Vol. 1: Basic foundations and principles of interpretation (2nd ed.). New York: Wiley.

Exner, J. E. (2003). The Rorschach: A Comprehensive System Vol. 1: Basic foundations and principles of interpretation (4th ed.). New York: Wiley.

Exner, J.E., Andronikof, A., & Fontan, P. (2022). The Rorschach: A Comprehensive System--Revised administration and coding manual. Fort Mills, SC: Rorschach Workshops.

Exner, J. E, Colligan, S. C„ Hillman, L. R., Metts, A, S. Ritzier, B.A, Rogers, K. T., Sciara, A.D., & Viglione, D, J. (2001). A Rorschach workbook for the comprehensive system (5th ed.). Asheville, NC: Rorschach Workshops.

Guarnaccia, V., Dill, C. A., Sabatino, S., & Southwick, S. (2001). Scoring accuracy using the Comprehensive System for the Rorschach. Journal of Personality Assessment, 77(3), 464–474.

Gwet, K. (2014). Handbook of inter-rater reliability (4th ed.). Gaithersburg, MD: Advanced Analytics Press.

Kleiger, J. H. (1997). Rorschach shading responses: From a printer's error to an integrated psychoanalytic paradigm. Journal of Personality Assessment, 69, 342-364.

Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting Intraclass Correlation Coefficients for reliability research.

Journal of Chiropractic Medicine, 15(2), 155–163.

Lewey, J. H., Kivisalu, T. M., & Giromini, L. (2019). Coding with R-PAS: Does prior training with the Exner Comprehensive System impact interrater reliability compared to those examiners with only R-PAS-based training? Journal of Personality Assessment, 101(4), 393–401.

Lundh, A., Kowalski, J., Sundberg, C. J., Gumpert, C., & Landén, M. (2010). Children’s Global Assessment Scale (CGAS) in a naturalistic clinical setting: Inter-rater reliability and comparison with expert ratings. Psychiatry Research, 177(1), 206–210.

Meyer, G. J., Hilsenroth, M. J., Baxter, D., Exner, J. E., Jr., Fowler, J. C., Piers, C. C., & Resnick, J. (2002). An examination of interrater reliability for scoring the Rorschach comprehensive system in eight data sets. Journal of Personality Assessment, 78(2), 219-274.

Meyer, G. J., Viglione, D J., Mihura, J. L., Erard, R. F., & Edberg, P. (2011). Rorschach Performance Assessment System: Administration, coding, interpretation, and technical manual. Toledo, OH: Rorschach Performance Assessment System, L.L.C.

Peebles-Kleiger, M. L (2002). Elaboration of some sequence analysis strategies: Examples and guidelines for level of confidence.

Journal of Personality Assessment, 79(1), 19-38.

Piotrowski, C. (2017). Rorschach research through the lens of bibliometric analysis: Mapping investigatory domain. Journal of Projective Psychology and Mental Health, 24, 34-35.

Schafer, R. (1954). Psychoanalytic interpretation of Rorschach testing. New York: Grune & Stratton.

Shaffer, D., Gould, M. S., Brasic, J., Ambrosini, P., Fisher, P., Bird, H., & Aluwahlia, S. (1983). A children's global assessment scale (CGAS). Archives of General Psychiatry, 40(11), 1228–1231.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. Viglione, D.J. (2010). Rorschach coding solutions: A reference guide for the Comprehensive System (2nd ed.). San Diego, CA: Author.

Viglione, D.J., Meyer, G.J., Resende, A.C., & Pignolo, C. (2017). A survey of challenges experienced by new learners coding the Rorschach. Journal of Personality Assessment, 99(3), 315-323.

Weiner, I. B (1998). Principles of Rorschach interpretation. Mahwah, NJ: Lawrence Erlbaum.