Research

AI systems fail at the interface between capability and use. Whether a model produces correct outputs and whether a person using it can accomplish their goal are different questions — and the failures that distinguish them are not visible to capability evaluation. I study those failures: what they are made of, what produces them, and what interventions address them. The work has produced taxonomies of error, design frameworks, and experimental findings. It has also produced new questions, which is how it proceeded.

Capability evaluation measures the wrong thing

The question that determines whether a deployed AI system works is not whether the model produces correct outputs. It is whether a human using the system can accomplish their goal. The ground truth is human, and capability evaluation cannot reach it.

This is structurally analogous to integration testing in software engineering. Components can pass unit tests individually while the integrated system fails in ways that only appear when the components interact. A coding assistant that generates correct code is not the same as a developer using that assistant producing secure, maintainable software. A diagnostic AI that matches specialist accuracy on a benchmark dataset is not the same as a clinician using that AI improving patient outcomes. The gap between these is measurable, but measuring it requires a human in the loop.

Games are a productive environment for this kind of evaluation. The task structure is known, the AI's behaviour is instrumented, and joint performance can be compared against what a well-calibrated team would produce. This makes failure diagnosable rather than merely observable — which is what distinguishes useful evaluation from confirmation that the system deployed.

Users understand AI in terms of what they need it to do

Games elicit a wide variety of human data — metacognitive reports about what players believe the AI is doing, but also behaviours that only make sense within a fantasy. Because players are not at risk of real consequences, they act on their beliefs rather than suppressing them. This makes the beliefs visible.

In Playing with Dezgo, players were asked to draw a well-known character without naming it, using a text-to-image AI. The constraint forced them to find other ways to express their intention — revealing what they believed the AI would respond to. When the AI produced output, players attributed successes to their prompting strategy and dismissed failures as errors in execution rather than evidence against their model. Because the intended character was known to analysts, players' behaviour revealed not just what they did but what they believed.

The Automation Confusion grounded theory study produced a taxonomy of twelve categories of misunderstanding, each with a specific structure. Errors of causation involve incorrect attributions of control: users believing they cause actions performed by the AI, or that the AI performs actions they themselves control. Errors of explanation involve users constructing accounts of the AI's behaviour that are too simple, incorrect, or absent. The theory is explanatory, not predictive — it accounts for why different beliefs produce the same observable behaviour, which is what makes it useful for diagnosis rather than just description.

Both studies produced detailed qualitative data — think-aloud protocols, interviews, observed behaviour — that revealed not just whether misunderstanding was present but what form it took, in which situations, and why. A confirmatory study with hundreds of participants would have told us whether confusion was present. It would not have told us what confusion is made of, or how to fix it.

Automation Confusion: A Grounded Theory of Non-Gamers' Confusion in Partially Automated Action Games — Cimolino, Chen, Gutwin, Graham. CHI 2023.

Playing with Dezgo: Adapting Human-AI Interaction to the Context of Play — Villareale, Cimolino, Gomme. Foundations of Digital Games 2023.

Understanding failure produces design insight

In studies of AI-assisted gaming with spinal cord injury rehabilitation patients, players exhibited a specific pattern of behaviour: when the AI did something they didn't want, they changed how they provided input — pressing buttons harder, changing timing, or performing actions that provided no input at all — believing these would change what the AI did. The behaviour was persistent and varied. It was also a causation error: an incorrect belief about influence over the AI's output.

Awareness cues were designed in response. If players knew what the AI intended to do next, they could decide whether to let it proceed or override it — rather than attempting to influence it through their manner of input. The cues addressed the specific uncertainty that was producing the behaviour: not what the AI had done, but what it was about to do.

The cues helped but did not eliminate the behaviour. Players still pressed buttons that did nothing. The Automation Confusion taxonomy explains why: awareness cues addressed the informational uncertainty — players now knew what the AI intended — but the belief that manner of input matters is a causation error with its own structure, and knowing the intention does not dissolve it. This is what a theory of failure is for: not just identifying that something went wrong, but identifying what an intervention can and cannot reach.

Two Heads Are Better Than One applies the same move at the level of system design. Based on a survey of 55 human-AI systems across six domains, it identifies four dimensions along which cooperative systems vary: AI Role captures how tasks are partitioned and allocated between participants; Supervision describes whether actors can intervene to correct each other and in which direction; Influence captures whether actors attend to and respond to each other's actions; Mediation describes how conflicting commands are unified — by combination, where both contribute to the shared control signal, or by selection, where one takes precedence. The dimension space enables designers to classify systems, identify design patterns, and transfer solutions across domains that otherwise share no common vocabulary. It is cited by survey papers and design frameworks.

Two Heads Are Better Than One: A Dimension Space for Unifying Human and Artificial Intelligence in Shared Control — Cimolino, Graham. CHI 2022. 33 citations.

When AI capabilities meet human needs

Ninja Showdown presents Rock, Paper, Scissors as a fighting game. Players share control with an AI partner that controls two of the three available attacks. Because the game's dynamics are deterministic, the correct trust decision at each step — defer to the AI or override it — is knowable from the AI's stated intention. This allowed each player to be treated as a binary trust classifier and trust appropriateness to be operationalized as Matthews Correlation Coefficient: a measure that captures not just how often players relied on the AI but whether they relied on it correctly.

Showing players what the AI intended to do next improved trust appropriateness without changing trust frequency. Players deferred when the intention was correct and overrode it when it wasn't. The uncertainty that intention cues reduced was in what the AI would do — information that, without cues, arrived too late to act on and made the trust decision effectively a coin toss.

Impact of Awareness Cues on Trust in Human-AI Shared Control — Cimolino, Gutwin, Graham. TRAIT Workshop at CHI 2022.

Partial automation was studied with six participants with spinal cord injuries, implemented in two games. All six could play both games. The ground truth for whether this worked is not a performance score — it is whether players could do something they could not do before, and what that meant to them. Participants described the experience as autonomous rather than passive participation: competition, accomplishment, and access to feelings of competence that injury had taken away. One said that games like these could show patients that "there's still lots to do in life."

In both cases the ground truth is human. In both cases the integration test has a result.

The Role of Partial Automation in Increasing the Accessibility of Digital Games — Cimolino, Askari, Graham. PACMHCI (CHI PLAY), 2021. 23 citations.

Beyond Fun: Players' Experiences of Accessible Rehabilitation Gaming for Spinal Cord Injury — Cimolino, Askari, Graham. ASSETS 2021.