When LLMs Fall Short in Deductive Coding: Model Comparisons and Human–AI Collaboration Workflow Design
/ Authors
/ Abstract
With generative artificial intelligence driving the growth of dialogic data in education, automated coding is a promising direction for learning analytics to improve efficiency. This surge highlights the need to understand the nuances of student-AI interactions, especially those rare yet crucial. However, automated coding may struggle to capture these rare codes due to imbalanced data, while human coding remains time-consuming and labour-intensive. The current study examined the potential of large language models (LLMs) to approximate or replace humans in deductive, theory-driven coding, while also exploring how human–AI collaboration might support such coding tasks at scale. We compared the coding performance of small transformer classifiers (e.g., BERT) and LLMs in two datasets, with particular attention to imbalanced head–tail distributions in dialogue codes. Our results showed that LLMs did not outperform BERT-based models and exhibited systematic errors and biases in deductive coding tasks. We designed and evaluated a human–AI collaborative workflow that improved coding efficiency while maintaining coding reliability. Our findings reveal both the limitations of LLMs – especially their difficulties with semantic similarity and theoretical interpretations – and the indispensable role of human judgment, while demonstrating the practical promise of human–AI collaborative workflows for coding.
Journal: Proceedings of the LAK26: 16th International Learning Analytics and Knowledge Conference