Multi-Hop Question Answering: When Can Humans Help and Where Do They Struggle?
/ Authors
/ Abstract
In many settings, language models struggle with multi-hop question answering. Ideally, humans could help, but how? On which reasoning sub-tasks would humans do well? To answer this, we recruited 40 untrained crowd-workers to perform the subtasks that comprise a multi-hop answering pipeline: recognize a question as complex; break down a complex question; retrieve answers to simple questions and assemble simple facts to answer complex questions. Our tasks were based on the challenging 2WikiMultiHopQA benchmark where human accuracy was 80.2% on direct complex question answering and 84.1% on simple question answering. We found our participants struggled most with recognizing that a question might be complex (67% accuracy) but were better at question decomposition (78.2%) and answer integration (97.3%). This suggests that if a system knew it was struggling with a multi-hop question, it could ask a human to break it down into simpler questions and integrate the answer.
Journal: Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems