Showing 1–20 of 21 results
/ Date/ Name
May 21, 2024A Workbench for Autograding Retrieve/Generate SystemsDec 22, 2024LLM-based relevance assessment still can't replace human relevance assessmentApr 27, 2025LLM-Evaluation Tropes: Perspectives on the Validity of LLM-EvaluationsJan 19, 2026Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?Oct 18, 2023Retrieve-Cluster-Summarize: An Alternative to End-to-End Training for Query-specific Article GenerationApr 13, 2023Perspectives on Large Language Models for Relevance JudgmentMay 1, 2018On the Equivalence of Generative and Discriminative Formulations of the Sequential Dependence ModelJan 21, 2026Supporting Humans in Evaluating AI Summaries of Legal DepositionsFeb 1, 2024An Exam-based Evaluation Approach Beyond Traditional Relevance JudgmentsJan 19, 2026Incorporating Q&A Nuggets into Retrieval-Augmented GenerationApr 18, 2019Knowledge-rich Image Gist Understanding Beyond Literal MeaningOct 17, 2024Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3Sep 8, 2025UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim ExtractionSep 30, 2025Auto-ARGUE: LLM-Based Report Generation EvaluationDec 20, 2019Report on the First HIPstIR Workshop on the Future of Information RetrievalDec 21, 2023Fine-grained Forecasting Models Via Gaussian Process Blurring EffectMay 13, 2017Benchmark for Complex Answer RetrievalJul 13, 2025Criteria-Based LLM Relevance JudgmentsSep 23, 2018Understanding the Gist of Images - Ranking of Concepts for Multimedia IndexingJul 13, 2025Does UMBRELA Work on Other LLMs?