Showing 1–18 of 18 results
/ Date/ Name
Sep 30, 2020CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language ModelsApr 18, 2017A Broad-Coverage Challenge Corpus for Sentence Understanding through InferenceMay 24, 2019Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE BenchmarkJul 25, 2017The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence RepresentationsApr 17, 2018ListOps: A Diagnostic Dataset for Latent Tree LearningJun 1, 2021What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?Dec 16, 2021QuALITY: Question Answering with Long Input Texts, Yes!Apr 11, 2022Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension QuestionsSep 18, 2025PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output TargetingMar 12, 2022What Makes Reading Comprehension Questions Difficult?Jan 18, 2023Discrete Latent Structure in Neural NetworksMay 2, 2019SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding SystemsOct 15, 2021BBQ: A Hand-Built Bias Benchmark for Question AnsweringJun 9, 2022Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsAug 26, 2022What Do NLP Researchers Believe? Results of the NLP Community MetasurveyOct 19, 2022Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension QuestionsJul 1, 2019Natural Language Understanding with the Quora Question Pairs DatasetApr 15, 2021Does Putting a Linguist in the Loop Improve NLU Data Collection?