Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo
AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.
Richard B. Arthur
High-consequence decision making demands peak performance from individuals in positions of responsibility. Such executive authority bears the obligation to act despite uncertainty, limited resources, time constraints, and accountability risks. Tools and strategies to motivate confidence and foster risk tolerance must confront informational noise and can provide qualified accountability. Machine intelligence augments human cognition and perception to improve situational awareness, decision framing, flexibility, and coherence through agentic stewardship of contextual metadata. We examine systemic and behavioral factors crucial to address in scenarios encumbered by complexity, uncertainty, and urgency.
Jiayi Hao, Chengzhi Zhang
Research methods constitute an indispensable tool for scholars engaged in scientific inquiry. Investigating how scholars use research methods throughout their careers can reveal distinct patterns in method adoption, providing valuable insights for novice researchers in selecting appropriate methods. This study employs a comprehensive dataset comprising full-text journal articles and bibliographic records from the Library and Information Science (LIS) domain. Utilizing an automated classification model based on full-text cognitive analysis, the research methods employed by LIS scholars are systematically identified. Topic modeling was then conducted using Top2Vec. Subsequently, author name disambiguation is performed, and academic age is calculated for each scholar. This study focuses on 435 senior scholars with an academic age of more than 14 years and a consistent publication record at five-year intervals, covering a total of 6,116 articles. The corpus covers 16 research method categories and 20 research topics. The findings indicate that bibliometric methods are the most frequently used across career stages, accounting for 19.61% among early-career scholars and 31.81% among senior scholars. Over the course of a scholarly career, the diversity of research methods initially increases and then declines. Furthermore, scholars exhibit a propensity for combining multiple research methods, including both conventional and unconventional pairings. Notably, the research methods most commonly used by researchers change with age and seniority.
Jingyi Zheng, Tianyi Hu, Yule Liu, Zhen Sun, Zongmin Zhang, Zifan Peng, Wenhan Dong, Xinlei He
Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly challenging.The results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert advertisements.Our further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual structures.We provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.
Dmitry Zaytsev, Valentina Kuskova, Michael Coppedge
Failures in complex systems often emerge through gradual degradation and the propagation of stress across interacting components rather than through isolated shocks. Democratic systems exhibit similar dynamics, where weakening institutions can trigger cascading deterioration in related institutional structures. Traditional reliability and survival models typically estimate failure risk based on the current system state but do not explicitly capture how degradation propagates through institutional networks over time. This paper introduces a trajectory-aware reliability modeling framework based on Dynamic Causal Neural Autoregression (DCNAR). The framework first estimates a causal interaction structure among institutional indicators and then models their joint temporal evolution to generate forward trajectories of system states. Failure risk is defined as the probability that predicted trajectories cross predefined degradation thresholds within a fixed horizon. Using longitudinal institutional indicators, we compare DCNAR-based trajectory risk models with discrete-time hazard and Cox proportional hazards models. Results show that trajectory-aware modeling consistently outperforms Cox models and improves risk prediction for several propagation-driven institutional failures. These findings highlight the importance of modeling dynamic system interactions for reliability analysis and early detection of systemic degradation.
Arka Majhi
Recent discoveries in VR have opened up scope for designing physical tools and controllers to enhance immersion, through perceived reality. In a virtually simulated sports scenario it is challenging to immerse user because most of the available controllers are unable to bridge the user experience in the real world to the actions in the virtual world. My research is to identify HCI problems in existing VR controllers, design a physical controller prototype with realistic tangible mapping, trying to solve the existing problems and evaluate it in a designed VR game for skating. Its immersiveness would be graded on Likert scale on parameters like perceived interactivity and reality, spatial presence and enjoyment. The evaluation will be done after trial runs and feedback sessions by playing the game with the designed controller and comparing it with ones available in the market. The findings will help people understand what all parameters we should consider while designing futuristic controllers, customized for a particular sport.
Rong Zhao, Wanqi Liu, Zhizhou Sha, Nanxi Su, Yecheng Zhang
Large language models (LLMs) are increasingly used to describe, evaluate and interpret places, yet it remains unclear whether they do so from a culturally neutral standpoint. Here we test urban perception in frontier LLMs using a balanced global street-view sample and prompts that either remain neutral or invoke different regional cultural standpoints. Across open-ended descriptions and structured place judgments, the neutral condition proved not to be neutral in practice. Prompts associated with Europe and Northern America remained systematically closer to the baseline than many non-Western prompts, indicating that model perception is organized around a culturally uneven reference frame rather than a universal one. Cultural prompting also shifted affective evaluation, producing sentiment-based ingroup preference for some prompted identities. Comparisons with regional human text-image benchmarks showed that culturally proximate prompting could improve alignment with human descriptions, but it did not recover human levels of semantic diversity and often preserved an affectively elevated style. The same asymmetry reappeared in structured judgments of safety, beauty, wealth, liveliness, boredom and depression, where model outputs were interpretable but only partly reproduced human group differences. These findings suggest that LLMs do not simply perceive cities from nowhere: they do so through a culturally uneven baseline that shapes what appears ordinary, familiar and positively valued.
Vibhor Agarwal, Ke Zhou, Edyta Paulina Bogucka, Daniele Quercia
AI companion chatbots increasingly shape how people seek social and emotional connection, sometimes substituting for relationships with romantic partners, friends, teachers, or even therapists. When these systems adopt those metaphorical roles, they are not neutral: such roles structure people's ways of interacting, distribute perceived AI harms and benefits, and may reflect behavioral addiction signs. Yet these role-dependent risks remain poorly understood. We analyze 248,830 posts from seven prominent Reddit communities describing interactions with AI companions. We identify ten recurring metaphorical roles (for example, soulmate, philosopher, and coach) and show that each role supports distinct ways of interacting. We then extract the perceived AI harms and AI benefits associated with these role-specific interactions and link them to behavioral addiction signs, all of which has been inferred from the text in the posts. AI soulmate companions are associated with romance-centered ways of interacting, offering emotional support but also introducing emotional manipulation and distress, culminating in strong attachment. In contrast, AI coach and guardian companions are associated with practical benefits such as personal growth and task support, yet are nonetheless more frequently associated with behavioral addiction signs such as daily life disruptions and damage to offline relationships. These findings show that metaphorical roles are a central ethical design concern for responsible AI companions.
Huy Nghiem, Phuong-Anh Nguyen-Le, Sy-Tuyen Ho, Hal Daume
Research has documented LLMs' name-based bias in hiring and salary recommendations. In this paper, we instead consider a setting where LLMs generate candidate summaries for downstream assessment. In a large-scale controlled study, we analyze nearly one million resume summaries produced by 4 models under systematic race-gender name perturbations, using synthetic resumes and real-world job postings. By decomposing each summary into resume-grounded factual content and evaluative framing, we find that factual content remains largely stable, while evaluative language exhibits subtle name-conditioned variation concentrated in the extremes of the distribution, especially in open-source models. Our hiring simulation demonstrates how evaluative summary transforms directional harm into symmetric instability that might evade conventional fairness audit, highlighting a potential pathway for LLM-to-LLM automation bias.
Sergey V. Samsonau, Matthew Pearce
Guiding others through authentic scientific research outside of PhD programs has been practiced for decades in specialized secondary schools, undergraduate research programs, and independent settings. These practitioners work in the middle, between the classroom science teacher and the PhD advisor, guiding learners with aptitude or serious interest. Sport and music have dedicated professions for this middle position (the school-team coach and the school band director); research does not. This paper names that missing profession the Research Guide: the practitioner who develops another person's capacity to do research, from framing a question to communicating findings. Hundreds of thousands of middle and high school students already pursue authentic research each year, even more college undergraduates participate in research with a faculty member, and millions of adults engage in citizen science. In current practice, the programs that serve this middle group mostly default to a simplified version of the PhD apprenticeship model structured around one mentor with a few students at a time, without systematic training; they overwhelmingly frame research as the hypothetico-deductive cycle alone. The role calls for cognitive apprenticeship, a pedagogical approach in which an expert's tacit moves on open-ended problems are made visible and scaffolded, then faded as the learner develops, while the research outcomes themselves remain unpredictable. It spans multiple modes of inquiry (not only the hypothetico-deductive cycle) and demands a combination that no existing training program produces: pedagogy, research methodology, developmental assessment, risk and productive struggle management, domain flexibility, and community building. Together these demands warrant a dedicated profession: a named role, a training pathway, a career ladder, hiring standards, and institutional recognition.
Shilei Luo, Zhiqi Zhang, Hengchen Dai, Dennis Zhang
AI agents powered by large language models are increasingly acting on behalf of humans in social and economic environments. Prior research has focused on their task performance and effects on human outcomes, but less is known about the relationship between agents and the specific individuals who deploy them. We ask whether agents systematically reflect the behavioral characteristics of their human owners, functioning as behavioral extensions rather than producing generic outputs. We study this question using 10,659 matched human-agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account. By comparing agents' posts on Moltbook with their owners' Twitter/X activity across features spanning topics, values, affect, and linguistic style, we find systematic transfer between agents and their specific owners. This transfer persists among agents without explicit configuration, and pairs that align on one behavioral dimension tend to align on others. These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use. We further show that agents with stronger behavioral transfer are more likely to disclose owner-related personal information in public discourse, suggesting that the same owner-specific context that drives behavioral transfer may also create privacy risk during ordinary use. Taken together, our results indicate that AI agents do not simply generate content, but reflect owner-related context in ways that can propagate human behavioral heterogeneity into digital environments, with implications for privacy, platform design, and the governance of agentic systems.
Sophia Abraham, Taiye Chen, Cyril Chhun, Giovanna Jaramillo-Gutierrez, Simon Mylius, Sayash Raaj, Peter Slattery, Sean McGregor
Artificial intelligence systems are now deployed at scale across sectors, accompanied by a growing number of real-world incidents ranging from misinformation and cybercrime to autonomous-system failures. Databases of AI incidents index these events, but they cannot measure ``risk'' (i.e., a joint measure of likelihood and severity) without additional data regarding the prevalence of risk-associated systems and their incident reporting rates. As a result, policymakers, companies, and the general public lack a means to weigh the benefits of AI against their in-context risks. Inspired by public-health processes, which presume noisy and incomplete disease surveillance, we identify six phases of incident emergence. We demonstrate the framework through a detailed case study of autonomous vehicles, whose mandatory reporting requirements produces reliable incident-rate ground truth expressed in distance traveled. The case study shows that an informed panel of domain experts (e.g., self-driving experts) can combine their domain expertise, incident data, and a collection of statistical and visualization tools to arrive at incident phase determinations serving public needs. We further demonstrate the approach with a deepfake incident case study and chart a path for future research in incident phase determination.
Sheikh Junaid Fayaz, Nestor D. Montiel-Bohorquez, Wilson Ricardo Leal da Silva, Shashank Bishnoi, Matteo Romano, Manuele Gatti, N. M. Anoop Krishnan
Cement production is among the largest contributors to industrial air pollution, emitting ~3 Mt NOx/year. The industry-standard mitigation approach, selective non-catalytic reduction (SNCR), exhibits low NH3 utilization efficiency, resulting in operational inefficiencies and increased reagent costs. Here, we develop a data-driven framework for emission control using large-scale operational data from four cement plants worldwide. Benchmarking nine machine learning architectures, we observe that prediction error varies ~3-5x across plants due to variation in data richness. Incorporating short-term process history nearly triples NOx prediction accuracy, revealing that NOx formation carries substantial process memory, a timescale dependence that is absent in CO and CO2. Further, we develop models that forecast NOx overshoots as early as nine minutes, providing a buffer for operational adjustments. The developed framework controls NOx formation at the source, reducing NH3 consumption in downstream SNCR. Surrogate model projections estimate a ~34-64% reduction in NOx while preserving clinker quality, corresponding to a reduction of ~290 t NOx/year and ~58,000 USD/year in NH3 savings. This work establishes a generalizable framework for data-driven emission control, offering a pathway toward low-emission operation without structural modifications or additional hardware, with potential applicability to other hard-to-abate industries such as steel, glass, and lime.
Maurice Chiodo, Toni Erskine, Dennis Müller, James G. Wright
We analyse the 2025 Signalgate leak of sensitive US military information by the Trump administration, addressing why confidentiality was violated (messages leaked to the press) in spite of encryption (Signal), to deepen the socio-technical considerations when designing and deploying encryption. First, we use applied pi-calculus to formally model the boutique secure facility setup requested by the US Defence Secretary, to prove that a leak would not be prevented. We then examine how using a secure channel might still not give overall information security, as, in this case, power imbalances between personnel and officials led to the application of cryptography that compromised their operational security. We look at how cryptographic tools may have instilled a false sense of security, and led officials to "overshare". We then apply this analysis to the Trump administration's general desire to burn through political, legal, and now technical process, and demonstrate geopolitical harms that may arise from such ineffective use of cryptography in a brief use case. We conclude that, even with advancements in usability of cryptographic tools, genuine message security is still out of reach of the "average user".
Segun Aroyehun, Stephan Lewandowsky, David Garcia
The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence--Minus--Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.
Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang, Mong-Li Lee, Wynne Hsu
Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.
Daniel W. Hook
The debate about scholarly knowledge infrastructure has long been framed as a contest between openness and commercial enclosure. This framing distorts both policy and practice. The real tension lies between the persistent cost of producing and refining structured metadata under deep technological friction, and the differentiated demands distinct communities place on data quality, focus and granularity. We introduce the innovation annulus: the zone between freely available structured data and the advancing frontier of commercially refined knowledge products. This zone is a permanent, functional feature of the ecosystem -- not a pathology to eliminate. By analogy with the efficient market hypothesis, its width measures production inefficiency, set by the interplay of friction and demand. Artificial intelligence reshapes the annulus, lowering barriers to basic structuring, raising the threshold at which refinement adds value, and introducing systemic risks through unprovenanced AI-derived metadata. CRediT contributions, funding acknowledgements and AI disclosure statements illustrate the annulus lifecycle. Governance should calibrate the annulus, not abolish it: thin enough to serve research efficiently, wide enough to sustain innovation. A formal welfare framework, analogous to the Nordhaus optimal patent life, characterises the trade-offs and yields testable predictions. The Barcelona Declaration offers a promising forum for boundary governance.
Michael Achmann-Denkler, Mario Haim, Christian Wolff
This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.
Kelly McConvey, Dipto Das, Maya Ghai, Angelina Zhai, Rosa Lee, Shion Guha
Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.
Yefim Shulman, Agnieszka Kitkowska, Mark Warner
For online health communities, community trust is paramount. Yet, advances in Large Language Models (LLMs) generating advice may erode this trust, especially if users cannot identify whether LLMs have been used. We investigate the feasibility of community-based detection of health advice authorship and how self-moderation of LLMs could help enhance advice utilization. In an online experiment, we evaluate people's ability to distinguish AI-generated from human-written advice across two health conditions, considering lived experience with a condition, AI-recognition training, and user attitudes towards transparency and trust around AI use. Our results indicate the need for transparency coupled with trust. We find little evidence of people's ability to discern advice authorship. However, we find a consistent effect of the health condition. Our qualitative findings identify unreliable signals, resulting in flawed heuristic evaluations of the advice. Our findings point to opportunities to improve the self-moderation of LLM-based AI and aid community-based AI moderation.