"au:"Anran Wang"" — arXiv2 Search

Showing 1–20 of 22 results

Hybrid Neural Networks for On-device Directional Hearing

Anran Wang, Maruchi Kim, Hao Zhang, Shyamnath Gollakota

Dec 11, 2021·cs.SD·PDF

On-device directional hearing requires audio source separation from a given direction while achieving stringent human-imperceptible latency requirements. While neural nets can achieve significantly better performance than traditional beamformers, all existing models fall short of supporting low-latency causal inference on computationally-constrained wearables. We present DeepBeam, a hybrid model that combines traditional beamformers with a custom lightweight neural net. The former reduces the computational burden of the latter and also improves its generalizability, while the latter is designed to further reduce the memory and computational overhead to enable real-time and low-latency operations. Our evaluation shows comparable performance to state-of-the-art causal inference models on synthetic data while achieving a 5x reduction of model size, 4x reduction of computation per second, 5x reduction in processing time and generalizing better to real hardware data. Further, our real-time hybrid model runs in 8 ms on mobile CPUs designed for low-power wearable devices and achieves an end-to-end latency of 17.5 ms.

Partial Incorrectness Logic

Lena Verscht, Ānrán Wáng, Benjamin Lucien Kaminski

Feb 20, 2025·cs.LO·PDF

Reasoning about program correctness has been a central topic in static analysis for many years, with Hoare logic (HL) playing an important role. The key notions in HL are partial and total correctness. Both require that program executions starting in a specified set of initial states (the precondition) reach a designated set of final states (the postcondition). Partial correctness is more lenient in that it does not require termination, effectively deeming divergence acceptable. We explore partial incorrectness logic, which stands in relation to O'Hearn's "total" incorrectness logic as partial correctness does to total correctness: Partial correctness allows divergence, partial incorrectness allows unreachability. While the duality between divergence and unreachability may not be immediately apparent, we explore this relationship further. Our chosen formalism is predicate transformers à la Dijkstra. We focus here on deterministic and reversible programs, though the discussion extends to nondeterministic and irreversible computations, both of which introduce additional nondeterminism that must be addressed.

Holistic Multi-modal Memory Network for Movie Question Answering

Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, Vijay Chandrasekhar

Nov 12, 2018·cs.CV·PDF

Answering questions according to multi-modal context is a challenging problem as it requires a deep integration of different data sources. Existing approaches only employ partial interactions among data sources in one attention hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop. In addition, it takes answer choices into consideration during the context retrieval stage. Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering. Our HMMN framework achieves state-of-the-art accuracy on MovieQA dataset. Extensive ablation studies show the importance of holistic reasoning and contributions of different attention strategies.

MilliSonic: Pushing the Limits of Acoustic Motion Tracking

Anran Wang, Shyamnath Gollakota

Jan 20, 2019·cs.HC·PDF

Recent years have seen interest in device tracking and localization using acoustic signals. State-of-the-art acoustic motion tracking systems however do not achieve millimeter accuracy and require large separation between microphones and speakers, and as a result, do not meet the requirements for many VR/AR applications. Further, tracking multiple concurrent acoustic transmissions from VR devices today requires sacrificing accuracy or frame rate. We present MilliSonic, a novel system that pushes the limits of acoustic based motion tracking. Our core contribution is a novel localization algorithm that can provably achieve sub-millimeter 1D tracking accuracy in the presence of multipath, while using only a single beacon with a small 4-microphone array.Further, MilliSonic enables concurrent tracking of up to four smartphones without reducing frame rate or accuracy. Our evaluation shows that MilliSonic achieves 0.7mm median 1D accuracy and a 2.6mm median 3D accuracy for smartphones, which is 5x more accurate than state-of-the-art systems. MilliSonic enables two previously infeasible interaction applications: a) 3D tracking of VR headsets using the smartphone as a beacon and b) fine-grained 3D tracking for the Google Cardboard VR system using a small microphone array.

FM Backscatter: Enabling Connected Cities and Smart Fabrics

Anran Wang, Vikram Iyer, Vamsi Talla, Joshua R. Smith, Shyamnath Gollakota

Feb 22, 2017·cs.NI·PDF

This paper enables connectivity on everyday objects by transforming them into FM radio stations. To do this, we show for the first time that ambient FM radio signals can be used as a signal source for backscatter communication. Our design creates backscatter transmissions that can be decoded on any FM receiver including those in cars and smartphones. This enables us to achieve a previously infeasible capability: backscattering information to cars and smartphones in outdoor environments. Our key innovation is a modulation technique that transforms backscatter, which is a multiplication operation on RF signals, into an addition operation on the audio signals output by FM receivers. This enables us to embed both digital data as well as arbitrary audio into ambient analog FM radio signals. We build prototype hardware of our design and successfully embed audio transmissions over ambient FM signals. Further, we achieve data rates of up to 3.2 kbps and ranges of 5-60 feet, while consuming as little as 11.07μW of power. To demonstrate the potential of our design, we also fabricate our prototype on a cotton t-shirt by machine sewing patterns of a conductive thread to create a smart fabric that can transmit data to a smartphone. We also embed FM antennas into posters and billboards and show that they can communicate with FM receivers in cars and smartphones.

Controlling relaxation dynamics of excitonic states in monolayer transition metal dichalcogenides WS2 through interface engineering

Anran Wang, Yuhan Wang, Jianfei Li, Ning Xu, Songlin Li, Xinran Wang, Yi Shi, Fengqiu Wang

Feb 27, 2021·cond-mat.mes-hall·PDF

Transition metal dichalcogenides (TMDs) are known to support complex excitonic states. Revealing the differences in relaxation dynamics among different excitonic species and elucidating the transition dynamics between them may provide important guidelines for designing TMD-based excitonic devices. Combining photoluminescence (PL) and reflectance contrast measurements with ultrafast pump-probe spectroscopy under cryogenic temperatures, we herein study the relaxation dynamics of neutral and charged excitons in a back-gate-controlled monolayer device. Pump-probe results reveal quite different relaxation dynamics of excitonic states under different interfacial conditions: while neutral excitons experience much longer lifetime than trions in monolayer WS2, the opposite is true in the WS2/h-BN heterostructure. It is found that the insertion of h-BN layer between the TMD monolayer and the substrate has a great influence on the lifetimes of different excitonic states. The h-BN flakes can not only screen the effects of impurities and defects at the interface, but also help establish a non-radiative transition from neutral excitons to trions to be the dominant relaxation pathway, under cryogenic temperature. Our findings highlight the important role interface may play in governing the transient properties of carriers in 2D semiconductors, and may also have implications for designing light-emitting and photo-detecting devices based on TMDs.

PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng, Zhiyang Teng, Kunpeng Qiu, Xiangtai Li, Anran Wang, Yu Tian, Ye Tian, Haochen Wang, Zhuochen Wang

Oct 29, 2025·cs.CL·PDF

Unified Vision-Language Models (UVLMs) perform both understanding and generation within a single architecture. Since these models rely on heterogeneous data and supervision, balancing both generation and understanding in reinforcement learning (RL) is challenging. To address this challenge, we propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. Specifically, we construct a unified paired dataset by synthesizing aligned instances via cross-modal semantic completion and retrieving semantically related samples. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present PairGRPO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. Extensive experiments across diverse UVLM architectures (Autoregressive and Discrete Diffusion) and scales (1B to 14B) demonstrate that PairUni yields consistent improvements over strong baselines. Notably, our method also demonstrates strong generalization by improving performance on image editing tasks without using any editing-specific data. Codes are available at https://github.com/Haochen-Wang409/PairUni.

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang

Jul 10, 2025·cs.CV·PDF

Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Shanchuan Lin, Anran Wang, Xiao Yang

Feb 21, 2024·cs.CV·PDF

We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.

All Tokens Matter: Token Labeling for Training Better Vision Transformers

Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, Jiashi Feng

Apr 22, 2021·cs.CV·PDF

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang

Oct 23, 2025·cs.CV·PDF

Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability.

skscope: Fast Sparsity-Constrained Optimization in Python

Zezhi Wang, Jin Zhu, Peng Chen, Huiyang Peng, Xiaoke Zhang, Anran Wang, Junxian Zhu, Xueqin Wang

Mar 27, 2024·stat.ML·PDF

Applying iterative solvers on sparsity-constrained optimization (SCO) requires tedious mathematical deduction and careful programming/debugging that hinders these solvers' broad impact. In the paper, the library skscope is introduced to overcome such an obstacle. With skscope, users can solve the SCO by just programming the objective function. The convenience of skscope is demonstrated through two examples in the paper, where sparse linear regression and trend filtering are addressed with just four lines of code. More importantly, skscope's efficient implementation allows state-of-the-art solvers to quickly attain the sparse solution regardless of the high dimensionality of parameter space. Numerical experiments reveal the available solvers in skscope can achieve up to 80x speedup on the competing relaxation solutions obtained via the benchmarked convex solver. skscope is published on the Python Package Index (PyPI) and Conda, and its source code is available at: https://github.com/abess-team/skscope.

P/D-Device: Disaggregated Large Language Model between Cloud and Devices

Yibo Jin, Yixu Xu, Yue Chen, Chengbin Wang, Tao Wang, Jiaqi Huang, Rongfei Zhang, Yiming Dong, Yuting Yan, Ke Cheng, Yingjie Zhu, Shulan Wang, Qianqian Tang, Shuaishuai Meng, Guanxin Cheng, Ze Wang, Shuyan Miao, Ketao Wang, Wen Liu, Yifan Yang, Tong Zhang, Anran Wang, Chengzhou Lu, Tiantian Dong, Yongsheng Zhang, Zhe Wang, Hefei Guo, Hongjie Liu, Wei Lu, Zhengyong Zhang

Aug 12, 2025·cs.DC·PDF

Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the cloud from achieving a higher throughput. Meanwhile, due to limited on-device resources, the time to first token (TTFT), i.e., the latency of prefill phase, increases dramatically with the growth on prompt length. In order to concur with such a bottleneck on resources, i.e., long occupation in cloud and limited on-device computing capacity, we propose to separate large language model between cloud and devices. That is, the cloud helps a portion of the content for each device, only in its prefill phase. Specifically, after receiving the first token from the cloud, decoupling with its own prefill, the device responds to the user immediately for a lower TTFT. Then, the following tokens from cloud are presented via a speed controller for smoothed TPOT (the time per output token), until the device catches up with the progress. On-device prefill is then amortized using received tokens while the resource usage in cloud is controlled. Moreover, during cloud prefill, the prompt can be refined, using those intermediate data already generated, to further speed up on-device inference. We implement such a scheme P/D-Device, and confirm its superiority over other alternatives. We further propose an algorithm to decide the best settings. Real-trace experiments show that TTFT decreases at least 60%, maximum TPOT is about tens of milliseconds, and cloud throughput increases by up to 15x.

SAMTok: Representing Any Mask with Two Words

Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li

Jan 22, 2026·cs.CV·PDF

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang

Oct 21, 2025·cs.CV·PDF

While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

ManiCLIP: Multi-Attribute Face Manipulation from Text

Hao Wang, Guosheng Lin, Ana García del Molino, Anran Wang, Jiashi Feng, Zhiqi Shen

Oct 2, 2022·cs.CV·PDF

In this paper we present a novel multi-attribute face manipulation method based on textual descriptions. Previous text-based image editing methods either require test-time optimization for each individual image or are restricted to single attribute editing. Extending these methods to multi-attribute face image editing scenarios will introduce undesired excessive attribute change, e.g., text-relevant attributes are overly manipulated and text-irrelevant attributes are also changed. In order to address these challenges and achieve natural editing over multiple face attributes, we propose a new decoupling training scheme where we use group sampling to get text segments from same attribute categories, instead of whole complex sentences. Further, to preserve other existing face attributes, we encourage the model to edit the latent code of each attribute separately via an entropy constraint. During the inference phase, our model is able to edit new face images without any test-time optimization, even from complex textual prompts. We show extensive experiments and analysis to demonstrate the efficacy of our method, which generates natural manipulated faces with minimal text-irrelevant attribute editing. Code and pre-trained model are available at https://github.com/hwang1996/ManiCLIP.

Dynamic Spatio-Temporal Specialization Learning for Fine-Grained Action Recognition

Tianjiao Li, Lin Geng Foo, Qiuhong Ke, Hossein Rahmani, Anran Wang, Jinghua Wang, Jun Liu

Sep 3, 2022·cs.CV·PDF

The goal of fine-grained action recognition is to successfully discriminate between action categories with subtle differences. To tackle this, we derive inspiration from the human visual system which contains specialized regions in the brain that are dedicated towards handling specific tasks. We design a novel Dynamic Spatio-Temporal Specialization (DSTS) module, which consists of specialized neurons that are only activated for a subset of samples that are highly similar. During training, the loss forces the specialized neurons to learn discriminative fine-grained differences to distinguish between these similar samples, improving fine-grained recognition. Moreover, a spatio-temporal specialization method further optimizes the architectures of the specialized neurons to capture either more spatial or temporal fine-grained information, to better tackle the large range of spatio-temporal variations in the videos. Lastly, we design an Upstream-Downstream Learning algorithm to optimize our model's dynamic decisions during training, improving the performance of our DSTS module. We obtain state-of-the-art performance on two widely-used fine-grained action recognition datasets.

DeepSense: Enabling Carrier Sense in Low-Power Wide Area Networks Using Deep Learning

Justin Chan, Anran Wang, Arvind Krishnamurthy, Shyamnath Gollakota

Apr 24, 2019·cs.NI·PDF

The last few years have seen the proliferation of low-power wide area networks like LoRa, Sigfox and 802.11ah, each of which use a different and sometimes proprietary coding and modulation scheme, work below the noise floor and operate on the same frequency band. We introduce DeepSense, which is the first carrier sense mechanism that enables random access and coexistence for low-power wide area networks even when signals are below the noise floor. Our key insight is that any communication protocol that operates below the noise floor has to use coding at the physical layer. We show that neural networks can be used as a general algorithmic framework that can learn the coding mechanisms being employed by such protocols to identify signals that are hidden within noise. Our evaluation shows that DeepSense performs carrier sense across 26 different LPWAN protocols and configurations below the noise floor and can operate in the presence of frequency shifts as well as concurrent transmissions. Beyond carrier sense, we also show that DeepSense can support multi bit-rate LoRa networks by classifying between 21 different LoRa configurations and flexibly adapting bitrates based on signal strength. In a deployment of a multi-rate LoRa network, DeepSense improves bit rate by 4x for nearby devices and provides a 1.7x increase in the number of locations that can connect to the campus-wide network.

Living IoT: A Flying Wireless Platform on Live Insects

Vikram Iyer, Rajalakshmi Nandakumar, Anran Wang, Sawyer Fuller, Shyamnath Gollakota

Dec 22, 2018·cs.NI·PDF

Sensor networks with devices capable of moving could enable applications ranging from precision irrigation to environmental sensing. Using mechanical drones to move sensors, however, severely limits operation time since flight time is limited by the energy density of current battery technology. We explore an alternative, biology-based solution: integrate sensing, computing and communication functionalities onto live flying insects to create a mobile IoT platform. Such an approach takes advantage of these tiny, highly efficient biological insects which are ubiquitous in many outdoor ecosystems, to essentially provide mobility for free. Doing so however requires addressing key technical challenges of power, size, weight and self-localization in order for the insects to perform location-dependent sensing operations as they carry our IoT payload through the environment. We develop and deploy our platform on bumblebees which includes backscatter communication, low-power self-localization hardware, sensors, and a power source. We show that our platform is capable of sensing, backscattering data at 1 kbps when the insects are back at the hive, and localizing itself up to distances of 80 m from the access points, all within a total weight budget of 102 mg.

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li

Nov 12, 2025·cs.CV·PDF

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

P/D-Device: Disaggregated Large Language Model between Cloud and Devices

Aug 12, 2025·cs.DC·PDF