Xin Chen, Saili Uday Gadgil, Jiarong Qiu
Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.
Xin Chen, Jia He, Maozheng Li, Dongliang Xu, Tianyu Wang, Yixiao Chen, Zhixin Lin, Yue Yao
Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs' capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird's-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some spatial questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.
Yi-Zhen Li, Jun-Jie Huo, Xin Chen, Heng-Dong Xi
We report an experimental investigation of turbulent Rayleigh-Benard convection in a rectangular cell of large aspect ratio ($Γ= 10$) over the Rayleigh number range $5.4\times10^7 \le Ra \le 7.2\times10^9$ and Prandtl number range $4.3 \le Pr \le 67.3$. Planar particle image velocimetry measurements show that the flow self organises into several horizontally stacked convection rolls, and repeated experiments under identical parameters (both $Ra$ and $Pr$) reveal that the number of rolls varies within the range of 3 to 7 with 6 being the most probable, which demonstrates the presence of multiple flow states. When $Pr$ is increased to 67.3, the number of roll like structures increases significantly, indicating a structural transition from a roll dominated to a plume dominated flow. This transition is reflected in the global momentum transport, for $Pr \leq 18.3$ the Reynolds number scales as $Re \sim Ra^{0.58}Pr^{-0.97}$, whereas the scaling is changed to $Re \sim Ra^{0.72}$ when $Pr$ reaches 67.3. Within individual rolls, we further examine the Reynolds numbers based on horizontal and vertical velocity components, $Re_{u,\text{roll}}$ and $Re_{w,\text{roll}}$, and find that the former increases while the latter decreases with roll size (quantified as the aspect ratio of the roll $Γ_\text{roll}$) due to continuity constraints, with their ratio following $Re_{w,\text{roll}}/Re_{u,\text{roll}} \sim Γ_\text{roll}^{-0.61}$. We impose different initial flow conditions (roll structures) with controlled perturbations, and demonstrate that the initial condition can influence the final turbulent state. We show that the number of horizontally stacked rolls regulates the global transport, larger number of rolls induces greater vertical momentum and heat transfer.
Xin Chen, Zhaolin Ren
Zeroth-order optimization (ZO) is widely used for solving black-box optimization and control problems. In particular, single-point ZO (SZO) is well-suited to online or dynamic problem settings due to its requirement of only a single function evaluation per iteration. However, SZO suffers from high gradient estimation variance and slow convergence, which severely limit its practical applicability. To overcome this limitation, we propose a novel yet simple SZO framework termed regression-based SZO (ReSZO), which substantially enhances the convergence rate. Specifically, ReSZO constructs a surrogate function via regression using historical function evaluations and employs the gradient of this surrogate function for iterative updates. Two instantiations of ReSZO, which fit linear and quadratic surrogate functions respectively, are introduced. Moreover, we provide a non-asymptotic convergence analysis for the linear instantiation of ReSZO, showing that its convergence rates are comparable to those of two-point ZO methods. Extensive numerical experiments demonstrate that ReSZO empirically converges two to three times faster than two-point ZO in terms of function query complexity.
Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang
Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}
Yadong Zhang, Xin Chen
Inspired by the well-known permutation entropy (PE), an effective image encoding scheme for chaotic time series, Triad State Space Construction (TSSC), is proposed. The TSSC image can recognize higher-order temporal patterns and identify new forbidden regions in time series motifs beyond the Bandt-Pompe probabilities. The Convolutional Neural Network (ConvNet) is widely used in image classification. The ConvNet classifier based on TSSC images (TSSC-ConvNet) are highly accurate and very robust in the chaotic signal classification.
Xin Chen, Adrien Bouhon, Linyang Li, François M. Peeters, Biplab Sanyal
Using evolutionary algorithm for crystal structure prediction, we present a new stable two-dimensional (2D) carbon allotrope composed of polymerized as-indacenes (PAI) in a zigzag pattern, namely PAI-graphene whose energy is lower than most of the reported 2D allotropes of graphene. Crucially, the crystal structure realizes a nonsymmorphic layer group that enforces a nontrivial global topology of the band structure with two Dirac cones lying perfectly at the Fermi level. The absence of electron/hole pockets makes PAI-graphene a pristine crystalline topological semimetal having anisotropic Fermi velocities with a high value of $7.0 \times 10^{5}$ m/s. We show that while the semimetallic property of the allotrope is robust against the application of strain, the positions of the Dirac cone and the Fermi velocities can be modified significantly with strain. Moreover, by combining strain along both the x- and y-directions, two band inversions take place at $Γ$ leading to the annihilation of the Dirac nodes demonstrating the possibility of strain-controlled conversion of a topological semimetal into a semiconductor. Finally we formulate the bulk-boundary correspondence of the topological nodal phase in the form of a generalized Zak-phase argument finding a perfect agreement with the topological edge states computed for different edge-terminations.
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, Huchuan Lu
Correlation acts as a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion manner to consider the similarity between the template and the search region. However, the correlation operation itself is a local linear matching process, leading to lose semantic information and fall into local optimum easily, which may be the bottleneck of designing high-accuracy tracking algorithms. Is there any better feature fusion method than correlation? To address this issue, inspired by Transformer, this work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. Finally, we present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Experiments show that our TransT achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks. Our tracker runs at approximatively 50 fps on GPU. Code and models are available at https://github.com/chenxin-dlut/TransT.
Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, Ping Luo, Stuart Russell, Pieter Abbeel, Rohin Shah
Imitation learning often needs a large demonstration set in order to handle the full range of situations that an agent might find itself in during deployment. However, collecting expert demonstrations can be expensive. Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data. Our Empirical Investigation of Representation Learning for Imitation (EIRLI) investigates whether similar benefits apply to imitation learning. We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites. In the settings we evaluate, we find that existing algorithms for image-based representation learning provide limited value relative to a well-tuned baseline with image augmentations. To explain this result, we investigate differences between imitation learning and other settings where representation learning has provided significant benefit, such as image classification. Finally, we release a well-documented codebase which both replicates our findings and provides a modular framework for creating new representation learning algorithms out of reusable components.
Minbin Huang, Zhijian Huang, Changlin Li, Xin Chen, Hang Xu, Zhenguo Li, Xiaodan Liang
Neural Architecture Search (NAS) aims to find efficient models for multiple tasks. Beyond seeking solutions for a single task, there are surging interests in transferring network design knowledge across multiple tasks. In this line of research, effectively modeling task correlations is vital yet highly neglected. Therefore, we propose \textbf{Arch-Graph}, a transferable NAS method that predicts task-specific optimal architectures with respect to given task embeddings. It leverages correlations across multiple tasks by using their embeddings as a part of the predictor's input for fast adaptation. We also formulate NAS as an architecture relation graph prediction problem, with the relational graph constructed by treating candidate architectures as nodes and their pairwise relations as edges. To enforce some basic properties such as acyclicity in the relational graph, we add additional constraints to the optimization process, converting NAS into the problem of finding a Maximal Weighted Acyclic Subgraph (MWAS). Our algorithm then strives to eliminate cycles and only establish edges in the graph if the rank results can be trusted. Through MWAS, Arch-Graph can effectively rank candidate models for each task with only a small budget to finetune the predictor. With extensive experiments on TransNAS-Bench-101, we show Arch-Graph's transferability and high sample efficiency across numerous tasks, beating many NAS methods designed for both single-task and multi-task search. It is able to find top 0.16\% and 0.29\% architectures on average on two search spaces under the budget of only 50 models.
Xin Chen, Emiliano Dall'Anese, Changhong Zhao, Na Li
With a large-scale integration of distributed energy resources (DERs), distribution systems are expected to be capable of providing capacity support for the transmission grid. To effectively harness the collective flexibility from massive DER devices, this paper studies distribution-level power aggregation strategies for transmission-distribution interaction. In particular, this paper proposes a method to model and quantify the aggregate power flexibility, i.e., the net power injection achievable at the substation, in unbalanced distribution systems over time. Incorporating the network constraints and multi-phase unbalanced modeling, the proposed method obtains an effective approximate feasible region of the net power injection. For any aggregate power trajectory within this region, it is proved that there exists a feasible disaggregation solution. In addition, a distributed model predictive control (MPC) framework is developed for the practical implementation of the transmission-distribution interaction. At last, we demonstrate the performances of the proposed method via numerical tests on a real-world distribution feeder with 126 multi-phase nodes.
Xin Chen, Ce Wang, Jinlong Yu
We discuss the topological invariant in the (2+1)-dimensional quench dynamics of a two-dimensional two-band Chern insulator starting from a topological initial state (i.e., with a nonzero Chern number $c_i$), evolved by a post-quench Hamiltonian (with Chern number $c_f$). In contrast to the process with $c_i=0$ studied in previous works, this process cannot be characterized by the Hopf invariant that is described by the sphere homotopy group $π_3(S^2)=\mathbb{Z}$. It is possible, however, to calculate a variant of the Chern-Simons integral with a complementary part to cancel the Chern number of the initial spin configuration, which at the same time does not affect the (2+1)-dimensional topology. We show that the modified Chern-Simons integral gives rise to a topological invariant of this quench process, i.e., the linking invariant in the $\mathbb{Z}_{2c_i}$ class: $ν= (c_f - c_i) \mod (2c_i)$. We give concrete examples to illustrate this result and also show the detailed deduction to get this linking invariant.
Long Huo, Xin Chen
Time series motifs are used for discovering higher-order structures of time series data. Based on time series motifs, the motif embedding correlation field (MECF) is proposed to characterize higher-order temporal structures of dynamical system time series. A MECF-based unsupervised learning approach is applied in locating the source of the forced oscillation (FO), a periodic disturbance that detrimentally impacts power grids. Locating the FO source is imperative for system stability. Compared with the Fourier analysis, the MECF-based unsupervised learning is applicable under various FO situations, including the single FO, FO with resonance, and multiple sources FOs. The MECF-based unsupervised learning is a data-driven approach without any prior knowledge requirement of system models or typologies. Tests on the UK high-voltage transmission grid illustrate the effectiveness of MECF-based unsupervised learning. In addition, the impacts of coupling strength and measurement noise on locating the FO source by the MECF-based unsupervised learning are investigated.
Xin Chen, Kai Wang
It is widely known that spoofing is a major threat that adversely impacts the reliability and accuracy of GNSS applications. In this study, a crowdsourcing double differential pseudorange spatial (D2SP) random set is constructed and the distribution of the set is derived.Based on the variance of the D2SP set, a tri-level hypothesis detection algorithm is designed to classify spoofing-free, fully-spoofed, and partially-spoofed cases in the region of interest (ROI).It does not require the prior knowledge of the truth positions or relative distances of the receivers.Simulation test results show that the proposed D2SP spoofing detection method has the advantages of lower computational complexity and higher tolerance for multipath errors compared with the generalized likelihood ratio test (GLRT) method that is the current mainstream spoofing detection algorithm based on multiple receivers' differential pseudoranges.Moreover, it also shows better flexibility for different sizes of ROI and numbers of the crowdsourcing receivers.
Xin Chen, Ben Kang, Jiawen Zhu, Dong Wang, Houwen Peng, Huchuan Lu
In this paper, we introduce a new sequence-to-sequence learning framework for RGB-based and multi-modal object tracking. First, we present SeqTrack for RGB-based tracking. It casts visual tracking as a sequence generation task, forecasting object bounding boxes in an autoregressive manner. This differs from previous trackers, which depend on the design of intricate head networks, such as classification and regression heads. SeqTrack employs a basic encoder-decoder transformer architecture. The encoder utilizes a bidirectional transformer for feature extraction, while the decoder generates bounding box sequences autoregressively using a causal transformer. The loss function is a plain cross-entropy. Second, we introduce SeqTrackv2, a unified sequence-to-sequence framework for multi-modal tracking tasks. Expanding upon SeqTrack, SeqTrackv2 integrates a unified interface for auxiliary modalities and a set of task-prompt tokens to specify the task. This enables it to manage multi-modal tracking tasks using a unified model and parameter set. This sequence learning paradigm not only simplifies the tracking framework, but also showcases superior performance across 14 challenging benchmarks spanning five single- and multi-modal tracking tasks. The code and models are available at https://github.com/chenxin-dlut/SeqTrackv2.
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, Gang Yu
We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.
Chen Xin, Thomas Motz, Wolfgang Fuhl, Andreas Hartel, Enkelejda Kasneci
Position detection of hydraulic cylinder pistons is crucial for numerous industrial automation applications. A typical traditional method is to excite electromagnetic waves in the cylinder structure and analytically solve the piston position based on the scattering parameters measured by a sensor. The core of this approach is a physical model that outlines the relationship between the measured scattering parameters and the targeted piston position. However, this physical model has shortcomings in accuracy and adaptability, especially in extreme conditions. To address these limitations, we propose machine learning and deep learning-based methods to learn the relationship directly in a data-driven manner. As a result, all deep learning models in this paper consistently outperform the physical one by a large margin. We further deliberate on the choice of models based on domain knowledge and provide in-depth analyses combining model performance with real-world physical characteristics. Specifically, we use Convolutional Neural Network (CNN) to discover local interactions of input among adjacent frequencies, apply Complex-Valued Neural Network (CVNN) to exploit the complex-valued nature of electromagnetic scattering parameters, and introduce a novel technique named Frequency Encoding to add weighted frequency information to the model input. The combination of these techniques results in our best-performing model, a complex-valued CNN with Frequency Encoding, which exhibits substantial improvement in accuracy with an error reduction of 1/12 compared to the traditional physical model.
Yubao Zhang, Xin Chen, Yi Gu, Zhicheng Li, Wu Kai
With the growing prevalence of electric vehicles (EVs) and advancements in EV electronics, vehicle-to-grid (V2G) techniques and large-scale scheduling strategies have emerged to promote renewable energy utilization and power grid stability. This study proposes a multi-stakeholder hierarchical V2G coordination based on deep reinforcement learning (DRL) and the Proof of Stake algorithm. Furthermore, the multi-stakeholders include the power grid, EV aggregators (EVAs), and users, and the proposed strategy can achieve multi-stakeholder benefits. On the grid side, load fluctuations and renewable energy consumption are considered, while on the EVA side, energy constraints and charging costs are considered. The three critical battery conditioning parameters of battery SOX are considered on the user side, including state of charge, state of power, and state of health. Compared with four typical baselines, the multi-stakeholder hierarchical coordination strategy can enhance renewable energy consumption, mitigate load fluctuations, meet the energy demands of EVA, and reduce charging costs and battery degradation under realistic operating conditions.
Xin Chen, Xue-Mei Li
Elliptic stochastic differential equations (SDE) make sense when the coefficients are only continuous. We study the corresponding linearized SDE whose coefficients are not assumed to be locally bounded. This leads to existence of $W_{\loc}^{1,p}$ solution flows for elliptic SDEs with Hölder continuous and $\cap_{p} W_{\loc}^{1,p}$ coefficients. Furthermore an approximation scheme is studied from which we obtain a representation for the derivative of the Markov semigroup, and an integration by parts formula.
Xin Chen, Todd Karin, Anubhav Jain
Solar modules in utility-scale systems are expected to maintain decades of lifetime to rival conventional energy sources. However, cyclic thermomechanical loading often degrades their long-term performance, highlighting the importance of effective design to mitigate thermal expansion mismatches between module materials. Given the complex composition of solar modules, isolating the impact of individual components on overall durability remains a challenging task. In this work, we analyze a comprehensive data set that comprises bill-of-materials (BOM) and thermal cycling power loss from 251 distinct module designs to identify the predominant design factors and their impacts on the thermomechanical durability of modules. The methodology of our analysis combines machine learning modeling (random forest) and Shapley additive explanation (SHAP) to correlate design factors with power loss and interpret the model's decision-making. The interpretation reveals that silicon type (monocrystalline or polycrystalline), encapsulant thickness, busbar numbers, and wafer thickness predominantly influence the degradation. With lower power loss of around 0.6\% on average in the SHAP analysis, monocrystalline cells present better durability than polycrystalline cells. This finding is further substantiated by statistical testing on our raw data set. The SHAP analysis also demonstrates that while thicker encapsulants lead to reduced power loss, further increasing their thickness over around 0.6 to 0.7mm does not yield additional benefits, particularly for the front side one. In addition, other important BOM features such as the number of busbars are analyzed. This study provides a blueprint for utilizing explainable machine learning techniques in a complex material system and can potentially guide future research on optimizing the design of solar modules.