Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan, Yonggang Wen
It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 Gbps network. To address this problem, we propose a communication backend named GradientFlow for distributed DNN training, and employ a set of network optimization techniques. First, we integrate ring-based allreduce, mixed-precision training, and computation/communication overlap into GradientFlow. Second, we propose lazy allreduce to improve network throughput by fusing multiple communication operations into a single one, and design coarse-grained sparse communication to reduce network traffic by only transmitting important gradient chunks. When training ImageNet/AlexNet on 512 GPUs, our approach achieves 410.2 speedup ratio and completes 95-epoch training in 1.5 minutes, which outperforms existing approaches.
Peng Sun, Peiwen Lin, Guangliang Cheng, Jianping Shi, Jiawan Zhang, Xi Li
Video object segmentation aims at accurately segmenting the target object regions across consecutive frames. It is technically challenging for coping with complicated factors (e.g., shape deformations, occlusion and out of the lens). Recent approaches have largely solved them by using backforth re-identification and bi-directional mask propagation. However, their methods are extremely slow and only support offline inference, which in principle cannot be applied in real time. Motivated by this observation, we propose a efficient detection-based paradigm for video object segmentation. We propose an unified One-Pass Video Segmentation framework (OVS-Net) for modeling spatial-temporal representation in a unified pipeline, which seamlessly integrates object detection, object segmentation, and object re-identification. The proposed framework lends itself to one-pass inference that effectively and efficiently performs video object segmentation. Moreover, we propose a maskguided attention module for modeling the multi-scale object boundary and multi-level feature fusion. Experiments on the challenging DAVIS 2017 demonstrate the effectiveness of the proposed framework with comparable performance to the state-of-the-art, and the great efficiency about 11.5 FPS towards pioneering real-time work to our knowledge, more than 5 times faster than other state-of-the-art methods.
Peng Sun
We explore an approach to the conjecture of Katok on intermediate entropies that based on uniqueness of equilibrium states, provided the entropy function is upper semi-continuous. As an application, we prove Katok's conjecture for Mañé diffeomorphisms.
Peng Sun, Guang Chen, Guerdan Luke, Yi Shang
Object detection in remote sensing, especially in aerial images, remains a challenging problem due to low image resolution, complex backgrounds, and variation of scale and angles of objects in images. In current implementations, multi-scale based and angle-based networks have been proposed and generate promising results with aerial image detection. In this paper, we propose a novel loss function, called Salience Biased Loss (SBL), for deep neural networks, which uses salience information of the input image to achieve improved performance for object detection. Our novel loss function treats training examples differently based on input complexity in order to avoid the over-contribution of easy cases in the training process. In our experiments, RetinaNet was trained with SBL to generate an one-stage detector, SBL-RetinaNet. SBL-RetinaNet is applied to the largest existing public aerial image dataset, DOTA. Experimental results show our proposed loss function with the RetinaNet architecture outperformed other state-of-art object detection models by at least 4.31 mAP, and RetinaNet by 2.26 mAP with the same inference speed of RetinaNet.
Jiechao Xiong, Qing Wang, Zhuoran Yang, Peng Sun, Lei Han, Yang Zheng, Haobo Fu, Tong Zhang, Ji Liu, Han Liu
Most existing deep reinforcement learning (DRL) frameworks consider either discrete action space or continuous action space solely. Motivated by applications in computer games, we consider the scenario with discrete-continuous hybrid action space. To handle hybrid action space, previous works either approximate the hybrid space by discretization, or relax it into a continuous set. In this paper, we propose a parametrized deep Q-network (P- DQN) framework for the hybrid action space without approximation or relaxation. Our algorithm combines the spirits of both DQN (dealing with discrete action space) and DDPG (dealing with continuous action space) by seamlessly integrating them. Empirical results on a simulation example, scoring a goal in simulated RoboCup soccer and the solo mode in game King of Glory (KOG) validate the efficiency and effectiveness of our method.
Lei Han, Jiechao Xiong, Peng Sun, Xinghai Sun, Meng Fang, Qingwei Guo, Qiaobo Chen, Tengfei Shi, Hongsheng Yu, Xipeng Wu, Zhengyou Zhang
StarCraft, one of the most difficult esport games with long-standing history of professional tournaments, has attracted generations of players and fans, and also, intense attentions in artificial intelligence research. Recently, Google's DeepMind announced AlphaStar, a grandmaster level AI in StarCraft II that can play with humans using comparable action space and operations. In this paper, we introduce a new AI agent, named TStarBot-X, that is trained under orders of less computations and can play competitively with expert human players. TStarBot-X takes advantage of important techniques introduced in AlphaStar, and also benefits from substantial innovations including new league training methods, novel multi-agent roles, rule-guided policy search, stabilized policy improvement, lightweight neural network architecture, and importance sampling in imitation learning, etc. We show that with orders of less computation scale, a faithful reimplementation of AlphaStar's methods can not succeed and the proposed techniques are necessary to ensure TStarBot-X's competitive performance. We reveal all technical details that are complementary to those mentioned in AlphaStar, showing the most sensitive parts in league training, reinforcement learning and imitation learning that affect the performance of the agents. Most importantly, this is an open-sourced study that all codes and resources (including the trained model parameters) are publicly accessible via \url{https://github.com/tencent-ailab/tleague_projpage}. We expect this study could be beneficial for both academic and industrial future research in solving complex problems like StarCraft, and also, might provide a sparring partner for all StarCraft II players and other AI agents.
Peng Sun
For a dynamical system satisfying the approximate product property and asymptotically entropy expansiveness, we characterize a delicate structrue of the space of invariant measures: The ergodic measures of intermediate entropies and intermediate pressures are generic in certain subspaces. This proves a conjecture of Katok for a broad class of systems and extends a sequence of known results.
Peng Sun, C. -P. Yuan, Feng Yuan
We derive all order soft gluon resummation in dijet azimuthal angular correlation in hadronic collisions at the next-to-leading logarithmic level. The relevant coefficients for the Sudakov resummation factor, the soft and hard factors, are calculated. The theory predictions agree well with the experimental data from D0 Collaboration at the Tevatron. This provides a benchmark calculation for the transverse momentum dependent QCD resummation for jet productions in hadron collisions and can be readily applied at the CERN LHC.
Peng Sun, Feng Yuan
We investigate the energy evolution of the azimuthal spin asymmetries in semi-inclusive hadron production in deep inelastic scattering (SIDIS) and Drell-Yan lepton pair production in pp collisions. The scale dependence is evaluated by applying an approximate solution to the Collins-Soper-Sterman (CSS) evolution equation at one-loop order which is adequate for moderate Q^2 variations. This describes well the unpolarized cross sections for SIDIS and Drell-Yan process in the $Q^2$ range of 2.4-100GeV^2. A combined analysis of the Sivers asymmetries in SIDIS from HERMES and COMPASS experiments, and the predictions for the Drell-Yan process at RHIC at \sqrt{S}=200GeV are presented. We further extend to the Collins asymmetries and find, for the first time, a consistent description for HERMES/COMPASS and BELLE experiments with the evolution effects. We emphasize an important test of the evolution effects by studying di-hadron azimuthal asymmetry in e^+e^- annihilation at moderate energy range, such as at BEPC at \sqrt{S}=4.6GeV.
Peng Sun
We prove a generalized Gauss-Kuzmin-Lévy theorem for the $p$-numerated generalized Gauss transformation $$T_p(x)=\{\frac{p}{x}\}.$$ In addition, we give an estimate for the constant that appears in the theorem.
Ya-Peng Hu, Peng Sun, Jian-Hui Zhang
Using the AdS/CFT correspondence, we study the hydrodynamics with conserved current from the dual Maxwell-Gauss-Bonnet gravity. After constructing the perturbative solution to the first order based on the boosted black brane solution in the bulk Maxwell-Gauss-Bonnet gravity, we extract the stress tensor and conserved current of the dual conformal fluid on its boundary, and also find the effect of Gauss-Bonnet term on the dual conformal fluid. Our results show that the Gauss-Bonnet term can affect the parameters such as the shear viscosity $η$, entropy density $s$, thermal conductivity $κ$ and electrical conductivity $σ$. However, it does not affect the so-called Wiedemann-Franz law which relates $κ$ to $σ$, while it affects the ratio $η/s$. In addition, another interesting result is that the $η/s$ can also be affected by the bulk Maxwell field in our case, which is consistent with some previous results predicted through the Kubo formula. Moreover, the anomalous magnetic and vortical effects by adding the Chern-Simons term are also considered in our case in the Maxwell-Gauss-Bonnet gravity.
Peng Sun
We study the exponential rate of decay of Lebesgue numbers of open covers in topological dynamical systems. We show that topological entropy is bounded by this rate multiplied by dimension. Some corollaries and examples are discussed.
Yang Liu, Peng Sun, Hang Li
By formally defining the training processes of large language models (LLMs), which usually encompasses pre-training, supervised fine-tuning, and reinforcement learning with human feedback, within a single and unified machine learning paradigm, we can glean pivotal insights for advancing LLM technologies. This position paper delineates the parallels between the training methods of LLMs and the strategies employed for the development of agents in two-player games, as studied in game theory, reinforcement learning, and multi-agent systems. We propose a re-conceptualization of LLM learning processes in terms of agent learning in language-based games. This framework unveils innovative perspectives on the successes and challenges in LLM development, offering a fresh understanding of addressing alignment issues among other strategic considerations. Furthermore, our two-player game approach sheds light on novel data preparation and machine learning techniques for training LLMs.
Peng Sun, Haoyin Zhou, Devon Lundine, James K. Min, Guanglei Xiong
Recently, machine learning has been successfully applied to model-based left ventricle (LV) segmentation. The general framework involves two stages, which starts with LV localization and is followed by boundary delineation. Both are driven by supervised learning techniques. When compared to previous non-learning-based methods, several advantages have been shown, including full automation and improved accuracy. However, the speed is still slow, in the order of several seconds, for applications involving a large number of cases or case loads requiring real-time performance. In this paper, we propose a fast LV segmentation algorithm by joint localization and boundary delineation via training explicit shape regressor with random pixel difference features. Tested on 3D cardiac computed tomography (CT) image volumes, the average running time of the proposed algorithm is 1.2 milliseconds per case. On a dataset consisting of 139 CT volumes, a 5-fold cross validation shows the segmentation error is $1.21 \pm 0.11$ for LV endocardium and $1.23 \pm 0.11$ millimeters for epicardium. Compared with previous work, the proposed method is more stable (lower standard deviation) without significant compromise to the accuracy.
Peng Sun, Xiangyu Zhang, Duan Wu
Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.
Peng Sun, Xinyi Shang, Tao Lin, Zhiqiang Shen
Consistency-based generative models like Shortcut and MeanFlow achieve impressive results via a target-aware design for solving the Probability Flow ODE (PF-ODE). Typically, such methods introduce a target time $r$ alongside the current time $t$ to modulate outputs between a local multi-step derivative ($r = t$) and a global few-step integral ($r = 0$). However, the conventional "one input, one output" paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi-step objective for stability. This separation forces a trade-off: allocating sufficient samples to the multi-step objective leaves the few-step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a "one input, dual output" paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity $v_t$ and flow-map $u_t$ from a single input $x_t$. This applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency. On ImageNet 256 $\times$ 256, a 679M Diffusion Transformer with SD-VAE achieves a state-of-the-art (SOTA) FID of 1.79 in just 2 steps. Code is available at: https://github.com/LINs-lab/DuMo
Peng Sun, Gabriel Draughon, Jerome Lynch
Coronavirus has been spreading around the world since the end of 2019. The virus can cause acute respiratory syndrome, which can be lethal, and is easily transmitted between hosts. Most states have issued state-at-home executive orders, however, parks and other public open spaces have largely remained open and are seeing sharp increases in public use. Therefore, in order to ensure public safety, it is imperative for patrons of public open spaces to practice safe hygiene and take preventative measures. This work provides a scalable sensing approach to detect physical activities within public open spaces and monitor adherence to social distancing guidelines suggested by the US Centers for Disease Control and Prevention (CDC). A deep learning-based computer vision sensing framework is designed to investigate the careful and proper utilization of parks and park facilities with hard surfaces (e.g. benches, fence poles, and trash cans) using video feeds from a pre-installed surveillance camera network. The sensing framework consists of a CNN-based object detector, a multi-target tracker, a mapping module, and a group reasoning module. The experiments are carried out during the COVID-19 pandemic between March 2020 and May 2020 across several key locations at the Detroit Riverfront Parks in Detroit, Michigan. The sensing framework is validated by comparing automatic sensing results with manually labeled ground-truth results. The proposed approach significantly improves the efficiency of providing spatial and temporal statistics of users in public open spaces by creating straightforward data visualizations for federal and state agencies. The results can also provide on-time triggering information for an alarming or actuator system which can later be added to intervene inappropriate behavior during this pandemic.
Peng Sun
We show that for any topological dynamical system with approximate product property, the set of points whose forward orbits do not accumulate to any point in a large set carries full topological pressure.
Peng Sun, Jiechao Xiong, Lei Han, Xinghai Sun, Shuxing Li, Jiawei Xu, Meng Fang, Zhengyou Zhang
Competitive Self-Play (CSP) based Multi-Agent Reinforcement Learning (MARL) has shown phenomenal breakthroughs recently. Strong AIs are achieved for several benchmarks, including Dota 2, Glory of Kings, Quake III, StarCraft II, to name a few. Despite the success, the MARL training is extremely data thirsty, requiring typically billions of (if not trillions of) frames be seen from the environment during training in order for learning a high performance agent. This poses non-trivial difficulties for researchers or engineers and prevents the application of MARL to a broader range of real-world problems. To address this issue, in this manuscript we describe a framework, referred to as TLeague, that aims at large-scale training and implements several main-stream CSP-MARL algorithms. The training can be deployed in either a single machine or a cluster of hybrid machines (CPUs and GPUs), where the standard Kubernetes is supported in a cloud native manner. TLeague achieves a high throughput and a reasonable scale-up when performing distributed training. Thanks to the modular design, it is also easy to extend for solving other multi-agent problems or implementing and verifying MARL algorithms. We present experiments over StarCraft II, ViZDoom and Pommerman to show the efficiency and effectiveness of TLeague. The code is open-sourced and available at https://github.com/tencent-ailab/tleague_projpage
Peng Sun
We show that for every topological dynamical system with the approximate product property, zero topological entropy is equivalent to unique ergodicity. Equivalence of minimality is also proved under a slightly stronger condition. Moreover, we show that unique ergodicity implies the approximate product property if the system has periodic points.