Zhuang Wang, Xinyu Wu, T. S. Eugene Ng
Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training. In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83x without losing accuracy. It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.
Manzi Huang, Xiantao Wang, Zhuang Wang, Zhihao Xu
In this paper, we introduce a class of homeomorphisms between metric spaces, which are locally biHölder continuous mappings. Then an embedding result between Besov spaces induced by locally biHölder continuous mappings between Ahlfors regular spaces is established, which extends the corresponding result of Björn-Björn-Gill-Shanmugalingam (J. Reine Angew. Math. 725: 63-114, 2017). Furthermore, an example is constructed to show that our embedding result is more general. We also introduce a geometric condition, named as uniform boundedness, to characterize when a quasisymmetric mapping between uniformly perfect spaces is locally biHölder continuous.
Manzi Huang, Xiantao Wang, Zhuang Wang, Zhihao Xu
In this paper, we establish a quantitative correspondence between power quasi-symmetric mappings on complete metric spaces and rough quasi-isometric mappings on their hyperbolic fillings. In particular, we prove that the exponents in the power quasi-symmetric mappings coincide with the coefficients in the rough quasi-isometric mappings. This shows that the obtained correspondence is both sharp and consistent. In this way, we generalize the corresponding result by Björn, Björn, Gill, and Shanmugalingam (J. Reine Angew. Math., 2017) from the setting of rooted trees to that of hyperbolic fillings.
Pekka Koskela, Tomás Soto, Zhuang Wang
The trace spaces of Sobolev spaces and related fractional smoothness spaces have been an active area of research since the work of Nikolskii, Aronszajn, Slobodetskii, Babich and Gagliardo among others in the 1950's. In this paper we review the literature concerning such results for a variety of weighted smoothness spaces. For this purpose, we present a characterization of the trace spaces (of fractional order of smoothness), based on integral averages on dyadic cubes, which is well adapted to extending functions using the Whitney extension operator.
Manzi Huang, Panu Lahti, Jiang Li, Zhuang Wang
For $0<δ,τ<1$ and $1\le s\le \frac{n}{n-δ}$, we prove that for a given $s$-John domain $Ω\subset \mathbb{R}^n$, the following Boxing inequality holds for every Lebesgue measurable set $U\subsetΩ$ with $|U|/|Ω|\leγ<1$: \[ \mathcal{H}^{s(n-δ)}_{\infty}(U\setminus\mathcal{N}_U)\le C(1-δ)\int_Ω\int_{|x-y|<τ\operatorname{dist}(y,\partialΩ)}\frac{|χ_U(x)-χ_U(y)|}{|x-y|^{n+δ}}\,dx\,dy, \] where $\mathcal{H}^{s(n-δ)}_{\infty}(U)$ denotes the $s(n-δ)$-dimensional Hausdorff content of $U$, $\mathcal{N}_U$ is a set of Lebesgue measure zero and the constant $C$ depends only on $n,τ,s,γ$, the John constant and the diameter of $Ω$. Moreover, we establish the functional formulation of the above Boxing inequality and discuss the equivalence between these two formulations. Based on the Boxing inequality, we prove the fractional Poincaré--Wirtinger trace inequality on $s$-John domains, of which the fractional Sobolev--Poincaré inequality and fractional Hardy-type inequality are special cases. Notably, we prove all of the aforementioned inequalities with the Bourgain--Brezis--Mironescu (BBM) factor $1-δ$. Furthermore, with the aid of the Bourgain--Brezis--Mironescu formula, we recover the Poincaré--Wirtinger trace inequality. Finally, by showing that, under the separation property, any domain supporting the Boxing inequality is necessarily a John domain, we conclude that the John domain condition is essentially sharp for the above inequalities. All the above inequalities with the BBM factor are new even for Lipschitz domains.
Zhuang Wang, Haibin Lin, Yibo Zhu, T. S. Eugene Ng
Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express all compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy? In this paper, we propose ByteComp to answer these questions. It first designs a decision tree abstraction to express all the compression strategies and develops empirical models to timeline tensor computation, communication, and compression to enable ByteComp to derive the intricate interactions among tensors. It then designs a compression decision algorithm that analyzes tensor interactions to eliminate and prioritize strategies and optimally offloads compression to CPUs. Experimental evaluations show that ByteComp can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs. Moreover, the computational time needed to select the compression strategy is measured in milliseconds, and the selected strategy is only a few percent from optimal.
Zhuang Wang, Zhaozhuo Xu, Jingyi Xi, Yuke Wang, Anshumali Shrivastava, T. S. Eugene Ng
Distributed training is the de facto standard to scale up the training of deep learning models with multiple GPUs. Its performance bottleneck lies in communications for gradient synchronization. Although high tensor sparsity is widely observed, the optimal communication scheme to fully leverage sparsity is still missing. This paper aims to bridge this gap. We first analyze the characteristics of sparse tensors in popular models to understand the fundamentals of sparsity. We then systematically explore the design space of communication schemes for sparse tensors and find the optimal ones. These findings give a new understanding and inspire us to develop a holistic gradient synchronization system called Zen for sparse tensors. We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to $2.48\times$ speedup in training throughput compared to the state-of-the-art methods.
Pekka Koskela, Zhuang Wang, Haiqing Xu
Let $Ω$ be an internal chord-arc Jordan domain and $\varphi:\mathbb S\rightarrow\partialΩ$ be a homeomorphism. We show that $\varphi$ has finite dyadic energy if and only if $\varphi$ has a diffeomorphic extension $h: \mathbb D\rightarrow Ω$ which has finite energy.
Pekka Koskela, Zhuang Wang
In this paper, we study function spaces defined via dyadic energies on the boundaries of regular trees. We show that correct choices of dyadic energies result in Besov-type spaces that are trace spaces of (weighted) first order Sobolev spaces.
Manzi Huang, Xiantao Wang, Zhuang Wang, Zhihao Xu
In this paper, we study the traces and the extensions for weighted Sobolev spaces on upper half spaces when the weights reach to the borderline cases. We first give a full characterization of the existence of trace spaces for these weighted Sobolev spaces, and then study the trace parts and the extension parts between the weighted Sobolev spaces and a new kind of Besov-type spaces (on hyperplanes) which are defined by using integral averages over selected layers of dyadic cubes.
Panu Lahti, Xining Li, Zhuang Wang
We study the boundary traces of Newton-Sobolev, Hajlasz-Sobolev, and BV (bounded variation) functions. Assuming less regularity of the domain than is usually done in the literature, we show that all of these function classes achieve the same "boundary values", which in particular implies that the trace spaces coincide provided that they exist. Many of our results seem to be new even in Euclidean spaces but we work in a more general complete metric space equipped with a doubling measure and supporting a Poincare inequality.
Zhuang Wang
In this paper, we study the traces of Orlicz-Sobolev spaces on a regular rooted tree. After giving a dyadic decomposition of the boundary of the regular tree, we present a characterization on the trace spaces of those first order Orlicz-Sobolev spaces whose Young function is of the form $t^p\log^λ(e+t)$, based on integral averages on dyadic elements of the dyadic decomposition.
Khanh Ngoc Nguyen, Zhuang Wang
We show that the combination of doubling and $(1,p)$-Poincare inequality is equivalent to a version of the $A_p$-condition on rooted K-ary trees.
Pekka Koskela, Khanh Ngoc Nguyen, Zhuang Wang
We give characterizations for the existence of traces for first order Sobolev spaces defined on regular trees.
Li Chou, Zichang Liu, Zhuang Wang, Anshumali Shrivastava
With the rapid growth in mobile computing, massive amounts of data and computing resources are now located at the edge. To this end, Federated learning (FL) is becoming a widely adopted distributed machine learning (ML) paradigm, which aims to harness this expanding skewed data locally in order to develop rich and informative models. In centralized FL, a collection of devices collaboratively solve a ML task under the coordination of a central server. However, existing FL frameworks make an over-simplistic assumption about network connectivity and ignore the communication bandwidth of the different links in the network. In this paper, we present and study a novel FL algorithm, in which devices mostly collaborate with other devices in a pairwise manner. Our nonparametric approach is able to exploit network topology to reduce communication bottlenecks. We evaluate our approach on various FL benchmarks and demonstrate that our method achieves 10X better communication efficiency and around 8% increase in accuracy compared to the centralized approach.
Weitao Wang, Sushovan Das, Xinyu Crystal Wu, Zhuang Wang, Ang Chen, T. S. Eugene Ng
Distributed applications, such as database queries and distributed training, consist of both compute and network tasks. DAG-based abstraction primarily targets compute tasks and has no explicit network-level scheduling. In contrast, Coflow abstraction collectively schedules network flows among compute tasks but lacks the end-to-end view of the application DAG. Because of the dependencies and interactions between these two types of tasks, it is sub-optimal to only consider one of them. We argue that co-scheduling of both compute and network tasks can help applications towards the globally optimal end-to-end performance. However, none of the existing abstractions can provide fine-grained information for co-scheduling. We propose MXDAG, an abstraction to treat both compute and network tasks explicitly. It can capture the dependencies and interactions of both compute and network tasks leading to improved application performance.
Zaifeng Pan, Yipeng Shen, Zhengding Hu, Zhuang Wang, Aninda Manocha, Zheng Wang, Zhongkai Yu, Yue Guan, Yufei Ding
LLM-based multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient LLM serving system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.
Jingwei Zuo, Xinze Feng, Zien Liu, Kaijian Wang, Fanjiang Ye, Ye Cao, Zhuang Wang, Yuke Wang
Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to $13.8\times$ speedup over state-of-the-art without sacrificing adapter quality.
Minghao Yan, Zhuang Wang, Zhen Jia, Shivaram Venkataraman, Yida Wang
Low-rank Adaptation (LoRA) has gained popularity as a fine-tuning approach for Large Language Models (LLMs) due to its low resource requirements and good performance. While a plethora of work has investigated improving LoRA serving efficiency by serving multiple LoRAs concurrently, existing methods assume that a wide range of LoRA adapters are available for serving. In our work, we conduct extensive empirical studies to identify that current training paradigms do not utilize hardware resources efficiently and require high overhead to obtain a performant LoRA. Leveraging these insights, we propose PLoRA, which automatically orchestrates concurrent LoRA fine-tuning jobs under given hardware and model constraints and develops performant kernels to improve training efficiency. Our experimental studies show that PLoRA reduces the makespan of LoRA fine-tuning over a given hyperparameter search space by up to 7.52x and improves training throughput by up to 12.8x across a range of state-of-the-art LLMs.
Weicong Su, Zhuang Wang, Yi Ru-Ya Zhang
We investigate the geometric behavior of $τ(E)$ for bounded finite-perimeter sets $E \subset \mathbb R^n$, where $τ(E)$ is the trace constant introduced by Figalli--Maggi--Pratelli [Invent. Math. 2010]. This quantity is a key ingredient in proving a quantitative isoperimetric inequality with the optimal exponent. We first show that for every $ε>0$ one can find a bounded open set $Ω\subset \mathbb R^n$ that is very close to the unit ball $\mathbb B^n$ in the sense that $$ τ(\mathbb B^n)>τ(Ω)>τ(\mathbb B^n)-ε\quad \text{and} \quad P(ΩΔ\mathbb B^n)\le C(n)ε, $$ while at the same time the complement of $Ω$ has infinitely many connected components. Thus, $τ(Ω)$ can be made arbitrarily close to $τ(\mathbb B^n)$ even when $Ω$ has highly intricate geometry. We then establish, under a mild additional hypothesis, the equivalence between a condition formulated in terms of $τ$ and two classical criteria from the literature for open sets that admit trace inequalities. As a consequence, we obtain the John-type characterization of domains that support a trace inequality, assuming the ball separation property.