Jingjun Han, Jihao Liu
For $ε$-lc Fano type varieties $X$ of dimension $d$ and a given finite set $Γ$, we show that there exists a positive integer $m_0$ which only depends on $ε,d$ and $Γ$, such that both $|-mK_X-\sum_i\lceil mb_i\rceil B_i|$ and $|-mK_X-\sum_i\lfloor mb_i\rfloor B_i|$ define birational maps for any $m\ge m_0$ provided that $B_i$ are pseudo-effective Weil divisors, $b_i\inΓ$, and $-(K_X+\sum_ib_iB_i)$ is big. When $Γ\subset[0,1]$ satisfies the DCC but is not finite, we construct an example to show that the effective birationality may fail even if $X$ is fixed, $B_i$ are fixed prime divisors, and $(X,B)$ is $ε'$-lc for some $ε'>0$.
Jingjun Han, Jihao Liu, V. V. Shokurov
We prove the existence of $n$-complements for pairs with DCC coefficients and the ACC for minimal log discrepancies of exceptional singularities. In order to prove these results, we develop the theory of complements for real coefficients. We introduce $(n,Γ_0)$-decomposable $\mathbb{R}$-complements, and show its existence for pairs with DCC coefficients.
Jihao Liu, Lingyao Xie
Let $(X\ni x,B)$ be an lc surface germ. If $X\ni x$ is klt, we show that there exists a divisor computing the minimal log discrepancy of $(X\ni x,B)$ that is a Kollár component of $X\ni x$. If $B\not=0$ or $X\ni x$ is not Du Val, we show that any divisor computing the minimal log discrepancy of $(X\ni x,B)$ is a potential lc place of $X\ni x$.
Junpeng Jiao, Jihao Liu, Lingyao Xie
We study the behavior of generalized lc pairs with $\mathrm{\textbf b}$-log abundant nef part, a meticulously designed structure on algebraic varieties. We show that this structure is preserved under the canonical bundle formula and sub-adjunction formulas, and is also compatible with the non-vanishing conjecture and the abundance conjecture in the classical minimal model program.
Jingjun Han, Jihao Liu, Yujie Luo
We prove that the ACC conjecture for minimal log discrepancies holds for threefolds in $[1-δ,+\infty)$, where $δ>0$ only depends on the coefficient set. We also study Reid's general elephant for pairs, and show Shokurov's conjecture on the existence of $(ε,n)$-complements for threefolds for any $ε\geq 1$. As a key important step, we prove the uniform boundedness of divisors computing minimal log discrepancies for terminal threefolds. We show the ACC for threefold canonical thresholds, and that the set of accumulation points of threefold canonical thresholds is equal to $\{0\}\cup\{\frac{1}{n}\}_{n\in\mathbb Z_{\ge 2}}$ as well.
Jihao Liu, Fanjun Meng, Lingyao Xie
We show that log canonical thresholds of fixed dimension are standardized. More precisely, we show that any sequence of log canonical thresholds in fixed dimension $d$ accumulates in a way which is i) either similar to how standard and hyperstandard sets accumulate, or ii) to log canonical thresholds in dimension $\leq d-2$. This provides an accurate description on the infinitesimal structure of the set of log canonical thresholds. We also discuss similar behaviors of minimal log discrepancies, canonical thresholds, and K-semistable thresholds.
Louis Esser, Jihao Liu, Chengxi Wang
We construct exceptional Fano varieties with the smallest known minimal log discrepancies in all dimensions. These varieties are well-formed hypersurfaces in weighted projective space. Their minimal log discrepancies decay doubly exponentially with dimension, and achieve the optimal value in dimension 2.
Jihao Liu, Xin Huang, Jinliang Zheng, Boxiao Liu, Jia Wang, Osamu Yoshie, Yu Liu, Hongsheng Li
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
Jihao Liu, Fanjun Meng, Lingyao Xie
For lc algebraically integrable foliations on klt varieties, we prove the base-point-freeness theorem, the contraction theorem, and the existence of flips. The first result resolves a conjecture of Cascini and Spicer, while the latter two results strengthen a result of Cascini and Spicer by removing their assumption on the termination of flips. Moreover, we prove the existence of the minimal model program for lc algebraically integrable foliations on klt varieties and the existence of good minimal models or Mori fiber spaces for lc algebraically integrable foliations polarized by ample divisors on klt varieties. As a consequence, we show that $\mathbb{Q}$-factorial klt varieties with lc algebraically integrable Fano foliation structures are Mori dream spaces. We also show the existence of a Shokurov-type polytope for lc algebraically integrable foliations.
Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare
This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.
Jihao Liu, Zheng Xu
Assuming the abundance conjecture in dimension $d$, we establish a non-algebraicity criterion of foliations: any log canonical foliation of rank $\le d$ with $ν\neqκ$ is not algebraically integrable, answering question of Ambro--Cascini--Shokurov--Spicer. Under the same hypothesis, we prove abundance for klt algebraically integrable adjoint foliated structures of dimension $\le d$ and show the existence of good minimal models or Mori fiber spaces. In particular, when $d=3$, all these results hold unconditionally. Using similar arguments, we solve a problem proposed by Lu and Wu on abundance of surface adjoint foliated structures that are not necessarily algebraically integrable.
Zhengyu Hu, Jihao Liu
We introduce linearly decomposable (LD) generalized pairs, which serve as a workable substitute for rational decompositions in the non-NQC setting. Using LD generalized pairs, together with a refinement of special termination and Kollár-type gluing theory, we prove the existence of flips for log canonical generalized pairs without assuming the klt condition, the NQC condition, or $\mathbb Q$-factoriality. Together with the cone and contraction theorems, this yields the existence of the minimal model program for arbitrary log canonical generalized pairs.
Jihao Liu, Xin Huang, Guanglu Song, Hongsheng Li, Yu Liu
Recently, transformer and multi-layer perceptron (MLP) architectures have achieved impressive results on various vision tasks. However, how to effectively combine those operators to form high-performance hybrid visual architectures still remains a challenge. In this work, we study the learnable combination of convolution, transformer, and MLP by proposing a novel unified architecture search approach. Our approach contains two key designs to achieve the search for high-performance networks. First, we model the very different searchable operators in a unified form, and thus enable the operators to be characterized with the same set of configuration parameters. In this way, the overall search space size is significantly reduced, and the total search cost becomes affordable. Second, we propose context-aware downsampling modules (DSMs) to mitigate the gap between the different types of operators. Our proposed DSMs are able to better adapt features from different types of operators, which is important for identifying high-performance hybrid architectures. Finally, we integrate configurable operators and DSMs into a unified search space and search with a Reinforcement Learning-based search algorithm to fully explore the optimal combination of the operators. To this end, we search a baseline network and scale it up to obtain a family of models, named UniNets, which achieve much better accuracy and efficiency than previous ConvNets and Transformers. In particular, our UniNet-B5 achieves 84.9% top-1 accuracy on ImageNet, outperforming EfficientNet-B7 and BoTNet-T7 with 44% and 55% fewer FLOPs respectively. By pretraining on the ImageNet-21K, our UniNet-B6 achieves 87.4%, outperforming Swin-L with 51% fewer FLOPs and 41% fewer parameters. Code is available at https://github.com/Sense-X/UniNet.
Bingyi Chen, Jihao Liu, Lingyao Xie
We establish the Kodaira vanishing theorem and the Kawamata-Viehweg vanishing theorem for lc generalized pairs. As a consequence, we provide a new proof of the base-point-freeness theorem for lc generalized pairs. This new approach allows us to prove the contraction theorem for lc generalized pairs without using Kollár's gluing theory.
Jihao Liu, Lingyao Xie
We show that $|mK_X|$ defines a birational map and has no fixed part for some bounded positive integer $m$ for any $\frac{1}{2}$-lc surface $X$ such that $K_X$ is big and nef. For every positive integer $n\geq 3$, we construct a sequence of projective surfaces $X_{n,i}$, such that $K_{X_{n,i}}$ is ample, ${\rm{mld}}(X_{n,i})>\frac{1}{n}$ for every $i$, $\lim_{i\rightarrow+\infty}{\rm{mld}}(X_{n,i})=\frac{1}{n}$, and for any positive integer $m$, there exists $i$ such that $|mK_{X_{n,i}}|$ has non-zero fixed part. These results answer the surface case of a question of Xu.
Jihao Liu, Boxiao Liu, Hongsheng Li, Yu Liu
Recent studies pointed out that knowledge distillation (KD) suffers from two degradation problems, the teacher-student gap and the incompatibility with strong data augmentations, making it not applicable to training state-of-the-art models, which are trained with advanced augmentations. However, we observe that a key factor, i.e., the temperatures in the softmax functions for generating probabilities of both the teacher and student models, was mostly overlooked in previous methods. With properly tuned temperatures, such degradation problems of KD can be much mitigated. However, instead of relying on a naive grid search, which shows poor transferability, we propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters. The meta parameters are adaptively adjusted during training according to the gradients of the learning objective. We validate that MKD is robust to different dataset scales, different teacher/student architectures, and different types of data augmentation. With MKD, we achieve the best performance with popular ViT architectures among compared methods that use only ImageNet-1K as training data, ranging from tiny to large models. With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.
Jihao Liu, Xin Huang, Jinliang Zheng, Yu Liu, Hongsheng Li
In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.
Jingjun Han, Jihao Liu, Joaquín Moraga
In this paper we study $(ε,δ)$-lc singularites, i.e. $ε$-lc singularities admitting a $δ$-plt blow-up. We prove that $n$-dimensional $(ε,δ)$-lc singularities are bounded up to a deformation, and $2$-dimensional $(ε,δ)$-lc singularities form a bounded family. Furthermore, we give an example which shows that $(ε,δ)$-lc singularities are not bounded in higher dimensions, even in the analytic sense.
Jihao Liu, Ming Zhang, Yangting Sun, Boxiao Liu, Guanglu Song, Yu Liu, Hongsheng Li
Reinforcement learning (RL)-based neural architecture search (NAS) generally guarantees better convergence yet suffers from the requirement of huge computational resources compared with gradient-based approaches, due to the rollout bottleneck -- exhaustive training for each sampled generation on proxy tasks. In this paper, we propose a general pipeline to accelerate the convergence of the rollout process as well as the RL process in NAS. It is motivated by the interesting observation that both the architecture and the parameter knowledge can be transferred between different experiments and even different tasks. We first introduce an uncertainty-aware critic (value function) in Proximal Policy Optimization (PPO) to utilize the architecture knowledge in previous experiments, which stabilizes the training process and reduces the searching time by 4 times. Further, an architecture knowledge pool together with a block similarity function is proposed to utilize parameter knowledge and reduces the searching time by 2 times. It is the first to introduce block-level weight sharing in RLbased NAS. The block similarity function guarantees a 100% hitting ratio with strict fairness. Besides, we show that a simply designed off-policy correction factor used in "replay buffer" in RL optimization can further reduce half of the searching time. Experiments on the Mobile Neural Architecture Search (MNAS) search space show the proposed Fast Neural Architecture Search (FNAS) accelerates standard RL-based NAS process by ~10x (e.g. ~256 2x2 TPUv2 x days / 20,000 GPU x hour -> 2,000 GPU x hour for MNAS), and guarantees better performance on various vision tasks.
Jihao Liu
In this paper we show that any two birational Mori fiber spaces of $\Qq$-factorial gklt g-pairs are connected by a finite sequence of Sarkisov links.