Li Wang, Lei Zhu, En Yu, Jiande Sun, Huaxiang Zhang
Deep hashing has recently received attention in cross-modal retrieval for its impressive advantages. However, existing hashing methods for cross-modal retrieval cannot fully capture the heterogeneous multi-modal correlation and exploit the semantic information. In this paper, we propose a novel \emph{Fusion-supervised Deep Cross-modal Hashing} (FDCH) approach. Firstly, FDCH learns unified binary codes through a fusion hash network with paired samples as input, which effectively enhances the modeling of the correlation of heterogeneous multi-modal data. Then, these high-quality unified hash codes further supervise the training of the modality-specific hash networks for encoding out-of-sample queries. Meanwhile, both pair-wise similarity information and classification information are embedded in the hash networks under one stream framework, which simultaneously preserves cross-modal similarity and keeps semantic consistency. Experimental results on two benchmark datasets demonstrate the state-of-the-art performance of FDCH.
Lei Zhu, Fernando Soldevila, Claudio Moretti, Alexandra d'Arco, Antoine Boniface, Xiaopeng Shao, Hilton B. de Aguiar, Sylvain Gigan
On-invasive optical imaging techniques are essential diagnostic tools in many fields. Although various recent methods have been proposed to utilize and control light in multiple scattering media, non-invasive optical imaging through and inside scattering layers across a large field of view remains elusive due to the physical limits set by the optical memory effect, especially without wavefront shaping techniques. Here, we demonstrate an approach that enables non-invasive fluorescence imaging behind scattering layers with field-of-views extending well beyond the optical memory effect. The method consists in demixing the speckle patterns emitted by a fluorescent object under variable unknown random illumination, using matrix factorization and a novel fingerprint-based reconstruction. Experimental validation shows the efficiency and robustness of the method with various fluorescent samples, covering a field of view up to three times the optical memory effect range. Our non-invasive imaging technique is simple, neither requires a spatial light modulator nor a guide star, and can be generalized to a wide range of incoherent contrast mechanisms and illumination schemes.
Lei Zhu, Zhaojing Luo, Wei Wang, Meihui Zhang, Gang Chen, Kaiping Zheng
Deep learning models usually require a large amount of labeled data to achieve satisfactory performance. In multimedia analysis, domain adaptation studies the problem of cross-domain knowledge transfer from a label rich source domain to a label scarce target domain, thus potentially alleviates the annotation requirement for deep learning models. However, we find that contemporary domain adaptation methods for cross-domain image understanding perform poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies the domain adaptation problem under the scenario where source data can be noisy. Prior methods on WSDA remove noisy source data and align the marginal distribution across domains without considering the fine-grained semantic structure in the embedding space, which have the problem of class misalignment, e.g., features of cats in the target domain might be mapped near features of dogs in the source domain. In this paper, we propose a novel method, termed Noise Tolerant Domain Adaptation, for WSDA. Specifically, we adopt the cluster assumption and learn cluster discriminatively with class prototypes in the embedding space. We propose to leverage the location information of the data points in the embedding space and model the location information with a Gaussian mixture model to identify noisy source data. We then design a network which incorporates the Gaussian mixture noise model as a sub-module for unsupervised noise removal and propose a novel cluster-level adversarial adaptation method which aligns unlabeled target data with the less noisy class prototypes for mapping the semantic structure across domains. We conduct extensive experiments to evaluate the effectiveness of our method on both general images and medical images from COVID-19 and e-commerce datasets. The results show that our method significantly outperforms state-of-the-art WSDA methods.
Yudong Han, Lei Zhu, Zhiyong Cheng, Jingjing Li, Xiaobai Liu
Graph based clustering is one of the major clustering methods. Most of it work in three separate steps: similarity graph construction, clustering label relaxing and label discretization with k-means. Such common practice has three disadvantages: 1) the predefined similarity graph is often fixed and may not be optimal for the subsequent clustering. 2) the relaxing process of cluster labels may cause significant information loss. 3) label discretization may deviate from the real clustering result since k-means is sensitive to the initialization of cluster centroids. To tackle these problems, in this paper, we propose an effective discrete optimal graph clustering (DOGC) framework. A structured similarity graph that is theoretically optimal for clustering performance is adaptively learned with a guidance of reasonable rank constraint. Besides, to avoid the information loss, we explicitly enforce a discrete transformation on the intermediate continuous label, which derives a tractable optimization problem with discrete solution. Further, to compensate the unreliability of the learned labels and enhance the clustering accuracy, we design an adaptive robust module that learns prediction function for the unseen data based on the learned discrete cluster labels. Finally, an iterative optimization strategy guaranteed with convergence is developed to directly solve the clustering results. Extensive experiments conducted on both real and synthetic datasets demonstrate the superiority of our proposed methods compared with several state-of-the-art clustering approaches.
Lei Zhu, Lijian Lin, Ye Zhu, Jiahao Wu, Xuehan Hou, Yu Li, Yunfei Liu, Jie Chen
Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.
Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau
Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{https://github.com/rayleizhu/GLMix}.
Lei Zhu, Hui Cui, Zhiyong Cheng, Jingjing Li, Zheng Zhang
Social network stores and disseminates a tremendous amount of user shared images. Deep hashing is an efficient indexing technique to support large-scale social image retrieval, due to its deep representation capability, fast retrieval speed and low storage cost. Particularly, unsupervised deep hashing has well scalability as it does not require any manually labelled data for training. However, owing to the lacking of label guidance, existing methods suffer from severe semantic shortage when optimizing a large amount of deep neural network parameters. Differently, in this paper, we propose a Dual-level Semantic Transfer Deep Hashing (DSTDH) method to alleviate this problem with a unified deep hash learning framework. Our model targets at learning the semantically enhanced deep hash codes by specially exploiting the user-generated tags associated with the social images. Specifically, we design a complementary dual-level semantic transfer mechanism to efficiently discover the potential semantics of tags and seamlessly transfer them into binary hash codes. On the one hand, instance-level semantics are directly preserved into hash codes from the associated tags with adverse noise removing. Besides, an image-concept hypergraph is constructed for indirectly transferring the latent high-order semantic correlations of images and tags into hash codes. Moreover, the hash codes are obtained simultaneously with the deep representation learning by the discrete hash optimization strategy. Extensive experiments on two public social image retrieval datasets validate the superior performance of our method compared with state-of-the-art hashing methods. The source codes of our method can be obtained at https://github.com/research2020-1/DSTDH
Fengling Li, Tong Wang, Lei Zhu, Zheng Zhang, Xinhua Wang
Supervised cross-modal hashing aims to embed the semantic correlations of heterogeneous modality data into the binary hash codes with discriminative semantic labels. Because of its advantages on retrieval and storage efficiency, it is widely used for solving efficient cross-modal retrieval. However, existing researches equally handle the different tasks of cross-modal retrieval, and simply learn the same couple of hash functions in a symmetric way for them. Under such circumstance, the uniqueness of different cross-modal retrieval tasks are ignored and sub-optimal performance may be brought. Motivated by this, we present a Task-adaptive Asymmetric Deep Cross-modal Hashing (TA-ADCMH) method in this paper. It can learn task-adaptive hash functions for two sub-retrieval tasks via simultaneous modality representation and asymmetric hash learning. Unlike previous cross-modal hashing approaches, our learning framework jointly optimizes semantic preserving that transforms deep features of multimedia data into binary hash codes, and the semantic regression which directly regresses query modality representation to explicit label. With our model, the binary codes can effectively preserve semantic correlations across different modalities, meanwhile, adaptively capture the query semantics. The superiority of TA-ADCMH is proved on two standard datasets from many aspects.
Lei Zhu, Yuxiang Wu, Jietao Liu, Tengfei Wu, Lixian Liu, Xiaopeng Shao
Light passing through scattering media will be strongly scattered and diffused into complex speckle pattern, which however contains almost all the spatial information and color information of the objects. Although various technologies have been proposed to realize color imaging through the scattering media, current technologies are still complex with long sequence of measurement for each imaging pixel or spectral point spread functions of optical system. Here we theoretically prove the spatial averaging of triple correlation technique can be used to retrieve the Fourier phase of object, and experimentally demonstrate it can be applied in color imaging through scattering media. Compared to other phase retrieval techniques, the phase retrieval with triple correlation technique can retain the orientation information of objects, and can composite color image without rotation operation. Furthermore, our approach has the potential of realizing spectral imaging through scattering media.
Lei Zhu, Qi She, Lidan Zhang, Ping Guo
The nonlocal-based blocks are designed for capturing long-range spatial-temporal dependencies in computer vision tasks. Although having shown excellent performances, they lack the mechanism to encode the rich, structured information among elements in an image. In this paper, to theoretically analyze the property of these nonlocal-based blocks, we provide a unified approach to interpreting them, where we view them as a graph filter generated on a fully-connected graph. When the graph filter is approximated by Chebyshev polynomials, a generalized formulation can be derived for explaining the existing nonlocal-based blocks ($\mathit{e.g.,}$ nonlocal block, nonlocal stage, double attention block). Furthermore, we propose an efficient and robust spectral nonlocal block, which can be flexibly inserted into deep neural networks to catch the long-range dependencies between spatial pixels or temporal frames. Experimental results demonstrate the clear-cut improvements and practical applicabilities of the spectral nonlocal block on image classification (Cifar-10/100, ImageNet), fine-grained image classification (CUB-200), action recognition (UCF-101), and person re-identification (ILID-SVID, Mars, Prid-2011) tasks.
Lei Zhu, Hangzhou He, Xinliang Zhang, Qian Chen, Shuang Zeng, Qiushi Ren, Yanye Lu
End-to-end weakly supervised semantic segmentation aims at optimizing a segmentation model in a single-stage training process based on only image annotations. Existing methods adopt an online-trained classification branch to provide pseudo annotations for supervising the segmentation branch. However, this strategy makes the classification branch dominate the whole concurrent training process, hindering these two branches from assisting each other. In our work, we treat these two branches equally by viewing them as diverse ways to generate the segmentation map, and add interactions on both their supervision and operation to achieve mutual promotion. For this purpose, a bidirectional supervision mechanism is elaborated to force the consistency between the outputs of these two branches. Thus, the segmentation branch can also give feedback to the classification branch to enhance the quality of localization seeds. Moreover, our method also designs interaction operations between these two branches to exchange their knowledge to assist each other. Experiments indicate our work outperforms existing end-to-end weakly supervised segmentation methods.
Lei Zhu, M. C. H. Wright, Jun-Hui Zhao, Yuefang Wu
Dec 14, 2009·astro-ph.SR·PDF We report new results from CARMA observations of both continuum and HCO+(1-0) emission at 3.4 mm from W3-SE, a molecular core of intermediate mass, together with the continuum observations at 1.1 and 0.85/0.45 mm with the SMA and JCMT. A continuum emission core elongated from SE to NW (~10"), has been observed at the and further resolved into a double source with the SMA at 1.1 mm, with a separation of ~4". Together with the measurements from the Spitzer and MSX at mid-IR, we determined the SED of W3-SE and fit it with a thermal dust emission model, suggesting the presence of two dust components with different temperatures. The emission at mm/submm wavelengths is dominated by a major cold (~41 K) with a mass of ~65 Msun. In addition, there is a weaker hot component (~400 K) which accounts for emission in the mid-IR, suggesting that a small fraction of dust has been heated by newly formed stars. We also imaged the molecular core in the HCO+(1-0) line using CARMA at an angular resolution ~6". With the CARMA observations, we have verified the presence of a blue-dominated double peak profile toward this core. The line profile cannot be explained by infall alone. The broad velocity wings of the line profile suggest that other kinematics such as outflows within the central 6" of the core likely dominate the resulting spectrum. The kinematics of the sub-structures of this core suggest that the molecular gas outside the main component appears to be dominated by the bipolar outflow originated from the dust core with a dynamical age of >30000 yr. Our analysis, based on the observations at wavelengths from mm/submm to mid-IR suggests that the molecular core W3-SE hosts a group of newly formed young stars and protostars.
Lei Zhu
Let $V$ be a smooth projective 3-fold of general type. Denote by $K^{3}$, a rational number, the self-intersection of the canonical sheaf of any minimal model of $V$. One defines $K^{3}$ as a canonical volume of $V$. The paper is devoted to proving the sharp lower bound $K^{3}\ge {1/420}$ which can be reached by an example: $X_{46}\subseteq \mathbb{P}(4,5,6,7,23)$.
Lei Zhu, Zhihao Yan, Hongbo Duan, Yongyang Cai, Xiaobing Zhang
Global cooperation is posited as a pivotal solution to address climate change, yet significant barriers, like free-riding, hinder its realization. This paper develops a dynamic game-theoretic model to analyze the stability of coalitions under multiple stochastic climate tippings, and a technology-sharing mechanism is designed in the model to combat free-ridings. Our results reveal that coalitions tend to shrink over time as temperatures rise, owing to potential free-ridings, despite a large size of initial coalition. The threat of climate tipping reduces the size of stable coalitions compared to the case where tipping is ignored. However, at post-tipping period, coalitions temporarily expand as regions respond to the shock, though this cooperation is short-lived and followed by further shrink. Notably, technology-sharing generates greater collective benefits than sanctions, suggesting that the proposed dynamic technology-sharing pathway bolsters coalition resilience against free-riding while limiting the global warming. This framework highlights the critical role of technology-sharing in fostering long-term climate cooperation under climate tipping uncertainties.
Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau
A practical large language model (LLM) service may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (\ie, key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. We have observed significant performance improvements to a production-level system, vLLM, through integration with RelayAttention. The improvements are even more profound with longer system prompts.
Lei Zhu, Jietao Liu, Lei Feng, Chengfei Guo, Tengfei Wu, Xiaopeng Shao
Light passing through scattering media will be strongly scattered and diffused into complex speckle pattern, which contains almost all the spatial information and spectral information of the objects. Although various methods have been proposed to recover the spatial information of the hidden objects, it is still a challenge to simultaneously obtain their spectral information. Here, we present an effective approach to realize spectral imaging through scattering media by combining the spectra retrieval and the speckle-correlation. Compared to the traditional imaging spectrometer, our approach is more flexible in the choice of core element. In this paper, we have demonstrated employing the frosted glass as the core element to achieve spectral imaging. Obtaining the spectral information and spatial information are demonstrated via numerical simulations. Experiment results further demonstrate the performance of our scheme in spectral imaging through scattering media. The spectral imaging based on scattering media is well suited for new type spectral imaging applications.
Lei Zhu, Qi She, Qian Chen, Yunfei You, Boyu Wang, Yanye Lu
Weakly supervised object localization (WSOL) focuses on localizing objects only with the supervision of image-level classification masks. Most previous WSOL methods follow the classification activation map (CAM) that localizes objects based on the classification structure with the multi-instance learning (MIL) mechanism. However, the MIL mechanism makes CAM only activate discriminative object parts rather than the whole object, weakening its performance for localizing objects. To avoid this problem, this work provides a novel perspective that models WSOL as a domain adaption (DA) task, where the score estimator trained on the source/image domain is tested on the target/pixel domain to locate objects. Under this perspective, a DA-WSOL pipeline is designed to better engage DA approaches into WSOL to enhance localization performance. It utilizes a proposed target sampling strategy to select different types of target samples. Based on these types of target samples, domain adaption localization (DAL) loss is elaborated. It aligns the feature distribution between the two domains by DA and makes the estimator perceive target domain cues by Universum regularization. Experiments show that our pipeline outperforms SOTA methods on multi benchmarks. Code are released at \url{https://github.com/zh460045050/DA-WSOL_CVPR2022}.
Lei Zhu, Zi Huang, Zhihui Li, Liang Xie, Heng Tao Shen
Unsupervised hashing can desirably support scalable content-based image retrieval (SCBIR) for its appealing advantages of semantic label independence, memory and search efficiency. However, the learned hash codes are embedded with limited discriminative semantics due to the intrinsic limitation of image representation. To address the problem, in this paper, we propose a novel hashing approach, dubbed as \emph{Discrete Semantic Transfer Hashing} (DSTH). The key idea is to \emph{directly} augment the semantics of discrete image hash codes by exploring auxiliary contextual modalities. To this end, a unified hashing framework is formulated to simultaneously preserve visual similarities of images and perform semantic transfer from contextual modalities. Further, to guarantee direct semantic transfer and avoid information loss, we explicitly impose the discrete constraint, bit--uncorrelation constraint and bit-balance constraint on hash codes. A novel and effective discrete optimization method based on augmented Lagrangian multiplier is developed to iteratively solve the optimization problem. The whole learning process has linear computation complexity and desirable scalability. Experiments on three benchmark datasets demonstrate the superiority of DSTH compared with several state-of-the-art approaches.
Xiao Dong, Lei Zhu, Xuemeng Song, Jingjing Li, Zhiyong Cheng
In this paper, we investigate the research problem of unsupervised multi-view feature selection. Conventional solutions first simply combine multiple pre-constructed view-specific similarity structures into a collaborative similarity structure, and then perform the subsequent feature selection. These two processes are separate and independent. The collaborative similarity structure remains fixed during feature selection. Further, the simple undirected view combination may adversely reduce the reliability of the ultimate similarity structure for feature selection, as the view-specific similarity structures generally involve noises and outlying entries. To alleviate these problems, we propose an adaptive collaborative similarity learning (ACSL) for multi-view feature selection. We propose to dynamically learn the collaborative similarity structure, and further integrate it with the ultimate feature selection into a unified framework. Moreover, a reasonable rank constraint is devised to adaptively learn an ideal collaborative similarity structure with proper similarity combination weights and desirable neighbor assignment, both of which could positively facilitate the feature selection. An effective solution guaranteed with the proved convergence is derived to iteratively tackle the formulated optimization problem. Experiments demonstrate the superiority of the proposed approach.
Lei Zhu, Yuanqi Chen, Xiaohang Liu, Thomas H. Li, Ge Li
Face animation is a challenging task. Existing model-based methods (utilizing 3DMMs or landmarks) often result in a model-like reconstruction effect, which doesn't effectively preserve identity. Conversely, model-free approaches face challenges in attaining a decoupled and semantically rich feature space, thereby making accurate motion transfer difficult to achieve. We introduce the semantic facial descriptors in learnable disentangled vector space to address the dilemma. The approach involves decoupling the facial space into identity and motion subspaces while endowing each of them with semantics by learning complete orthogonal basis vectors. We obtain basis vector coefficients by employing an encoder on the source and driving faces, leading to effective facial descriptors in the identity and motion subspaces. Ultimately, these descriptors can be recombined as latent codes to animate faces. Our approach successfully addresses the issue of model-based methods' limitations in high-fidelity identity and the challenges faced by model-free methods in accurate motion transfer. Extensive experiments are conducted on three challenging benchmarks (i.e. VoxCeleb, HDTF, CelebV). Comprehensive quantitative and qualitative results demonstrate that our model outperforms SOTA methods with superior identity preservation and motion transfer.