Weifeng Ge, Weilin Huang, Dengke Dong, Matthew R. Scott
We present a novel hierarchical triplet loss (HTL) capable of automatically collecting informative training samples (triplets) via a defined hierarchical tree that encodes global context information. This allows us to cope with the main limitation of random sampling in training a conventional triplet loss, which is a central issue for deep metric learning. Our main contributions are two-fold. (i) we construct a hierarchical class-level tree where neighboring classes are merged recursively. The hierarchical structure naturally captures the intrinsic data distribution over the whole database. (ii) we formulate the problem of triplet collection by introducing a new violate margin, which is computed dynamically based on the designed hierarchical tree. This allows it to automatically select meaningful hard samples with the guide of global context. It encourages the model to learn more discriminative features from visual similar classes, leading to faster convergence and better performance. Our method is evaluated on the tasks of image retrieval and face recognition, where it outperforms the standard triplet loss substantially by 1%-18%. It achieves new state-of-the-art performance on a number of benchmarks, with much fewer learning iterations.
Weilin Huang, Christopher P. Bridge, J. Alison Noble, Andrew Zisserman
We present an automatic method to describe clinically useful information about scanning, and to guide image interpretation in ultrasound (US) videos of the fetal heart. Our method is able to jointly predict the visibility, viewing plane, location and orientation of the fetal heart at the frame level. The contributions of the paper are three-fold: (i) a convolutional neural network architecture is developed for a multi-task prediction, which is computed by sliding a 3x3 window spatially through convolutional maps. (ii) an anchor mechanism and Intersection over Union (IoU) loss are applied for improving localization accuracy. (iii) a recurrent architecture is designed to recursively compute regional convolutional features temporally over sequential frames, allowing each prediction to be conditioned on the whole video. This results in a spatial-temporal model that precisely describes detailed heart parameters in challenging US videos. We report results on a real-world clinical dataset, where our method achieves performance on par with expert annotations.
Chenfan Zhuang, Xintong Han, Weilin Huang, Matthew R. Scott
Training an object detector on a data-rich domain and applying it to a data-poor one with limited performance drop is highly attractive in industry, because it saves huge annotation cost. Recent research on unsupervised domain adaptive object detection has verified that aligning data distributions between source and target images through adversarial learning is very useful. The key is when, where and how to use it to achieve best practice. We propose Image-Instance Full Alignment Networks (iFAN) to tackle this problem by precisely aligning feature distributions on both image and instance levels: 1) Image-level alignment: multi-scale features are roughly aligned by training adversarial domain classifiers in a hierarchically-nested fashion. 2) Full instance-level alignment: deep semantic information and elaborate instance representations are fully exploited to establish a strong relationship among categories and domains. Establishing these correlations is formulated as a metric learning problem by carefully constructing instance pairs. Above-mentioned adaptations can be integrated into an object detector (e.g. Faster RCNN), resulting in an end-to-end trainable framework where multiple alignments can work collaboratively in a coarse-tofine manner. In two domain adaptation tasks: synthetic-to-real (SIM10K->Cityscapes) and normal-to-foggy weather (Cityscapes->Foggy Cityscapes), iFAN outperforms the state-of-the-art methods with a boost of 10%+ AP over the source-only baseline.
Yongqiang Gao, Weilin Huang, Yu Qiao
Local binary descriptors are attracting increasingly attention due to their great advantages in computational speed, which are able to achieve real-time performance in numerous image/vision applications. Various methods have been proposed to learn data-dependent binary descriptors. However, most existing binary descriptors aim overly at computational simplicity at the expense of significant information loss which causes ambiguity in similarity measure using Hamming distance. In this paper, by considering multiple features might share complementary information, we present a novel local binary descriptor, referred as Ring-based Multi-Grouped Descriptor (RMGD), to successfully bridge the performance gap between current binary and floated-point descriptors. Our contributions are two-fold. Firstly, we introduce a new pooling configuration based on spatial ring-region sampling, allowing for involving binary tests on the full set of pairwise regions with different shapes, scales and distances. This leads to a more meaningful description than existing methods which normally apply a limited set of pooling configurations. Then, an extended Adaboost is proposed for efficient bit selection by emphasizing high variance and low correlation, achieving a highly compact representation. Secondly, the RMGD is computed from multiple image properties where binary strings are extracted. We cast multi-grouped features integration as rankSVM or sparse SVM learning problem, so that different features can compensate strongly for each other, which is the key to discriminativeness and robustness. The performance of RMGD was evaluated on a number of publicly available benchmarks, where the RMGD outperforms the state-of-the-art binary descriptors significantly.
Linjie Xing, Zhi Tian, Weilin Huang, Matthew R. Scott
Recent progress has been made on developing a unified framework for joint text detection and recognition in natural images, but existing joint models were mostly built on two-stage framework by involving ROI pooling, which can degrade the performance on recognition task. In this work, we propose convolutional character networks, referred as CharNet, which is an one-stage model that can process two tasks simultaneously in one pass. CharNet directly outputs bounding boxes of words and characters, with corresponding character labels. We utilize character as basic element, allowing us to overcome the main difficulty of existing approaches that attempted to optimize text detection jointly with a RNN-based recognition branch. In addition, we develop an iterative character detection approach able to transform the ability of character detection learned from synthetic data to real-world images. These technical improvements result in a simple, compact, yet powerful one-stage model that works reliably on multi-orientation and curved text. We evaluate CharNet on three standard benchmarks, where it consistently outperforms the state-of-the-art approaches [25, 24] by a large margin, e.g., with improvements of 65.33%->71.08% (with generic lexicon) on ICDAR 2015, and 54.0%->69.23% on Total-Text, on end-to-end text recognition. Code is available at: https://github.com/MalongTech/research-charnet.
Sheng Guo, Weilin Huang, Yu Qiao
Image representation and classification are two fundamental tasks towards multimedia content retrieval and understanding. The idea that shape and texture information (e.g. edge or orientation) are the key features for visual representation is ingrained and dominated in current multimedia and computer vision communities. A number of low-level features have been proposed by computing local gradients (e.g. SIFT, LBP and HOG), and have achieved great successes on numerous multimedia applications. In this paper, we present a simple yet efficient local descriptor for image classification, referred as Local Color Contrastive Descriptor (LCCD), by leveraging the neural mechanisms of color contrast. The idea originates from the observation in neural science that color and shape information are linked inextricably in visual cortical processing. The color contrast yields key information for visual color perception and provides strong linkage between color and shape. We propose a novel contrastive mechanism to compute the color contrast in both spatial location and multiple channels. The color contrast is computed by measuring \emph{f}-divergence between the color distributions of two regions. Our descriptor enriches local image representation with both color and contrast information. We verified experimentally that it can compensate strongly for the shape based descriptor (e.g. SIFT), while keeping computationally simple. Extensive experimental results on image classification show that our descriptor improves the performance of SIFT substantially by combinations, and achieves the state-of-the-art performance on three challenging benchmark datasets. It improves recent Deep Learning model (DeCAF) [1] largely from the accuracy of 40.94% to 49.68% in the large scale SUN397 database. Codes for the LCCD will be available.
Sheng Guo, Weilin Huang, Limin Wang, Yu Qiao
Convolutional neural networks (CNN) have recently achieved remarkable successes in various image classification and understanding tasks. The deep features obtained at the top fully-connected layer of the CNN (FC-features) exhibit rich global semantic information and are extremely effective in image classification. On the other hand, the convolutional features in the middle layers of the CNN also contain meaningful local information, but are not fully explored for image representation. In this paper, we propose a novel Locally-Supervised Deep Hybrid Model (LS-DHM) that effectively enhances and explores the convolutional features for scene recognition. Firstly, we notice that the convolutional features capture local objects and fine structures of scene images, which yield important cues for discriminating ambiguous scenes, whereas these features are significantly eliminated in the highly-compressed FC representation. Secondly, we propose a new Local Convolutional Supervision (LCS) layer to enhance the local structure of the image by directly propagating the label information to the convolutional layers. Thirdly, we propose an efficient Fisher Convolutional Vector (FCV) that successfully rescues the orderless mid-level semantic information (e.g. objects and textures) of scene image. The FCV encodes the large-sized convolutional maps into a fixed-length mid-level representation, and is demonstrated to be strongly complementary to the high-level FC-features. Finally, both the FCV and FC-features are collaboratively employed in the LSDHM representation, which achieves outstanding performance in our experiments. It obtains 83.75% and 67.56% accuracies respectively on the heavily benchmarked MIT Indoor67 and SUN397 datasets, advancing the stat-of-the-art substantially.
Tong He, Weilin Huang, Yu Qiao, Jian Yao
Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature computed globally from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this work, we present a new system for scene text detection by proposing a novel Text-Attentional Convolutional Neural Network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/nontext information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates main task of text/non-text classification. In addition, a powerful low-level detector called Contrast- Enhancement Maximally Stable Extremal Regions (CE-MSERs) is developed, which extends the widely-used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 dataset, with a F-measure of 0.82, improving the state-of-the-art results substantially.
Yu Gao, Xintong Han, Xun Wang, Weilin Huang, Matthew R. Scott
Fine-grained image categorization is challenging due to the subtle inter-class differences.We posit that exploiting the rich relationships between channels can help capture such differences since different channels correspond to different semantics. In this paper, we propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images. For a single image, a self-channel interaction (SCI) module is proposed to explore channel-wise correlation within the image. This allows the model to learn the complementary features from the correlated channels, yielding stronger fine-grained features. Furthermore, given an image pair, we introduce a contrastive channel interaction (CCI) module to model the cross-sample channel interaction with a metric learning framework, allowing the CIN to distinguish the subtle visual differences between images. Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing. Finally, comprehensive experiments are conducted on three publicly available benchmarks, where the proposed method consistently outperforms the state-of-theart approaches, such as DFL-CNN (Wang, Morariu, and Davis 2018) and NTS (Yang et al. 2018).
Zhi Tian, Weilin Huang, Tong He, Pan He, Yu Qiao
We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi- language text without further post-processing, departing from previous bottom-up methods requiring multi-step post-processing. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0:14s/image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/.
Tong He, Weilin Huang, Yu Qiao, Jian Yao
We introduce a new top-down pipeline for scene text detection. We propose a novel Cascaded Convolutional Text Network (CCTN) that joints two customized convolutional networks for coarse-to-fine text localization. The CCTN fast detects text regions roughly from a low-resolution image, and then accurately localizes text lines from each enlarged region. We cast previous character based detection into direct text region estimation, avoiding multiple bottom- up post-processing steps. It exhibits surprising robustness and discriminative power by considering whole text region as detection object which provides strong semantic information. We customize convolutional network by develop- ing rectangle convolutions and multiple in-network fusions. This enables it to handle multi-shape and multi-scale text efficiently. Furthermore, the CCTN is computationally efficient by sharing convolutional computations, and high-level property allows it to be invariant to various languages and multiple orientations. It achieves 0.84 and 0.86 F-measures on the ICDAR 2011 and ICDAR 2013, delivering substantial improvements over state-of-the-art results [23, 1].
Yuechen Yu, Yilei Xiong, Weilin Huang, Matthew R. Scott
Siamese-based trackers have achieved excellent performance on visual object tracking. However, the target template is not updated online, and the features of the target template and search image are computed independently in a Siamese architecture. In this paper, we propose Deformable Siamese Attention Networks, referred to as SiamAttn, by introducing a new Siamese attention mechanism that computes deformable self-attention and cross-attention. The self attention learns strong context information via spatial attention, and selectively emphasizes interdependent channel-wise features with channel attention. The cross-attention is capable of aggregating rich contextual inter-dependencies between the target template and the search image, providing an implicit manner to adaptively update the target template. In addition, we design a region refinement module that computes depth-wise cross correlations between the attentional features for more accurate tracking. We conduct experiments on six benchmarks, where our method achieves new state of-the-art results, outperforming the strong baseline, SiamRPN++ [24], by 0.464->0.537 and 0.415->0.470 EAO on VOT 2016 and 2018. Our code is available at: https://github.com/msight-tech/research-siamattn.
Miao Kang, Xiaojun Hu, Weilin Huang, Matthew R. Scott, Mauricio Reyes
We propose a Dual-Stream Pyramid Registration Network (referred as Dual-PRNet) for unsupervised 3D medical image registration. Unlike recent CNN-based registration approaches, such as VoxelMorph, which explores a single-stream encoder-decoder network to compute a registration fields from a pair of 3D volumes, we design a two-stream architecture able to compute multi-scale registration fields from convolutional feature pyramids. Our contributions are two-fold: (i) we design a two-stream 3D encoder-decoder network which computes two convolutional feature pyramids separately for a pair of input volumes, resulting in strong deep representations that are meaningful for deformation estimation; (ii) we propose a pyramid registration module able to predict multi-scale registration fields directly from the decoding feature pyramids. This allows it to refine the registration fields gradually in a coarse-to-fine manner via sequential warping, and enable the model with the capability for handling significant deformations between two volumes, such as large displacements in spatial domain or slice space. The proposed Dual-PRNet is evaluated on two standard benchmarks for brain MRI registration, where it outperforms the state-of-the-art approaches by a large margin, e.g., having improvements over recent VoxelMorph [2] with 0.683->0.778 on the LPBA40, and 0.511->0.631 on the Mindboggle101, in term of average Dice score. Code is available at: https://github.com/kangmiao15/Dual-Stream-PRNet-Plus.
Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, Xiaoou Tang
We develop a Deep-Text Recurrent Network (DTRN) that regards scene text reading as a sequence labelling problem. We leverage recent advances of deep convolutional neural networks to generate an ordered high-level sequence from a whole word image, avoiding the difficult character segmentation problem. Then a deep recurrent model, building on long short-term memory (LSTM), is developed to robustly recognize the generated CNN sequences, departing from most existing approaches recognising each character independently. Our model has a number of appealing properties in comparison to existing scene text recognition methods: (i) It can recognise highly ambiguous words by leveraging meaningful context information, allowing it to work reliably without either pre- or post-processing; (ii) the deep CNN feature is robust to various image distortions; (iii) it retains the explicit order information in word image, which is essential to discriminate word strings; (iv) the model does not depend on pre-defined dictionary, and it can process unknown words and arbitrary strings. Codes for the DTRN will be available.
Weilin Huang, Hujun Yin
This paper presents a computationally efficient yet powerful binary framework for robust facial representation based on image gradients. It is termed as structural binary gradient patterns (SBGP). To discover underlying local structures in the gradient domain, we compute image gradients from multiple directions and simplify them into a set of binary strings. The SBGP is derived from certain types of these binary strings that have meaningful local structures and are capable of resembling fundamental textural information. They detect micro orientational edges and possess strong orientation and locality capabilities, thus enabling great discrimination. The SBGP also benefits from the advantages of the gradient domain and exhibits profound robustness against illumination variations. The binary strategy realized by pixel correlations in a small neighborhood substantially simplifies the computational complexity and achieves extremely efficient processing with only 0.0032s in Matlab for a typical face image. Furthermore, the discrimination power of the SBGP can be enhanced on a set of defined orientational image gradient magnitudes, further enforcing locality and orientation. Results of extensive experiments on various benchmark databases illustrate significant improvements of the SBGP based representations over the existing state-of-the-art local descriptors in the terms of discrimination, robustness and complexity. Codes for the SBGP methods will be available at http://www.eee.manchester.ac.uk/research/groups/sisp/software/.
Sheng Guo, Zihua Xiong, Yujie Zhong, Limin Wang, Xiaobo Guo, Bing Han, Weilin Huang
In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning. CACL consists of a 3D CNN and a video transformer which are used in parallel to generate diverse positive pairs for contrastive learning. This allows the model to learn strong representations from such diverse yet meaningful pairs. Furthermore, we introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences in the temporal order. This enables the model to learn a rich temporal representation that compensates strongly to the video-level representation learned by the CACL. We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets, where our method achieves excellent performance, surpassing the state-of-the-art methods such as VideoMoCo and MoCo+BE by a large margin. The code is made available at https://github.com/guoshengcv/CACL.
Xintong Han, Zuxuan Wu, Weilin Huang, Matthew R. Scott, Larry S. Davis
Visual compatibility is critical for fashion analysis, yet is missing in existing fashion image synthesis systems. In this paper, we propose to explicitly model visual compatibility through fashion image inpainting. To this end, we present Fashion Inpainting Networks (FiNet), a two-stage image-to-image generation framework that is able to perform compatible and diverse inpainting. Disentangling the generation of shape and appearance to ensure photorealistic results, our framework consists of a shape generation network and an appearance generation network. More importantly, for each generation network, we introduce two encoders interacting with one another to learn latent code in a shared compatibility space. The latent representations are jointly optimized with the corresponding generation network to condition the synthesis process, encouraging a diverse set of generated results that are visually compatible with existing fashion garments. In addition, our framework is readily extended to clothing reconstruction and fashion transfer, with impressive results. Extensive experiments with comparisons with state-of-the-art approaches on fashion synthesis task quantitatively and qualitatively demonstrate the effectiveness of our method.
Xu Chen, Zida Cheng, Jiangchao Yao, Chen Ju, Weilin Huang, Jinsong Lan, Xiaoyi Zeng, Shuai Xiao
Cross-domain CTR (CDCTR) prediction is an important research topic that studies how to leverage meaningful data from a related domain to help CTR prediction in target domain. Most existing CDCTR works design implicit ways to transfer knowledge across domains such as parameter-sharing that regularizes the model training in target domain. More effectively, recent researchers propose explicit techniques to extract user interest knowledge and transfer this knowledge to target domain. However, the proposed method mainly faces two issues: 1) it usually requires a super domain, i.e. an extremely large source domain, to cover most users or items of target domain, and 2) the extracted user interest knowledge is static no matter what the context is in target domain. These limitations motivate us to develop a more flexible and efficient technique to explicitly transfer knowledge. In this work, we propose a cross-domain augmentation network (CDAnet) being able to perform explicit knowledge transfer between two domains. Specifically, CDAnet contains a designed translation network and an augmentation network which are trained sequentially. The translation network computes latent features from two domains and learns meaningful cross-domain knowledge of each input in target domain by using a designed cross-supervised feature translator. Later the augmentation network employs the explicit cross-domain knowledge as augmented information to boost the target domain CTR prediction. Through extensive experiments on two public benchmarks and one industrial production dataset, we show CDAnet can learn meaningful translated features and largely improve the performance of CTR prediction. CDAnet has been conducted online A/B test in image2product retrieval at Taobao app, bringing an absolute 0.11 point CTR improvement, a relative 0.64% deal growth and a relative 1.26% GMV increase.
Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, Yu Qiao
Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2. Scene categories are often defined by multi-level information, including local objects, global layout, and background environment, thus leading to large intra-class variations. In addition, with the increasing number of scene categories, label ambiguity has become another crucial issue in large-scale classification. This paper focuses on large-scale scene recognition and makes two major contributions to tackle these issues. First, we propose a multi-resolution CNN architecture that captures visual content and structure at multiple levels. The multi-resolution CNNs are composed of coarse resolution CNNs and fine resolution CNNs, which are complementary to each other. Second, we design two knowledge guided disambiguation techniques to deal with the problem of label ambiguity. (i) We exploit the knowledge from the confusion matrix computed on validation data to merge ambiguous classes into a super category. (ii) We utilize the knowledge of extra networks to produce a soft label for each image. Then the super categories or soft labels are employed to guide CNN training on the Places2. We conduct extensive experiments on three large-scale image datasets (ImageNet, Places, and Places2), demonstrating the effectiveness of our approach. Furthermore, our method takes part in two major scene recognition challenges, and achieves the second place at the Places2 challenge in ILSVRC 2015, and the first place at the LSUN challenge in CVPR 2016. Finally, we directly test the learned representations on other scene benchmarks, and obtain the new state-of-the-art results on the MIT Indoor67 (86.7\%) and SUN397 (72.0\%). We release the code and models at~\url{https://github.com/wanglimin/MRCNN-Scene-Recognition}.
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and reward losses. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real/synthetic image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).