Linsui Deng, Yilin Zhang
Feature screening for ultrahigh-dimension, in general, proceeds with two essential steps. The first step is measuring and ranking the marginal dependence between response and covariates, and the second is determining the threshold. We develop a new screening procedure, called SIT-BY procedure, that possesses appealing statistical properties in both steps. By employing sliced independence estimates in the measuring and ranking stage, our proposed procedure requires no model assumptions, remains invariant to monotone transformation, and achieves almost linear computation complexity. Inspired by false discovery rate (FDR) control procedures, we offer a data-adaptive threshold benefit from the asymptotic normality of test statistics. Under moderate conditions, we demonstrate that our procedure can asymptotically control the FDR while maintaining the sure screening property. We investigate the finite sample performance of our proposed procedure via extensive simulations and an application to genome-wide dataset.
Mohammad Amini, Babak Ahmadi, Xiaomeng Xiong, Yilin Zhang, Christopher Qiao
This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.
Yilin Zhang, Leo D. Westbury, Elaine M. Dennison, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar
Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model's potential in real-world scenarios.
Yilin Zhang, Cai Xu, Haishun Chen, Ziyu Guan, Wei Zhao
Trusted multi-view classification typically relies on a view-wise evidential fusion process: each view independently produces class evidence and uncertainty, and the final prediction is obtained by aggregating these independent opinions. While this design is modular and uncertainty-aware, it implicitly assumes that evidence from different views is numerically comparable. In practice, however, this assumption is fragile. Different views often differ in feature space, noise level, and semantic granularity, while independently trained branches are optimized only for prediction correctness, without any constraint enforcing cross-view consistency in evidence strength. As a result, the uncertainty used for fusion can be dominated by branch-specific scale bias rather than true sample-level reliability. To address this issue, we propose Trusted Multi-view learning with Unified Routing (TMUR), which decouples view-specific evidence extraction from fusion arbitration. TMUR uses view-private experts and one collaborative expert, and employs a unified router that observes the global multi-view context to generate sample-level expert weights. Soft load-balancing and diversity regularization further encourage balanced expert utilization and more discriminative expert specialization. We also provide theoretical analysis showing why independent evidential supervision does not identify a common cross-view evidence scale, and why unified global routing is preferable to branch-local arbitration when reliability is sample-dependent.
Chengyuan Zhang, Yilin Zhang, Lei Zhu, Deyin Liu, Lin Wu, Bo Li, Shichao Zhang, Mohammed Bennamoun, Farid Boussaid
This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask-DINO into a two-stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine-tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over-fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state-of-the-art approaches.
Yilin Zhang, Marie Poux-Berthe, Chris Wells, Karolina Koc-Michalska, Karl Rohe
We propose a graph contextualization method, pairGraphText, to study political engagement on Facebook during the 2012 French presidential election. It is a spectral algorithm that contextualizes graph data with text data for online discussion thread. In particular, we examine the Facebook posts of the eight leading candidates and the comments beneath these posts. We find evidence of both (i) candidate-centered structure, where citizens primarily comment on the wall of one candidate and (ii) issue-centered structure (i.e. on political topics), where citizens' attention and expression is primarily directed towards a specific set of issues (e.g. economics, immigration, etc). To identify issue-centered structure, we develop pairGraphText, to analyze a network with high-dimensional features on the interactions (i.e. text). This technique scales to hundreds of thousands of nodes and thousands of unique words. In the Facebook data, spectral clustering without the contextualizing text information finds a mixture of (i) candidate and (ii) issue clusters. The contextualized information with text data helps to separate these two structures. We conclude by showing that the novel methodology is consistent under a statistical model.
Yilin Zhang, Songshan Yang
Measuring and testing dependence between complex objects is of great importance in modern statistics. Most existing work relied on the distance between random variables, which inevitably required the moment conditions to guarantee the distance is well-defined. Based on the geometry element ``angle", we develop a novel class of nonlinear dependence measures for data in metric space that can avoid such conditions. Specifically, by making use of the reproducing kernel Hilbert space equipped with Gaussian measure, we introduce kernel angle covariances that can be applied to complex objects such as random vectors or matrices. We estimate kernel angle covariances based on $U$-statistic and establish the corresponding independence tests via gamma approximation. Our kernel angle independence tests, imposing no-moment conditions on kernels, are robust with heavy-tailed random variables. We conduct comprehensive simulation studies and apply our proposed methods to a facial recognition task. Our kernel angle covariances-based tests show remarkable performances in dealing with image data.
Yilin Zhang, Songshan Yang, Yunan Wu, Lan Wang
This paper introduces the partial Gini covariance, a novel dependence measure that addresses the challenges of high-dimensional inference with heavy-tailed errors, often encountered in fields like finance, insurance, climate, and biology. Conventional high-dimensional regression inference methods suffer from inaccurate type I errors and reduced power in heavy-tailed contexts, limiting their effectiveness. Our proposed approach leverages the partial Gini covariance to construct a robust statistical inference framework that requires minimal tuning and does not impose restrictive moment conditions on error distributions. Unlike traditional methods, it circumvents the need for estimating the density of random errors and enhances the computational feasibility and robustness. Extensive simulations demonstrate the proposed method's superior power and robustness over standard high-dimensional inference approaches, such as those based on the debiased Lasso. The asymptotic relative efficiency analysis provides additional theoretical insight on the improved efficiency of the new approach in the heavy-tailed setting. Additionally, the partial Gini covariance extends to the multivariate setting, enabling chi-square testing for a group of coefficients. We illustrate the method's practical application with a real-world data example.
Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu
Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (\ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.
Yilin Zhang, Cai Xu, Han Jiang, Ziyu Guan, Wei Zhao, Xiaofei He, Murat Sensoy
Multi-view learning methods often focus on improving decision accuracy while neglecting the decision uncertainty, which significantly restricts their applications in safety-critical scenarios. To address this, trusted multi-view learning methods estimate prediction uncertainties by learning class distributions from each instance. However, these methods heavily rely on high quality ground-truth labels. This motivates us to delve into a new problem: how to develop a reliable multi-view learning model under the guidance of noisy labels? We propose the Trusted Multi view Noise Refining (TMNR) method to address this challenge by modeling label noise arising from low-quality data features and easily-confused classes. TMNR employs evidential deep neural networks to construct view-specific opinions that capture both beliefs and uncertainty. These opinions are then transformed through noise correlation matrices to align with the noisy supervision, where matrix elements are constrained by sample uncertainty to reflect label reliability. Furthermore, considering the challenge of jointly optimizing the evidence network and noise correlation matrices under noisy supervision, we further propose Trusted Multi-view Noise Re-Refining (TMNR^2 ), which disentangles this complex co-training problem by establishing different training objectives for distinct modules. TMNR^2 identifies potentially mislabeled samples through evidence-label consistency and generates pseudo-labels from neighboring information. By assigning clean samples to optimize evidential networks and noisy samples to guide noise correlation matrices, respectively, TMNR^2 reduces mapping interference and achieves stabilizes training. Experimental results demonstrate that TMNR^2 significantly outperforms baseline methods, with average accuracy improvements of 7% on datasets with 50% label noise.
Yilin Zhang, Karl Rohe
This paper uses the relationship between graph conductance and spectral clustering to study (i) the failures of spectral clustering and (ii) the benefits of regularization. The explanation is simple. Sparse and stochastic graphs create a lot of small trees that are connected to the core of the graph by only one edge. Graph conductance is sensitive to these noisy `dangling sets'. Spectral clustering inherits this sensitivity. The second part of the paper starts from a previously proposed form of regularized spectral clustering and shows that it is related to the graph conductance on a `regularized graph'. We call the conductance on the regularized graph CoreCut. Based upon previous arguments that relate graph conductance to spectral clustering (e.g. Cheeger inequality), minimizing CoreCut relaxes to regularized spectral clustering. Simple inspection of CoreCut reveals why it is less sensitive to small cuts in the graph. Together, these results show that unbalanced partitions from spectral clustering can be understood as overfitting to noise in the periphery of a sparse and stochastic graph. Regularization fixes this overfitting. In addition to this statistical benefit, these results also demonstrate how regularization can improve the computational speed of spectral clustering. We provide simulations and data examples to illustrate these results.
Yilin Zhang, Karl Rohe, Sebastien Roch
Respondent-driven sampling (RDS) is a popular approach to study marginalized or hard-to-reach populations. It collects samples from a networked population by incentivizing participants to refer their friends into the study. One major challenge in analyzing RDS samples is seed bias. Seed bias refers to the fact that when the social network is divided into multiple communities (or blocks), the RDS sample might not provide a balanced representation of the different communities in the population, and such unbalance is correlated with the initial participant (or the seed). In this case, the distributions of estimators are typically non-trivial mixtures, which are determined (1) by the seed and (2) by how the referrals transition from one block to another. This paper shows that (1) block-transition probabilities are easy to estimate with high accuracy, and (2) we can use these estimated block-transition probabilities to estimate the stationary distribution over blocks and thus, an estimate of the block proportions. This stationary distribution on blocks has previously been used in the RDS literature to evaluate whether the sampling process has appeared to `mix'. We use these estimated block proportions in a simple post-stratified (PS) estimator that greatly diminishes seed bias. By aggregating over the blocks/strata in this way, we prove that the PS estimator is $\sqrt{n}$-consistent under a Markov model, even when other estimators are not. Simulations show that the PS estimator has smaller Root Mean Square Error (RMSE) compared to the state-of-the-art estimators.
Jianzhao Wang, Yanyan Wei, Dehua Hu, Yilin Zhang, Shengeng Tang, Kun Li, Zhao Zhang
This technical report presents our team's solution for the WeatherProof Dataset Challenge: Semantic Segmentation in Adverse Weather at CVPR'24 UG2+. We propose a two-stage deep learning framework for this task. In the first stage, we preprocess the provided dataset by concatenating images into video sequences. Subsequently, we leverage a low-rank video deraining method to generate high-fidelity pseudo ground truths. These pseudo ground truths offer superior alignment compared to the original ground truths, facilitating model convergence during training. In the second stage, we employ the InternImage network to train for the semantic segmentation task using the generated pseudo ground truths. Notably, our meticulously designed framework demonstrates robustness to degraded data captured under adverse weather conditions. In the challenge, our solution achieved a competitive score of 0.43 on the Mean Intersection over Union (mIoU) metric, securing a respectable rank of 4th.
Yilin Zhang, Cai Xu, You Wu, Ziyu Guan, Wei Zhao
Deep neural networks often produce overconfident predictions, undermining their reliability in safety-critical applications. This miscalibration is further exacerbated under distribution shift, where test data deviates from the training distribution due to environmental or acquisition changes. While existing approaches improve calibration through training-time regularization or post-hoc adjustment, their reliance on access to or simulation of target domains limits their practicality in real-world scenarios. In this paper, we propose a novel calibration framework that operates without access to target domain information. From a frequency-domain perspective, we identify that distribution shifts often distort high-frequency visual cues exploited by deep models, and introduce a low-frequency filtering strategy to encourage reliance on domain-invariant features. However, such information loss may degrade In-Distribution (ID) calibration performance. Therefore, we further propose a gradient-based rectification mechanism that enforces ID calibration as a hard constraint during optimization. Experiments on synthetic and real-world shifted datasets, including CIFAR-10/100-C and WILDS, demonstrate that our method significantly improves calibration under distribution shift while maintaining strong in-distribution performance.
Haishun Chen, Cai Xu, Jinlong Yu, Yilin Zhang, Ziyu Guan, Wei Zhao, Fangyuan Zhao, Xin Yang
Multi-view evidential learning aims to integrate information from multiple views to improve prediction performance and provide trustworthy uncertainty esitimation. Most previous methods assume that view-specific evidence learning is naturally reliable. However, in practice, the evidence learning process tends to be biased. Through empirical analysis on real-world data, we reveal that samples tend to be assigned more evidence to support data-rich classes, thereby leading to unreliable uncertainty estimation in predictions. This motivates us to delve into a new Biased Evidential Multi-view Learning (BEML) problem. To this end, we propose Fairness-Aware Multi-view Evidential Learning (FAML). FAML first introduces an adaptive prior based on training trajectory, which acts as a regularization strategy to flexibly calibrate the biased evidence learning process. Furthermore, we explicitly incorporate a fairness constraint based on class-wise evidence variance to promote balanced evidence allocation. In the multi-view fusion stage, we propose an opinion alignment mechanism to mitigate view-specific bias across views, thereby encouraging the integration of consistent and mutually supportive evidence.Theoretical analysis shows that FAML enhances fairness in the evidence learning process. Extensive experiments on five real-world multi-view datasets demonstrate that FAML achieves more balanced evidence allocation and improves both prediction performance and the reliability of uncertainty estimation compared to state-of-the-art methods.
Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag
Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking. However, the prevalence and impact of length bias in QE metrics have been underexplored. Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a systematic preference for shorter translations when multiple candidates of comparable quality are available for the same source text. These biases risk unfairly penalizing longer, correct translations and can propagate into downstream pipelines that rely on QE signals for data selection or system optimization. We trace the root cause of learned QE metrics to skewed supervision distributions, where longer error-free examples are underrepresented in training data. As a diagnostic intervention, we apply length normalization during training and show that this simple modification effectively decouples error prediction from sequence length, yielding more reliable QE signals across translations of varying length.
Shuyi Gu, Zhenghua Luo, Lin Hu, Yilin Zhang, Junxiong Guo
Chirp signals have established diverse applications caused by the capable of producing time-dependent linear frequencies. Most feature extraction transformation methods for chirp signals focus on enhancing the performance of transform methods but neglecting the information derived from the transformation process. Consequently, they may fail to fully exploit the information from observations, resulting in decreased performance under conditions of low signal-to-noise ratio and limited observations. In this work, we develop a novel post-processing method called mapping information model to addressing this challenge. The model establishes a link between the observation space and feature space in feature extraction transform, enabling interference suppression and obtain more accurate information by iteratively resampling and assigning weights in both spaces. Analysis of the iteration process reveals a continual increase in weight of signal samples and a gradual stability in weight of noise samples. The demonstration of the noise suppression in the iteration process and feature enhancement supports the effectiveness of the mapping information model. Furthermore, numerical simulations also affirm the high efficiency of the proposed model by showcasing enhanced signal detection and estimation performances without requiring additional observations. This superior model allows amplifying performance within feature extraction transformation for chirp signal processing under low SNR and limited observation conditions, opens up new opportunities for areas such as communication, biomedicine, and remote sensing.
Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music "https://team.doubao.com/seed-music".
Zichao Dong, Yilin Zhang, Xufeng Huang, Hang Ji, Zhan Shi, Xin Zhan, Junbo Chen
We introduce a novel MV-DETR pipeline which is effective while efficient transformer based detection method. Given input RGBD data, we notice that there are super strong pretraining weights for RGB data while less effective works for depth related data. First and foremost , we argue that geometry and texture cues are both of vital importance while could be encoded separately. Secondly, we find that visual texture feature is relatively hard to extract compared with geometry feature in 3d space. Unfortunately, single RGBD dataset with thousands of data is not enough for training an discriminating filter for visual texture feature extraction. Last but certainly not the least, we designed a lightweight VG module consists of a visual textual encoder, a geometry encoder and a VG connector. Compared with previous state of the art works like V-DETR, gains from pretrained visual encoder could be seen. Extensive experiments on ScanNetV2 dataset shows the effectiveness of our method. It is worth mentioned that our method achieve 78\% AP which create new state of the art on ScanNetv2 benchmark.
Liwei Wu, Yilin Zhang, Justin Leung, Jingyi Gao, April Li, Jian Zhao
The proliferation of virtual reality (VR) has led to its increasing adoption as an immersive medium for delivering presentations, distinct from other VR experiences like games and 360-degree videos by sharing information in richly interactive environments. However, creating engaging VR presentations remains a challenging and time-consuming task for users, hindering the full realization of VR presentation's capabilities. This research aims to explore the potential of VR presentation, analyze users' opinions, and investigate these via providing a user-friendly no-coding authoring tool. Through an examination of popular presentation software and interviews with seven professionals, we identified five design aspects and four design challenges for VR presentations. Based on the findings, we developed VRStory, a prototype for presentation authoring without coding to explore the design aspects and strategies for addressing the challenges. VRStory offers a variety of predefined and customizable VR elements, as well as modules for layout design, navigation control, and asset generation. A user study was then conducted with 12 participants to investigate their opinions and authoring experience with VRStory. Our results demonstrated that, while acknowledging the advantages of immersive and spatial features in VR, users often have a consistent mental model for traditional 2D presentations and may still prefer planar and static formats in VR for better accessibility and efficient communication. We finally shared our learned design considerations for future development of VR presentation tools, emphasizing the importance of balancing of promoting immersive features and ensuring accessibility.