"au:"Pengcheng Wang"" — arXiv2 Search

Showing 21–40 of 40 results

RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition

Pengcheng Wang, Sheng Li, Takahiro Shinozaki

Aug 5, 2025·eess.AS·PDF

In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.

Automatic Facial Paralysis Estimation with Facial Action Units

Xuri Ge, Joemon M. Jose, Pengcheng Wang, Arunachalam Iyer, Xiao Liu, Hu Han

Mar 3, 2022·cs.CV·PDF

Facial palsy is unilateral facial nerve weakness or paralysis of rapid onset with unknown causes. Automatically estimating facial palsy severeness can be helpful for the diagnosis and treatment of people suffering from it across the world. In this work, we develop and experiment with a novel model for estimating facial palsy severity. For this, an effective Facial Action Units (AU) detection technique is incorporated into our model, where AUs refer to a unique set of facial muscle movements used to describe almost every anatomically possible facial expression. In this paper, we propose a novel Adaptive Local-Global Relational Network (ALGRNet) for facial AU detection and use it to classify facial paralysis severity. ALGRNet mainly consists of three main novel structures: (i) an adaptive region learning module that learns the adaptive muscle regions based on the detected landmarks; (ii) a skip-BiLSTM that models the latent relationships among local AUs; and (iii) a feature fusion&refining module that investigates the complementary between the local and global face. Quantitative results on two AU benchmarks, i.e., BP4D and DISFA, demonstrate our ALGRNet can achieve promising AU detection accuracy. We further demonstrate the effectiveness of its application to facial paralysis estimation by migrating ALGRNet to a facial paralysis dataset collected and annotated by medical professionals.

High energy beam energy measurement with microwave-electron Compton backscattering

Meiyu Si, Yongsheng Huang, Shanhong Chen, Pengcheng Wang, Zhe Duan, Xiaofei Lan, Yuan Chen, Xinchou Lou, Manqi Ruan, Yiwei Wang, Guangyi Tang, Ouzheng Xiao, Jianyong Zhang

Aug 22, 2021·hep-ex·PDF

The uncertainty of the energy measurement of the electron beam on circular electron positron collider (CEPC) must be smaller than 10$\mathrm{MeV}$ to make sure the accurate measurement of the mass of the Higgs boson. In order to simplify the energy measurement system, a new method is proposed by fitting the Compton edge of the energy distribution of the gamma ray from a microwave-electron Compton scattering. With our method, the uncertainty of the energy measurement is 6$\mathrm{MeV}$ for the electron energy of $120\mathrm{GeV}$ in the Higgs mode. In this system, the energy resolution of the gamma detection needs to reach $10^{-4}$. Therefore, only the high-purity germanium (HPGe) detector can meet the critical requirement. In a head-on collision mode, the initial photons should be microwave photons with the wavelength of 3.04 centimeters. A cylindrical resonant cavity with selected ${TM_{010}}$ mode is used to transmit microwaves. After the microwave-electron Compton backscattering, the scattered photons and the synchrotron-radiation background transmit a shielding structure and then are detected by a HPGe detector at the end of the beam line of the synchrotron-radiation applications. The hole radius in the side wall of the cavity is about $1.5\mathrm{mm}$ to allow the electron beam passing through. The results of the computer simulation technology (CST) software shows that the influence of the hole radius on the cavity field is negligible. The change of the resonance frequency can be easily corrected by fine-tuning the cavity size.

Dispersion-Engineered Compact Twisted Metasurfaces Enabling 3D Frequency-Reconfigurable Holography

Cheng Pang, Yuzhong Wang, Pengcheng Wang, Axiang Yu, Yiding Liu, Ziang Yue, Mingshuang Hu, Jianqi Hu, Yongkang Dong, Jiaran Qi

Apr 4, 2025·physics.optics·PDF

Flexible dispersion manipulation is critical for holography to achieve broadband imaging or frequency division multiplexing. Within this context, metasurface-based holography offers advanced dispersion control, yet dynamic reconfigurability remains largely unexplored. This work develops a dispersion-engineered inverse design framework that enables 3D frequency-reconfigurable holography through a twisted metasurface system. The physical implementation is based on a compact layered configuration that cascades the broadband radiation-type metasurface (RA-M) and phase-only metasurface (P-M). The RA-M provides a phase-adjustable input to excite P-M, while the rotation of P-M creates a reconfigurable response of holograms. By employing the proposed scheme, dynamic switching of frequency-space multiplexing and achromatic holograms are designed and experimentally demonstrated in the microwave region. This method advances flexible dispersion engineering for metasurface-based holography, and the compact system holds significant potential for applications in ultra-broadband imaging, high-capacity optical display, and switchable meta-devices.

Dim Small Target Detection and Tracking: A Novel Method Based on Temporal Energy Selective Scaling and Trajectory Association

Weihua Gao, Wenlong Niu, Wenlong Lu, Pengcheng Wang, Zhaoyuan Qi, Xiaodong Peng, Zhen Yang

May 15, 2024·cs.CV·PDF

The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.

IntrinTrans: LLM-based Intrinsic Code Translator for RISC-V Vector

Liutong Han, Zhiyuan Tan, Hongbin Zhang, Pengcheng Wang, Chu Kang, Mingjie Xing, Yanjun Wu

Oct 11, 2025·cs.SE·PDF

The use of intrinsic functions to leverage hardware-specific capabilities is a crucial approach for optimizing library performance. Many mainstream libraries implement a large number of vectorized algorithms on Arm or x86 SIMD (Single-Instruction, Multiple-Data) intrinsic functions. Translating existing vectorized intrinsic code into the intrinsics of an emerging architecture is a practical and effective approach. However, current cross-architecture translation largely relies on manual rewriting or rule-based mapping methods, which are both time-consuming and prone to errors. We present \texttt{IntrinTrans}, a LLM-based agent that utilizes compile-and-test feedback to translate intrinsic code across architectures automatically, and further optimizes the generated intrinsics using register-usage information derived from liveness analysis. To evaluate the effectiveness of our method, we used \texttt{IntrinTrans} to translate the open-source benchmark from Arm Neon Intrinsic to the emerging RISC-V Vector (RVV) Intrinsic implementation and compared its performance with that of the native RVV implementation. Our experiments show that advanced LLMs can generate semantically correct RVV Intrinsic functions with only a finite number of iterations. Depending on the base LLMs, the pass rate ranges from 47% to 100%, achieving performance similar to the native implementation (0.85x to 1.28x).

Virtuoso: Video-based Intelligence for real-time tuning on SOCs

Jayoung Lee, PengCheng Wang, Ran Xu, Venkat Dasari, Noah Weston, Yin Li, Saurabh Bagchi, Somali Chaterji

Dec 24, 2021·cs.CV·PDF

Efficient and adaptive computer vision systems have been proposed to make computer vision tasks, such as image classification and object detection, optimized for embedded or mobile devices. These solutions, quite recent in their origin, focus on optimizing the model (a deep neural network, DNN) or the system by designing an adaptive system with approximation knobs. In spite of several recent efforts, we show that existing solutions suffer from two major drawbacks. First, the system does not consider energy consumption of the models while making a decision on which model to run. Second, the evaluation does not consider the practical scenario of contention on the device, due to other co-resident workloads. In this work, we propose an efficient and adaptive video object detection system, Virtuoso, which is jointly optimized for accuracy, energy efficiency, and latency. Underlying Virtuoso is a multi-branch execution kernel that is capable of running at different operating points in the accuracy-energy-latency axes, and a lightweight runtime scheduler to select the best fit execution branch to satisfy the user requirement. To fairly compare with Virtuoso, we benchmark 15 state-of-the-art or widely used protocols, including Faster R-CNN (FRCNN), YOLO v3, SSD, EfficientDet, SELSA, MEGA, REPP, FastAdapt, and our in-house adaptive variants of FRCNN+, YOLO+, SSD+, and EfficientDet+ (our variants have enhanced efficiency for mobiles). With this comprehensive benchmark, Virtuoso has shown superiority to all the above protocols, leading the accuracy frontier at every efficiency level on NVIDIA Jetson mobile GPUs. Specifically, Virtuoso has achieved an accuracy of 63.9%, which is more than 10% higher than some of the popular object detection models, FRCNN at 51.1%, and YOLO at 49.5%.

ApproxNet: Content and Contention-Aware Video Analytics System for Embedded Clients

Ran Xu, Rakesh Kumar, Pengcheng Wang, Peter Bai, Ganga Meghanath, Somali Chaterji, Subrata Mitra, Saurabh Bagchi

Aug 28, 2019·cs.CV·PDF

Videos take a lot of time to transport over the network, hence running analytics on the live video on embedded or mobile devices has become an important system driver. Considering that such devices, e.g., surveillance cameras or AR/VR gadgets, are resource constrained, creating lightweight deep neural networks (DNNs) for embedded devices is crucial. None of the current approximation techniques for object classification DNNs can adapt to changing runtime conditions, e.g., changes in resource availability on the device, the content characteristics, or requirements from the user. In this paper, we introduce ApproxNet, a video object classification system for embedded or mobile clients. It enables novel dynamic approximation techniques to achieve desired inference latency and accuracy trade-off under changing runtime conditions. It achieves this by enabling two approximation knobs within a single DNN model, rather than creating and maintaining an ensemble of models (e.g., MCDNN [MobiSys-16]. We show that ApproxNet can adapt seamlessly at runtime to these changes, provides low and stable latency for the image and video frame classification problems, and show the improvement in accuracy and latency over ResNet [CVPR-16], MCDNN [MobiSys-16], MobileNets [Google-17], NestDNN [MobiCom-18], and MSDNet [ICLR-18].

Active Prompting with Chain-of-Thought for Large Language Models

Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, Tong Zhang

Feb 23, 2023·cs.CL·PDF

The increasing scale of large language models (LLMs) brings emergent abilities to various complex tasks requiring reasoning, such as arithmetic and commonsense reasoning. It is known that the effective design of task-specific prompts is critical for LLMs' ability to produce high-quality answers. In particular, an effective approach for complex question-and-answer tasks is example-based prompting with chain-of-thought (CoT) reasoning, which significantly improves the performance of LLMs. However, current CoT methods rely on a fixed set of human-annotated exemplars, which are not necessarily the most effective examples for different tasks. This paper proposes a new method, Active-Prompt, to adapt LLMs to different tasks with task-specific example prompts (annotated with human-designed CoT reasoning). For this purpose, we propose a solution to the key problem of determining which questions are the most important and helpful ones to annotate from a pool of task-specific queries. By borrowing ideas from the related problem of uncertainty-based active learning, we introduce several metrics to characterize the uncertainty so as to select the most uncertain questions for annotation. Experimental results demonstrate the superiority of our proposed method, achieving state-of-the-art on eight complex reasoning tasks. Further analyses of different uncertainty metrics, pool sizes, zero-shot learning, and accuracy-uncertainty relationship demonstrate the effectiveness of our method. Our code will be available at https://github.com/shizhediao/active-prompt.

Inter-Spacecraft Tilt-to-Length Noise Reduction Algorithm for Taiji Mission

Qiong Deng, Leiqiao Ye, Ke An, Yidi Fan, Ruihong Gao, Ziren Luo, Minghui Du, Pengcheng Wang, Peng Xu

Sep 24, 2025·gr-qc·PDF

The Taiji mission for space-based gravitational wave (GW) detection employs laser interferometry to measure picometer-scale distance variations induced by GWs. The tilt-to-length (TTL) coupling noise in the inter-spacecraft interferometers, which originates from the angular jitters of the spacecrafts and the movable optical subassemblies, is predicted to be one of the main noise sources that might reduce Taiji's sensitivity to GWs. Since these angular jitters can be read out through the differential wavefront sensors, it is possible to suppress TTL noise during the data processing stage by fitting and subtracting it. This paper proposes an improved algorithm for TTL noise suppression, which addresses the issue of unknown noise floor required for optimal estimation in the practical detection scenario, and the design of this algorithm takes into account the presence of GW signals. The algorithm is validated via numerical simulation, which is built on a spacecraft dynamics simulation incorporating Taiji's drag-free and attitude control system. We also demonstrate the robustness of this algorithm by varying TTL coefficients at different levels, indicating that our algorithm is applicable to a range of payload statuses, and ultimately providing a critical advancement toward realizing Taiji's full sensitivity.

Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation

Guojian Zhan, Letian Tao, Pengcheng Wang, Yixiao Wang, Yiheng Li, Yuxin Chen, Hongyang Li, Masayoshi Tomizuka, Shengbo Eben Li

Feb 14, 2026·cs.LG·PDF

Learning expressive and efficient policy functions is a promising direction in reinforcement learning (RL). While flow-based policies have recently proven effective in modeling complex action distributions with a fast deterministic sampling process, they still face a trade-off between expressiveness and computational burden, which is typically controlled by the number of flow steps. In this work, we propose mean velocity policy (MVP), a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation. To ensure its high expressiveness, an instantaneous velocity constraint (IVC) is introduced on the mean velocity field during training. We theoretically prove that this design explicitly serves as a crucial boundary condition, thereby improving learning accuracy and enhancing policy expressiveness. Empirically, our MVP achieves state-of-the-art success rates across several challenging robotic manipulation tasks from Robomimic and OGBench. It also delivers substantial improvements in training and inference speed over existing flow-based policy baselines.

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun, Pengcheng Wang, Bingyang Liu, Xuanzhe Liu, Xin Jin

Dec 10, 2025·cs.DC·PDF

Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.

Towards Realistic Detection Pipelines of Taiji: New Challenges in Data Analysis and High-Fidelity Simulations of Space-Based Gravitational Wave Antenna

Minghui Du, Pengcheng Wang, Ziren Luo, Wen-Biao Han, Xin Zhang, Xian Chen, Zhoujian Cao, Yonghe Zhang, He Wang, Xiaodong Peng, Li-E Qiang, Ke An, Yidi Fan, Jiafeng Zhang, Liang-Gui Zhu, Ping Shen, Qianyun Yun, Xiao-Bo Zou, Ye Jiang, Tianyu Zhao, Yong Yuan, Xiaotong Wei, Yuxiang Xu, Bo Liang, Peng Xu, Yueliang Wu

May 22, 2025·gr-qc·PDF

Taiji, a Chinese space-based gravitational wave (GW) detection project, aims to explore the millihertz GW universe with unprecedented sensitivity. By observing astrophysical and cosmological sources, including Galactic binaries, massive black hole binaries, extreme mass-ratio inspirals, and stochastic gravitational wave backgrounds, etc., Taiji is expected to deliver transformative insights into astrophysics, cosmology, and fundamental physics. However, Taiji's data analysis faces unique challenges compared to ground-based detectors like LIGO-Virgo-KAGRA, such as the overlap of numerous signals, extended data durations, more rigorous accuracy requirements for the waveform templates, incompletely characterized noise spectra, non-stationary noises, and various data anomalies. Taking Taiji as a representative example, this paper reviews the data characteristics and data analysis challenges of space-based GW detection, and introduces the second round of Taiji Data Challenge, a collection of simulation datasets designed as a shared platform for resolving these critical issues. This platform distinguishes itself from previous works by the systematic integration of orbital dynamics based on a full drag-free and attitude control simulation, extended noise sources, more complicated and overlapping GW signals, second-generation time-delay interferometry, and the coupling effect of time-varying arm-lengths, etc. Concurrently released is the open-source toolkit Triangle, which offers the capabilities for customized simulation of signals, noises, and other instrumental effects. By taking a step further towards realistic detection, Taiji Data Challenge II and Triangle altogether serve as a new testbed, supporting the development of Taiji's global analysis and end-to-end pipelines, and ultimately bridging the gaps between observation and scientific objectives.

LVMapper: A Large-variance Clone Detector Using Sequencing Alignment Approach

Ming Wu, Pengcheng Wang, Kangqi Yin, Haoyu Cheng, Yun Xu, Chanchal K. Roy

Sep 10, 2019·cs.SE·PDF

To detect large-variance code clones (i.e. clones with relatively more differences) in large-scale code repositories is difficult because most current tools can only detect almost identical or very similar clones. It will make promotion and changes to some software applications such as bug detection, code completion, software analysis, etc. Recently, CCAligner made an attempt to detect clones with relatively concentrated modifications called large-gap clones. Our contribution is to develop a novel and effective detection approach of large-variance clones to more general cases for not only the concentrated code modifications but also the scattered code modifications. A detector named LVMapper is proposed, borrowing and changing the approach of sequencing alignment in bioinformatics which can find two similar sequences with more differences. The ability of LVMapper was tested on both self-synthetic datasets and real cases, and the results show substantial improvement in detecting large-variance clones compared with other state-of-the-art tools including CCAligner. Furthermore, our new tool also presents good recall and precision for general Type-1, Type-2 and Type-3 clones on the widely used benchmarking dataset, BigCloneBench.

JANUS: Benchmarking Commercial and Open-Source Cloud and Edge Platforms for Object and Anomaly Detection Workloads

Karthick Shankar, Pengcheng Wang, Ran Xu, Ashraf Mahgoub, Somali Chaterji

Dec 9, 2020·cs.CV·PDF

With diverse IoT workloads, placing compute and analytics close to where data is collected is becoming increasingly important. We seek to understand what is the performance and the cost implication of running analytics on IoT data at the various available platforms. These workloads can be compute-light, such as outlier detection on sensor data, or compute-intensive, such as object detection from video feeds obtained from drones. In our paper, JANUS, we profile the performance/$ and the compute versus communication cost for a compute-light IoT workload and a compute-intensive IoT workload. In addition, we also look at the pros and cons of some of the proprietary deep-learning object detection packages, such as Amazon Rekognition, Google Vision, and Azure Cognitive Services, to contrast with open-source and tunable solutions, such as Faster R-CNN (FRCNN). We find that AWS IoT Greengrass delivers at least 2X lower latency and 1.25X lower cost compared to all other cloud platforms for the compute-light outlier detection workload. For the compute-intensive streaming video analytics task, an opensource solution to object detection running on cloud VMs saves on dollar costs compared to proprietary solutions provided by Amazon, Microsoft, and Google, but loses out on latency (up to 6X). If it runs on a low-powered edge device, the latency is up to 49X lower.

GLGE: A New General Language Generation Evaluation Benchmark

Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou, Nan Duan

Nov 24, 2020·cs.CL·PDF

Multi-task benchmarks such as GLUE and SuperGLUE have driven great progress of pretraining and transfer learning in Natural Language Processing (NLP). These benchmarks mostly focus on a range of Natural Language Understanding (NLU) tasks, without considering the Natural Language Generation (NLG) models. In this paper, we present the General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks. For each task, we continue to design three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, and GLGE-Hard). This introduces 24 subtasks to comprehensively compare model performance. To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines including MASS, BART, and ProphetNet (The source code and dataset are publicly available at https://github.com/microsoft/glge).

ORPHEUS: Living Labs for End-to-End Data Infrastructures for Digital Agriculture

Pengcheng Wang, Edgardo Barsallo Yi, Tomas Ratkus, Somali Chaterji

Oct 4, 2021·cs.NI·PDF

IoT networks are being used to collect, analyze, and utilize sensor data. There are still some key requirements to leverage IoT networks in digital agriculture, e.g., design and deployment of energy saving and ruggedized sensor nodes (SN), reliable and long-range wireless network connectivity, end-to-end data collection pipelines for batch and streaming data. Thus, we introduce our living lab ORPHEUS and its design and implementation trajectory to showcase our orchestrated testbed of IoT sensors, data connectivity, database orchestration, and visualization dashboard. We deploy light-weight energy saving SNs in the field to collect data, using LoRa (Long Range wireless) to transmit data from the SNs to the Gateway node, upload all the data to the database server, and finally visualize the data. For future exploration, we also built a testbed of embedded devices using four different variants of NVIDIA Jetson development modules (Nano, TX2, Xavier NX, AGX Xavier) to benchmark the potential upgrade choices for SNs in ORPHEUS. Based on our deployment in multiple farms in a 3-county region around Purdue University, and on the Purdue University campus, we present analyses from our living lab deployment and additional components of the next-generation IoT farm.

What's on Your Plate? Inferring Chinese Cuisine Intake from Wearable IMUs

Jiaxi Yin, Pengcheng Wang, Han Ding, Fei Wang

Nov 7, 2025·cs.CV·PDF

Accurate food intake detection is vital for dietary monitoring and chronic disease prevention. Traditional self-report methods are prone to recall bias, while camera-based approaches raise concerns about privacy. Furthermore, existing wearable-based methods primarily focus on a limited number of food types, such as hamburgers and pizza, failing to address the vast diversity of Chinese cuisine. To bridge this gap, we propose CuisineSense, a system that classifies Chinese food types by integrating hand motion cues from a smartwatch with head dynamics from smart glasses. To filter out irrelevant daily activities, we design a two-stage detection pipeline. The first stage identifies eating states by distinguishing characteristic temporal patterns from non-eating behaviors. The second stage then conducts fine-grained food type recognition based on the motions captured during food intake. To evaluate CuisineSense, we construct a dataset comprising 27.5 hours of IMU recordings across 11 food categories and 10 participants. Experiments demonstrate that CuisineSense achieves high accuracy in both eating state detection and food classification, offering a practical solution for unobtrusive, wearable-based dietary monitoring.The system code is publicly available at https://github.com/joeeeeyin/CuisineSense.git.

HDC-X: Efficient Medical Data Classification for Embedded Devices

Jianglan Wei, Zhenyu Zhang, Pengcheng Wang, Mingjie Zeng, Zhigang Zeng

Sep 18, 2025·cs.LG·PDF

Energy-efficient medical data classification is essential for modern disease screening, particularly in home and field healthcare where embedded devices are prevalent. While deep learning models achieve state-of-the-art accuracy, their substantial energy consumption and reliance on GPUs limit deployment on such platforms. We present HDC-X, a lightweight classification framework designed for low-power devices. HDC-X encodes data into high-dimensional hypervectors, aggregates them into multiple cluster-specific prototypes, and performs classification through similarity search in hyperspace. We evaluate HDC-X across three medical classification tasks; on heart sound classification, HDC-X is $350\times$ more energy-efficient than Bayesian ResNet with less than 1% accuracy difference. Moreover, HDC-X demonstrates exceptional robustness to noise, limited training data, and hardware error, supported by both theoretical analysis and empirical results, highlighting its potential for reliable deployment in real-world settings. Code is available at https://github.com/jianglanwei/HDC-X.

LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference

Jiawei Yi, Ping Gong, Youhui Bai, Zewen Jin, Shengnan Wang, Jiaqi Ruan, Jia He, Jiaan Zhu, Pengcheng Wang, Haibo Wang, Weiguang Wang, Xia Zhu, Cheng Li

Nov 18, 2025·cs.LG·PDF

During LLM inference, KVCache memory usage grows linearly with sequence length and batch size and often exceeds GPU capacity. Recent proposals offload KV states to host memory and reduce transfers using top-k attention. But their CPU-centric management of the on-GPU cache and CPU-GPU data movement incurs high overhead and fragments the bulk GPU execution that CUDA Graph relies on. To close this gap, we observe that adjacent queries within the same attention head exhibit strong directional similarity and retrieve highly overlapping top-k KV states. This insight enables a simple head granularity cache algorithm, QSAC, in which each head reuses its previously cached KV states whenever the current query is sufficiently similar to the prior one. QSAC further simplifies cache management primitives and cuts CPU involvement almost entirely. We develop LiteCache, a KVCache subsystem that incorporates QSAC. LiteCache introduces a GPU-centric synchronization controller and speculative sparse prefetching, enabling fully overlapped data movement and computation. These mechanisms produce a stable and predictable execution pattern that remains compatible with the bulk execution mode required by CUDA Graphs. Evaluation on two widely-used LLMs indicates that LiteCache achieves comparable accuracy to baselines, while sharply minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 10.7-224.2% on both H100 and A40 GPUs and easily supporting sequence lengths beyond 1M. We opensource LiteCache at https://anonymous.4open.science/r/LiteCache-888D.

← Previous

Towards Realistic Detection Pipelines of Taiji: New Challenges in Data Analysis and High-Fidelity Simulations of Space-Based Gravitational Wave Antenna

May 22, 2025·gr-qc·PDF