Guoyu Li, Yang Cao, Lucas H L Ng, Alexander Charlton, Qianzhou Wang, Will Punter, Philippos Papaphilippou, Ce Guo, Hongxiang Fan, Wayne Luk, Saman Amarasinghe, Ajay Brahmakshatriya
With network requirements diverging across emerging applications, latency-critical services demand minimal logic delay, while hyperscale training and collectives require sustained line-rate throughput for synchronized bulk transfers. This divergence creates an urgent need for custom network switches tailored to specialized protocols and application-specific traffic patterns. This paper presents SPAC (Switch and Protocol Adaptive Customization), a novel approach that automates the generation of FPGA-based network switches co-optimized for custom protocols and application-specific traffic patterns. SPAC introduces a unified workflow with a domain-specific language (DSL) for protocol-architecture co-design, a library of modular HLS-based adaptive switch components, and a trace-aware Design Space Exploration (DSE) engine. By providing a multi-fidelity simulation stack, SPAC enables rapid identification of Pareto-optimal designs prior to deployment. We demonstrate the efficacy of the domain-specific adaptation of SPAC across a spectrum of real-world scenarios, spanning from latency-sensitive sensor and HFT networks to hyperscale datacenter fabrics. Experimental results show that by tailoring the micro-architecture and protocol to the specific workload, SPAC-generated designs reduce LUT and BRAM usage by 55% and 53%, respectively. Compared to fixed-architecture counterparts, SPAC delivers latency reductions ranging from 7.8% to 38.4% across various tasks while maintaining adequate resource consumption and packet drop rate.
Rishona Daniels, Duna Wattad, Ronny Ronen, David Saad, Shahar Kvatinsky
Reservoir computing (RC) is an emerging recurrent neural network architecture that has attracted growing attention for its low training cost and modest hardware requirements. Memristor-based circuits are particularly promising for RC, as their intrinsic dynamics can reduce network size and parameter overhead in tasks such as time-series prediction and image recognition. Although RC has been demonstrated with several memristive devices, a comprehensive evaluation of device-level requirements remains limited. In this paper, we analyze and explain the operation of a parallel delayed feedback network (PDFN) RC architecture with volatile memristors, focusing on how device characteristics -- such as decay rate, quantization, and variability -- affect reservoir performance. We further discuss strategies to improve data representation in the reservoir using preprocessing methods and suggest potential improvements. The proposed approach achieves 95.89% classification accuracy on MNIST, comparable with the best reported memristor-based RC implementations. Furthermore, the method maintains high robustness under 20% device variability, achieving an accuracy of up to 94.2%. These results demonstrate that volatile memristors can support reliable spatio-temporal information processing and reinforce their potential as key building blocks for compact, high-speed, and energy-efficient neuromorphic computing systems.
Xiangyu Ren, Yuexun Huang, Zhemin Zhang, Yuchen Zhu, Tsung-Yi Ho, Antonio Barbalace, Zhiding Liang
Apr 23, 2026·quant-ph·PDF Photonic quantum computing provides a promising route toward quantum computation by naturally supporting the measurement-based quantum computation (MBQC) model. In MBQC, programs are executed through measurements on a pre-generated graph state, whose construction largely depends on probabilistic fusion operations. However, fusion operations in PQC are vulnerable to two major error sources: fusion failure and fusion erasure. As a result, MBQC compilation must account for both error mechanisms to generate reliable and efficient photonic executions. Prior state-of-the-art MBQC compilation, represented by OneAdapt, is designed for all-photonic architectures and mainly focuses on handling fusion failures. Nevertheless, it does not explicitly model fusion erasures induced by photon loss, which can be substantially more damaging than fusion failures. To mitigate fusion erasure errors, we introduce a new MBQC compilation scheme built upon the spin qubit quantum memory. We propose tree-encoded fusion, an encoding strategy that suppresses erasure errors during graph-state generation. We further incorporate this scheme into a compiler framework with algorithms that reduce the execution overhead of quantum programs. We evaluate the proposed framework using a realistic PQC simulator on six representative quantum algorithm benchmarks across multiple program scales. The results show that tree-encoded fusion achieves better robustness than alternative fusion-encoding strategies, and that our compiler provides exponential improvement over OneAdapt. In addition, we validate the feasibility of our approach through a proof-of-concept demonstration on real PQC hardware.
Vincent Sprave, Martin Wilhelm, Daniele Passaretti, Alberto Garcia-Ortiz, Thilo Pionteck
Adaptive Systems-on-Chips (SoCs) are increasingly being used in mixed criticality systems (MCSs), such as in autonomous driving, aviation and medical systems. In this context, AMD has proposed the Versal SoC, which has a heterogeneous architecture including, among other components, an Artificial Intelligence Engine (AIE), which is a 2D array of processors and memory tiles designed for AI and signal processing workloads. While this AIE offers significant potential for accelerating real-time data processing tasks, this has not yet been explored in the context of MCSs since individual tasks with different criticality levels cannot be dynamically assigned to tiles due to the static mapping of dataflow graphs and tasks. In this work, we propose a dynamic task dispatching infrastructure that enables task switching on the AIE at runtime. Based on this infrastructure, we present an MCS design that dynamically assigns tasks of different criticality to a pool of AIE tiles, depending on the criticality mode of the system. Our approach overcomes the limitations of static dataflow graph mappings and, for the first time, exploits the parallel processing capabilities of the AIE for MCSs. We also present a comprehensive timing analysis of the overhead introduced by the task dispatcher infrastructure, focusing on control logic, context switching and data copy operations. This shows that these operations have low variance and are negligible compared to the overall execution time, demonstrating that our infrastructure is suitable for MCSs. Finally, we evaluate the proposed infrastructure using an autonomous driving workload with tasks that have variable execution times and different criticality levels. In this case study, we maximized AIE utilization, reducing idle time by 65.5 %, while measuring an execution time overhead of less than 0.002 %, and doubling the throughput of low-criticality tasks.
Max Tzschoppe, Martin Wilhelm, Sven Groppe, Thilo Pionteck
This paper introduces a search algorithm for index structures based on a B+ tree, specifically optimized for execution on a field-programmable gate array (FPGA). Our implementation efficiently traverses and reuses tree nodes by processing a batch of search keys level by level. This approach reduces costly global memory accesses, improves reuse of loaded B+ tree nodes, and enables parallel search key comparisons directly on the FPGA. Using a high-level synthesis (HLS) approach, we developed a highly flexible and configurable search kernel design supporting variable batch sizes, customizable node sizes, and arbitrary tree depths. The final design was implemented on an AMD Alveo U250 Data Center Accelerator Card, and was evaluated against the B+ tree search algorithm from the TLX library running on an AMD EPYC 7542 processor (2.9 GHz). With a batch size of 1000 search keys, a B+ tree containing one million entries, and a tree order of 16, we measured a 4.9x speedup for the single-kernel FPGA design compared to a single-threaded CPU implementation. Running four kernel instances in parallel on the FPGA resulted in a 2.1$\times$ performance improvement over a CPU implementation using 16 threads.
Chetan Choppali Sudarshan, Aman Arora, Vidya A Chhabria
Climate change concerns emphasize the need for sustainable computing. Modeling the carbon footprint (CFP), including operational and embodied CFP from semiconductor use, manufacture and design, is essential. Field programmable gate arrays (FPGAs) stand out as promising platforms due to their reconfigurability across various applications, enabling the amortization of embodied CFP across multiple applications. This paper introduces GreenFPGA, a tool estimating the total CFP of FPGAs over their lifespan, considering uncertainties in CFP modeling. It accounts for CFP during design, manufacturing, reconfigurability (reuse), operation, disposal, testing, and recycling. GreenFPGA identifies deployment regimes in which FPGAs can be more sustainable than ASICs, GPUs, and CPUs under the modeled iso-performance assumptions. Experimental results highlight the importance of analyzing applications across different computing platforms to assess their CFP while varying parameters such as application type, lifetime, usage time, and volume impact their total CFP. Across the evaluated pairwise iso-performance case studies with ASICs, GPUs, and CPUs, FPGAs can be more sustainable under specific deployment regimes involving frequently changing, diverse workloads and low-volume applications.
Jumin Kim, Seungmin Baek, Hwayong Nam, Minbok Wi, Nam Sung Kim, Jung Ho Ahn
As DRAM scaling exacerbates RowHammer, DDR5 introduces per-row activation counting (PRAC) to track aggressor activity. However, PRAC indiscriminately increments counters on every activation -- including benign refreshes -- while relying solely on explicit RFM operations for resets. Consequently, counters saturate even in an idle bank, triggering cascading mitigations and degrading performance. This vulnerability arises from a fundamental mismatch: PRAC tracks the aggressor but aims to protect the victim. We present Per-Victim-row hAmmered Counting (PVAC), a victim-based counting mechanism that aligns the counter semantics with the physical disturbance mechanism of RowHammer. PVAC increments the counters of victim rows, resets the activated row, and naturally bounds counter values under normal refresh. To enable efficient victim-based updates, PVAC employs a dedicated counter subarray (CSA) that performs all counter resets and increments concurrently with normal accesses, without timing overhead. We further devise an energy-efficient CSA layout that minimizes refresh-induced counter accesses. Through victim-based counting, PVAC supports higher hammering tolerance than PRAC while maintaining the same worst-case safety guarantee. Across benign workloads and adversarial attack patterns, PVAC avoids spurious Alerts, eliminates PRAC timing penalties, and achieves higher performance and lower energy consumption than prior PRAC-based defenses.
Naser Khatti Dizabadi, Ceyda Elcin Kaya
This paper presents a low-power cache architecture based on the series interconnection of conventional 6-transistor static random-access memory (6T SRAM) cells. The proposed approach aims to reduce leakage power in SRAM-based cache memories without increasing the transistor count of the memory cell itself. In the proposed architecture, adjacent cells within a column are reconfigured in a serial topology, thereby exploiting the stacking effect to suppress leakage current, particularly during hold operation. This architectural modification requires corresponding changes to the addressing and sensing structure of the cache, including adjustments to the column organization and readout path. To evaluate the proposed method, transient simulations were carried out using Keysight ADS. The simulation results show that the proposed architecture reduces leakage power compared with the conventional SRAM interconnection scheme while preserving the use of standard 6T SRAM cells.
Xian Rong Qin, Yong Zhang, Ying Hu, Tao Su, Bo-Wen Jia, Ning Xu
Design automation has the potential to substantially improve the efficiency of analog integrated circuit (IC) design. However, existing algorithms and tools typically focus on individual stages, such as device sizing, placement, or routing, and still require significant manual intervention to complete the full design flow. While large language models (LLMs) have recently demonstrated remarkable success in automating digital IC design workflows, these advances cannot be directly transferred to analog IC design. Key challenges include strongly coupled performance metrics, the predominance of unstructured circuit schematic images, and the fact that most prior approaches address only isolated stages of the analog design process, limiting their ability to capture end-to-end performance impact. To address these challenges, we propose AnalogMaster, an extensible, LLM-based framework that enables end-to-end automation of analog IC design through a unified pipeline spanning circuit image-to-netlist generation, parameter optimization, placement, and routing. AnalogMaster integrates a joint reasoning mechanism that leverages in-context learning and intent reasoning to achieve accurate and robust image-to-netlist conversion. A parameter search agent integrating self-enhanced prompt engineering and context truncation is developed for effective device sizing and downstream physical design. Experimental evaluations on 15 representative circuits with varying levels of complexity demonstrate strong and consistent performance across multiple models. In particular, GPT-5 achieves success rates of 92.9% and 99.9% on Pass@1 and Pass@5, respectively. These results validate the effectiveness and robustness of the proposed framework and establish a practical paradigm for applying LLMs to full-stack analog IC design automation.
Kyungmi Lee, Zhiye Song, Eun Kyung Lee, Xin Zhang, Tamar Eilam, Anantha P. Chandrakasan
As AI workloads drive increases in datacenter power consumption, accurate GPU power estimation is critical for proactive power management. However, existing power models face a scalability bottleneck not in the modeling techniques themselves, but in obtaining the hardware utilization inputs they require. Conventional approaches rely on either costly simulation or hardware profiling, which makes them impractical when rapid predictions are required. This work presents EnergAIzer, which addresses this scalability bottleneck by developing a lightweight solution to predict utilization inputs, reducing the estimation walltime from hours to seconds. Our key insight is that kernels in AI workloads commonly employ optimizations that create structured patterns, which analytically determine memory traffic and execution timeline. We construct a performance model using these patterns as an analytical scaffold for empirical data fitting, which also naturally exposes module-level utilization. This predicted utilization is then fed into our power model to estimate dynamic power consumption. EnergAIzer achieves 8% power errors on NVIDIA Ampere GPUs, competitive with traditional power models with elaborate cycle-level simulation or hardware profiling. We demonstrate EnergAIzer's exploration capabilities for frequency scaling and architectural configurations, including forecasting the power of NVIDIA H100 with just 7% error. In summary, EnergAIzer provides fast and accurate power prediction for AI workloads, paving the way for power-aware design explorations.
Zehuan Zhang, Mark Chen, He Li, Wayne Luk
Complex-Valued Neural Networks (CVNNs) have significant advantages in handling tasks that involve complex numbers. However, existing CVNNs are unable to quantify predictive uncertainty. We propose, for the first time, dropout-based Bayesian Complex-Valued Neural Networks (BayesCVNNs) to enable uncertainty quantification for complex-valued applications, exhibiting broad applicability and efficiency for hardware implementation due to modularity. Furthermore, as the dual-part nature of complex values significantly broadens the design space and enables novel configurations based on layer-mixing and part-mixing, we introduce an automated search approach to effectively identify optimal configurations for both real and imaginary components. To facilitate deployment, we present a framework that generates customized FPGA-based accelerators for BayesCVNNs, leveraging a set of optimized building blocks. Experiments demonstrate the best configuration can be effectively found via the automated search, attaining higher performance with lower hardware costs compared with manually crafted models. The optimized accelerators achieve approximately 4.5x and 13x speedups on different models with less than 10% power consumption compared to GPU implementations, and outperform existing work in both algorithm and hardware aspects. Our code is publicly available at: https://github.com/zehuanzhang/BayesCVNN.git.
Upasna, Venkata Kalyan Tavva
Heterogeneous Memory Architecture (HMA) aims to optimize memory usage by leveraging a combination of memory types, such as high-bandwidth memory (HBM), commodity DRAM, and non-volatile memory (NVM), when utilized as main memory. To achieve maximum performance benefits, frequently accessed data pages are prioritized for storage in the faster HBM, while less frequently accessed pages are stored in slower memory types like DRAM or NVM. This enables a more efficient allocation of memory resources and improves overall system performance. In a Flat Address Space memory organization, all memory types, both fast and slow, are treated as a unified memory pool. This approach increases the overall memory capacity accessible to the system. In Flat Address Space organization, frequently accessed data pages may need to be remapped from slower memory to faster memory to improve memory access times. Such relocation requires changes to the data/states in the TLB (TLB shootdown) and the processor cache (cache line invalidations), leading to performance degradation. To address these inefficiencies, we propose a novel solution called Duon. The goal of Duon is to eliminate the overheads associated with page migration in systems using Extended TLB and Page Table. Specifically, our approach ensures that the updated mapping information for remapped pages is carefully stored directly in the TLB and page table itself. By doing so, the need for TLB shootdown and cache line invalidation after page migration is eliminated. Consequently, our proposal results in an overall improvement in IPC by 3.87% over existing state-of-the-art techniques, enhancing the efficiency and performance of heterogeneous memory systems. Further, our approach can work with any of the existing page migration policies and improve the performance.
Rohan S. Kumar, Takahiro Tsunoda, Sophia H. Xue, Dantong Li, Robert J. Schoelkopf, Yongshan Ding
Apr 21, 2026·quant-ph·PDF Near-term quantum workloads demand error management, yet the two lightest-weight techniques, Quantum Error Detection (QED) and Probabilistic Error Cancellation (PEC), have complementary cost profiles whose joint architectural design space remains unexplored. QED encodes logical qubits and discards error-flagged runs, filtering noise with low qubit overhead but leaving residual errors; PEC can correct these in software, but at exponential cost in noise strength. If QED efficiently reduces per-gate noise, PEC's cost savings can outweigh QED's discard overhead; realizing this, however, requires solving two system-level design challenges. First, the \textit{QED interval} -- how often detection cycles are inserted -- is a tunable architectural parameter governing the cost-accuracy tradeoff. We derive an efficiency condition and show that the canonical one-cycle-per-gate frequency does not achieve break-even in any code we evaluate, while optimized intervals on high-rate Iceberg codes do. Second, we discover that naive PEC+QED integration \textit{degrades} accuracy below the QED-only baseline. The root cause is a transient error profile in the first detection cycle that corrupts PEC's noise model. We develop \textit{steady-state extraction}, a co-designed characterization protocol that isolates steady-state error behavior, reducing estimation bias by up to $10.2\times$. On a $[[6,4,2]]$ Iceberg code running QAOA ($p{=}4$--$8$) with a fixed shot budget, PEC+QED achieves $2$--$11\times$ lower absolute error and up to $31\times$ lower MSE versus PEC on physical qubits, with per-interval savings compounding over interval depth.
Cagri Eryilmaz
Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.
Archisman Ghosh, Avimita Chatterjee, Swaroop Ghosh
Apr 21, 2026·quant-ph·PDF Practical quantum advantage is expected to depend on fault-tolerant quantum computing, although the architectural overhead needed to support fault tolerance is still extremely high. Prior FTQC designs generally emphasize either fast logical-qubit accessibility at the cost of significant qubit overhead, or high logical-qubit density at the cost of added workload latency. We propose an architecture that balances these competing objectives by placing surface-code patches around an ancilla-centric region, which yields nearly uniform ancilla access for all data qubits. Building on this design, we introduce a new workload-driven placement method that uses the $T$-gate profile of an application to determine an effective floorplan. We further provide a reconfigurable optimization for reducing the latency of $Y$-gate measurements on a per-workload basis. To improve flexibility, we also study concurrent execution of multiple programs on the same architecture. Numerical evaluation indicates that our approach keeps cycles per instruction near the optimal regime while reducing the number of required data tiles by up to $\sim21\%$, and achieves up to $\sim90\%$ efficiency when running 10 programs concurrently.
Chao Qian, Tianheng Ling, Gregor Schiele
Long Short-term Memory Networks (LSTMs) are a vital Deep Learning technique suitable for performing on-device time series analysis on local sensor data streams of embedded devices. In this paper, we propose a new hardware accelerator design for LSTMs specially optimised for resource-scarce embedded Field Programmable Gate Arrays (FPGAs). Our design improves the execution speed and reduces energy consumption compared to related work. Moreover, it can be adapted to different situations using a number of optimisation parameters, such as the usage of DSPs or the implementation of activation functions. We present our key design decisions and evaluate the performance. Our accelerator achieves an energy efficiency of 11.89 GOP/s/W during a real-time inference with 32873 samples/s.
Zhenghua Ma, G Abarajithan, Dimitrios Danopoulos, Olivia Weng, Francesco Restuccia, Ryan Kastner
Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and require that model weights remain fully on-chip. Spatial dataflow implementations are common for extreme-edge applications. Spatial dataflow works well for small networks, but it fails to scale to larger models due to inherent resource scaling limitations. AI Engines on modern FPGA SoCs offer a promising alternative with high compute density and additional on-chip memory. However, the architecture, programming model, and performance-scaling behavior of AI Engines differ fundamentally from those of the programmable logic, making direct comparison non-trivial and the benefits of using AI Engines unclear. This work addresses how and when extreme-edge scientific neural networks should be implemented on AI Engines versus programmable logic. We provide systematic architectural characterization and micro-benchmarking and introduce a latency-adjusted resource equivalence (LARE) metric that identifies when AI Engine implementations outperform programmable logic designs. We further propose spatial and API-level dataflow optimizations tailored to low-latency scientific inference. Finally, we demonstrate the successful deployment of end-to-end neural networks on AI Engines that cannot fit on programmable logic when using the hlsml toolchain.
Kangbo Bai, Zhantong Zhu, Yifan Ding, Tianyu Jia
In large-scale distributed LLM training, communication between devices becomes the key performance bottleneck. Chiplet technology can integrate multiple dies into a package to scale-up node performance with higher bandwidth. Meanwhile, optical interconnect (OI) technology offers long-reach, high-bandwidth links, making it well suited for scale-out networks. The combination of these two technologies has the potential to overcome communication bottlenecks within and across packages. In this work, we present ChipLight, a cross-layer multi-objective design and optimization method for training clusters leveraging chiplet and OI. We first abstract an architecture model for such complex clusters, co-optimizing chiplet architecture, training parallel strategy, and OI network topology. Based on such models, we tailor the design space exploration flow by combining both black-box and white-box methodologies. Evaluated by our experimental results, ChipLight achieves significantly improved training efficiency and provides valuable design insights for the development of future training clusters.
Chao Li
Memristive crossbars store numerical weights needing aggregation and decoding; a single junction means nothing alone. This paper presents a fundamentally different use: each junction stores a complete, domain-scoped logical assertion (holds/negated/undefined). Ternary resistance states encode these values directly. We establish a structure-preserving mapping from a domain algebra to crossbar topology: domains become isolated arrays, specialization becomes directed wiring, relation typing controls inheritance gates, and cross-domain links become explicit registers. The physical layout thus embodies the algebra; changing wiring changes reasoning semantics. We detail an ICD-11 respiratory disease classification chip (1,247 entities, ~136k 1T1R junctions) enabling domain scoping, three-valued logic, transitive cascade, typed inheritance, and cross-axis queries. Behavioral simulation (sigma_log=0.15, SNR=20dB) shows error-free operation across 100,000 trials per task with wide tolerance margins. Where prior work unified representation and computation in software, this work unifies them in hardware: reading one junction answers one question, without symbolic interpretation.
Mustafa Mert Özyılmaz
ARM-based and x86-64 laptop processors differ not only in instruction-set design, but also in memory hierarchy, core organization, system integration, and power-management mechanisms. This study presents a combined architectural and experimental comparison of an Apple M3 system and an AMD Ryzen 7 3750H system. The architectural analysis contrasts AArch64's fixed-width load-store design with the variable-length, memory-operand-rich x86-64 instruction model, and discusses how register organization, calling conventions, heterogeneous core organization, memory behavior, and low-power mechanisms shape observed performance and energy characteristics. The experimental part uses two native assembly benchmarks: a recursive Fibonacci workload and an integer matrix-multiplication workload. The analysis combines repeated timing measurements, processor-energy measurements, and cross-platform microarchitectural counter measurements from matched portable-C profiling runs. The Ryzen platform is decisively faster on the branch-heavy Fibonacci benchmark, while matrix multiplication shows no meaningful timing advantage for either platform in the present measurements. In contrast, the Apple platform is markedly more energy-efficient, reducing energy-to-solution by approximately 5.82$\times$ on Fibonacci and 6.38$\times$ on matrix multiplication. These results are interpreted as platform-level findings rather than as pure ISA-only effects, reflecting differences in implementation, system integration, and measurement methodology in addition to instruction-set structure.