Zhiyue Zhao, Zhiwei Jiang, Yizhe Huang, Mebrouka Boubeche, Valentina G. Matveeva, Hector F. Garces, Huixia Luo, Kai Yan
Rational design and green synthesis of low-cost and robust catalysts efficient for the selective oxidation of various alcohols are full of challenges. Herein, we report a fast and solvent-free arc-melting (AM) method to controllably synthesize semimetal CoSi alloy (abbreviated as AM-CoSi) that is efficient for the base- and solvent-free oxidation of six types of aromatic alcohols. X-ray absorption fine structure (XAFS), electron paramagnetic resonance (EPR), and aberration corrected high angle annular dark field scanning transmission electron microscope (AC HAADF-STEM) confirmed the successful synthesis of AM-CoSi with rich Si vacancy (Siv). The as-prepared CoSi alloy catalysts exhibit an order of magnitude activity enhancement in the oxidation of model reactant benzyl alcohol (BAL) to benzyl benzoate (BBE) compared with its mono counterparts, whereas 70 % yield of BBE which is the highest yield to date. Experimental results and DFT calculations well verify that the CoSi alloy structure improves the BAL conversion and Si vacancy mainly contributes to the generation of BBE. After that, CoSi alloy maintains high stability and a potential pathway is rationally proposed. Besides, CoSi alloy also efficiently works for the selective oxidation of various alcohols with different groups. This work demonstrates for the first time that semimetal CoSi alloy is robust for the green oxidation of various alcohols and provides a vast opportunity for reasonable design and application of other semimetal alloy catalysts.
Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen
Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.
Yongtao Ge, Guangkai Xu, Zhiyue Zhao, Libo Sun, Zheng Huang, Yanlong Sun, Hao Chen, Chunhua Shen
Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by leveraging pre-trained diffusion models and fine-tuning on even a small scale of synthetic training data. Frustratingly, these models are trained with different recipes on different datasets, making it hard to find out the critical factors that determine the evaluation performance. Besides, current geometry evaluation benchmarks have two main drawbacks that may prevent the development of the field, i.e., limited scene diversity and unfavorable label quality. To resolve the above issues, (1) we build fair and strong baselines in a unified codebase for evaluating and analyzing the geometry estimation models; (2) we evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations. Our results reveal that pre-trained using large data, discriminative models such as DINOv2, can outperform generative counterparts with a small amount of high-quality synthetic data under the same training configuration, which suggests that fine-tuning data quality is a more important factor than the data scale and model architecture. Our observation also raises a question: if simply fine-tuning a general vision model such as DINOv2 using a small amount of synthetic depth data produces SOTA results, do we really need complex generative models for depth estimation? We believe this work can propel advancements in geometry estimation tasks as well as a wide range of downstream applications.
Yuchen Wang, Yaoyu Liu, Man Zhang, Biying Liu, Zhiyue Zhao, Kai Yan
Nickel-based layered double hydroxides (LDHs) are promising electrode materials in the fields of energy storage (supercapacitors) and conversion (urea oxidation). The rational construction of atomic and electronic structure is crucial for nickel-based LDHs to realize their satisfactory electrochemical performance. Herein, we report a facile, ecofriendly, one-step synthesis process to construct petal-like oxygen-deficient NiAl-LDH nanosheets for hybrid super-capacitors (HSCs) and urea oxidation reaction (UOR). The asprepared NiAl-LDH nanosheets with rich oxygen vacancies possess a large specific surface area of 216.6 m2 g-1 and a desirable electronic conductivity of 3.45 * 10-4 S cm-1 to deliver an ultra-high specific capacitance of 2801 F g-1 (700 C g-1) at 1 A g-1. Furthermore, high specific energy of 50.0 W h kg-1 at 400 W kg-1 and excellent cycle stability with 91% capacitance retention after 10,000 cycles are achieved by the NiAl-LDHs/CFP (carbon fiber paper) (+)//YP-80F (a commercial activated carbon) (-) HSC. Besides, NiAl-LDH nanosheets also work as an efficient electrocatalyst for UOR, which only requires 1.42 V vs. reversible hydrogen electrode to drive 10 mA cm-2 in 1 mol L-1 KOH with 0.33 mol L-1 urea. This remarkable performance is superior to most reported values of previous candidates owing to the thin structure of NiAl-LDH nanosheets for exposing more active sites and abundant oxygen vacancies. In addition, various reaction parameters are investigated to optimize the electrochemical performance. In general, this work paves a new way for the architecture of multifunctional nanostructured energy materials.
Yuchen Wang, Yaoyu Liu, Zhiyue Zhao, Zhikeng Zheng, Alina M. Balu, Rafael Luque, Kai Yan
Electrocatalytic hydrogenation of nitrobenzene (Ph-NO2) reaction (EHNR) has been considered as a potential alternative to the traditional thermocatalytic process in the production of high-value aniline (Ph-NH2). However, due to the absence of robust catalyst and low surface H* coverage, the EHNR faces the challenges of undesired performance and indetermined mechanism. Herein, we construct a type of noble-metal free topological FeSi (M-FeSi) materials through a solvent-free microwave strategy for efficient EHNR in neutral medium. Impressively, benefiting from abundant active H* intermediates on the surface of M-FeSi catalyst, the topological M-FeSi catalyst exhibits 99.7% conversion of Ph-NO2 and 93.8% yield of Ph-NH2 after 200 C in neutral medium, which are superior to previous candidates and FeSi catalyst synthesized via the traditional arc-melting method under same conditions. Besides, theoretical calculations validate that high surface H* coverage over M-FeSi catalyst is conducive to switching the rate-determining step from Ph-NO2* Ph-NO* to Ph-NO* Ph-NHOH*, and thus decreasing the total energy barrier of electrocatalytic Ph-NH2 production.
Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen
Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step stochastic diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) The stochastic nature of diffusion models has a slightly negative impact on deterministic visual perception tasks. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific image-level supervision is beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks. Different from the previous multi-step methods, our paradigm has a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for image-level supervision, which is critical to improving the fine-grained details of predictions. Comprehensive experiments on diverse dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.
Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen
We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker
Zhiwei Jiang, Zhiyue Zhao, Xin Li, Huaiguang Li, Hector F. Garces, Mahmoud Amer, Kai Yan
Developing a green and cost-effective catalytic system for the selective oxidation of biomass-derived alcohols is vital for the sustainable synthesis of fine chemicals. Herein, highly dispersed subnanometric amorphous CoOx clusters in anatase TiO2 nanosheets (Co-TiO2) fabricated by green solvent CO2 assisted approach could directly activate peroxymonosulfate (PMS) for the highly selective oxidation of various biomass-derived alcohols. Advanced characterizations (e.g., EXAFS, EPR, AC HAADF-STEM) reveal that a strong interaction of CoOx clusters and the anatase TiO2 support exist in Co-TiO2 and Co atom in Co-TiO2 is mainly consisted of Co2+ and Co3+. The Co-TiO2 catalyst offers superior catalytic performance in the conversion of six types of alcohols (e.g., benzyl alcohol (BAL), 5-hydroxymethylfurfural (HMF)) with high selectivity to produce corresponding aldehydes. Highly dispersed CoOx clusters and the interaction between CoOx clusters and TiO2 support contribute to the superior performance. Mechanism studies show that SO4 radicals play the dominant role in the selective oxidation of model reactant BAL and 1O2 participates in the non-radical pathway. DFT calculations are well matched with experiment and decipher that the strong interaction between CoOx clusters and TiO2 support promotes the formation of SO4 and SO5.
Jiacheng Fan, Zhiyue Zhao, Yiqian Zhang, Chao Chen, Peide Wang, Hengdi Zhang, Zhengxue Cheng
Acquiring large-scale, high-fidelity robot demonstration data remains a critical bottleneck for scaling Vision-Language-Action (VLA) models in dexterous manipulation. We propose a Real-Sim-Real data collection and data editing pipeline that transforms human demonstrations into robot-executable, environment-specific training data without direct robot teleoperation. Standardized data collection rooms are built to capture multimodal human demonstrations (synchronized 3 RGB-D videos, 11 RGB videos, 29-DoF glove joint angles, and 14-channel tactile signals). Based on these human demonstrations, we introduce a tactile-aware retargeting method that maps human hand states to robot dex-hand states via geometry and force-guided optimization. Then the retargeted robot trajectories are rendered in a photorealistic Isaac Sim environment to build robot training data. Real world experiments have demonstrated: (1) The retargeted dex-hand trajectories achieve an 84\% success rate across 10 diverse object manipulation tasks. (2) VLA policies (Pi0.5) trained exclusively on our generated data achieve 80\% average success rate on three representative tasks, i.e., pick-and-place, pushing and pouring. To conclude, robot training data can be efficiently "painted" from human demonstrations using our real-sim-real data pipeline. We offer a scalable, cost-effective alternative to teleoperation with minimal performance loss for complex dexterous manipulation.
Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models. Code and Model: https://github.com/aim-uofa/Diception