Phan Nguyen, Rosemary Braun
Jun 30, 2018·q-bio.QM·PDF Accurate gene regulatory networks can be used to explain the emergence of different phenotypes, disease mechanisms, and other biological functions. Many methods have been proposed to infer networks from gene expression data but have been hampered by problems such as low sample size, inaccurate constraints, and incomplete characterizations of regulatory dynamics. Since expression regulation is dynamic, time-course data can be used to infer causality, but these datasets tend to be short or sparsely sampled. In addition, temporal methods typically assume that the expression of a gene at a time point depends on the expression of other genes at only the immediately preceding time point, while other methods include additional time points without any constraints to account for their temporal distance. These limitations can contribute to inaccurate networks with many missing and anomalous links. We adapted the time-lagged Ordered Lasso, a regularized regression method with temporal monotonicity constraints, for \textit{de novo} reconstruction. We also developed a semi-supervised method that embeds prior network information into the Ordered Lasso to discover novel regulatory dependencies in existing pathways. We evaluated these approaches on simulated data for a repressilator, time-course data from past DREAM challenges, and a HeLa cell cycle dataset to show that they can produce accurate networks subject to the dynamics and assumptions of the time-lagged Ordered Lasso regression.
Nguyen Phan, Guoning Chen
Curve-based representations, particularly integral curves, are often used to represent large-scale computational fluid dynamic simulations. Processing and analyzing curve-based vector field data sets often involves searching for neighboring segments given a query point or curve segment. However, because the original flow behavior may not be fully represented by the set of integral curves and the input integral curves may not be evenly distributed in space, popular neighbor search strategies often return skewed and redundant neighboring segments. Yet, there is a lack of systematic and comprehensive research on how different configurations of neighboring segments returned by specific neighbor search strategies affect subsequent tasks. To fill this gap, this study evaluates the performance of two popular neighbor search strategies combined with different distance metrics on a point-based vector field reconstruction task and a segment saliency estimation using input integral curves. A large number of reconstruction tests and saliency calculations are conducted for the study. To characterize the configurations of neighboring segments for an effective comparison of different search strategies, a number of measures, like average neighbor distance and uniformity, are proposed. Our study leads to a few observations that partially confirm our expectations about the ideal configurations of a neighborhood while revealing additional findings that were overlooked by the community.
Nguyen Phan, Brian Kim, Adeel Zafar, Guoning Chen
Streamlines have been widely used to represent and analyze various steady vector fields. To sufficiently represent important features in complex vector fields (like flow), a large number of streamlines are required. Due to the lack of a rigorous definition of features or patterns in streamlines, user interaction and exploration are required to achieve effective interpretation. Existing approaches based on clustering or pattern search, while valuable for specific analysis tasks, often face challenges in supporting interactive and level-of-detail exploration of large-scale curve-based data, particularly when real-time parameter adjustment and iterative refinement are needed. To address this, we design and implement an interactive web-based system. Our system utilizes a Curve Segment Neighborhood Graph (CSNG) to encode the neighboring relationships between curve segments. CSNG enables us to adapt a fast community detection algorithm to identify coherent flow structures and spatial groupings in the streamlines interactively. CSNG also supports a multi-level exploration through an enhanced force-directed layout. Furthermore, our system integrates an adjacency matrix representation to reveal detailed inter-relations among segments. To achieve real-time performance within a web browser, our system employs matrix compression for memory-efficient CSNG storage and parallel processing. We have applied our system to analyze and interpret complex patterns in several streamline datasets. Our experiments show that we achieve real-time performance on datasets with hundreds of thousands of segments.
Brian R. Bartoldson, Yeping Hu, Amar Saini, Jose Cadena, Yucheng Fu, Jie Bao, Zhijie Xu, Brenda Ng, Phan Nguyen
Data-driven modeling approaches can produce fast surrogates to study large-scale physics problems. Among them, graph neural networks (GNNs) that operate on mesh-based data are desirable because they possess inductive biases that promote physical faithfulness, but hardware limitations have precluded their application to large computational domains. We show that it is \textit{possible} to train a class of GNN surrogates on 3D meshes. We scale MeshGraphNets (MGN), a subclass of GNNs for mesh-based physics modeling, via our domain decomposition approach to facilitate training that is mathematically equivalent to training on the whole domain under certain conditions. With this, we were able to train MGN on meshes with \textit{millions} of nodes to generate computational fluid dynamics (CFD) simulations. Furthermore, we show how to enhance MGN via higher-order numerical integration, which can reduce MGN's error and training time. We validated our methods on an accompanying dataset of 3D $\text{CO}_2$-capture CFD simulations on a 3.1M-node mesh. This work presents a practical path to scaling MGN for real-world applications.
Phan Nguyen, Rosemary Braun
Motivation: Inferring the structure of gene regulatory networks from high--throughput datasets remains an important and unsolved problem. Current methods are hampered by problems such as noise, low sample size, and incomplete characterizations of regulatory dynamics, leading to networks with missing and anomalous links. Integration of prior network information (e.g., from pathway databases) has the potential to improve reconstructions. Results: We developed a semi--supervised network reconstruction algorithm that enables the synthesis of information from partially known networks with time course gene expression data. We adapted PLS-VIP for time course data and used reference networks to simulate expression data from which null distributions of VIP scores are generated and used to estimate edge probabilities for input expression data. By using simulated dynamics to generate reference distributions, this approach incorporates previously known regulatory relationships and links the network to the dynamics to form a semi-supervised approach that discovers novel and anomalous connections. We applied this approach to data from a sleep deprivation study with KEGG pathways treated as prior networks, as well as to synthetic data from several DREAM challenges, and find that it is able to recover many of the true edges and identify errors in these networks, suggesting its ability to derive posterior networks that accurately reflect gene expression dynamics.
Nguyen K. Phan, Ricardo Morales, Sebastian D. Espriella, Guoning Chen
We present a novel diffusion-based framework for synthesizing 2D vector fields from sparse, coherent inputs (i.e., streamlines) while maintaining physical plausibility. Our method employs a conditional denoising diffusion probabilistic model with classifier-free guidance, enabling progressive reconstruction that preserves both geometric and physical constraints. Experimental results demonstrate our method's ability to synthesize plausible vector fields that adhere to physical laws while maintaining fidelity to sparse input observations, outperforming traditional optimization-based approaches in terms of flexibility and physical consistency.
Hai-Yen Phan Nguyen, Phi-Lan Ly, Duc-Manh Le, Trong-Hop Do
In the context of modern life, particularly in Industry 4.0 within the online space, emotions and moods are frequently conveyed through social media posts. The trend of sharing stories, thoughts, and feelings on these platforms generates a vast and promising data source for Big Data. This creates both a challenge and an opportunity for research in applying technology to develop more automated and accurate methods for detecting stress in social media users. In this study, we developed a real-time system for stress detection in online posts, using the "Dreaddit: A Reddit Dataset for Stress Analysis in Social Media," which comprises 187,444 posts across five different Reddit domains. Each domain contains texts with both stressful and non-stressful content, showcasing various expressions of stress. A labeled dataset of 3,553 lines was created for training. Apache Kafka, PySpark, and AirFlow were utilized to build and deploy the model. Logistic Regression yielded the best results for new streaming data, achieving 69,39% for measuring accuracy and 68,97 for measuring F1-scores.
Per Kristian Lehre, Phan Trung Hai Nguyen
We perform a rigorous runtime analysis for the Univariate Marginal Distribution Algorithm on the LeadingOnes function, a well-known benchmark function in the theory community of evolutionary computation with a high correlation between decision variables. For a problem instance of size $n$, the currently best known upper bound on the expected runtime is $\mathcal{O}(nλ\logλ+n^2)$ (Dang and Lehre, GECCO 2015), while a lower bound necessary to understand how the algorithm copes with variable dependencies is still missing. Motivated by this, we show that the algorithm requires a $e^{Ω(μ)}$ runtime with high probability and in expectation if the selective pressure is low; otherwise, we obtain a lower bound of $Ω(\frac{nλ}{\log(λ-μ)})$ on the expected runtime. Furthermore, we for the first time consider the algorithm on the function under a prior noise model and obtain an $\mathcal{O}(n^2)$ expected runtime for the optimal parameter settings. In the end, our theoretical results are accompanied by empirical findings, not only matching with rigorous analyses but also providing new insights into the behaviour of the algorithm.
Michael Freedman, T. Tam Nguyen Phan
We construct Bing houses in all dimensions $n \geq 3$, obtaining non-separating PL immersions of $\mathbb{S}^n \rightarrow\mathbb{R}^{n+1}$.
Duc-Cuong Dang, Per Kristian Lehre, Phan Trung Hai Nguyen
Estimation of Distribution Algorithms (EDAs) are stochastic heuristics that search for optimal solutions by learning and sampling from probabilistic models. Despite their popularity in real-world applications, there is little rigorous understanding of their performance. Even for the Univariate Marginal Distribution Algorithm (UMDA) -- a simple population-based EDA assuming independence between decision variables -- the optimisation time on the linear problem OneMax was until recently undetermined. The incomplete theoretical understanding of EDAs is mainly due to lack of appropriate analytical tools. We show that the recently developed level-based theorem for non-elitist populations combined with anti-concentration results yield upper bounds on the expected optimisation time of the UMDA. This approach results in the bound $\mathcal{O}(nλ\log λ+n^2)$ on two problems, LeadingOnes and BinVal, for population sizes $λ>μ=Ω(\log n)$, where $μ$ and $λ$ are parameters of the algorithm. We also prove that the UMDA with population sizes $μ\in \mathcal{O}(\sqrt{n}) \cap Ω(\log n)$ optimises OneMax in expected time $\mathcal{O}(λn)$, and for larger population sizes $μ=Ω(\sqrt{n}\log n)$, in expected time $\mathcal{O}(λ\sqrt{n})$. The facility and generality of our arguments suggest that this is a promising approach to derive bounds on the expected optimisation time of EDAs.
Per Kristian Lehre, Phan Trung Hai Nguyen
Unlike traditional evolutionary algorithms which produce offspring via genetic operators, Estimation of Distribution Algorithms (EDAs) sample solutions from probabilistic models which are learned from selected individuals. It is hoped that EDAs may improve optimisation performance on epistatic fitness landscapes by learning variable interactions. However, hardly any rigorous results are available to support claims about the performance of EDAs, even for fitness functions without epistasis. The expected runtime of the Univariate Marginal Distribution Algorithm (UMDA) on OneMax was recently shown to be in $\mathcal{O}\left(nλ\log λ\right)$ by Dang and Lehre (GECCO 2015). Later, Krejca and Witt (FOGA 2017) proved the lower bound $Ω\left(λ\sqrt{n}+n\log n\right)$ via an involved drift analysis. We prove a $\mathcal{O}\left(nλ\right)$ bound, given some restrictions on the population size. This implies the tight bound $Θ\left(n\log n\right)$ when $λ=\mathcal{O}\left(\log n\right)$, matching the runtime of classical EAs. Our analysis uses the level-based theorem and anti-concentration properties of the Poisson-Binomial distribution. We expect that these generic methods will facilitate further analysis of EDAs.
T. Tam Nguyen Phan
We show that if $g$ is a Riemannian metric on a closed piecewise locally symmetric manifold $M$, then the lift of $g$ to the universal cover $\widetilde{M}$ has a discrete isometry group. We also show that the index $[\Isom(\widetilde{M}): π_1(M)]$ is bounded by a constant independent of $g$.
Grigori Avramidi, T. Tam Nguyen Phan
In this paper we show that flat (m-1)-dimensional tori give nontrivial rational homology cycles in congruence covers of the locally symmetric space SL(m,Z) \SL(m,R)/SO(m). We also show that the dimension of the subspace of H_{m-1}(Γ\SL(m,R)/SO(m);Q) spanned by flat (m-1)-tori grows as one goes up in congruence covers.
T. Tam Nguyen Phan
We construct complete, finite volume, 4-dimensional manifolds with sectional curvature $-1<K<0$ with cusp cross sections compact solvmanifolds.
Cong-Tinh Dao, Nguyen Minh Thao Phan, Jun-En Ding, Chenwei Wu, David Restrepo, Dongsheng Luo, Fanyi Zhao, Chun-Chieh Liao, Wen-Chih Peng, Chi-Te Wang, Pei-Fu Chen, Ling Chen, Xinglong Ju, Feng Liu, Fang-Ming Hung
Electronic health records (EHRs) are designed to synthesize diverse data types, including unstructured clinical notes, structured lab tests, and time-series visit data. Physicians draw on these multimodal and temporal sources of EHR data to form a comprehensive view of a patient's health, which is crucial for informed therapeutic decision-making. Yet, most predictive models fail to fully capture the interactions, redundancies, and temporal patterns across multiple data modalities, often focusing on a single data type or overlooking these complexities. In this paper, we present CURENet, a multimodal model (Combining Unified Representations for Efficient chronic disease prediction) that integrates unstructured clinical notes, lab tests, and patients' time-series data by utilizing large language models (LLMs) for clinical text processing and textual lab tests, as well as transformer encoders for longitudinal sequential visits. CURENet has been capable of capturing the intricate interaction between different forms of clinical data and creating a more reliable predictive model for chronic illnesses. We evaluated CURENet using the public MIMIC-III and private FEMH datasets, where it achieved over 94\% accuracy in predicting the top 10 chronic conditions in a multi-label framework. Our findings highlight the potential of multimodal EHR integration to enhance clinical decision-making and improve patient outcomes.
Per Kristian Lehre, Phan Trung Hai Nguyen
The Population-Based Incremental Learning (PBIL) algorithm uses a convex combination of the current model and the empirical model to construct the next model, which is then sampled to generate offspring. The Univariate Marginal Distribution Algorithm (UMDA) is a special case of the PBIL, where the current model is ignored. Dang and Lehre (GECCO 2015) showed that UMDA can optimise LeadingOnes efficiently. The question still remained open if the PBIL performs equally well. Here, by applying the level-based theorem in addition to Dvoretzky--Kiefer--Wolfowitz inequality, we show that the PBIL optimises function LeadingOnes in expected time $\mathcal{O}(nλ\log λ+ n^2)$ for a population size $λ= Ω(\log n)$, which matches the bound of the UMDA. Finally, we show that the result carries over to BinVal, giving the fist runtime result for the PBIL on the BinVal problem.
T. Tam Nguyen Phan
We use the reflection group trick to glue manifolds with corners that are Borel-Serre compactifications of locally symmetric spaces of noncompact type and obtain aspherical manifolds. We call these \emph{piecewise locally symmetric} manifolds. This class of spaces provide new examples of aspherical manifolds whose fundamental groups have the structure of a complex of groups. These manifolds typically do not admit a locally $\CAT(0)$ metric. We prove that any self homotopy equivalence of such manifolds is homotopic to a homeomorphism. We compute the group of self homotopy equivalences of such a manifold and show that it can contain a normal free abelian subgroup, and thus can be infinite.
Michael Freedman, T. Tam Nguyen Phan
We show that all PL manifolds of dimension $\geq 3$ have spines similar to Bing's house with two rooms. Beyond this we explore approximation rigidity and an $h$-principle.
Per Kristian Lehre, Phan Trung Hai Nguyen
We introduce a new benchmark problem called Deceptive Leading Blocks (DLB) to rigorously study the runtime of the Univariate Marginal Distribution Algorithm (UMDA) in the presence of epistasis and deception. We show that simple Evolutionary Algorithms (EAs) outperform the UMDA unless the selective pressure $μ/λ$ is extremely high, where $μ$ and $λ$ are the parent and offspring population sizes, respectively. More precisely, we show that the UMDA with a parent population size of $μ=Ω(\log n)$ has an expected runtime of $e^{Ω(μ)}$ on the DLB problem assuming any selective pressure $\fracμλ \geq \frac{14}{1000}$, as opposed to the expected runtime of $\mathcal{O}(nλ\log λ+n^3)$ for the non-elitist $(μ,λ)~\text{EA}$ with $μ/λ\leq 1/e$. These results illustrate inherent limitations of univariate EDAs against deception and epistasis, which are common characteristics of real-world problems. In contrast, empirical evidence reveals the efficiency of the bi-variate MIMIC algorithm on the DLB problem. Our results suggest that one should consider EDAs with more complex probabilistic models when optimising problems with some degree of epistasis and deception.
T. Tam Nguyen Phan
We study noncompact, complete, finite volume, negatively curved manifolds $M$. We construct $M$ with infinitely generated fundamental groups in all dimensions $n \geq 2$. We construct $M$ whose cusp cross sections are compact hyperbolic manifolds in all dimension $n\geq 3$. In contrast we show that if sectional curvature $-1<K(M)<0$, then cusp cross sections have zero simplicial volume. We construct negatively curved lattices that do not contain any parabolic isometries. We show that there are $M$ such that $\widetilde{M}$ does not satisfy the visibility axiom. We give a condition on the curvature growth versus the volume decay that guarantees topological finiteness. We raise a few questions on finite volume, negatively curved manifolds.