Scholar iON
Academic Synthesis
The selected body of research explores advanced statistical methodologies and models to address complex biological and ecological questions. Bickel's work focuses on enhancing the reliability of gene network reconstructions by proposing the decisive false discovery rate (dFDR) as a tool for estimating the probabilities of spurious gene associations, which is crucial for understanding gene expression dynamics. Ma extends traditional species-area relationships to diversity-area relationships using Hill numbers, thereby broadening the scope of biodiversity scaling models and verifying the approach with extensive microbiome data. Pascual's dynamic model of HIV infection provides insights into the interactions between the virus and immune system cells, offering a framework for simulating antiviral therapies. Glinsky's study on retroviral LTR elements in human embryos reveals their integration into genomic regulatory networks, highlighting their evolutionary and developmental significance. Collectively, these studies emphasize the importance of sophisticated statistical and computational models in elucidating biological systems and their evolutionary implications.
Motivation: The reconstruction of gene networks from gene expression microarrays is gaining popularity as methods improve and as more data become available. The reliability of such networks could be judged by the probability that a connection between genes is spurious, resulting from chance fluctuations rather than from a true biological relationship. Results: Unlike the false discovery rate and positive false discovery rate, the decisive false discovery rate (dFDR) is exactly equal to a conditional probability without assuming independence or the randomness of hypothesis truth values. This property is useful not only in the common application to the detection of differential gene expression, but also in determining the probability of a spurious connection in a reconstructed gene network. Estimators of the dFDR can estimate each of three probabilities: 1. The probability that two genes that appear to be associated with each other lack such association. 2. The probability that a time ordering observed for two associated genes is misleading. 3. The probability that a time ordering observed for two genes is misleading, either because they are not associated or because they are associated without a lag in time. The first probability applies to both static and dynamic gene networks, and the other two only apply to dynamic gene networks. Availability: Cross-platform software for network reconstruction, probability estimation, and plotting is free from http://www.davidbickel.com as R functions and a Java application.
I extend the traditional SAR, which has achieved status of ecological law and plays a critical role in global biodiversity assessment, to the general (alpha- or beta-diversity in Hill numbers) diversity area relationship (DAR). The extension was motivated to remedy the limitation of traditional SAR that only address one aspect of biodiversity scaling, i.e., species richness scaling over space. The extension was made possible by the fact that all Hill numbers are in units of species (referred to as the effective number of species or as species equivalents), and I postulated that Hill numbers should follow the same or similar pattern of SAR. I selected three DAR models, the traditional power law (PL), PLEC (PL with exponential cutoff) and PLIEC (PL with inverse exponential cutoff). I defined three new concepts and derived their quantifications: (i)DAR profile: z-q series where z is the PL scaling parameter at different diversity order (q); (ii)PDO (pair-wise diversity overlap) profile: g-q series where g is the PDO corresponding to q; (iii) MAD (maximal accrual diversity) profile: Dmax-q series where Dmax is the MAD corresponding to q. Furthermore, the PDO-g is quantified based on the self-similarity property of the PL model, and Dmax can be estimated from the PLEC parameters. The three profiles constitute a novel DAR approach to biodiversity scaling. I verified the postulation with the American gut microbiome project (AGP) dataset of 1473 healthy North American individuals (the largest human dataset from a single project to date). The PL model was preferred due to its simplicity and established ecological properties such as self-similarity (necessary for establishing PDO profile), and PLEC has an advantage in establishing the MAD profile. All three profiles for the AGP dataset were successfully quantified and compared with existing SAR parameters in the literature whenever possible.
A dynamic model of non-lineal time-dependent ordinary differential equations (ODE) has been applied to the interactions of a HIV infection with the immune system cells. This model has been simplified into two compartments: lymph node and peripheral blood. The model includes CD4 T-lymphocytes in several states (quiescent Q, naive N and activated T), cytotoxic CD8 T-cells, B-cells and dendritic cells. Cytokines and immunoglobulins specific for each antigen (i.e. gp41 or p24) have been also included in the model, modelling the atraction effect of CD4 T-cells to the infected area and the reduction of virus concentration by immunoglobulins. HIV virus infection of CD4 T-lymphocytes is modelled in several stages: before fusion as HIV-attached (H) and after fusion as non-permissive / abortively infected (M), and permissive / latently infected (L) and permissive / actively infected (I). These equations have been implemented in a C++/Python interface application, called Immune System app, which runs Open Modelica software to solve the ODE system through a 4th order Runge-Kutta numerical approximation. Results of the simulation show that although HIV virus concentration in both compartments is lower than $10^{-10}$ virus/$ΞΌL$ after t=2 years, quiescent lymphocytes reach an equilibrium with a concentration lower than the initial conditions, due to the latency state, which serves as a reservoir in time of virus production. As a conclusion, this model can provide reliable results in other conditions, such as antiviral therapies.
Two distinct families of pan-primate endogenous retroviruses, namely HERVL and HERVH, infected primates germline, colonized host genomes, and evolved into the global retroviral genomic regulatory dominion (GRD) operating during human embryogenesis (HE). HE retroviral GRD constitutes 8839 highly conserved fixed LTR elements linked to 5444 down-stream target genes forged by evolution into a functionally-consonant constellation of 26 genome-wide multimodular genomic regulatory networks (GRNs), each of which is defined by significant enrichment of numerous single gene ontology (GO)-specific traits. Locations of GRNs appear scattered across chromosomes to occupy from 5.5%-15.09% of human genome. Each GRN harbors from 529-1486 retroviral LTRs derived from LTR7, MLT2A1, and MLT2A2 sequences that are quantitatively balanced according to their genome-wide abundance. GRNs integrate activities from 199-805 down-stream target genes, including transcription factors, chromatin-state remodelers, signal-sensing and signal-transduction mediators, enzymatic and receptor binding effectors, intracellular complexes and extracellular matrix elements, and cell-cell adhesion molecules. GRNs compositions consist of several hundred to thousands smaller GO enrichment-defined genomic regulatory modules (GRMs) combining from a dozen to hundreds LTRs and down-stream target genes, which appear to operate on individuals life-span timescale along specific phenotypic avenues to exert profound effects on patterns of transcription, protein-protein interactions, developmental phenotypes, physiological traits, and pathological conditions of Modern Humans. Overall, this study identifies 69,573 statistically significant retroviral LTR-linked GRMs (Binominal FDR q-value threshold of 0.001), including 27,601 GRMs validated by the single GO-specific directed acyclic graph (DAG) analyses across six GO annotations.
Sequence organizations are viewed from two points: one is from informational redundancy or informational correlation (IC) and another is from k-mer frequency statistics. Two problems are investigated. The first is how the ICs exceed the fluctuation bound and the order emerges from fluctuation in a genome when the sequence length attains some critical value. We demonstrated that the transition from fluctuation to order takes place at about sequence length 200-300 thousands bases for human and E coli genome. It means that the life emerges from a region between macroscopic and microscopic. The second is about the statistical law of the k-mer organization in a genome under the evolutionary pressure and functional selection. We deduced a sum rule Q(k,N) on the k-mer frequency deviations from the randomness in a N-long sequence of genome and deduced the relations of Q(k,N) with k and N. We found that Q(k,N) increases with length N at a constant rate for most genome sequences and demonstrated that when the functional selection of k-mers is accumulated to some critical value the ordering takes place. An important finding is the sum rule correlated with the evolutionary complexity of the genome.
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems, controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which FDRs are greatly underestimated due to weaknesses in random sequence models. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, motif scanning, and multi-microarray analyses.
DNA read mapping is a ubiquitous task in bioinformatics, and many tools have been developed to solve the read mapping problem. However, there are two trends that are changing the landscape of readmapping: First, new sequencing technologies provide very long reads with high error rates (up to 15%). Second, many genetic variants in the population are known, so the reference genome is not considered as a single string over ACGT, but as a complex object containing these variants. Most existing read mappers do not handle these new circumstances appropriately.
We introduce a new read mapper prototype called VATRAM that considers variants. It is based on Min-Hashing of q-gram sets of reference genome windows. Min-Hashing is one form of locality sensitive hashing. The variants are directly inserted into VATRAMs index which leads to a fast mapping process. Our results show that VATRAM achieves better precision and recall than state-of-the-art read mappers like BWA under certain cirumstances. VATRAM is open source and can be accessed at https://bitbucket.org/Quedenfeld/vatram-src/.
One of the major successes in computational biology has been the unification, using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied towards these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems associated with different statistical models. This paper introduces the \emph{polytope propagation algorithm} for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.
We determine the Renyi entropies K_q of symbol sequences generated by human chromosomes. These exhibit nontrivial behaviour as a function of the scanning parameter q. In the thermodynamic formalism, there are phase transition-like phenomena close to the q=1 region. We develop a theoretical model for this based on the superposition of two multifractal sets, which can be associated with the different statistical properties of coding and non-coding DNA sequences. This model is in good agreement with the human chromosome data.
Optimizing the properties of molecules (materials or drugs) for stronger toughness, lower toxicity, or better bioavailability has been a long-standing challenge. In this context, we propose a molecular optimization framework called Q-Drug (Quantum-inspired optimization algorithm for Drugs) that leverages quantum-inspired algorithms to optimize molecules on discrete binary domain variables. The framework begins by encoding the molecules into binary embeddings using a discrete VAE. The binary embeddings are then used to construct an Ising energy-like objective function, over which the state-of-the-art quantum-inspired optimization algorithm is adopted to find the optima. The binary embeddings corresponding to the optima are decoded to obtain the optimized molecules. We have tested the framework for optimizing drug molecule properties and have found that it outperforms other molecular optimization methods, finding molecules with better properties in 1/20th to 1/10th of the time previously required. The framework can also be deployed directly on various quantum computing equipment, such as laser pulses CIMs, FPGA Ising Machines, and quantum computers based on quantum annealing, among others. Our work demonstrates a new paradigm that leverages the advantages of quantum computing and AI to solve practically useful problems.