Scholar iON
Academic Synthesis
The selected scholarly works underscore the critical advancements and methodological innovations in machine learning, particularly in addressing challenges related to data scarcity, covariate shifts, and representation learning. Klarner et al. (2023) introduce Q-SAVI, a probabilistic model that incorporates domain-informed priors to enhance drug discovery under covariate shift, demonstrating substantial improvements over existing techniques. Burel (2008) explores deduction modulo, presenting a framework for efficiently simulating higher-order arithmetic through first-order theories, offering potential speed-ups in proof processes. Cilibrasi and Vitanyi (2006) delve into compression-based similarity measures, proposing universal distances that enhance data mining and pattern recognition applications. Lastly, Liu et al. (2025) address limitations in generative recommenders with the DECOR framework, which refines token embeddings through contextual and collaborative signals, showcasing improved recommendation performance. Collectively, this research highlights the significance of integrating domain-specific knowledge, efficient computation, and enhanced representation learning to tackle complex machine learning challenges.
Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods. In this paper, we present Q-SAVI, a probabilistic model able to address these challenges by encoding explicit prior knowledge of the data-generating process into a prior distribution over functions, presenting researchers with a transparent and probabilistically principled way to encode data-driven modeling preferences. Building on a novel, gold-standard bioactivity dataset that facilitates a meaningful comparison of models in an extrapolative regime, we explore different approaches to induce data shift and construct a challenging evaluation setup. We then demonstrate that using Q-SAVI to integrate contextualized prior knowledge of drug-like chemical space into the modeling process affords substantial gains in predictive accuracy and calibration, outperforming a broad range of state-of-the-art self-supervised pre-training and domain adaptation techniques.
In deduction modulo, a theory is not represented by a set of axioms but by a congruence on propositions modulo which the inference rules of standard deductive systems---such as for instance natural deduction---are applied. Therefore, the reasoning that is intrinsic of the theory does not appear in the length of proofs. In general, the congruence is defined through a rewrite system over terms and propositions. We define a rigorous framework to study proof lengths in deduction modulo, where the congruence must be computed in polynomial time. We show that even very simple rewrite systems lead to arbitrary proof-length speed-ups in deduction modulo, compared to using axioms. As higher-order logic can be encoded as a first-order theory in deduction modulo, we also study how to reinterpret, thanks to deduction modulo, the speed-ups between higher-order and first-order arithmetics that were stated by GΓΆdel. We define a first-order rewrite system with a congruence decidable in polynomial time such that proofs of higher-order arithmetic can be linearly translated into first-order arithmetic modulo that system. We also present the whole higher-order arithmetic as a first-order system without resorting to any axiom, where proofs have the same length as in the axiomatic presentation.
We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like ``red'' or ``christianity.'' For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular featuresdistances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. between pairs of literal objects. For the second type we consider similarity
Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via sequence-to-sequence modeling. However, these two stages are optimized for different objectives: semantic reconstruction during tokenizer pretraining versus user interaction modeling during recommender training. This objective misalignment leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to reflect diverse usage contexts; and (ii) discarded pretrained semantics, where pretrained knowledge - typically from language model embeddings - is overwritten during recommender training on user interactions. To address these limitations, we propose to learn $\underline{DE}$composed $\underline{CO}$ntextual Token $\underline{R}$epresentations (DECOR), a unified framework that preserves pretrained semantics while enhancing the adaptability of token embeddings. DECOR introduces contextualized token composition to refine token embeddings based on user interaction context, and decomposed embedding fusion that integrates pretrained codebook embeddings with newly learned collaborative embeddings. Experiments on three real-world datasets demonstrate that DECOR consistently outperforms state-of-the-art baselines in recommendation performance.
This work first presents our attempts to establish an automated model using state-of-the-art approaches for analysing bias in search results of Bing and Google. Experimental results indicate that the current class-wise F1-scores of our best model are not sufficient to establish an automated model for bias analysis. Thus, we decided not to continue with this approach.
We present a minimal model to study liquid phase separation in a fixed pH ensemble. The model describes a mixture composed of macromolecules that exist in three different charge states and have a tendency to phase separate. We introduce the pH dependence of phase separation by means of a set of reactions describing the protonation and deprotonation of macromolecules, as well as the self-ionisation of water. We use conservation laws to identify the conjugate thermodynamic variables at chemical equilibrium. Using this thermodynamic conjugate variables we perform a Legendre transform which defines the corresponding free energy at fixed pH. We first study the possible phase diagram topologies at the isoelectric point of the macromolecules. We then show how the phase behavior depends on pH by moving away from the isoelectric point. We find that phase diagrams as a function of pH strongly depend on whether oppositely charged macromolecules or neutral macromolecules have a stronger tendency to phase separate. We predict the existence of reentrant behavior as a function of pH. In addition, our model also predicts that the region of phase separation is typically broader at the isoelectric point. This model could account for both, the protein separation observed in yeast cells for pH values close to the isoelectric point of many cytosolic proteins and also for the in vitro experiments of single proteins exhibiting phase separation as a function of pH.
This study reveals the essence of ligand recognition mechanisms by which calmodulin (CaM) controls a variety of Ca2+ signaling processes. We study eight forms of calcium-loaded CaM each with distinct conformational states. Reducing the structure to two degrees of freedom conveniently describes main features of conformational changes of CaM via simultaneous twist-bend motions of the two lobes. We utilize perturbation-response scanning (PRS) technique, coupled with molecular dynamics simulations to analyze conformational preferences of calcium-loaded CaM, initially in extended form. PRS is comprised of sequential application of directed forces on residues followed by recording the resulting coordinates. We show that manipulation of a single residue, E31 located in one of the EF hand motifs, reproduces structural changes to compact forms, and the flexible linker acts as a transducer of binding information to distant parts of the protein. Independently, using four different pKa calculation strategies, we find E31 to be the charged residue (out of 52), whose ionization state is most sensitive to subtle pH variations in the physiological range. It is proposed that at relatively low pH, CaM structure is less flexible. By gaining charged states at specific sites at a pH value around 7, local conformational changes in the protein will lead to shifts in the energy landscape, paving the way to other conformational states. These findings are in accordance with FRET measured shifts in conformational distributions towards more compact forms with decreased pH. They also corroborate mutational studies and proteolysis results which point to the significant role of E31 in CaM dynamics.
Many networked systems require a central authority to enforce a global configuration against local peer influence. We study influence dynamics on finite weighted directed graphs with a distinguished hub node and binary vertex states ('Glory' or 'Gnash'). We give a sharp, local, and efficiently checkable criterion that guarantees global convergence to Glory in a single synchronous update from any initial state. At each non-hub vertex, the incoming weight from the hub must at least match the total incoming weight from all other nodes. Specialising in uniform hub broadcasts, the exact threshold equals the maximum non-hub incoming weight over all vertices, and we prove this threshold is tight. We extend the result to a tau-biased update rule and to asynchronous (Gauss-Seidel) schedules, where a single pass still suffices under the same domination hypothesis. Machine-checked proofs in Coq accompany all theorems.
An important tool for proving safety of dynamical systems is the notion of a barrier certificate. In this paper we prove that every robustly safe ordinary differential equation has a barrier certificate. Moreover, we show a construction of such a barrier certificate based on a set of states that is reachable in finite time.
In the quiet backwaters of cs.CV, cs.LG and stat.ML, a cornucopia of new learning systems is emerging from a primordial soup of mathematics-learning systems with no need for external supervision. To date, little thought has been given to how these self-supervised learners have sprung into being or the principles that govern their continuing diversification. After a period of deliberate study and dispassionate judgement during which each author set their Zoom virtual background to a separate Galapagos island, we now entertain no doubt that each of these learning machines are lineal descendants of some older and generally extinct species. We make five contributions: (1) We gather and catalogue row-major arrays of machine learning specimens, each exhibiting heritable discriminative features; (2) We document a mutation mechanism by which almost imperceptible changes are introduced to the genotype of new systems, but their phenotype (birdsong in the form of tweets and vestigial plumage such as press releases) communicates dramatic changes; (3) We propose a unifying theory of self-supervised machine evolution and compare to other unifying theories on standard unifying theory benchmarks, where we establish a new (and unifying) state of the art; (4) We discuss the importance of digital biodiversity, in light of the endearingly optimistic Paris Agreement.