Explainable Disease Stage Prediction from NMR Spectra of Blood Serum

for Degree: 
Contact Person: 
Status: 
Available

Abstract

This master thesis shall develop a feature extraction technique together with an explainable predictor model that can accurately predict the stage of a disease from Nuclear Magnetic Resonance (NMR) measurement spectra of blood serum. This is motivated by the fact that the concentration of certain proteins is indicative of this, and is reflected in the NMR spectra. The final predictor should yield a simple and explainable prediction process, on as few features as possible. This should allow medical experts to understand the used indicators, and potentially point out which novel indicators should be investigated more closely.

The thesis will be supervised in collaboration with the Metabolics lab (contact: Alvaro Mallagaray, Lorena Rudolph), who brings in the chemical and medical background, and datasets.

Problem Statement

Recent research in metabolics has shown that the concentration of certain chemical components (glycoproteins and their associated glycosylation profiles) in a blood probe are indicative for stages of different diseases. These concentrations as well as other predictive features are in principle available from cheaply available NMR spectra of a blood probe. However, so far, feature extraction from NMR is either done black-box or by simply reducing the spectra to a collection of peak signals (lineshapes), which is

  • hardly explainable, i.e., gives doctors and experts little clue about what chemical compound(s) truly indicated a subsequent prediction, and
  • suffers from high feature dimensionality, making it hardly suitable for medical applications where training data is scarce.

Goals

  • Develop and implement feature extraction techniques for NMR spectra of blood
  • Develop a data-efficient explainable predictor model that can accurately predict the stage of a disease from NMR spectra of blood serum
  • Evaluate the full pipeline with respect to accuracy, and explainability:
    • compare the developed against baseline feature extraction techniques for this purpose
    • determine the explainability-accuracy trade-off for this purpose

Suggested Approach

The thesis will work with data available from the Metabolics lab, including reference NMR spectra of molecules, as well as datasets of NMR spectra of blood serum labeled with disease stage.

Some direct starting points to improve the existing lineshape-based approaches are:

  1. Different dimensionality reduction techniques: These may not immediately create explainability, but could tackle the feature dimensionality issue, such that a baseline predictor can be created.
  2. Simple modeling of available chemical knowledge: It is known how NMR responses of glycoproteins (glycans attached to proteins) can be estimated from the known spectra of the glycans and the proteins in a fairly simple way. This should enable to directly extract information about concentration of specific molecule parts from the spectra.

The quality of the extracted features shall be evaluated by the potential to predict the stage of a disease using explainable models. For the predictor, a small DNN-based approach or the result of an AutoML model search can serve as a non-explainable baseline. Explainable approaches for the predictor can be, e.g.,

  • linear models, or
  • rule-based models (using, e.g., data-efficient inductive logic programming, or using decision trees).

Requirements

  • Solid programming skills in python or matlab; capability to read matlab code will be helpful
  • Familiarity with and interest in basic statistics (like correlation analysis), machine learning techniques/models (like gradient descent / linear regression and deep neural networks), and dimensionality reduction techniques (like principal component analysis)
  • Basic understanding of ante-hoc explainability in machine learning
  • Some prior knowledge and experience in computational spectra processing, as well as basic understanding of NMR would be helpful but is not mandatory
  • Interest in contributing to bleeding edge research on medical diagnostics.

Literature

  • Mallagaray, Alvaro, Lorena Rudolph, Melissa Lindloge, Jarne Mölbitz, Henrik Thomsen, Franziska Schmelter, Mohamad Ward Alhabash, et al. 2023. “Towards a Precise NMR Quantification of Acute Phase Inflammation Proteins from Human Serum.” Angewandte Chemie International Edition 62 (35): e202306154. https://doi.org/10.1002/anie.202306154.
  • Rudolph, Lorena, Renia Krellmann, Darko Castven, Lina Jegodzinski, Helena Deriš, Jerko Štambuk, Jarne Mölbitz, et al. 2025. “Fast NMR-Based Assessment of Cancer-Associated Protein Glycosylations from Serum Samples.” Analytical Chemistry 97 (17): 9367–77. https://doi.org/10.1021/acs.analchem.5c00285.
  • Jegodzinski, Lina, Lorena Rudolph, Darko Castven, Friedhelm Sayk, Ashok Kumar Rout, Bandik Föh, Laura Hölzen, et al. 2025. “PNPLA3 I148M Variant Links to Adverse Metabolic Traits in MASLD during Fasting and Feeding☆.” JHEP Reports, May, 101450. https://doi.org/10.1016/j.jhepr.2025.101450.

 

---

 

Some more technical background

For a deeper understanding of the topic, some more details about the glycolization profiles are provided below. This is no preliminary for the thesis.

Background

  • Glycans (sugar molecules) can be attached to proteins, and the resulting glycoproteins occur naturally in human blood. In NMR spectra of blood serum, this will result in responses at several different spectral regions, e.g., one for glycans, one for the proteins; and these are separate from, e.g., the one for lipids.
  • The (co-)occurrences / amounts of glycans are assumed to be indicative of specific disease stages (these are modulated in various pathological processes, e.g. produced as part of reaction to inflammation). The same holds for lipids, but their exact individual spectra are unknown.
  • The NMR spectrum builds up as follows:
    • Spectra of different, non-interacting molecules simply add up as a linear sum, with the weight being the percentage of the molecules in the serum.
    • Proteins, glycans, and lipids each have their own specific region of a spectrum.
    • The spectrum of a glycoprotein is also simply the sum of (1) the spectrum of its protein (approximately unchanged compared to the protein isolated) and the (2) modified glycan spectra.
    • The spectrum response of a glycan when bound to a protein changes due to 2 factors compared to the spectrum response of an isolated glycan without protein:
      • (Known) The protein: The proteins dampen the NMR response of the attached glycan. In the frequency domain, this reduces the response exponentially according to some exp(-Rt) by known protein- and frequency-specific dampening factor R. After Fourier transform, i.e., in the typical NMR spectra representation, this causes the glycan peaks to be broadened, such that individual peaks overlap more.
        Here, as well, the position of the peaks is assumed to remain at (approximately) the same position.
        -> Code is available to conduct this transformation.
      • (Unknown) The sugars: Since a different part of the molecule is excited during NMR measurement of the attached glycan compared to the isolated glycan, some positions in the spectrum response (corresponding to different atoms in the glycan molecule) are reduced, others are emphasized.
        The position of peaks, however, is assumed to stay approximately the same; just the intensity is altered. Thus, one should be able to model this simply by multiplication with frequency-specific weights.