Machine Learning and Big Data-enabled Biotechnology

Name: Machine Learning and Big Data-enabled Biotechnology
Brand: Wiley-VCH
Price: 142.99 EUR
Availability: OnlineOnly

Hal S. Alper(Editor)

Wiley-VCH (Publisher)

1st Edition

Published on 15. January 2026

432 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-3-527-85051-8 (ISBN)

€142.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Person

Content

Part I - From DNA?
1 Deep learning approaches for synthetic biology part design
2 Automated approaches for GSM development from DNA sequence
3 Predictive models from genome sequences
Part II - ?.to Proteins?
4 De novo protein structure and design tools
5 Machine learning approaches for protein engineering
6 Pathway discovery / Retrobiosynthesis
7 Enzyme functional classifications
8 Proteomics machine learning approaches and de novo identification
Part III - ?to whole cells and beyond
9 Machine learning approaches for gene expression
10 Metabolomics big data approaches
11 Use of Generative AI and natural language processing for cell models
12 Metabolic production, strain engineering, and flux design
13 Automated function and learning in biofoundries/strain designs
14 Machine learning predictions of phenotype and bioreactor performance

1
From Genome to Actionable Insights in Biotechnology

James Morrissey Benjamin Strain and Cleo Kontoravdi

Imperial College London, Department of Chemical Engineering, London, SW7 2AZ, UK

1.1 Introduction

Systems biology facilitates the understanding of biological processes [1, 2]. It provides a framework for contextualizing and understanding high-throughput data, allowing structured and meaningful insights to be gained. With the increasing generation of high-throughput omics data and usage of machine learning (ML) data analysis tools, it becomes ever more important to develop robust systems biology tools.

One of the key tools in a systems biologist's repertoire is a network. Networks in biology represent the flow and interactions between components such as genes, proteins, reactions, and metabolites, where network nodes correspond to these biological entities and edges define their interactions, such as biochemical conversion, regulation, and physical binding [3]. Networks improve interpretation and understanding of complex biological phenomena, but, most importantly, in the context of big data, they provide structure. This structure can help turn highly complex data from a "black box" to multiscale models with predictive and interpretable capabilities. Networks allow underlying biological mechanisms to be understood during big data approaches. This structural framework is particularly valuable in biotechnology, where datasets like genomics, transcriptomics, proteomics, and metabolomics must be carefully integrated and analyzed. Even if they are "smaller" in scale, both in network size and connectivity, than those in fields like image recognition or natural language processing, their biological complexity requires equal, if not greater, interpretative attention [4, 5].

By leveraging the structure provided by biological networks, researchers can generate predictions that help elucidate how cellular systems function under various conditions. These predictions, such as flux distributions, regulatory responses, or growth outcomes, offer mechanistic insight into the underlying biology [6]. With this understanding, targeted interventions can be designed, such as modifying gene expression, adjusting nutrient feeds, or engineering metabolic pathways. Crucially, these interventions often lead to positive outcomes, including improved productivity, robustness, or efficiency in biotechnological applications. Thus, networks serve not only as interpretative tools but also as platforms for driving practical, data-informed decisions.

The most widely utilized type of biological network is the metabolic network [7]. These networks describe the flow (or flux) of metabolites through biochemical reactions within a cellular system. Depending on the application, metabolic networks can range from small, pathway-specific subsystems to genome-scale reconstructions that aim to represent the entirety of an organism's metabolic capabilities [8]. These large-scale reconstructions are referred to as genome-scale metabolic models (GEMs). GEMs provide a computational framework for probing metabolism by predicting intracellular fluxes under defined conditions. By integrating experimental data into a metabolic network, GEMs enable simulation of cellular behavior, guiding hypothesis generation, strain engineering, and process optimization in biotechnology [9, 10]. GEMs also serve as a central scaffold for incorporating other biological networks, enabling the integration of multiomics data and supporting comprehensive, data-driven approaches [11, 12].

In this chapter, we explore how biological networks, particularly metabolic networks, can be constructed from genomic data, validated and refined, transformed into predictive models, and ultimately used to generate actionable insights in biotechnology. We highlight how high-throughput omics and ML tools can be used to enhance interpretability, constrain solution spaces, and improve predictive power to enable biotechnology applications.

1.2 From Genome to Network

Genomic data provides the foundational blueprint for all cellular processes. However, to extract actionable insight, this static information must be translated into dynamic representations of biological function. Networks offer a powerful way to achieve this, linking genes to their roles in metabolism, regulation, signaling, and molecular interactions [7].

From a genome sequence, various biological networks can be reconstructed, such as metabolic networks describing biochemical reactions, gene regulatory networks (GRNs) that capture transcriptional control, signaling networks that map cellular communication, and protein-protein interaction (PPI) networks that reveal the physical interactions within the proteome [3]. Each of these networks offers a different layer of understanding, and together they form a comprehensive systems-level view of the cell.

This section outlines how these networks are constructed from genomic data, with a focus on metabolic networks, either using bottom-up approaches that start from gene annotations and known biological functions or using top-down approaches that integrate omics data to refine or contextualize existing network structures. These reconstructions are the foundation for building models that can interpret data, generate predictions, and guide interventions in biotechnology. Chapter 2 provides detailed information on how network models can be constructed from genomic information using computational algorithms and manual approaches.

1.2.1 Metabolic Networks

Among the various biological networks that can be reconstructed from genomic data, metabolic networks are the most widely applied. Metabolic networks represent the biochemical reactions linking metabolites, genes, enzymes, and reactions in the cell system [13]. This can either be a subset of biochemical reactions (e.g. just focusing on core functions) or can be genome-scale. In recent years, GEMs have been created for a multitude of organisms [14] due to the availability of high-throughput omics data and increased applications of these models in systems biology.

1.2.1.1 Bottom-Up Approaches for Network Reconstruction

The bottom-up approach is a lengthy and manual approach to GEM creation, but it is recommended to create high-quality GEMs from scratch. The reader is pointed to a protocol [8], which contains an in-depth protocol for GEM reconstruction. In this section, we summarize the key steps as illustrated in Figure 1.1. A GEM reconstruction begins with a genome annotation for the organism of interest. To create a draft metabolic network, metabolic reactions can be extracted from the genome annotation using gene ontology (GO) [16], enzyme commission (EC) numbers [17], and biochemical reaction databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [18] and BRaunschweig ENzyme DAtabase (BRENDA) [19]. Chapters 4 and 5 discuss how protein function can be extracted using these methods as well as other ML frameworks.

This draft annotation is then subject to manual curation to scrutinize every gene and reaction entry. The entries must be relevant to the organism of interest, for example, ensuring correct cofactor and substrate specificity for each enzyme, as well as the correct gene localization. It is preferable to have literature or experimental data to support the presence and function of genes and reactions. When data is lacking, phylogenetically close organisms can be used. During manual curation, a confidence score is useful for assessing the amount of information available for each entry. Gene-protein-reaction (GPR) associations indicate which genes are required for reaction to occur, which states the presence of isozymes and enzyme complexes. The GPRs must be manually refined using databases and literature searches.

The next step is to ensure correct reaction stoichiometry. Metabolites in databases (and hence the draft reconstruction) are represented with uncharged formulas, but their protonation state varies depending on the pH of the cellular environment. The charged formula is determined based on the pKa values of functional groups, which can be predicted using computational tools or literature for the correct pH in the subcellular location. Once the charged formulas are assigned, the correct reaction stoichiometry is established by ensuring mass and charge balance across reactions, incorporating protons and water where necessary. Correct balancing is crucial to avoid artificial energy generation. Correct reaction directionality is also essential for preventing irreversible reactions from running backward, which would lead to incorrect predictions and thermodynamically infeasible loops (futile cycles), which are discussed in Section 1.3.2. Correct directionality is determined using existing biochemical data, but when this is unavailable, Gibbs free energy change can be obtained from databases or calculated using group contribution methods [20]. To complete and validate the network, further reactions must be added, which are discussed in Section 1.3.1.

Figure 1.1 Key omics data and steps in the bottom-up construction of a metabolic network from an organism's genome.

Source: Strain et al. [15] / Elsevier / CC BY 4.0.

1.2.1.2 Top-Down Approaches for Network Reconstruction

In contrast to building draft networks...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Machine Learning and Big Data-enabled Biotechnology

Description

More details

Other editions

Additional editions

Person

Content

1
From Genome to Actionable Insights in Biotechnology

1.1 Introduction

1.2 From Genome to Network

1.2.1 Metabolic Networks

1.2.1.1 Bottom-Up Approaches for Network Reconstruction

1.2.1.2 Top-Down Approaches for Network Reconstruction

System requirements

Schweitzer Fachinformationen

Machine Learning and Big Data-enabled Biotechnology

Description

More details

Other editions

Additional editions

Person

Content

1 From Genome to Actionable Insights in Biotechnology

1.1 Introduction

1.2 From Genome to Network

1.2.1 Metabolic Networks

1.2.1.1 Bottom-Up Approaches for Network Reconstruction

1.2.1.2 Top-Down Approaches for Network Reconstruction

System requirements

1
From Genome to Actionable Insights in Biotechnology