Tutorials in Chemoinformatics

Name: Tutorials in Chemoinformatics
Brand: Wiley
Price: 89.99 EUR
Availability: OnlineOnly

Alexandre Varnek(Herausgeber*in)

Wiley (Verlag)

Erschienen am 22. Juni 2017

488 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-13798-6 (ISBN)

89,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

30 tutorials and more than 100 exercises in chemoinformatics, supported by online software and data sets
Chemoinformatics is widely used in both academic and industrial chemical and biochemical research worldwide. Yet, until this unique guide, there were no books offering practical exercises in chemoinformatics methods. Tutorials in Chemoinformatics contains more than 100 exercises in 30 tutorials exploring key topics and methods in the field. It takes an applied approach to the subject with a strong emphasis on problem-solving and computational methodologies.
Each tutorial is self-contained and contains exercises for students to work through using a variety of software packages. The majority of the tutorials are divided into three sections devoted to theoretical background, algorithm description and software applications, respectively, with the latter section providing step-by-step software instructions. Throughout, three types of software tools are used: in-house programs developed by the authors, open-source programs and commercial programs which are available for free or at a modest cost to academics. The in-house software and data sets are available on a dedicated companion website.
Key topics and methods covered in Tutorials in Chemoinformatics include:
* Data curation and standardization
* Development and use of chemical databases
* Structure encoding by molecular descriptors, text strings and binary fingerprints
* The design of diverse and focused libraries
* Chemical data analysis and visualization
* Structure-property/activity modeling (QSAR/QSPR)
* Ensemble modeling approaches, including bagging, boosting, stacking and random subspaces
* 3D pharmacophores modeling and pharmacological profiling using shape analysis
* Protein-ligand docking
* Implementation of algorithms in a high-level programming language
Tutorials in Chemoinformatics is an ideal supplementary text for advanced undergraduate and graduate courses in chemoinformatics, bioinformatics, computational chemistry, computational biology, medicinal chemistry and biochemistry. It is also a valuable working resource for medicinal chemists, academic researchers and industrial chemists looking to enhance their chemoinformatics skills.

Weitere Details

Weitere Ausgaben

Person

Inhalt

List of Contributors xv

Preface xvii

About the Companion Website xix

Part 1 Chemical Databases 1

1 Data Curation 3 Gilles Marcou and Alexandre Varnek

Theoretical Background 3

Software 5

Step-by-Step Instructions 7

Conclusion 34

References 36

2 Relational Chemical Databases: Creation, Management, and Usage 37 Gilles Marcou and Alexandre Varnek

Theoretical Background 37

Step-by-Step Instructions 41

Conclusion 65

References 65

3 Handling of Markush Structures 67 Timur Madzhidov, Ramil Nugmanov, and Alexandre Varnek

Theoretical Background 67

Step-by-Step Instructions 68

Conclusion 73

References 73

4 Processing of SMILES, InChI, and Hashed Fingerprints 75 João Montargil Aires de Sousa

Theoretical Background 75

Algorithms 76

Step-by-Step Instructions 78

Conclusion 80

References 81

Part 2 Library Design 83

5 Design of Diverse and Focused Compound Libraries 85 Antonio de la Vega de Leon, Eugen Lounkine, Martin Vogt, and Jürgen Bajorath

Introduction 85

Data Acquisition 86

Implementation 86

Compound Library Creation 87

Compound Library Analysis 90

Normalization of Descriptor Values 91

Visualizing Descriptor Distributions 92

Decorrelation and Dimension Reduction 94

Partitioning and Diverse Subset Calculation 95

Partitioning 95

Diverse Subset Selection 97

Combinatorial Libraries 98

Combinatorial Enumeration of Compounds 98

Retrosynthetic Approaches to Library Design 99

References 101

Part 3 Data Analysis and Visualization 103

6 Hierarchical Clustering in R 105 Martin Vogt and Jürgen Bajorath

Theoretical Background 105

Algorithms 106

Instructions 107

Hierarchical Clustering Using Fingerprints 108

Hierarchical Clustering Using Descriptors 111

Visualization of the Data Sets 113

Alternative Clustering Methods 116

Conclusion 117

References 118

7 Data Visualization and Analysis Using Kohonen Self-Organizing Maps 119 João Montargil Aires de Sousa

Theoretical Background 119

Algorithms 120

Instructions 121

Conclusion 126

References 126

Part 4 Obtaining and Validation QSAR/QSPR Models 127

8 Descriptors Generation Using the CDK Toolkit and Web Services 129 João Montargil Aires de Sousa

Theoretical Background 129

Algorithms 130

Step-by-Step Instructions 131

Conclusion 133

References 134

9 QSPR Models on Fragment Descriptors 135 Vitaly Solov'ev and Alexandre Varnek

Abbreviations 135

Data 136

ISIDA_QSPR Input 137

Data Split Into Training and Test Sets 139

Substructure Molecular Fragment (SMF) Descriptors 139

Regression Equations 142

Forward and Backward Stepwise Variable Selection 142

Parameters of Internal Model Validation 143

Applicability Domain (AD) of the Model 143

Storage and Retrieval Modeling Results 144

Analysis of Modeling Results 144

Root-Mean Squared Error (RMSE) Estimation 148

Setting the Parameters 151

Analysis of n-Fold Cross-Validation Results 151

Loading Structure-Data File 153

Descriptors and Fitting Equation 154

Variables Selection 155

Consensus Model 155

Model Applicability Domain 155

n-Fold External Cross-Validation 155

Saving and Loading of the Consensus Modeling Results 155

Statistical Parameters of the Consensus Model 156

Consensus Model Performance as a Function of Individual Models Acceptance Threshold 157

Building Consensus Model on the Entire Data Set 158

Loading Input Data 159

Loading Selected Models and Choosing their Applicability Domain 160

Reporting Predicted Values 160

Analysis of the Fragments Contributions 161

References 161

10 Cross-Validation and the Variable Selection Bias 163 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 163

Step-by-Step Instructions 165

Conclusion 172

References 173

11 Classification Models 175 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 176

Algorithms 178

Step-by-Step Instructions 180

Conclusion 191

References 192

12 Regression Models 193 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 194

Step-by-Step Instructions 197

Conclusion 207

References 208

13 Benchmarking Machine-Learning Methods 209 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 209

Step-by-Step Instructions 210

Conclusion 222

References 222

14 Compound Classification Using the scikit-learn Library 223 Jenny Balfer, Jürgen Bajorath, and Martin Vogt

Theoretical Background 224

Algorithms 225

Step-by-Step Instructions 230

Naïve Bayes 230

Decision Tree 231

Support Vector Machine 234

Notes on Provided Code 237

Conclusion 238

References 239

Part 5 Ensemble Modeling 241

15 Bagging and Boosting of Classification Models 243 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 243

Algorithm 244

Step by Step Instructions 245

Conclusion 247

References 247

16 Bagging and Boosting of Regression Models 249 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 249

Algorithm 249

Step-by-Step Instructions 250

Conclusion 255

References 255

17 Instability of Interpretable Rules 257 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 257

Algorithm 258

Step-by-Step Instructions 258

Conclusion 261

References 261

18 Random Subspaces and Random Forest 263 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 264

Algorithm 264

Step-by-Step Instructions 265

Conclusion 269

References 269

19 Stacking 271 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek

Theoretical Background 271

Algorithm 272

Step-by-Step Instructions 273

Conclusion 277

References 278

Part 6 3D Pharmacophore Modeling 279

20 3D Pharmacophore Modeling Techniques in Computer-Aided Molecular Design Using LigandScout 281 Thomas Seidel, Sharon D. Bryant, Gökhan Ibis, Giulio Poli, and Thierry Langer

Introduction 281

Theory: 3D Pharmacophores 283

Representation of Pharmacophore Models 283

Hydrogen-Bonding Interactions 285

Hydrophobic Interactions 285

Aromatic and Cation-p Interactions 286

Ionic Interactions 286

Metal Complexation 286

Ligand Shape Constraints 287

Pharmacophore Modeling 288

Manual Pharmacophore Construction 288

Structure-Based Pharmacophore Models 289

Ligand-Based Pharmacophore Models 289

3D Pharmacophore-Based Virtual Screening 291

3D Pharmacophore Creation 291

Annotated Database Creation 291

Virtual Screening-Database Searching 292

Hit-List Analysis 292

Tutorial: Creating 3D-Pharmacophore Models Using LigandScout 294

Creating Structure-Based Pharmacophores From a Ligand-Protein Complex 294

Description: Create a Structure-Based Pharmacophore Model 296

Create a Shared Feature Pharmacophore Model From Multiple Ligand-Protein Complexes 296

Description: Create a Shared Feature Pharmacophore and Align it to Ligands 297

Create Ligand-Based Pharmacophore Models 298

Description: Ligand-Based Pharmacophore Model Creation 300

Tutorial: Pharmacophore-Based Virtual Screening Using LigandScout 301

Virtual Screening, Model Editing, and Viewing Hits in the Target Active Site 301

Description: Virtual Screening and Pharmacophore Model Editing 302

Analyzing Screening Results with Respect to the Binding Site 303

Description: Analyzing Hits in the Active Site Using LigandScout 305

Parallel Virtual Screening of Multiple Databases Using LigandScout 305

Virtual Screening in the Screening Perspective of LigandScout 306

Description: Virtual Screening Using LigandScout 306

Conclusions 307

Acknowledgments 307

References 307

Part 7 The Protein 3D-Structures in Virtual Screening 311

21 The Protein 3D-Structures in Virtual Screening 313 Inna Slynko and Esther Kellenberger

Introduction 313

Description of the Example Case 314

Thrombin and Blood Coagulation 314

Active Thrombin and Inactive Prothrombin 314

Thrombin as a Drug Target 314

Thrombin Three-Dimensional Structure: The 1OYT PDB File 315

Modeling Suite 315

Overall Description of the Input Data Available on the Editor Website 315

Exercise 1: Protein Analysis and Preparation 316

Step 1: Identification of Molecules Described in the 1OYT PDB File 316

Step 2: Protein Quality Analysis of the Thrombin/Inhibitor PDB Complex Using MOE Geometry Utility 320

Step 3: Preparation of the Protein for Drug Design Applications 321

Step 4: Description of the Protein-Ligand Binding Mode 325

Step 5: Detection of Protein Cavities 328

Exercise 2: Retrospective Virtual Screening Using the Pharmacophore Approach 330

Step 1: Description of the Test Library 332

Step 2.1: Pharmacophore Design, Overview 333

Step 2.2: Pharmacophore Design, Flexible Alignment of Three Thrombin Inhibitors 334

Step 2.3: Pharmacophore Design, Query Generation 335

Step 3: Pharmacophore Search 337

Exercise 3: Retrospective Virtual Screening Using the Docking Approach 341

Step 1: Description of the Test Library 341

Step 2: Preparation of the Input 341

Step 3: Re-Docking of the Crystallographic Ligand 341

Step 4: Virtual Screening of a Database 345

General Conclusion 350

References 351

Part 8 Protein-Ligand Docking 353

22 Protein-Ligand Docking 355 Inna Slynko, Didier Rognan, and Esther Kellenberger

Introduction 355

Description of the Example Case 356

Methods 356

Ligand Preparation 359

Protein Preparation 359

Docking Parameters 360

Description of Input Data Available on the Editor Website 360

Exercises 362

A Quick Start with LeadIT 362

Re-Docking of Tacrine into AChE 362

Preparation of AChE From 1ACJ PDB File 362

Docking of Neutral Tacrine, then of Positively Charged Tacrine 363

Docking of Positively Charged Tacrine in AChE in Presence of Water 365

Cross-Docking of Tacrine-Pyridone and Donepezil Into AChE 366

Preparation of AChE From 1ACJ PDB File 366

Cross-Docking of Tacrine-Pyridone Inhibitor and Donepezil in AChE in Presence of Water 367

Re-Docking of Donepezil in AChE in Presence of Water 370

General Conclusions 372

Annex: Screen Captures of LeadIT Graphical Interface 372

References 375

Part 9 Pharmacophorical Profiling Using Shape Analysis 377

23 Pharmacophorical Profiling Using Shape Analysis 379 Jérémy Desaphy, Guillaume Bret, Inna Slynko, Didier Rognan, and Esther Kellenberger

Introduction 379

Description of the Example Case 380

Aim and Context 380

Description of the Searched Data Set 381

Description of the Query 381

Methods 381

Rocs 381

VolSite and Shaper 384

Other Programs for Shape Comparison 384

Description of Input Data Available on the Editor Website 385

Exercises 387

Preamble: Practical Considerations 387

Ligand Shape Analysis 387

What are ROCS Output Files? 387

Binding Site Comparison 388

Conclusions 390

References 391

Part 10 Algorithmic Chemoinformatics 393

24 Algorithmic Chemoinformatics 395 Martin Vogt, Antonio de la Vega de Leon, and Jürgen Bajorath

Introduction 395

Similarity Searching Using Data Fusion Techniques 396

Introduction to Virtual Screening 396

The Three Pillars of Virtual Screening 397

Molecular Representation 397

Similarity Function 397

Search Strategy (Data Fusion) 397

Fingerprints 397

Count Fingerprints 397

Fingerprint Representations 399

Bit Strings 399

Feature Lists 399

Generation of Fingerprints 399

Similarity Metrics 402

Search Strategy 404

Completed Virtual Screening Program 405

Benchmarking VS Performance 406

Scoring the Scorers 407

How to Score 407

Multiple Runs and Reproducibility 408

Adjusting the VS Program for Benchmarking 408

Analyzing Benchmark Results 410

Conclusion 414

Introduction to Chemoinformatics Toolkits 415

Theoretical Background 415

A Note on Graph Theory 416

Basic Usage: Creating and Manipulating Molecules in RDKit 417

Creation of Molecule Objects 417

Molecule Methods 418

Atom Methods 418

Bond Methods 419

An Example: Hill Notation for Molecules 419

Canonical SMILES: The Canon Algorithm 420

Theoretical Background 420

Recap of SMILES Notation 420

Canonical SMILES 421

Building a SMILES String 422

Canonicalization of SMILES 425

The Initial Invariant 427

The Iteration Step 428

Summary 431

Substructure Searching: The Ullmann Algorithm 432

Theoretical Background 432

Backtracking 433

A Note on Atom Order 436

The Ullmann Algorithm 436

Sample Runs 440

Summary 441

Atom Environment Fingerprints 441

Theoretical Background 441

Implementation 443

The Hashing Function 443

The Initial Atom Invariant 444

The Algorithm 444

Summary 447

References 447

Index 449

1
Data Curation

Gilles Marcou and Alexandre Varnek

Goal: Identify and curate problematic chemical information from a data collection. The raw dataset is processed so that it will be ready to feed a relational database dedicated to the organoleptic properties of small organic molecules. Information is interpreted and re-encoded as categories or bit vectors when relevant.

Software: KNIME 3.0, ChemAxon

Data: The following files are provided in the tutorial:

thegoodscent_dup.csv - The raw data formatted in a semicolon separated file extracted from the web site of The Good Scent Company. The data is prepared and the most visible errors and discrepancies are already corrected.
thegoodscent_dup.raw - The raw data without any processing related to the tutorial.
MissingOdorTypes.csv - Manually curated Odor Types provided for some difficult cases.
StructureCuration.csv - File containing the curation rules for some deficient SMILES of the input.
TutoDataCuration.zip - The final KNIME workflow. Unzip the archive in the KNIME workspace and it will appear in your LOCAL workflows.
Slurp.pl - A Perl script exploring the website of The Good Scents Company in search of some chemical information.

The Good Scent Company is an online shop providing cosmetic, flavor, and fragrance ingredients. It provides information for the flavor, food, and fragrance industry since 1994, and sales ingredients since 1980.

Theoretical Background

Chemical datasets can be collected from literature, compendiums, web sites, lab-books, databases, and so on. Aggregation and automatic treatment of data represent additional sources of errors. Therefore, verification of quality and accuracy of chemical information is a crucial step of data valorization.[1]

The problem of the quality of publicly available chemical data can be illustrated on the searching the Web for the chemical structure of antibacterial compound Vancomycine, for which stereochemistry information is essential. One can suggest two possible queries using InChIKey notations:[2,3]

Query 1: "MYPYJXKWCTUITO" "Vancomycine"
Query 2: "MYPYJXKWCTUITO-LYRMYLQWSA-N" "Vancomycine"

Query 1 corresponds to the first layer of the InChI code of Vancomycine; it encodes only elemental constitution and atoms connectivity, whereas Query 2 includes detailed stereochemistry information.

A search on Google (29/01/2016) retrieves 82 and 71 entries for Queries 1 and 2, respectively. Entries found with Query 2 correspond to the correct chemical structure of Vancomycine, whereas all 11 additional entries retrieved with Query 1 refer to its different enantiomers, see example on Scheme 1.1.

Scheme 1.1 Chemical structures of Vancomycine from PubChem. (a) PubChem CID 441141, InChIKey : MYPYJXKWCTUITO-UTHKAUQRSA-N. (b) PubChem CID 14969, InChIKey : MYPYJXKWCTUITO-LYRMYLQWSA-N. Notice that Vancomycine corresponds to structure (b), whereas structure (a) is, in fact, its enantiomer.

From this example, one can see that an estimate of the erroneous data associating Vancomycine to the wrong chemical structure is about 13%. Analysis of some 6800 publications in drug discovery[4] show that the average error rate of reported chemical structures is about 8% and, it seems, nothing has changed so far. Numerous examples and alerts about data curation problems, especially in public databases, can be found in the literature.[4-8]

In this tutorial, a dataset regarding organoleptic properties of cosmetic related chemicals was collected from the website http://thegoodscentscompany.com/(January 2016). The dataset contains eight records: the name of the chemical substance, the CAS number, an odor category and description, the source of the odor description, a taste description, the literature for the source of the taste description, and the SMILES encoding the chemical structure of the substance. The data were retrieved automatically using a script provided with the tutorial (however, the script might need changes to work properly if the website has changed its structure in the meantime).

Each substance should be associated to exactly one organoleptic category, its odor type. Besides, some additional descriptions of the odor and tastes can be present. These textual descriptions are interpreted in terms of a dictionary of concepts used to describe the odors and tastes: the organoleptic semantic. With the help of this semantic each substance can be represented as a bit vector: each bit is related to an organoleptic descriptor. A bit is "on" if a particular description is relevant for the substance and it is "off" otherwise. Similarly, the chemical structures are interpreted in terms of MACCS fingerprints. In such a vector, a bit is "on" if the chemical structure of the substance possesses some feature (includes some element or chemical function for instance). Binary descriptions are suitable for further analysis, to compute distances or association rules.

Chemical structures and organoleptic descriptions, organoleptic category and bibliographic references are split into different files that can be loaded into separate tables and then merged into a relational database.

Software

KNIME is an Integrated Development Environment (IDE) and a workflow-programing language. Processing units, called nodes, are connected to each other. Data is directed from one node to another following the connections between them. By default, KNIME is divided into eight zones (Figure 1.1). The first one (1) is the toolbar of buttons for quick shortcuts. These buttons include creating a new project, saving the current projects, zooming and automatic cleanup of the workbench, running and managing the workflow. The second (2) area is the workbench, the place to drag and drop the nodes and to connect them in order to design a workflow. A miniature of the workbench is provided inside the sixth area (6), the Outline, in order to help navigating the workflow. The third area (3) it the KNIME Explorer, a storage area for workflows: it is divided by default into LOCAL and EXAMPLES. The EXAMPLES require an Internet connection to connect with a public KNIME server (login as guest, no password) where is found useful KNIME examples implementing solutions for many basic and advanced operations. The fourth (4) area is the Node Repository; this is the place where all nodes, representing data processing operations, are stored. Nodes are organized in a tree and a navigation bar provides a node search tool. The most frequently used nodes and the annotated ones are available inside the seventh area (7), the Favorite Nodes. The fifth area (5) is the Node Description. When a node is selected, it displays the help text describing the purpose of the node, its parameters, and the format of input and output. The eighth area (8) is the Console where errors and warning messages are displayed.

Figure 1.1 KNIME Overview. The interface is organized as follows: (1) the toolbar, (2) the workbench, (3) the KNIME Explorer, (4) the Node Repository, (5) the Node Description, (6) the Outline, (7) the Favorite Nodes, (8) the Console.

When KNIME is activated the first time, it requests a directory to use as workspace. This workspace is used to store temporary files and the workflows. The location and name of the workspace is up to the choice of the user. This choice can be changed later in the Preferences menu of KNIME.

Using KNIME consists in manipulating the following basic concepts:

Drag and drop a node from the node repository into the workbench to use it.
A node (Figure 1.2) has a main title describing its purpose, a traffic light describing the state of a node, and a custom name. On the side of a node are located handles. The left handles are input and right handles are output.
The traffic light is red if the node is not configured, orange if the node is ready, green if the node was successful in processing the data. It is modified if the node generated an error or a warning.
Click on a right handle of a node, pull and release the mouse button on a left handle triangle of another node to connect the two nodes. The connection represents the dataflow. The output of a node (right handed triangle) is the input (left handed triangle) of the next node.
Right click on a node to open a popup menu. The main action of this menu is to configure the node. Other common actions are to execute the node, edit the tooltip message, or to get a preview of the data processing by the node.
Lay the mouse over an in or out triangle of a node to get a snippet of the state of the data at this location of the workflow.
Right click to an edge connecting two nodes to edit or delete it.
It is recommended to find a particular node using the search tool of the Node Repository.

Figure 1.2 Schematic view of a KNIME node. The main title describes the data processing. To the left and right, the handles represent...

Systemvoraussetzungen

Als PDF speichern Als Link merken