Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
List of Contributors xv
Preface xvii
About the Companion Website xix
Part 1 Chemical Databases 1
1 Data Curation 3 Gilles Marcou and Alexandre Varnek
Theoretical Background 3
Software 5
Step-by-Step Instructions 7
Conclusion 34
References 36
2 Relational Chemical Databases: Creation, Management, and Usage 37 Gilles Marcou and Alexandre Varnek
Theoretical Background 37
Step-by-Step Instructions 41
Conclusion 65
References 65
3 Handling of Markush Structures 67 Timur Madzhidov, Ramil Nugmanov, and Alexandre Varnek
Theoretical Background 67
Step-by-Step Instructions 68
Conclusion 73
References 73
4 Processing of SMILES, InChI, and Hashed Fingerprints 75 João Montargil Aires de Sousa
Theoretical Background 75
Algorithms 76
Step-by-Step Instructions 78
Conclusion 80
References 81
Part 2 Library Design 83
5 Design of Diverse and Focused Compound Libraries 85 Antonio de la Vega de Leon, Eugen Lounkine, Martin Vogt, and Jürgen Bajorath
Introduction 85
Data Acquisition 86
Implementation 86
Compound Library Creation 87
Compound Library Analysis 90
Normalization of Descriptor Values 91
Visualizing Descriptor Distributions 92
Decorrelation and Dimension Reduction 94
Partitioning and Diverse Subset Calculation 95
Partitioning 95
Diverse Subset Selection 97
Combinatorial Libraries 98
Combinatorial Enumeration of Compounds 98
Retrosynthetic Approaches to Library Design 99
References 101
Part 3 Data Analysis and Visualization 103
6 Hierarchical Clustering in R 105 Martin Vogt and Jürgen Bajorath
Theoretical Background 105
Algorithms 106
Instructions 107
Hierarchical Clustering Using Fingerprints 108
Hierarchical Clustering Using Descriptors 111
Visualization of the Data Sets 113
Alternative Clustering Methods 116
Conclusion 117
References 118
7 Data Visualization and Analysis Using Kohonen Self-Organizing Maps 119 João Montargil Aires de Sousa
Theoretical Background 119
Algorithms 120
Instructions 121
Conclusion 126
References 126
Part 4 Obtaining and Validation QSAR/QSPR Models 127
8 Descriptors Generation Using the CDK Toolkit and Web Services 129 João Montargil Aires de Sousa
Theoretical Background 129
Algorithms 130
Step-by-Step Instructions 131
Conclusion 133
References 134
9 QSPR Models on Fragment Descriptors 135 Vitaly Solov'ev and Alexandre Varnek
Abbreviations 135
Data 136
ISIDA_QSPR Input 137
Data Split Into Training and Test Sets 139
Substructure Molecular Fragment (SMF) Descriptors 139
Regression Equations 142
Forward and Backward Stepwise Variable Selection 142
Parameters of Internal Model Validation 143
Applicability Domain (AD) of the Model 143
Storage and Retrieval Modeling Results 144
Analysis of Modeling Results 144
Root-Mean Squared Error (RMSE) Estimation 148
Setting the Parameters 151
Analysis of n-Fold Cross-Validation Results 151
Loading Structure-Data File 153
Descriptors and Fitting Equation 154
Variables Selection 155
Consensus Model 155
Model Applicability Domain 155
n-Fold External Cross-Validation 155
Saving and Loading of the Consensus Modeling Results 155
Statistical Parameters of the Consensus Model 156
Consensus Model Performance as a Function of Individual Models Acceptance Threshold 157
Building Consensus Model on the Entire Data Set 158
Loading Input Data 159
Loading Selected Models and Choosing their Applicability Domain 160
Reporting Predicted Values 160
Analysis of the Fragments Contributions 161
References 161
10 Cross-Validation and the Variable Selection Bias 163 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 163
Step-by-Step Instructions 165
Conclusion 172
References 173
11 Classification Models 175 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 176
Algorithms 178
Step-by-Step Instructions 180
Conclusion 191
References 192
12 Regression Models 193 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 194
Step-by-Step Instructions 197
Conclusion 207
References 208
13 Benchmarking Machine-Learning Methods 209 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 209
Step-by-Step Instructions 210
Conclusion 222
References 222
14 Compound Classification Using the scikit-learn Library 223 Jenny Balfer, Jürgen Bajorath, and Martin Vogt
Theoretical Background 224
Algorithms 225
Step-by-Step Instructions 230
Naïve Bayes 230
Decision Tree 231
Support Vector Machine 234
Notes on Provided Code 237
Conclusion 238
References 239
Part 5 Ensemble Modeling 241
15 Bagging and Boosting of Classification Models 243 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 243
Algorithm 244
Step by Step Instructions 245
Conclusion 247
References 247
16 Bagging and Boosting of Regression Models 249 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 249
Algorithm 249
Step-by-Step Instructions 250
Conclusion 255
References 255
17 Instability of Interpretable Rules 257 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 257
Algorithm 258
Step-by-Step Instructions 258
Conclusion 261
References 261
18 Random Subspaces and Random Forest 263 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 264
Algorithm 264
Step-by-Step Instructions 265
Conclusion 269
References 269
19 Stacking 271 Igor I. Baskin, Gilles Marcou, Dragos Horvath, and Alexandre Varnek
Theoretical Background 271
Algorithm 272
Step-by-Step Instructions 273
Conclusion 277
References 278
Part 6 3D Pharmacophore Modeling 279
20 3D Pharmacophore Modeling Techniques in Computer-Aided Molecular Design Using LigandScout 281 Thomas Seidel, Sharon D. Bryant, Gökhan Ibis, Giulio Poli, and Thierry Langer
Introduction 281
Theory: 3D Pharmacophores 283
Representation of Pharmacophore Models 283
Hydrogen-Bonding Interactions 285
Hydrophobic Interactions 285
Aromatic and Cation-p Interactions 286
Ionic Interactions 286
Metal Complexation 286
Ligand Shape Constraints 287
Pharmacophore Modeling 288
Manual Pharmacophore Construction 288
Structure-Based Pharmacophore Models 289
Ligand-Based Pharmacophore Models 289
3D Pharmacophore-Based Virtual Screening 291
3D Pharmacophore Creation 291
Annotated Database Creation 291
Virtual Screening-Database Searching 292
Hit-List Analysis 292
Tutorial: Creating 3D-Pharmacophore Models Using LigandScout 294
Creating Structure-Based Pharmacophores From a Ligand-Protein Complex 294
Description: Create a Structure-Based Pharmacophore Model 296
Create a Shared Feature Pharmacophore Model From Multiple Ligand-Protein Complexes 296
Description: Create a Shared Feature Pharmacophore and Align it to Ligands 297
Create Ligand-Based Pharmacophore Models 298
Description: Ligand-Based Pharmacophore Model Creation 300
Tutorial: Pharmacophore-Based Virtual Screening Using LigandScout 301
Virtual Screening, Model Editing, and Viewing Hits in the Target Active Site 301
Description: Virtual Screening and Pharmacophore Model Editing 302
Analyzing Screening Results with Respect to the Binding Site 303
Description: Analyzing Hits in the Active Site Using LigandScout 305
Parallel Virtual Screening of Multiple Databases Using LigandScout 305
Virtual Screening in the Screening Perspective of LigandScout 306
Description: Virtual Screening Using LigandScout 306
Conclusions 307
Acknowledgments 307
References 307
Part 7 The Protein 3D-Structures in Virtual Screening 311
21 The Protein 3D-Structures in Virtual Screening 313 Inna Slynko and Esther Kellenberger
Introduction 313
Description of the Example Case 314
Thrombin and Blood Coagulation 314
Active Thrombin and Inactive Prothrombin 314
Thrombin as a Drug Target 314
Thrombin Three-Dimensional Structure: The 1OYT PDB File 315
Modeling Suite 315
Overall Description of the Input Data Available on the Editor Website 315
Exercise 1: Protein Analysis and Preparation 316
Step 1: Identification of Molecules Described in the 1OYT PDB File 316
Step 2: Protein Quality Analysis of the Thrombin/Inhibitor PDB Complex Using MOE Geometry Utility 320
Step 3: Preparation of the Protein for Drug Design Applications 321
Step 4: Description of the Protein-Ligand Binding Mode 325
Step 5: Detection of Protein Cavities 328
Exercise 2: Retrospective Virtual Screening Using the Pharmacophore Approach 330
Step 1: Description of the Test Library 332
Step 2.1: Pharmacophore Design, Overview 333
Step 2.2: Pharmacophore Design, Flexible Alignment of Three Thrombin Inhibitors 334
Step 2.3: Pharmacophore Design, Query Generation 335
Step 3: Pharmacophore Search 337
Exercise 3: Retrospective Virtual Screening Using the Docking Approach 341
Step 1: Description of the Test Library 341
Step 2: Preparation of the Input 341
Step 3: Re-Docking of the Crystallographic Ligand 341
Step 4: Virtual Screening of a Database 345
General Conclusion 350
References 351
Part 8 Protein-Ligand Docking 353
22 Protein-Ligand Docking 355 Inna Slynko, Didier Rognan, and Esther Kellenberger
Introduction 355
Description of the Example Case 356
Methods 356
Ligand Preparation 359
Protein Preparation 359
Docking Parameters 360
Description of Input Data Available on the Editor Website 360
Exercises 362
A Quick Start with LeadIT 362
Re-Docking of Tacrine into AChE 362
Preparation of AChE From 1ACJ PDB File 362
Docking of Neutral Tacrine, then of Positively Charged Tacrine 363
Docking of Positively Charged Tacrine in AChE in Presence of Water 365
Cross-Docking of Tacrine-Pyridone and Donepezil Into AChE 366
Preparation of AChE From 1ACJ PDB File 366
Cross-Docking of Tacrine-Pyridone Inhibitor and Donepezil in AChE in Presence of Water 367
Re-Docking of Donepezil in AChE in Presence of Water 370
General Conclusions 372
Annex: Screen Captures of LeadIT Graphical Interface 372
References 375
Part 9 Pharmacophorical Profiling Using Shape Analysis 377
23 Pharmacophorical Profiling Using Shape Analysis 379 Jérémy Desaphy, Guillaume Bret, Inna Slynko, Didier Rognan, and Esther Kellenberger
Introduction 379
Description of the Example Case 380
Aim and Context 380
Description of the Searched Data Set 381
Description of the Query 381
Methods 381
Rocs 381
VolSite and Shaper 384
Other Programs for Shape Comparison 384
Description of Input Data Available on the Editor Website 385
Exercises 387
Preamble: Practical Considerations 387
Ligand Shape Analysis 387
What are ROCS Output Files? 387
Binding Site Comparison 388
Conclusions 390
References 391
Part 10 Algorithmic Chemoinformatics 393
24 Algorithmic Chemoinformatics 395 Martin Vogt, Antonio de la Vega de Leon, and Jürgen Bajorath
Introduction 395
Similarity Searching Using Data Fusion Techniques 396
Introduction to Virtual Screening 396
The Three Pillars of Virtual Screening 397
Molecular Representation 397
Similarity Function 397
Search Strategy (Data Fusion) 397
Fingerprints 397
Count Fingerprints 397
Fingerprint Representations 399
Bit Strings 399
Feature Lists 399
Generation of Fingerprints 399
Similarity Metrics 402
Search Strategy 404
Completed Virtual Screening Program 405
Benchmarking VS Performance 406
Scoring the Scorers 407
How to Score 407
Multiple Runs and Reproducibility 408
Adjusting the VS Program for Benchmarking 408
Analyzing Benchmark Results 410
Conclusion 414
Introduction to Chemoinformatics Toolkits 415
Theoretical Background 415
A Note on Graph Theory 416
Basic Usage: Creating and Manipulating Molecules in RDKit 417
Creation of Molecule Objects 417
Molecule Methods 418
Atom Methods 418
Bond Methods 419
An Example: Hill Notation for Molecules 419
Canonical SMILES: The Canon Algorithm 420
Theoretical Background 420
Recap of SMILES Notation 420
Canonical SMILES 421
Building a SMILES String 422
Canonicalization of SMILES 425
The Initial Invariant 427
The Iteration Step 428
Summary 431
Substructure Searching: The Ullmann Algorithm 432
Theoretical Background 432
Backtracking 433
A Note on Atom Order 436
The Ullmann Algorithm 436
Sample Runs 440
Summary 441
Atom Environment Fingerprints 441
Theoretical Background 441
Implementation 443
The Hashing Function 443
The Initial Atom Invariant 444
The Algorithm 444
Summary 447
References 447
Index 449
Gilles Marcou and Alexandre Varnek
Goal: Identify and curate problematic chemical information from a data collection. The raw dataset is processed so that it will be ready to feed a relational database dedicated to the organoleptic properties of small organic molecules. Information is interpreted and re-encoded as categories or bit vectors when relevant.
Software: KNIME 3.0, ChemAxon
Data: The following files are provided in the tutorial:
thegoodscent_dup.csv
thegoodscent_dup.raw
MissingOdorTypes.csv
StructureCuration.csv
TutoDataCuration.zip
Slurp.pl
The Good Scent Company is an online shop providing cosmetic, flavor, and fragrance ingredients. It provides information for the flavor, food, and fragrance industry since 1994, and sales ingredients since 1980.
Chemical datasets can be collected from literature, compendiums, web sites, lab-books, databases, and so on. Aggregation and automatic treatment of data represent additional sources of errors. Therefore, verification of quality and accuracy of chemical information is a crucial step of data valorization.[1]
The problem of the quality of publicly available chemical data can be illustrated on the searching the Web for the chemical structure of antibacterial compound Vancomycine, for which stereochemistry information is essential. One can suggest two possible queries using InChIKey notations:[2,3]
Query 1 corresponds to the first layer of the InChI code of Vancomycine; it encodes only elemental constitution and atoms connectivity, whereas Query 2 includes detailed stereochemistry information.
A search on Google (29/01/2016) retrieves 82 and 71 entries for Queries 1 and 2, respectively. Entries found with Query 2 correspond to the correct chemical structure of Vancomycine, whereas all 11 additional entries retrieved with Query 1 refer to its different enantiomers, see example on Scheme 1.1.
Scheme 1.1 Chemical structures of Vancomycine from PubChem. (a) PubChem CID 441141, InChIKey : MYPYJXKWCTUITO-UTHKAUQRSA-N. (b) PubChem CID 14969, InChIKey : MYPYJXKWCTUITO-LYRMYLQWSA-N. Notice that Vancomycine corresponds to structure (b), whereas structure (a) is, in fact, its enantiomer.
From this example, one can see that an estimate of the erroneous data associating Vancomycine to the wrong chemical structure is about 13%. Analysis of some 6800 publications in drug discovery[4] show that the average error rate of reported chemical structures is about 8% and, it seems, nothing has changed so far. Numerous examples and alerts about data curation problems, especially in public databases, can be found in the literature.[4-8]
In this tutorial, a dataset regarding organoleptic properties of cosmetic related chemicals was collected from the website http://thegoodscentscompany.com/(January 2016). The dataset contains eight records: the name of the chemical substance, the CAS number, an odor category and description, the source of the odor description, a taste description, the literature for the source of the taste description, and the SMILES encoding the chemical structure of the substance. The data were retrieved automatically using a script provided with the tutorial (however, the script might need changes to work properly if the website has changed its structure in the meantime).
Each substance should be associated to exactly one organoleptic category, its odor type. Besides, some additional descriptions of the odor and tastes can be present. These textual descriptions are interpreted in terms of a dictionary of concepts used to describe the odors and tastes: the organoleptic semantic. With the help of this semantic each substance can be represented as a bit vector: each bit is related to an organoleptic descriptor. A bit is "on" if a particular description is relevant for the substance and it is "off" otherwise. Similarly, the chemical structures are interpreted in terms of MACCS fingerprints. In such a vector, a bit is "on" if the chemical structure of the substance possesses some feature (includes some element or chemical function for instance). Binary descriptions are suitable for further analysis, to compute distances or association rules.
Chemical structures and organoleptic descriptions, organoleptic category and bibliographic references are split into different files that can be loaded into separate tables and then merged into a relational database.
KNIME is an Integrated Development Environment (IDE) and a workflow-programing language. Processing units, called nodes, are connected to each other. Data is directed from one node to another following the connections between them. By default, KNIME is divided into eight zones (Figure 1.1). The first one (1) is the toolbar of buttons for quick shortcuts. These buttons include creating a new project, saving the current projects, zooming and automatic cleanup of the workbench, running and managing the workflow. The second (2) area is the workbench, the place to drag and drop the nodes and to connect them in order to design a workflow. A miniature of the workbench is provided inside the sixth area (6), the Outline, in order to help navigating the workflow. The third area (3) it the KNIME Explorer, a storage area for workflows: it is divided by default into LOCAL and EXAMPLES. The EXAMPLES require an Internet connection to connect with a public KNIME server (login as guest, no password) where is found useful KNIME examples implementing solutions for many basic and advanced operations. The fourth (4) area is the Node Repository; this is the place where all nodes, representing data processing operations, are stored. Nodes are organized in a tree and a navigation bar provides a node search tool. The most frequently used nodes and the annotated ones are available inside the seventh area (7), the Favorite Nodes. The fifth area (5) is the Node Description. When a node is selected, it displays the help text describing the purpose of the node, its parameters, and the format of input and output. The eighth area (8) is the Console where errors and warning messages are displayed.
Figure 1.1 KNIME Overview. The interface is organized as follows: (1) the toolbar, (2) the workbench, (3) the KNIME Explorer, (4) the Node Repository, (5) the Node Description, (6) the Outline, (7) the Favorite Nodes, (8) the Console.
When KNIME is activated the first time, it requests a directory to use as workspace. This workspace is used to store temporary files and the workflows. The location and name of the workspace is up to the choice of the user. This choice can be changed later in the Preferences menu of KNIME.
Using KNIME consists in manipulating the following basic concepts:
Figure 1.2 Schematic view of a KNIME node. The main title describes the data processing. To the left and right, the handles represent...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.