
Statistical Issues in Machine Learning Towards Reliable Split Selection and Variable Importance Measures
Cuvillier Verlag eBooks
Published on 30. July 2008
204 pages
978-3-7369-2661-5 (ISBN)
System requirements
for PDF without DRM
E-Book Single Licence
You are acquiring a single user licence for this eBook, which you might not transfer. [L]
Available for download
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Recursive partitioning methods from machine learning are being widely applied in many scientific fields such as, e.g., genetics and bioinformatics. The present work is concerned with the two main problems that arise in recursive partitioning, instability and biased variable selection, from a statistical point of view. With respect to the first issue, instability, the entire scope of methods from standard classification trees over robustified classification trees and ensemble methods such as TWIX, bagging and random forests is covered in this work.
While ensemble methods prove to be much more stable than single trees, they also loose most of their interpretability. Therefore an adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single tree if the partition is sufficiently stable. With respect to the second issue, variable selection bias, the statistical sources of this artifact in single trees and a new form of bias inherent in ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased split selection criterion is evaluated and another one newly introduced here. Based on the results for single trees and further findings on the effects of bootstrap sampling on association measures, it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of bootstrap sampling should be employed in ensemble methods to be able to reliably compare the variable importance scores of predictor variables of different types. The statistical properties and the null hypothesis of a test for the random forest variable importance are critically investigated. Finally, a new, conditional importance measure is suggested that allows for a fair comparison in the case of correlated predictor variables and better reflects the null hypothesis of interest.
More details
Language
English
Place of publication
Göttingen
Germany
File size
0,99 MB
ISBN-13
978-3-7369-2661-5 (9783736926615)
Schweitzer Classification
Other editions
Additional editions
Carolin Strobl
Statistical Issues in Machine Learning Towards Reliable Split Selection and Variable Importance Measures
Book
2008
1st Edition
Cuvillier Verlag
€28.00
No shipping information available
Person
Author/originator
System requirements
File format: PDF
Copy protection: without DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use the free software Adobe Reader, Adobe Digital Editions, or any other PDF viewer of your choice (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or another reading app for eBooks, e.g., PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook does not use copy protection or Digital Rights Management.
For more information, see our eBook Help page.