Advances in Knowledge Discovery and Data Mining

Name: Advances in Knowledge Discovery and Data Mining | 15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, Proceedings, Part I
Brand: Springer
Price: 53.49 EUR
Availability: OnlineOnly

15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, Proceedings, Part I

Joshua Zhexue Huang Longbing Cao Jaideep Srivastava(Editor)

Springer (Publisher)

Published on 27. May 2011

XXIV, 564 pages

E-Book

PDF with digital watermarking

System requirements

978-3-642-20841-6 (ISBN)

€53.49incl. 7% vat

System requirements

for PDF with digital watermarking

E-Book Single Licence

Available for download

Description

More details

Other editions

Content

Title
Preface
Organization
Table of Contents
Feature Extraction
An Instance Selection Algorithm Based on Reverse Nearest Neighbor
Introduction
Relate Works
Incremental Algorithms
Decremental Algorithms
The RNNR Algorithm
The Choosing Strategy of RNNR-AL0 (Absorption, Larger Than ZERO)
The Choosing Strategy of RNNR-AL1 (Absorption, Larger Than ONE)
The Choosing Strategy of RNNR-L1 (Selecting All, Larger Than ONE)
Experiments
Conclusions and Future Work
References
A Game Theoretic Approach for Feature Clustering and Its Application to Feature Selection
Introduction
Related Work
Coalitional Games Preliminaries
NSP Computation via Integer Linear Program (ILP)
Feature Clustering via Nash Stable Partition
Feature Selection Approaches
Handling of Large Feature Set Size
Equivalence between a k-NSP and Minimum k-Cut
Hierarchical Feature Clustering
Experiments
References
Feature Selection Strategy in Text Classification
Introduction
Feature Selection
An Overview
Analysis
Modeling
Experiment
Algorithms for Comparison
Result and Discussion
Related Work
Summary and Conclusion
References
Unsupervised Feature Weighting Based on Local Feature Relatedness
Introduction
Preliminary
Document Representation
Relatedness Measure
Feature Weighting
Feature Weighting Based on Syntactic Information
Feature Weighting Based on Local Feature Relatedness (LFR)
Combination of Syntactic and Semantic Factors
Related Work
Feature Weighting Based on Global Feature Relatedness (GFR)
Document Similarity Based on Inter-document Feature Relatedness (IFR)
Experiments
Datasets
Methodology
Evaluation Metrics
Experiment Results
Computational Complexity
Conclusions
References
An Effective Feature Selection Method for Text Categorization
Introduction
Reviews of Feature Selection Methods in Text Categorization
Definitions
Three Baseline FS Methods
Analysis of the Traditional Feature Selection Methods in Text Categorization
Optimal Feature Selection for KNN
K-Nearest Neighbor Classification(KNN) for Text Categorization
Effective Feature Selection Criterion for KNN
An Illustrating Example
Experiments
Datasets
Classifiers
Performance Measurement
Experimental Results
Discussion of Results
Conclusion
References
Machine Learning
A Subpath Kernel for Rooted Unordered Trees
Introduction
A Linear-Time Kernel for Rooted Unordered Trees
A New Tree Kernel Based on Tree Subpaths
Subpath Set
A Subpath Tree Kernel
An Efficient Algorithm for the Subpath Tree Kernel
Experiments
An XML Classification Dataset
Glycan Classification Datasets
Comparison of Execution Times of the Proposed Kernel and the Linear-Time Tree Kernel
Related Work
Conclusion
References
Classification Probabilistic PCA with Application in Domain Adaptation
Introduction
Classification Probabilistic PCA (CPPCA)
Probabilistic PCA (PPCA) Revisited
Classification Probabilistic PCA (CPPCA)
EM Learning for CPPCA
Experimental Results
Product Review Adaptation Experiment
Conclusions
References
Probabilistic Matrix Factorization Leveraging Contexts for Unsupervised Relation Extraction
Introduction
Related Work
Unsupervised Relation Extraction
Relation Discovery
Feature Extraction
CL-PMF for Dimension Reduction
Experiments
Annotated Corpus
Evaluation
Methods
Results and Discussion
Parameters for CL-PMF
Conclusion
References
The Unsymmetrical-Style Co-training
Introduction
Preliminaries
Co-training
Co-EM
Multiple-Learner
Unsymmetrical Co-training
Experiments
Conclusion
References
Balance Support Vector Machines Locally Using the Structural Similarity Kernel
Introduction
Bibliographic Coupling Based Structural Similarity
From Global to Local Balance
From Hub Scores to Signed Authority Scores
Related Research
Experiments
Setup
Results
Discussions of Limitations
Concluding Remarks
References
Using Classifier-Based Nominal Imputation to Improve Machine Learning
Introduction
Framework
Imputation for Nominal Data
Classifier-Based Nominal Imputation
Using CNI to Improve Classification Performance of Machine Learned Classifiers
Experimental Design and Results
Evaluation of CNI Imputation Algorithms
The Impact of Nominal Imputers on the Classification Performance for Instance-Based Learning Algorithms
The Impact of Nominal Imputers on Other Machine Learned Classifiers
Conclusions
References
A Bayesian Framework for Learning Shared and Individual Subspaces from Multiple Data Sources
Introduction
Bayesian Shared Subspace Learning (BSSL)
Bayesian Representation
Gibbs Inference
Subspace Dimensionality and Complexity Analysis
Social Media Applications
BSSL Based Social Media Retrieval
BSSL Based Cross-Social Media Retrieval
Experiments
Dataset
Subspace Learning and Parameter Setting
Experiment 1: Social Media Retrieval Using Auxiliary Sources
Experiment 2: Cross Media Retrieval
Conclusion
References
Are Tensor Decomposition Solutions Unique? On the Global Convergence HOSVD and ParaFac Algorithms
Introduction
Tensor Decomposition
High Order SVD (HOSVD)
ParaFac Decomposition
Unique Solution
A Natural Starting Point for W: The T1 Decomposition and the PCA Solution
Initialization
Run Statistics and Validation
Eigenvalue Distributions
Datasets
Image Randomization
Main Results
Eigenvalue-Base Uniqueness Prediction
Theoretical Analysis
Summary
References
Improved Spectral Hashing
Introduction
Spectral Hashing
Formulation and Algorithm
Discussion
Proposed Methods
SH with Probability Transform
Generalized SH
Toward a More Efficient Code
Experiments
Datasets and Evaluation Measures
Experimental Results
Computational Cost
Conclusion and Future Works
References
Clustering
High-Order Co-clustering Text Data on Semantics-Based Representation Model
Introduction
High-Order Representation Structure
High-Order Co-clustering Method
Experimental Results and Discussion
Dataset
Clustering Results and Discussion
Conclusions and Future Work
References
The Role of Hubness in Clustering High-Dimensional Data
Introduction
Related Work
The Hubness Phenomenon
The Emergence of Hubs
Relation of Hubs to Data Clusters
Hub-Based Clustering
Deterministic Approach
Probabilistic Approach
Experiments and Evaluation
Synthetic Data: Gaussian Mixtures
Clustering in the Presence of High Noise Levels
Experiments on Real-World Data
Conclusions and Future Work
References
Spatial Entropy-Based Clustering for Mining Data with Spatial Correlation
Introduction
Related Work
Spatial Entropy-Based Clustering
Spatial Entropy
Using Spatial Entropy in Spatial Clustering
A Spatial Entropy-Based Spatial Clustering Method
Experiments
Conclusions
References
Self-adjust Local Connectivity Analysis for Spectral Clustering
Introduction
Methodology
Local Connectivity-Based Scaling
Eigenvector Selection
Experimental Evaluation
Conclusions
References
An Effective Density-Based Hierarchical Clustering Technique to Identify Coherent Patterns from Gene Expression Data
Introduction
GeneClusTree
Regulation Information Extraction
Performance Evaluation
Conclusions and Future Work
References
Nonlinear Discriminative Embedding for Clustering via Spectral Regularization
Introduction
Spectral Regularization on Manifold
The Proposed Method
Problem Formulation and Its Solution
Convergence Analysis
Experimental Results
Data Sets
Clustering Evaluation and Parameter Selection
Clustering Performance
Clustering Performance vs. Parameters
Comparison on the Embeddings
Conclusions
References
An Adaptive Fuzzy k-Nearest Neighbor Method Based on Parallel Particle Swarm Optimization for Bankruptcy Prediction
Introduction
Background Materials
Fuzzy k-Nearest Neighbor Algorithm (FKNN)
Time Variant Particle Swarm Optimization (TVPSO)
Proposed PTVPSO-FKNN Prediction Model
TVPSO-FKNN Model Based on the Serial PSO Algorithm
Parallel Implementation of the TVPSO-FKNN Model on the Multi-core Platform (PTVPSO-FKNN)
Experimental Design
Data Description
Experimental Setup
Measure for Performance Evaluation
Experimental Results and Discussion
Experiment I: Classification in the Whole Original Feature Space
Experiment II: Classification Using the PTVPSO-FKNN Model with Feature Selection
Experiment III: Comparison between the Parallel TVPSO-FKNN Model and the Serial One
Conclusions
References
Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data
Introduction
Instance-Level Constraints
The Algorithm
Initialization
Refinement
Alleviation of Cannot-Link Violation
Experimental Results
Data Sets
Results and Discussion
Conclusions
References
Classification
Identifying Hidden Contexts in Classification
Introduction
Problem Set-Up
Positioning within Related Work
Three Techniques for Identifying Hidden Contexts
Experimental Evaluation
Evaluation Criteria
Datasets and Experimental Protocol
Results
Case Study
Conclusion
References
Cross-Lingual Sentiment Classification via Bi-view Non-negative Matrix Tri-Factorization
Introduction
Related Work
Problem Setting
Bi-view Non-negative Matrix Tri-Factorization
Basic Idea
Mathematical Formulation and Brief Analysis
Experiments
Datasets
Baselines
Overall Comparison Results
Influence of Parameters
Conclusion and Future Work
References
A Sequential Dynamic Multi-class Model and Recursive Filtering by Variational Bayesian Methods
Introduction
Sequential Dynamic Multi-class Model
Recursive Filtering by Variational Bayes
Variational Bayes Approximation
Summary
Variational Predictive Distributions of New Inputs
Experiment Results
Synthetic Problem
Four-Class Motor Imagery EEG Data for the BCI-Competition 2005
Waveform Data Set from the UCI Machine Learning Repository
Conclusions
References
Random Ensemble Decision Trees for Learning Concept-Drifting Data Streams
Introduction
Related Work
The EDTC Algorithm
Experiments
Conclusion
References
Collaborative Data Cleaning for Sentiment Classification with Noisy Training Corpus
Introduction
Related Work
Sentiment Classification
Data Cleaning
Problem Statement
The Data Cleaning Algorithms
Overview
Self-cleaning
Co-cleaning
Tri-cleaning
Empirical Evaluation
Evaluation Setup
Evaluation Results
Conclusion and Future Work
References
Pattern Mining
Using Constraints to Generate and Explore Higher Order Discriminative Patterns
Introduction
Discriminative Patterns
Definitions
Previous Work in Discriminative Pattern Mining
Defining Higher Order Patterns with Constraints
A More General Approach
Experimental Results
Formal Analysis
Conclusion and Future Work
References
Mining Maximal Co-located Event Sets
Introduction
Problem Statement and Related Work
Basic Concepts of Spatial Co-location Mining
Problem Statement
Related Work
Algorithm
Preprocess
Candidate Generation
Candidate Pruning
Candidate Instance Filtering
Algorithm and Analysis
Experimental Results
Conclusion
References
Pattern Mining for a Two-Stage Information Filtering System
Introduction
Related Work
Rough Set-Based Topic Filtering
Discovery of R-Patterns
Rough Threshold Model
Pattern Taxonomy Mining
Experiments
Discussion
Conclusions
References
Efficiently Retrieving Longest Common Route Patterns of Moving Objects By Summarizing Turning Regions
Introduction
Problem Definition
Mining Algorithm for Longest Common Route Patterns Based on Turning Regions
Discovering Turning Regions
Retrieving Longest Common Route Patterns
Performance Evaluations
Optimal eps and DL angle
Efficiency and Accuracy
Conclusions
References
Automatic Assignment of Item Weights for Pattern Mining on Data Streams
Introduction
Background and Related Work
Valency Model
Our Weight Adaptation Methodology
Data Structure: Inverted Index Matrix
Distance Function
Evaluation
Precision
Execution Time
Evaluating Drift
Real World Dataset: Accident
Conclusion
References
Prediction
Predicting Private Company Exits Using Qualitative Data
Introduction
Data Extraction and Representation
Social Network Ranking
Mapping Companies to N-tuples
Missing Entries
Model Development and Results
Resampling and Cross-Validation
Results
Discussion and Conclusion
References
A Rule-Based Method for Customer Churn Prediction in Telecommunication Services
Introduction
Related Work
Algorithm of CRL
Basic Concepts
Rule Learning
Pruning Rules
Classification
Experiments and Discussion
Evaluation
Discussion
Conclusion and Future Works
References
Text Mining
Adaptive and Effective Keyword Search for XML
Introduction and Motivation
Result Model
Algorithms
Matrix Algorithm
Content-Information-First (CIF) Algorithm
Structure-Information-First (SIF) Algorithm
Experiments
Conclusion
References
Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing in Bayesian Topic Models
Introduction
Previous Works
Method
Model Construction
Posterior Inference
Experiments
Datasets
Settings
Evaluation Measure
Preliminary Experiments
Main Experiment
Conclusion
References
Constrained LDA for Grouping Product Features in Opinion Mining
Introduction
Related Work
The Proposed Algorithm
Introduction to LDA
Constrained-LDA
Constraint Extraction
Experimental Evaluation
Data Sets
Gold Standard
Evaluation Measure
Compared with LDA
Comparing with mLSA
Influence of Parameters
Conclusions
References
Semantic Dependent Word Pairs Generative Model for Fine-Grained Product Feature Mining
Introduction
Data Representation and Problem Definition
Problem Definition:
Semantic Dependent Word Pair Generative Model
Inference and Parameter Estimation
Latent Variable Inference
Parameter Estimation
Hyper-parameter Estimation
Evaluation
Perplexity
Average Cluster Entropy
Normalized Mutual Information Index
Experiments
Conclusion
References
Grammatical Dependency-Based Relations for Term Weighting in Text Classification
Introduction
The Proposed Term Weighting Framework
Relation Extraction
Graph Construction: Constructing, Weighting and Ranking Graph
Constructing Graph
Weighting Graph
Ranking Graph
Applying Graph-Based Document Representation to Text Classification
Proposed Term Class Dependence (TCD)
Proposed Hybrid Term Weighting Methods Based on TCD
Experiments
Classifier and Data Sets
Performance and Discussion
Conclusion and Future Work
References
XML Documents Clustering Using a Tensor Space Model
Introduction
Related Work
The Proposed XCT Method
Problem Definition and Preliminaries
Generation of Structure Features for TSM
Generation of Content Features for TSM
The TSM Representation, Decomposition and Clustering
Experiments and Discussion
Datasets
Experimental Design
Evaluation Measures
Empirical Analysis
Conclusion
References
An Efficient Pre-processing Method to Identify Logical Components from PDF Documents
Introduction
Related Works
The Sparse-Line Property
Machine Learning Methods
Component Boundary Detection
Experiments and Results
Data Set
Performance of Sparse Line Detection
Performance of Noise Line Removal
Table/Equation Boundary Detection
Conclusions
References
Combining Proper Name-Coreference with Conditional Random Fields for Semi-supervised Named Entity Recognition in Vietnamese Text
Introduction
Related Works
Conditional Random Field
Named Entity Recognition in Vietnamese Text
Characteristics of Vietnamese Proper Names
Semi-supervised Learning Algorithm
Experiments and Discussion
Conclusions
References
Topic Analysis of Web User Behavior Using LDA Model on Proxy Logs
Introduction
LDA Formulation
Cross-Hierarchical Directory Matching
LDA-Based Topic Modeling
Symbolizing URLs from Proxy Log
Description of Proxy log
Basic Idea of Labeling Words to User Session
Cross-Hierarchical Directory matching
Experiments and Results
Data Sets and Evaluation Settings
Evaluation Metrics
Optimality Analysis of LDA Model
Visualizing 24 Topics and Student Characterization
Conclusion
References
SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content
Introduction
Contribution
Related Work
Relation between Noise-Content Ratio and Similarity
Concepts and Notation
Theoretical Analysis
AF_SpotSigs and SizeSpotSigs Algorithm
Experiment
Data Set
Choice of Stopwords
AF_SpotSigs vs. SpotSigs
SizeSpotSigs over SpotSigs and AF_SpotSigs
Conclusions and Future Works
References
Knowledge Transfer across Multilingual Corpora via Latent Topics
Introduction
Problem Definition
Our Approach
Latent Dirichlet Allocation
Bilingual Latent Dirichlet Allocation
Cross-Lingual Document Classification
Experimental Results
Experimental Setup
Perplexity
Classification Accuracy
Topic Smoothing
Related Work
Conclusion
References
Author Index

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Advances in Knowledge Discovery and Data Mining

Description

More details

Other editions

Additional editions

Content

System requirements