
Big Data Analytics and Knowledge Discovery
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
The 24 revised full papers and 11 short papers presented were carefully reviewed and selected from 97 submissions. The papers are organized in the following topical sections: new generation data warehouses design; cloud and NoSQL databases; advanced programming paradigms; non-functional requirements satisfaction; machine learning; social media and twitter analysis; sentiment analysis and user influence; knowledge discovery; and data flow management and optimization.
More details
Other editions
Additional editions

Content
- Intro
- Preface
- Organization
- Contents
- New Generation Data Warehouses Design
- Evaluation of Data Warehouse Design Methodologies in the Context of Big Data
- Abstract
- 1 Introduction
- 2 Methodology Classification
- 3 Metrics for Design Evaluation of Methodologies
- 3.1 Metrics for Methodology Evaluation
- 3.2 Metrics for Schema Quality Evaluation
- 4 Experimental Results
- 4.1 Methodology Evaluation
- 4.2 Schema Evaluation
- 5 Conclusion
- References
- Optimal Task Ordering in Chain Data Flows: Exploring the Practicality of Non-scalable Solutions
- 1 Introduction
- 2 Preliminaries
- 2.1 Problem Complexity
- 2.2 Chains in TPC-DI
- 3 Accurate Algorithms for Linear Execution Plans
- 3.1 Backtracking
- 3.2 Dynamic Programming
- 3.3 Topological Sorting
- 4 Evaluation of the Time Overhead
- 5 Related Work
- 6 Conclusions
- References
- Exploiting Mathematical Structures of Statistical Measures for Comparison of RDF Data Cubes
- 1 Introduction
- 2 Model and Data Representation
- 3 Structural Comparison of RDF Data Cubes
- 3.1 Computability and Comparability
- 3.2 Comparison Functionalities
- 3.3 Experimentation
- 4 Conclusion
- References
- S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse
- 1 Introduction
- 2 Related Work
- 3 Overview of Shared Distributed Datasets
- 3.1 Phase 1: The Logical Representation
- 3.2 Phase 2: The Physical Representation
- 4 Experimental Evaluation
- 4.1 Experimental Setup
- 4.2 Experimental Results and Discussion
- 5 Conclusion and Future Work
- References
- Cloud and NoSQL Databases
- Enforcing Privacy in Cloud Databases
- 1 Introduction
- 2 Non-cryptographic Methods
- 2.1 Differential Privacy
- 2.2 Data Anonymization
- 2.3 Data Fragmentation
- 3 Secret Sharing-Based Methods
- 3.1 Verifiable Secret Sharing
- 3.2 Order-Preserving Secret Sharing
- 3.3 Discussion
- 4 Index-Based Methods
- 4.1 Bucketization-Based Indexing
- 4.2 Order-Preserving Indexing
- 4.3 Searchable Encryption
- 4.4 Discussion
- 5 Secure Databases
- 5.1 CryptDB
- 5.2 MONOMI
- 5.3 Multi-valued Order Preserving Encryption (MV-OPE)
- 5.4 Secure Trusted Hardware
- 5.5 Discussion
- 6 Conclusion
- 6.1 Security
- 6.2 Query Post-processing
- 6.3 Storage Overhead
- 6.4 Computational Overhead
- 6.5 Wrap-up
- References
- TARDIS: Optimal Execution of Scientific Workflows in Apache Spark
- 1 Introduction
- 2 Problem Definition
- 3 Background
- 3.1 Spark
- 4 TARDIS Engine
- 4.1 Architecture
- 4.2 TARDIS Language
- 4.3 Data Placement
- 4.4 Scheduling
- 4.5 Collecting Output Files
- 5 Experiments
- 6 Conclusion
- References
- MDA-Based Approach for NoSQL Databases Modelling
- Abstract
- 1 Introduction
- 2 Research Problem and Related Work
- 3 UMLtoNoSQL Approach
- 3.1 UMLtoGenericModel Transformation
- 3.2 GenericModeltoPhysicalModel Transformation
- 4 Experiments
- 4.1 Implementation
- 4.2 Evaluation
- 5 Conclusion and Future Work
- References
- Advanced Programming Paradigms
- MiSeRe-Hadoop: A Large-Scale Robust Sequential Classification Rules Mining Framework
- 1 Introduction
- 2 Preliminaries
- 3 MiSeRe Algorithm
- 4 MiSeRe Hadoop Algorithm
- 4.1 Step I:
- 4.2 Step II:
- 5 Experiments
- 6 Conclusion and Future Work
- References
- An Efficient Map-Reduce Framework to Mine Periodic Frequent Patterns
- 1 Introduction
- 2 Background
- 2.1 Mining Periodic-Frequent Patterns on a Single Machine
- 2.2 Mining PFPs with Period Summary
- 2.3 Map-Reduce Framework
- 2.4 Parallel FP-growth
- 3 Proposed Approaches
- 3.1 Parallel Periodic Frequent Pattern Growth (PPF-growth)
- 3.2 PPF-growth Using Partition Summary
- 4 Performance Evaluation
- 5 Conclusion
- References
- MapReduce-Based Complex Big Data Analytics over Uncertain and Imprecise Social Networks
- 1 Introduction and Related Work
- 2 Background: Data Science
- 3 Mining Complex Big Data in Uncertain and Imprecise Social Networks
- 3.1 Interdependencies Between Followers and Followees in Complex Big Social Networks
- 3.2 Discovery of Popular Followees
- 3.3 The First Set of MapReduce Functions in BigUISN
- 3.4 The Second Set of MapReduce Functions in BigUISN
- 3.5 Beyond the Second Set of MapReduce Functions in BigUISN
- 4 Evaluation, Observations, and Discussion
- 5 Conclusions and Future Work
- References
- Non-functional Requirements Satisfaction
- A Case for Abstract Cost Models for Distributed Execution of Analytics Operators
- 1 Introduction
- 2 Piecewise Linear Model Structure and Training
- 3 Makespan Model for Sorting
- 3.1 Round-Time Estimation for Map and Reduce Phase
- 3.2 Exploiting Model Structure for Optimization
- 4 Dense Matrix Product
- 4.1 Makespan Model for Block-Wise Matrix Multiplication
- 4.2 Optimal Partitioning
- 5 Experiments
- 5.1 Basic Setup
- 5.2 Sorting
- 5.3 Matrix Multiplication
- 6 Related Work
- 7 Conclusions
- References
- Pre-processing and Indexing Techniques for Constellation Queries in Big Data
- 1 Introduction
- 2 Related Works
- 3 Problem Formulation
- 4 CQ Processing
- 4.1 Query Pre-processing
- 4.2 Query Transformation
- 4.3 Dataset Pre-processing
- 5 Experiments
- 5.1 Query Pre-processing
- 5.2 PH-tree Versus Quad-Tree
- 6 Conclusion
- References
- A Lightweight Elastic Queue Middleware for Distributed Streaming Pipeline
- 1 Introduction
- 2 Elastic Queue Middleware
- 2.1 The Role of EQM in Elastic Streaming Processing Engines
- 2.2 Implementing EQM Based on HBase
- 3 Experiments
- 4 Related Work
- 5 Conclusion
- References
- Modeling Data Flow Execution in a Parallel Environment
- 1 Introduction
- 1.1 Parallelizing Data Flows
- 1.2 Assumptions Regarding a Single Multi-core Machine Execution Environment
- 1.3 Motivation for Devising a New Cost Model
- 2 Other Related Work
- 3 Preliminaries
- 4 Our Cost Model
- 4.1 A Generalized Cost Model for Response Time
- 4.2 Models Without Considering the Communication Cost
- 4.3 Considering Communication Costs
- 4.4 Considering Partitioned Parallelism
- 5 Conclusions and Future Work
- References
- Machine Learning
- Accelerating K-Means by Grouping Points Automatically
- Abstract
- 1 Introduction
- 2 Related Work
- 3 Proposed Method
- 3.1 The Framework of Our Algorithm
- 3.2 Filtering for Clusters of Points
- 3.3 Fission Step: Grouping Points Automatically
- 3.4 Filtering for Groups of Points
- 3.5 Fusion Step: Limiting the Increasing Number of Groups
- 3.6 Algorithm
- 4 Experiment and Analysis
- 4.1 Experiment Design
- 4.2 Cost Comparison and Relative Speedup
- 4.3 Separability
- 4.4 Avoided Distance Calculations
- 5 Conclusion and Future Work
- References
- A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage
- 1 Introduction
- 2 Related Work
- 3 Assessing the Accuracy of Record Linkage
- 4 Machine Learning Algorithms
- 4.1 Decision Trees
- 4.2 Gradient Boosted Trees
- 4.3 Random Forests
- 4.4 Naïve Bayes
- 4.5 Linear Support Vector Machine
- 4.6 Logistic Regression
- 4.7 Comparative Analysis
- 5 Proposed Trainable Model
- 5.1 Pre-processing
- 5.2 Transformation
- 5.3 Model Selection
- 5.4 Model Execution
- 6 Experimental Results
- 7 Conclusions and Future Work
- References
- An Efficient Approach for Instance Selection
- 1 Introduction
- 2 Related Works
- 3 Notations
- 4 The XLDIS Algorithm
- 5 Experiments
- 6 Conclusion
- References
- Search Result Personalization in Twitter Using Neural Word Embeddings
- 1 Introduction
- 2 Related Work
- 2.1 Twitter Search
- 2.2 Personalized Twitter Search
- 3 Our Approach
- 3.1 User Modeling
- 3.2 Results Re-ranking
- 4 Evaluation
- 4.1 Twitter Lists Based Evaluation
- 4.2 Hashtags Based Evaluation
- 5 Conclusions
- References
- Diverse Selection of Feature Subsets for Ensemble Regression
- 1 Introduction
- 2 Related Work
- 3 Diverse Subset Selection Strategy (DS3)
- 3.1 Problem Overview
- 3.2 Solution Overview
- 3.3 Relevance Based Generation of Initial Candidates
- 3.4 Multiple Feature Sets Based on Difference and Quality
- 3.5 Unifying Multiple Subsets by Ensemble Regression
- 3.6 Time Complexity
- 4 Experiments
- 4.1 Synthetic Data Sets
- 4.2 Real-World Data Sets
- 4.3 Parameter Analysis
- 4.4 Iterations
- 5 Conclusions
- References
- K-Means Clustering Using Homomorphic Encryption and an Updatable Distance Matrix: Secure Third Party Data Clustering with Limited Data Owner Interaction
- 1 Introduction
- 2 Related Work
- 3 Preliminaries
- 3.1 K-Means Clustering
- 3.2 Liu's Homomorphic Encryption Scheme
- 4 The Updatable Distance Matrix Concept
- 5 Secure K-Means Clustering Using the UDM Concept
- 5.1 Data Owner Process
- 5.2 Third Party Process
- 6 Evaluation
- 7 Conclusion
- References
- Reweighting Forest for Extreme Multi-label Classification
- Abstract
- 1 Introduction
- 2 Related Work
- 3 Proposed Method
- 3.1 Problem Definition and Proposed Framework
- 3.2 The Reweighting Phase
- 3.3 The Pretesting Phase
- 4 Experiments
- 4.1 Experimental Setup
- 4.2 Experimental Results
- 5 Conclusion
- References
- Social Media and Twitter Analysis
- A Relativistic Opinion Mining Approach to Detect Factual or Opinionated News Sources
- 1 Introduction
- 2 Related Work
- 3 Experimental Setup
- 3.1 Dataset
- 3.2 Knowledge-Base and Preprocessing
- 3.3 Sentiment Analysis
- 4 Experimental Results
- 5 Conclusion
- 6 Future Work
- References
- A Reliability-Based Approach for Influence Maximization Using the Evidence Theory
- 1 Introduction
- 2 Related Works
- 2.1 Influence Maximization Models
- 2.2 Influence and Theory of Belief Functions
- 3 Theory of Belief Functions
- 4 Reliability-Based Influence Maximization
- 4.1 Influence Characterization
- 4.2 Estimating Reliability
- 4.3 Influence Estimation
- 5 Results and Discussion
- 6 Conclusion
- References
- Sentiment Analysis on Twitter to Improve Time Series Contextual Anomaly Detection for Detecting Stock Market Manipulation
- 1 Introduction
- 2 Methods
- 2.1 Sentiment Analysis on Twitter
- 2.2 Data
- 2.3 Data Preprocessing
- 2.4 Modelling
- 2.5 Feature Selection
- 2.6 Classifiers
- 2.7 Classifier Evaluation
- 2.8 Calculating Polarity for Each Stock
- 3 Results and Discussion
- References
- Automatic Segmentation of Big Data of Patent Texts
- 1 Introduction
- 2 Related Work
- 3 Segmentation Guidelines
- 4 Methods and Evaluations
- 4.1 Workflow
- 4.2 Headings Identification
- 4.3 Meaning of Headings (Semantic of Headings)
- 4.4 Heuristic Methods
- 4.5 Big Data Approach
- 4.6 Implementation
- 4.7 Evaluation
- 5 Conclusion
- References
- Sentiment Analysis and User Influence
- Tag Me a Label with Multi-arm: Active Learning for Telugu Sentiment Analysis
- 1 Introduction
- 2 Related Work
- 3 Dataset Generation
- 3.1 Word Embeddings Generation
- 3.2 Feature Engineering
- 4 The Proposed Approach
- 4.1 Active Learning
- 4.2 Input to the System
- 4.3 Query Selection Strategies
- 4.4 Classification Model for Telugu Sentiment Analysis
- 5 Experiments and Results
- 6 Conclusion
- 6.1 Future Work
- References
- Belief Temporal Analysis of Expert Users: Case Study Stack Overflow
- 1 Introduction
- 2 Related Work
- 3 Theory of Belief Functions: An Overview
- 3.1 Particular Belief Functions
- 3.2 Discounting
- 3.3 Decision Making
- 4 Belief Model of Users in Stack Overflow
- 4.1 Hypothesis
- 4.2 Definition of Mass Functions
- 4.3 Data Aggregation and Decision Making
- 5 Experimental Evaluation and Analysis
- 5.1 Time Analysis of the Data Set
- 5.2 Analysis of Users' Behavior over Time
- 6 Conclusion
- References
- Leveraging Hierarchy and Community Structure for Determining Influencers in Networks
- 1 Introduction
- 1.1 Contributions and Organization
- 2 Related Work
- 3 Preliminaries
- 4 Influence Scoring Using Position, Reachability and Interaction
- 4.1 Trussness Based Hierarchical Decomposition
- 4.2 Positional Index
- 4.3 Reachability Index
- 4.4 Interaction Index
- 4.5 Influence Score
- 5 Experimental Analysis
- 5.1 Investigation Using SIR Model
- 5.2 Monotonicity
- 6 Conclusion
- References
- Using Social Media for Word-of-Mouth Marketing
- 1 Introduction
- 2 Related Work
- 3 Problem Definition
- 4 Analysis of Online Social Groups
- 5 Social Interaction Graph
- 5.1 Measuring Topical Relevance
- 6 Finding Influential Users in OSG
- 7 Reinforced Marketing
- 8 Evaluations
- 8.1 Experimental Setup
- 8.2 Evaluation Metrics
- 8.3 Effectiveness of Algorithms
- 8.4 Precision Analysis
- 8.5 Marketing Across Topics
- 8.6 Empirical Evaluation
- 8.7 Temporal Dynamics
- 9 Conclusion
- References
- Knowledge Discovery
- Knowledge Discovery of Complex Data Using Gaussian Mixture Models
- 1 Introduction
- 2 Related Work
- 2.1 Data Representations
- 2.2 Similarity Measures
- 2.3 Indexes
- 3 Methods
- 3.1 Gaussian Mixture Models
- 3.2 Infinite Euclidean Distance for Distributions
- 4 Experimental Evaluations
- 4.1 Data Sets
- 4.2 Query Performance
- 4.3 Classification on NBA Data
- 4.4 Clustering on Weather Data
- 5 Conclusions and Future Work
- References
- Optimized Mining of Potential Positive and Negative Association Rules
- 1 Introduction and Motivations
- 2 Preliminary Concepts
- 3 OM2PNR Algorithm
- 3.1 Optimization of the Research the Frequent Patterns
- 3.2 Optimization of the Course of Research the Potential Rules
- 4 Experimental Resultants
- 5 Conclusion
- References
- Extracting Non-redundant Correlated Purchase Behaviors by Utility Measure
- 1 Introduction
- 2 Preliminaries and Problem Statement
- 3 Proposed CoHUIM Algorithm for Mining CoHUIs
- 3.1 Properties of the CoHUI
- 3.2 Reducing Database Size Using Projection Mechanism
- 3.3 Proposed Sorted Downward Closure Property
- 3.4 Procedure of the Projection-Based CoHUIM Algorithm
- 4 Experimental Results
- 4.1 Dataset and Experimental Setup
- 4.2 Pattern Analysis
- 4.3 Runtime Analysis
- 5 Conclusions
- References
- Data Flow Management and Optimization
- Detecting Feature Interactions in Agricultural Trade Data Using a Deep Neural Network
- 1 Introduction and Motivation
- 2 Related Research
- 3 Deep Belief Network Components
- 3.1 Parameter Initialisation and Optimisation
- 3.2 Architectural Configuration
- 4 An Approach to Interpreting Deep Representations
- 5 Experiments
- 5.1 Setup
- 5.2 Results and Analysis
- 6 Conclusions and Future Work
- References
- Air Quality Monitoring System and Benchmarking
- 1 Introduction
- 2 System Design and Implementation
- 3 Benchmarking
- 3.1 Experimental Settings
- 3.2 Benchmarking Methods
- 3.3 Benchmarking Results
- 4 Related Work
- 5 Conclusions and Future Work
- References
- Electric Vehicle Charging Station Deployment for Minimizing Construction Cost
- 1 Introduction
- 2 Information Extraction
- 2.1 Idle Trip
- 2.2 Charging Demand Model
- 2.3 Impact of Traffic Condition
- 2.4 Estate Price Model
- 3 Construction Cost Optimization
- 4 Evaluation
- 4.1 Data Set
- 4.2 Baselines
- 4.3 Evaluation Metrics
- 4.4 Experiment Settings and Evaluation Results
- 5 Related Work
- 6 Conclusion
- References
- Author Index
System requirements
File format: PDF
Copy protection: Watermark-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use the free software Adobe Reader, Adobe Digital Editions, or any other PDF viewer of your choice (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or another reading app for eBooks, e.g., PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Watermark-DRM, a „soft” copy protection. This means that there are no technical restrictions to prevent illegal distribution. However, there is a personalised watermark embedded in the eBook that can be used to identify the purchaser of the eBook in the event of misuse and to provide evidence for legal purposes.
For more information, see our eBook Help page.