
Information Retrieval: Advanced Topics and Techniques
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
In the last decade, deep learning and word embeddings have made significant impacts on information retrieval (IR) by adding techniques based in neural networks and language models. At the same time, certain search modalities such as neural IR and conversational search have become more popular. This book, written by international academic and industry experts, brings the field up to date with detailed discussions of these new approaches and techniques. The book is organized in three sections: Foundations, Adaptations and Concerns, and Verticals.
Under Foundations, we address topics that form the basic structure of any modern IR system, including recommender systems. These new techniques are developed to augment indexing, retrieval, and ranking. Neural IR, recommender systems, evaluation, query-driven functionality, and knowledge graphs are covered in this section.
IR systems need to adapt to specific user characteristics and preferences, and techniques that were considered too niche a few years ago are now a matter of system design consideration. The Adaptations and Concerns section covers the following topics: conversational search, cross-language retrieval, temporal extraction and retrieval, bias in retrieval systems, and privacy in search.
While web search engines are the most popular information access point, there are cases where specific verticals provide a better experience in terms of content and relevance. The Verticals section describes eCommerce, professional search, personal collections, music retrieval, and biomedicine as examples.
More details
Other editions
Additional editions

Content
- Intro
- Information Retrieval
- Contents
- Preface
- 1 Introduction
- 1.1 Motivation
- 1.2 A Bit of History
- 1.3 Responsible Retrieval Systems
- 1.4 Content Organization
- 1.4.1 Foundations
- 1.4.2 Adaptations and Concerns
- 1.4.3 Verticals
- 1.5 Intended Audience
- I FOUNDATIONS
- 2 Neural Information Retrieval
- 2.1 Introduction
- 2.2 Text Representations for Ranking
- 2.2.2 LTR Features
- 2.2.3 Word Embeddings
- 2.3 Interaction-focused Systems
- 2.3.1 Convolutional Neural Networks
- 2.3.2 Pre-trained Language Models
- 2.3.4 Ranking with Encoder-Decoder Models
- 2.3.5 Fine-tuning Interaction-focused Systems
- 2.3.6 Dealing with Long Texts
- 2.4 Representation-focused Systems
- 2.4.2 Multiple Representations
- 2.4.3 Fine-tuning Representation-focused Systems
- 2.5 Retrieval Architectures and Vector Search
- 2.5.3 Locality Sensitive Hashing Approaches
- 2.5.4 Vector Quantization Approaches
- 2.5.5 Graph Approaches
- 2.5.6 Optimizations
- 2.6 Learned Sparse Retrieval
- 2.6.1 Document Expansion Learning
- 2.6.2 Impact Score Learning
- 2.6.3 Sparse Representation Learning
- 2.7 Retrieval-augmented Generation
- 2.8 Conclusions
- 3 Recommender Systems
- 3.1 Introduction
- 3.2 The Recommendation Task
- 3.3 Recommendation Algorithms
- 3.3.1 Recommendation as a Machine Learning Problem
- 3.3.2 Characterizing Approaches Based on Their Input Data
- 3.3.3 Collaborative Filtering
- 3.3.3.1 Nearest Neighbors
- 3.3.3.2 Matrix Factorization
- 3.3.4 Learning to Rank
- 3.3.5 Neural Recommendation
- 3.3.6 Content-based and Hybrid Recommender Systems
- 3.3.6.1 Pure Content-based Systems
- 3.3.6.2 Hybrid Recommender Systems
- 3.3.6.3 Collaborative Filtering with Side Information
- 3.3.7 Discussion
- 3.4 Evaluation of Recommender Systems
- 3.4.1 Online Evaluation
- 3.4.2 Offline Evaluation
- 3.4.3 Offline Data
- 3.4.3.1 Data Splitting
- 3.4.3.2 Candidate Item Sampling
- 3.4.4 Recommendation Task and Metrics
- 3.4.4.1 Rating Prediction
- 3.4.4.2 Ranking Quality: Recommendation as an IR Task
- 3.4.4.3 Collection
- 3.4.4.4 Query and Information Need
- 3.4.4.5 Relevance
- 3.4.4.6 Metrics
- 3.4.5 Beyond Accuracy
- 3.4.5.1 Novelty
- 3.4.5.2 Long-tail Novelty
- 3.4.5.3 Unexpectedness
- 3.4.5.4 Serendipity
- 3.4.5.5 Further Notions
- 3.4.5.6 Diversity
- 3.4.5.7 Intra-list Dissimilarity
- 3.4.5.8 Aspect-based Diversity
- 3.4.5.9 Coverage
- 3.4.5.10 Enhancing Novelty and Diversity
- 3.4.5.11 User Perceptions of Diversity and Novelty
- 3.5 Sequential and Session-based Recommendation
- 3.5.1 Problem Definition and Terminology
- 3.5.2 Algorithms for Sequential and Session-based Recommendation
- 3.5.2.2 Sequence-aware Matrix Factorization
- 3.5.2.3 Hybrid Approaches
- 3.5.2.4 Nearest-neighbors and Other Methods
- 3.5.3 Evaluation of Sequential and Session-based Recommender Systems
- 3.5.3.1 Offline Evaluation
- 3.5.3.2 Data Splitting
- 3.5.3.3 Making the Measurement
- 3.5.3.4 Cross-Validation
- 3.5.3.5 User-centric Evaluation
- 3.5.3.6 Real-world Evaluation
- 3.5.4 Discussion and Outlook
- 3.6 Popularity, Bias, and Recurrence in Recommendation
- 3.6.1 Countering Bias
- 3.6.2 Understanding Bias
- 3.6.3 The Feedback Loop
- 3.7 Impact and Value of Recommender Systems
- 3.7.1 Understanding the Impact of Recommendations with the Human in the Loop
- 3.7.2 Consumer and Business Value of Recommender Systems
- 3.7.2.1 Recommendation as a Multistakeholder Optimization Problem
- 3.7.2.2 Impact and Value-Oriented Recommender Systems Evaluation
- 3.8 Conclusions and Challenges
- 3.8.1 Summary
- 3.8.2 Further Readings and Future Directions
- 3.8.2.1 Conversational Recommender Systems
- 3.8.2.2 Fairness in Recommender Systems
- 3.8.2.3 Offline/Online Misalignment in Evaluation
- 4 Evaluation of IR Systems
- 4.1 Introduction
- 4.2 Offline Evaluation
- 4.2.2 Evaluation Campaigns
- 4.2.3 Document Corpora and Topics
- 4.2.4 Pooling
- 4.2.5 Crowdsourcing
- 4.2.6 Multi-armed Bandits
- 4.3 Evaluation Measures
- 4.3.3 Average Precision
- 4.3.4 Discounted Cumulated Gain
- 4.4 Statistical Significance Testing
- 4.4.1 Basic Intuition about Statistical Significance Testing
- 4.4.2 ANalysis Of VAriance
- 4.4.2.2 Assessment of the Model
- 4.4.2.3 Effect Size
- 4.4.2.4 Multiple Comparisons
- 4.4.2.5 Assumptions
- 4.5 Offline Evaluation with Online Data
- 4.5.1 Measures Calibrated with Online Data
- 4.6 Online Evaluation
- 4.6.1 Description of Online Evaluation
- 4.6.2 Online Controlled Experiments
- 4.6.4 Interleaving
- 4.6.5 Online Measures
- Absolute document level metrics
- Absolute ranking level metrics
- 4.7 Measurements
- 4.7.1 Overview
- 4.7.2 The Representational Theory of Measurement
- 4.7.3 Classification of the Scales of Measurement
- 4.7.3.1 Nominal Scale
- 4.7.3.2 Ordinal scale
- 4.7.3.3 Interval Scale
- 4.7.3.4 Ratio Scale
- 4.7.4 Admissible Statistical Operations
- 4.7.5 Statistical Significance Testing
- 4.7.6 Why the Measurement Theory Matters to IR Evaluation
- 4.7.6.1 Averaging System Performance
- 4.7.6.2 Statistical Significance Testing
- 4.7.6.3 Score Standardization
- 4.7.6.4 Topic Difficulty
- 4.7.7 A Formal Theory of IR Evaluation Measures
- 4.7.7.1 Early Attempts
- 4.7.7.2 Current Studies
- 4.7.8 Implications on Statistical Significance Testing
- 4.7.8.1 Other Studies
- 4.8 Conclusions and Challenges
- 4.8.1 Evaluation of Complex Tasks
- 4.8.2 Reproducibility
- 4.8.3 Meaningfulness
- 4.8.4 Large Language Models and Generative AI
- 5 Query-driven Search Functionality
- 5.1 Introduction
- 5.2 Problem Definitions
- 5.2.1 Query Auto-completion
- 5.2.2 Query Suggestion
- 5.3 A Framework for Search Assist Systems
- 5.4 Key Factors in Search Assist Functions
- 5.4.1 Temporal Factors
- 5.4.2 Contextual Factors
- 5.4.3 Location Factors
- 5.4.4 Demographic Factors
- 5.4.5 Behavioral Factors
- 5.5 Algorithms
- 5.6 Datasets
- 5.7 Evaluation Metrics
- 5.7.1 Ranking Metrics
- 5.7.2 User Assist Metrics
- 5.7.3 Post Usage Metrics
- 5.8 Historical Notes
- 5.9 Conclusions and Challenges
- 6 Knowledge Graphs and Search
- 6.1 Introduction
- What is a knowledge graph
- What knowledge graphs are out there
- How to search a knowledge graph
- Engines and indexing
- Combination with text search and federated search
- What else is there to know about knowledge graphs
- The future of knowledge graphs
- 6.2 What Is a Knowledge Graph
- 6.2.1 Our Toy Knowledge Graph
- 6.2.2 RDF
- 6.2.3 Our Revised Toy Knowledge Graph
- 6.2.4 Reification
- 6.2.5 Other Kinds of Information
- 6.3 What Knowledge Graphs Are Out There
- 6.3.1 Wikidata
- 6.3.2 Freebase
- 6.3.3 DBpedia
- 6.3.4 YAGO
- 6.3.5 UniProt
- 6.3.6 PubChem
- 6.3.7 DBLP
- 6.3.8 OpenStreetMap
- 6.4 How to Search a Knowledge Graph: Structured Query Languages
- 6.4.1 SPARQL
- 6.4.2 Cypher (Neo4j)
- 6.5 Engines and Indexing
- 6.5.1 Object Identifiers
- 6.5.2 Triple Permutations
- 6.5.3 Query Planning
- Distinctness
- Multiplicity
- Other columns
- Case 1:
- Case 2:
- 6.5.4 Further Improvements
- 6.5.5 Virtuoso
- 6.5.6 Blazegraph
- 6.5.7 Neo4j
- 6.6 How to Search a Knowledge Graph: Assisting the User
- 6.6.2 Question Answering
- Step 1: Find entities from the knowledge graph mentioned in the question
- Step 2: Generating candidates
- Step 3: Computing feature vectors
- Coverage (Cov):
- Step 4: Ranking
- 6.7 Combination with Text Search and Federated Search
- 6.7.1 Keyword Search in Literals
- 6.7.2 Search in an External Text Corpus Linked to a Knowledge Graph
- 6.7.3 Federated Search
- 6.7.4 Use Cases of Federated Search
- 6.8 Conclusions and Challenges
- II ADAPTATIONS AND CONCERNS
- 7 Conversational Search
- 7.1 Introduction
- 7.1.1 Defining Conversational Search
- 7.1.2 Combined Representation of Conversational Search Definitions
- 7.1.3 Overview
- 7.2 Searching Through Conversations
- 7.2.1 Speech User Interfaces
- 7.2.2 Spoken Dialogue Systems
- 7.3 Information Seeking Models, Theories, and Properties
- 7.3.1 Interactive Information Retrieval
- 7.3.2 Question Answering
- 7.3.3 Modeling Information Seeking Through Dialogue
- 7.3.4 Theoretical Frameworks and Properties for Conversational Search
- 7.4 Fundamental Search Actions in Conversational Search
- 7.4.1 Query Formulation
- 7.4.2 Results Presentation and Answer Organization
- 7.4.3 Query Reformulation and Refinements
- 7.5 Fundamental Non-search Actions in Conversational Search
- 7.5.1 Discourse Management
- 7.5.2 Navigation
- 7.5.3 Grounding
- 7.5.4 Visibility of Information-seeking Partner Status
- 7.6 Implementing and Evaluating Conversational Systems
- 7.6.1 Evaluation of Conversational Systems
- 7.6.1.1 Offline Evaluation
- 7.6.1.2 Online Evaluation
- 7.6.2 Data for Conversational Systems
- 7.7 Conclusions and Challenges
- 8 Cross-language Retrieval
- 8.1 Introduction
- 8.1.1 Some Use Cases
- 8.1.2 The Three Waves of Information Retrieval
- 8.2 The Core Technology of CLIR
- 8.2.1 What to Translate?
- 8.2.1.1 The Query or the Documents?
- 8.2.1.2 Which Terms?
- 8.2.2 Which Translations Are Possible?
- 8.2.3 How to Use those Translations?
- 8.3 Updating Nie: CLIR since 2010
- 8.3.1 What to Translate: Parts of Words
- 8.3.2 Which Translations: Bilingual and Multilingual Embeddings
- 8.3.3 Which Translations: MT
- 8.3.4 Using Translations: Neural IR
- 8.3.5 Fusion
- 8.3.6 Cross-language Speech Retrieval
- 8.3.7 Building Real Systems
- 8.4 Evaluation
- 8.4.1 Shared-task Evaluation
- 8.4.2 Test Collections
- 8.5 Conclusions and Challenges
- Acknowledgments
- 9 Temporal Extraction and Retrieval
- 9.1 Introduction
- 9.2 Temporal Expressions
- 9.3 Knowledge Extraction
- 9.3.1 Granularity of Knowledge Evolution
- 9.3.1.1 Fine-grained Knowledge Extraction
- 9.3.1.2 Knowledge Evolution at Large
- 9.3.2 Knowledge Extraction about Entities
- 9.3.2.1 Targeted versus Open Fact Extraction
- 9.3.2.2 Temporality in Fact Extraction
- 9.3.3 Temporal Fact Harvesting from Web Content
- 9.3.3.1 Facts and Observations
- 9.3.3.2 Preprocessing and Candidate Gathering
- 9.3.3.3 Pattern Analysis
- 9.3.3.4 T-fact Extraction
- 9.3.3.5 Denoising of Fact Candidates
- 9.3.3.6 Analysis
- 9.4 Storage
- 9.4.1 Scalable Storage
- 9.4.2 Indexing
- 9.4.3 Querying
- 9.5 Ranking Models
- 9.6 Applications
- 9.6.1 Timelines
- 9.6.2 Social Media
- 9.6.3 Question-Answering
- 9.6.4 World Events
- 9.6.4.1 Conceptual Approach
- 9.6.4.2 Event Types
- 9.6.4.3 Entity-level (Semantic) Prediction Model
- 9.6.4.4 Country Extraction
- 9.6.4.5 Analysis
- 9.7 Conclusions and Challenges
- Acknowledgments
- 10 Bias in Retrieval Systems
- 10.1 Introduction
- 10.2 Taxonomy of Biases
- 10.2.1 Preexisting Bias
- 10.2.2 Stakeholder Bias
- 10.2.3 Data Bias
- 10.2.4 Algorithmic Bias
- 10.3 Preexisting Bias
- 10.4 Stakeholder Bias
- 10.4.1 Stakeholder Interests and Incentives
- 10.4.1.1 Search and Social Media
- 10.4.1.2 Recommender Systems in Commerce
- 10.4.2 User Bias
- 10.4.2.1 Clicking Patterns
- 10.4.2.2 Click-through Models
- 10.4.2.3 Estimating Relevance
- 10.4.3 Developer Bias
- 10.4.3.1 Cultural Bias
- 10.4.3.2 Bias in Research
- 10.5 Data Bias
- 10.5.1 Representation Bias
- 10.5.1.1 Linguistic Bias
- 10.5.1.2 Economic Bias
- 10.5.2 Bias in Producer Behavior
- 10.5.2.2 Content Bias
- 10.5.3 Bias in Consumer Behavior
- 10.5.3.1 Popularity Bias
- 10.5.3.2 Sparsity Bias
- 10.6 Algorithmic Bias
- 10.6.1 Exposure Bias
- 10.6.2 Retrievability Bias
- 10.6.3 Ranking Bias
- 10.6.4 Evaluation Bias
- 10.6.4.1 Sampling Bias in Editorial Evaluation
- 10.6.4.2 Observability Bias
- 10.6.4.3 Popularity versus Personalization
- 10.6.4.5 Pool Bias in System Evaluation
- 10.6.4.6 Precision and Anti-precision
- 10.6.4.7 Pool-risk
- 10.7 Mitigation and Fairness Techniques
- 10.7.1 Mitigating Sampling Bias
- 10.7.1.1 How Many Samples?
- 10.7.1.2 Modeling Search Engine Query Distributions
- 10.7.1.3 Getting the Right Distribution
- 10.7.3 Unbiased Evaluation with Inverse Propensity Scoring
- 10.7.3.1 Mitigating Popularity Bias
- 10.7.3.2 Mitigating Ranking Bias
- 10.7.4 Mitigating Sparsity Bias in Performance Metrics
- 10.7.5 Mitigating Pool Bias in Evaluation Metrics
- 10.7.5.1 Comparing Rankings
- 10.7.5.2 Ranking Perturbation to a Pooled Run Induced by a New Run
- 10.7.5.3 Pool Bias Indicator and Correction
- 10.7.6 Mitigating Feedback Loops
- 10.7.7 Fair Rankings
- 10.7.7.1 Score-based Ranking
- 10.7.7.2 Fair Learning-to-Rank
- 10.8 Conclusions and Challenges
- 10.A Appendix: Data Metrics
- 10.A.1 Item Popularity Metrics
- 10.A.2 Novelty Metrics
- 10.A.2.1 Popularity-based Novelty Metrics
- 10.A.2.2 Distance-based Novelty Metrics
- 10.B Standard System Evaluation Metrics
- 10.B.1 Average-over-all Evaluation Metrics
- 10.B.2 Top-n Performance Evaluation
- 11 Privacy in Information Retrieval
- 11.1 Introduction
- 11.2 Risks and Rewards of Privacy Leakage
- 11.3 Privacy for Searchers
- 11.3.1 Privacy Policies
- 11.4 Privacy for Search Engines
- 11.4.1 Approaches to Data Retention Policies
- 11.4.2 Anonymization and Differential Privacy
- 11.4.3 Query Logs
- 11.4.4 Personalization Leaks
- 11.5 Privacy for Document Owners
- 11.6 Conclusions and Challenges
- III VERTICALS
- 12 eCommerce Search and Discovery
- 12.1 Introduction
- 12.1.1 User Intent and Discovery
- 12.1.2 Relevance in eCommerce
- 12.2 Data and Architecture in eCommerce Search
- 12.2.1 Data for eCommerce
- 12.2.2 eCommerce Search Architecture
- 12.3 Query Understanding: The eCommerce Perspective
- 12.3.1 Mapping to Structured Data
- 12.3.2 Category Affinity
- 12.3.3 User Intent Classification
- 12.4 Matching
- 12.4.1 Facets
- 12.4.2 Matching with Text-based Queries
- 12.4.3 Product Availability
- 12.4.4 Catalog Size Effects on Matching
- 12.4.5 Taxonomies and Knowledge Graphs
- 12.5 Ranking
- 12.5.1 Ranking Algorithms
- 12.5.1.1 Deterministic Sorts
- 12.5.1.2 Relevance Ranking
- 12.5.2 Ranking Features for eCommerce
- 12.5.2.1 Behavioral Signals
- 12.5.2.2 Cold Start
- 12.5.3 Evaluation
- 12.5.4 Fairness and Two-sided Marketplaces
- 12.6 Embeddings for eCommerce Matching and Ranking
- 12.7 eCommerce Engineering Challenges
- 12.7.1 Scaling
- 12.7.2 Update Pipeline
- 12.7.3 Caching and Tiering
- 12.8 Conclusions and Challenges
- 12.8.1 Autocomplete
- 12.8.2 Multi- and Cross-lingual
- 12.8.3 Recommendations
- 12.8.4 Sponsored Products
- 12.8.5 Browse Pages
- 12.8.6 UX Factors
- 12.8.7 Whole Page Optimization
- Acknowledgments
- 13 Professional Search
- 13.1 Introduction
- 13.2 Professional Search Tasks and Domains
- 13.2.1 Academic
- 13.2.2 Legal
- 13.2.3 Medical
- 13.2.4 Humanities
- 13.3 Evaluation of Professional Search Systems
- 13.3.1 Domain-specific Test Collections
- 13.3.2 User Observation Studies
- 13.3.3 Simulation Studies
- 13.4 Design and Development of Professional Search Systems
- 13.4.1 Query Interfaces
- 13.4.2 Indexing and Ranking
- 13.4.3 Additional Functionalities
- 13.5 Neural (Re-)ranking for Domain-specific Retrieval
- 13.5.1 Domain-specific Encoders
- 13.5.2 Query-by-document Retrieval
- 13.6 Conclusions and Challenges
- 14 Searching Personal Collections
- 14.1 Introduction
- 14.2 Organizing Digital Assets
- 14.3 Labeling Digital Assets
- 14.4 Automatic Classification and Clustering
- 14.5 Email Spam Filtering, Phishing, and Abuse Detection
- 14.6 Email Threading
- 14.7 Known-item Retrieval
- 14.8 Refinding Using Search
- 14.9 Leveraging Episodic Memory
- 14.10 Test Collections
- 14.11 Crafted Rankers
- 14.12 Associative Browsing and Searching
- 14.13 Learning to Rank for Personal Collections
- 14.14 Desktop Search
- 14.15 File Recommendation
- 14.16 Infrastructure for Cloud-based Personal Collections
- 14.17 Beyond Files and Email: Human Digital Memory
- 14.18 Conclusions and Challenges
- 15 Audio-based Music Retrieval
- 15.1 Introduction
- 15.2 Origins and Evolution
- 15.3 Audio-based MIR Architectures
- 15.4 Pitch-content Description
- 15.4.1 Pitch and Fundamental Frequency
- 15.4.2 Pitch-content Description Tasks
- 15.4.5 Datasets and Evaluation
- 15.5 Rhythmic Description
- 15.5.1 Rhythm Fundamentals
- 15.5.2 Knowledge-driven Rhythm Description
- 15.5.3 Data-driven Rhythm Description
- 15.5.4 Datasets and Evaluation
- 15.6 Music Emotion Recognition
- 15.6.1 Emotion Taxonomies, Annotations, and Descriptors
- 15.6.2 Data-driven Music Emotion Recognition
- 15.6.3 Datasets and Evaluation
- 15.6.4 Personalization in Music Emotion Recognition
- 15.7 Version Identification
- 15.7.1 Input Representations for Version Identification
- 15.7.2 Knowledge-driven Version Identification
- 15.7.3 Data-driven Version Identification
- 15.7.4 Datasets and Evaluation
- 15.8 Conclusions and Challenges
- 15.8.1 Data and Reproducibility
- 15.8.2 Subjectivity and Agreement in Human Annotations
- 15.8.3 Evaluation Metrics and Neglected Dimensions
- 15.8.4 What For? Social, Ethical, and Cultural Perspectives
- Acknowledgments
- Appendix: Abbreviations
- 16 IR in Biomedicine
- 16.1 Introduction
- 16.1.1 Content
- 16.1.1.1 Bibliographic Content
- 16.1.1.2 Full-text Content
- 16.1.1.3 Annotated Content
- 16.1.1.4 Aggregated Content
- 16.1.2 Indexing
- 16.1.3 Retrieval
- 16.2 Research Directions
- 16.2.1 System-oriented Evaluation
- 16.2.2 User-oriented Evaluation
- 16.3 Impact of Generative AI
- 16.4 Conclusions
- Bibliography
- Authors' Biographies
- Editors
- Authors
- Index
System requirements
File format: ePUB
Copy protection: without DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use a reader that can handle the file format ePUB, such as Adobe Digital Editions or FBReader – both free (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePUB works well for novels and non-fiction books – i.e., 'flowing' text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook does not use copy protection or Digital Rights Management
For more information, see our eBook Help page.