
Data Discovery in Data Lakes
Description
As data lakes have become a prominent foundation for enterprise and scientific data management, organizations increasingly face the challenge of locating relevant datasets and building ad-hoc integration pipelines across heterogeneous, poorly documented, and rapidly evolving data collections. In this setting, data discovery becomes a critical capability for turning raw, distributed data assets into usable knowledge.
This book examines data discovery and its evolution across industry and academia. It covers the principles, systems, and techniques that enable users to find, understand, and use relevant data across increasingly complex data ecosystems. The book discusses modern approaches to efficient and effective data discovery, including novel system architectures, search and matching methods, metadata use, dataset profiling, and human-in-the-loop techniques.
Beyond core technical concepts, the book offers insight into how data discovery systems are evaluated and benchmarked. It highlights practical challenges faced in real-world deployments, compares emerging academic and industrial approaches, and identifies open research questions that continue to shape the field. The book is intended for researchers, practitioners, and students interested in data management, data integration, data lakes, and the future of intelligent data access.
More details
Persons
Ziawasch Abedjan is Chair of the Data Integration and Data Preparation group at TU Berlin and research group lead at Berlin Institute for Foundations of Learning and Data (BIFOLD). His research is focused on developing generalizable and scalable techniques for various data integration problems, such as data cleaning, data science pipeline generation and understanding, and data discovery. Previously, he chaired the Database and Information systems group at the Leibniz University Hannover. He held positions as Assistant Professor at TU Berlin, Postdoctoral Associate at MIT, Research Associate at QCRI, Senior Researcher at the German Center for Artificial Intelligence (DFKI), and Visiting Academic at Amazon Search. He received his PhD from the Hasso Plattner Institute in Potsdam and was awarded the University of Potsdam's best Dissertation Prize in 2014. He has co-authored more than 80 peer-reviewed papers in leading database and software engineering venues and received recognition with several academic awards.
His research is supported by the German Research Council (DFG) and the German Federal Ministry of Science and Education.
Mahdi Esmailoghli is a postdoctoral research associate at the David R. Cheriton School of Computer Science at the University of Waterloo. His research is centered on the challenges of data discovery within large-scale data lakes. He investigates how data systems can assist researchers and users in navigating vast table repositories to identify relevant datasets and suggest optimized pipelines, including index structures and algorithms.
His work has been published in leading data management venues, including SIGMOD, VLDB, ICDE, EDBT, and CIDR. Mahdi earned his PhD in Computer Science with the highest honors from the Technical University of Berlin.
Sainyam Galhotra is an assistant professor in the Department of Computer Science at Cornell University. His research focuses on trustworthy artificial intelligence, data-centric AI, causal reasoning, and reliable agentic systems. He studies how AI systems can reason over complex multimodal data while remaining transparent, robust, and adaptive in dynamic environments. His work has been published in leading venues across machine learning, data management, and AI systems, including ICLR, NeurIPS, SIGMOD, VLDB, and ICSE.
He is the recipient of the Best Paper Award at FSE 2017, the Most Reproducible Paper Award at SIGMOD 2017 and 2018, and the Best Artifact Paper Honorable Mention Award at SIGMOD 2023. He has also been recognized as a Data Science Rising Star, a DAAD AInet Fellow, and the first recipient of the Krithi Ramamritham Award at the University of Massachusetts Amherst for contributions to database research.
Content
Chapter 1. Data Discovery: Motivation and Challenges.- Chapter 2. Discovery Settings.- Chapter 3. Discovery Operations.- Chapter 4. Indexing Strategies for Data Discovery.- Chapter 5. Data Discovery Systems.- Chapter 6. Benchmarking Data Discovery Systems.- Chapter 7. Conclusions.