
LLM Engineer's Handbook
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
- Learn essential skills for deploying and monitoring LLMs, ensuring optimal performance in production
- Utilize preference alignment, evaluation, and inference optimization to enhance performance and adaptability of your LLM applications
Book DescriptionArtificial intelligence has undergone rapid advancements, and Large Language Models (LLMs) are at the forefront of this revolution. This LLM book offers insights into designing, training, and deploying LLMs in real-world scenarios by leveraging MLOps best practices. The guide walks you through building an LLM-powered twin that's cost-effective, scalable, and modular. It moves beyond isolated Jupyter notebooks, focusing on how to build production-grade end-to-end LLM systems. Throughout this book, you will learn data engineering, supervised fine-tuning, and deployment. The hands-on approach to building the LLM Twin use case will help you implement MLOps components in your own projects. You will also explore cutting-edge advancements in the field, including inference optimization, preference alignment, and real-time data processing, making this a vital resource for those looking to apply LLMs in their projects. By the end of this book, you will be proficient in deploying LLMs that solve practical problems while maintaining low-latency and high-availability inference capabilities. Whether you are new to artificial intelligence or an experienced practitioner, this book delivers guidance and practical techniques that will deepen your understanding of LLMs and sharpen your ability to implement them effectively.What you will learn - Implement robust data pipelines and manage LLM training cycles
- Create your own LLM and refine it with the help of hands-on examples
- Get started with LLMOps by diving into core MLOps principles such as orchestrators and prompt monitoring
- Perform supervised fine-tuning and LLM evaluation
- Deploy end-to-end LLM solutions using AWS and other tools
- Design scalable and modularLLM systems
- Learn about RAG applications by building a feature and inference pipeline
Who this book is forThis book is for AI engineers, NLP professionals, and LLM engineers looking to deepen their understanding of LLMs. Basic knowledge of LLMs and the Gen AI landscape, Python and AWS is recommended. Whether you are new to AI or looking to enhance your skills, this book provides comprehensive guidance on implementing LLMs in real-world scenarios
All prices
More details
Other editions
Additional editions

Persons
Paul Iusztin is a senior ML and MLOps engineer at Metaphysic, a leading GenAI platform, serving as one of their core engineers in taking their deep learning products to production. Along with Metaphysic, with over seven years of experience, he built GenAI, Computer Vision and MLOps solutions for CoreAI, Everseen, and Continental. Paul's determined passion and mission are to build data-intensive AI/ML products that serve the world and educate others about the process. As the Founder of Decoding ML, a channel for battle-tested content on learning how to design, code, and deploy production-grade ML, Paul has significantly enriched the engineering and MLOps community. His weekly content on ML engineering and his open-source courses focusing on end-to-end ML life cycles, such as Hands-on LLMs and LLM Twin, testify to his valuable contributions.Labonne Maxime :
Maxime Labonne is a Senior Staff Machine Learning Scientist at Liquid AI, serving as the head of post-training. He holds a Ph.D. in Machine Learning from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML. An active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralBeagle and Phixtral. He is the author of the best-selling book "Hands-On Graph Neural Networks Using Python," published by Packt.
Content
- Cover
- Copyright
- Forewords
- Contributors
- Table of Contents
- Preface
- Making the Most Out of This Book - Get to Know Your Free Benefits
- Chapter 1: Understanding the LLM Twin Concept and Architecture
- Understanding the LLM Twin concept
- What is an LLM Twin?
- Why building an LLM Twin matters
- Why not use ChatGPT (or another similar chatbot)?
- Planning the MVP of the LLM Twin product
- What is an MVP?
- Defining the LLM Twin MVP
- Building ML systems with feature/training/inference pipelines
- The problem with building ML systems
- The issue with previous solutions
- The solution - ML pipelines for ML systems
- The feature pipeline
- The training pipeline
- The inference pipeline
- Benefits of the FTI architecture
- Designing the system architecture of the LLM Twin
- Listing the technical details of the LLM Twin architecture
- How to design the LLM Twin architecture using the FTI pipeline design
- Data collection pipeline
- Feature pipeline
- Training pipeline
- Inference pipeline
- Final thoughts on the FTI design and the LLM Twin architecture
- Summary
- References
- Chapter 2: Tooling and Installation
- Python ecosystem and project installation
- Poetry: dependency and virtual environment management
- Poe the Poet: task execution tool
- MLOps and LLMOps tooling
- Hugging Face: model registry
- ZenML: orchestrator, artifacts, and metadata
- Orchestrator
- Artifacts and metadata
- How to run and configure a ZenML pipeline
- Comet ML: experiment tracker
- Opik: prompt monitoring
- Databases for storing unstructured and vector data
- MongoDB: NoSQL database
- Qdrant: vector database
- Preparing for AWS
- Setting up an AWS account, an access key, and the CLI
- SageMaker: training and inference compute
- Why AWS SageMaker?
- Summary
- References
- Chapter 3: Data Engineering
- Designing the LLM Twin's data collection pipeline
- Implementing the LLM Twin's data collection pipeline
- ZenML pipeline and steps
- The dispatcher: How do you instantiate the right crawler?
- The crawlers
- Base classes
- GitHubCrawler class
- CustomArticleCrawler class
- MediumCrawler class
- The NoSQL data warehouse documents
- The ORM and ODM software patterns
- Implementing the ODM class
- Data categories and user document classes
- Gathering raw data into the data warehouse
- Troubleshooting
- Selenium issues
- Import our backed-up data
- Summary
- References
- Chapter 4: RAG Feature Pipeline
- Understanding RAG
- Why use RAG?
- Hallucinations
- Old information
- The vanilla RAG framework
- Ingestion pipeline
- Retrieval pipeline
- Generation pipeline
- What are embeddings?
- Why embeddings are so powerful
- How are embeddings created?
- Applications of embeddings
- More on vector DBs
- How does a vector DB work?
- Algorithms for creating the vector index
- DB operations
- An overview of advanced RAG
- Pre-retrieval
- Retrieval
- Post-retrieval
- Exploring the LLM Twin's RAG feature pipeline architecture
- The problem we are solving
- The feature store
- Where does the raw data come from?
- Designing the architecture of the RAG feature pipeline
- Batch pipelines
- Batch versus streaming pipelines
- Core steps
- Change data capture: syncing the data warehouse and feature store
- Why is the data stored in two snapshots?
- Orchestration
- Implementing the LLM Twin's RAG feature pipeline
- Settings
- ZenML pipeline and steps
- Querying the data warehouse
- Cleaning the documents
- Chunk and embed the cleaned documents
- Loading the documents to the vector DB
- Pydantic domain entities
- OVM
- The dispatcher layer
- The handlers
- The cleaning handlers
- The chunking handlers
- The embedding handlers
- Summary
- References
- Chapter 5: Supervised Fine-Tuning
- Creating an instruction dataset
- General framework
- Data quantity
- Data curation
- Rule-based filtering
- Data deduplication
- Data decontamination
- Data quality evaluation
- Data exploration
- Data generation
- Data augmentation
- Creating our own instruction dataset
- Exploring SFT and its techniques
- When to fine-tune
- Instruction dataset formats
- Chat templates
- Parameter-efficient fine-tuning techniques
- Full fine-tuning
- LoRA
- QLoRA
- Training parameters
- Learning rate and scheduler
- Batch size
- Maximum length and packing
- Number of epochs
- Optimizers
- Weight decay
- Gradient checkpointing
- Fine-tuning in practice
- Summary
- References
- Chapter 6: Fine-Tuning with Preference Alignment
- Understanding preference datasets
- Preference data
- Data quantity
- Data generation and evaluation
- Generating preferences
- Tips for data generation
- Evaluating preferences
- Creating our own preference dataset
- Preference alignment
- Reinforcement Learning from Human Feedback
- Direct Preference Optimization
- Implementing DPO
- Summary
- References
- Chapter 7: Evaluating LLMs
- Model evaluation
- Comparing ML and LLM evaluation
- General-purpose LLM evaluations
- Domain-specific LLM evaluations
- Task-specific LLM evaluations
- RAG evaluation
- Ragas
- ARES
- Evaluating TwinLlama-3.1-8B
- Generating answers
- Evaluating answers
- Analyzing results
- Summary
- References
- Chapter 8: Inference Optimization
- Model optimization strategies
- KV cache
- Continuous batching
- Speculative decoding
- Optimized attention mechanisms
- Model parallelism
- Data parallelism
- Pipeline parallelism
- Tensor parallelism
- Combining approaches
- Model quantization
- Introduction to quantization
- Quantization with GGUF and llama.cpp
- Quantization with GPTQ and EXL2
- Other quantization techniques
- Summary
- References
- Chapter 9: RAG Inference Pipeline
- Understanding the LLM Twin's RAG inference pipeline
- Exploring the LLM Twin's advanced RAG techniques
- Advanced RAG pre-retrieval optimizations: query expansion and self-querying
- Query expansion
- Self-querying
- Advanced RAG retrieval optimization: filtered vector search
- Advanced RAG post-retrieval optimization: reranking
- Implementing the LLM Twin's RAG inference pipeline
- Implementing the retrieval module
- Bringing everything together into the RAG inference pipeline
- Summary
- References
- Chapter 10: Inference Pipeline Deployment
- Criteria for choosing deployment types
- Throughput and latency
- Data
- Understanding inference deployment types
- Online real-time inference
- Asynchronous inference
- Offline batch transform
- Monolithic versus microservices architecture in model serving
- Monolithic architecture
- Microservices architecture
- Choosing between monolithic and microservices architectures
- Exploring the LLM Twin's inference pipeline deployment strategy
- The training versus the inference pipeline
- Deploying the LLM Twin service
- Implementing the LLM microservice using AWS SageMaker
- What are Hugging Face's DLCs?
- Configuring SageMaker roles
- Deploying the LLM Twin model to AWS SageMaker
- Calling the AWS SageMaker Inference endpoint
- Building the business microservice using FastAPI
- Autoscaling capabilities to handle spikes in usage
- Registering a scalable target
- Creating a scalable policy
- Minimum and maximum scaling limits
- Cooldown period
- Summary
- References
- Chapter 11: MLOps and LLMOps
- The path to LLMOps: Understanding its roots in DevOps and MLOps
- DevOps
- The DevOps lifecycle
- The core DevOps concepts
- MLOps
- MLOps core components
- MLOps principles
- ML vs. MLOps engineering
- LLMOps
- Human feedback
- Guardrails
- Prompt monitoring
- Deploying the LLM Twin's pipelines to the cloud
- Understanding the infrastructure
- Setting up MongoDB
- Setting up Qdrant
- Setting up the ZenML cloud
- Containerize the code using Docker
- Run the pipelines on AWS
- Troubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMaker
- Adding LLMOps to the LLM Twin
- LLM Twin's CI/CD pipeline flow
- More on formatting errors
- More on linting errors
- Quick overview of GitHub Actions
- The CI pipeline
- GitHub Actions CI YAML file
- The CD pipeline
- Test out the CI/CD pipeline
- The CT pipeline
- Initial triggers
- Trigger downstream pipelines
- Prompt monitoring
- Alerting
- Summary
- References
- Appendix: MLOps Principles
- 1. Automation or operationalization
- 2. Versioning
- 3. Experiment tracking
- 4. Testing
- Test types
- What do we test?
- Test examples
- 5. Monitoring
- Logs
- Metrics
- System metrics
- Model metrics
- Drifts
- Monitoring vs. observability
- Alerts
- 6. Reproducibility
- Other Books You May Enjoy
- Index
LLM Engineer's Handbook
Master the art of engineering large language models from concept to production
Paul Iusztin
Maxime Labonne
LLM Engineer's Handbook
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Senior Publishing Product Manager: Gebin George
Acquisition Editor - Peer Reviews: Swaroop Singh
Project Editor: Amisha Vathare
Content Development Editor: Tanya D'cruz
Copy Editor: Safis Editing
Technical Editor: Karan Sonawane
Proofreader: Safis Editing
Indexer: Manju Arasan
Presentation Designer: Rajesh Shirsath
Developer Relations Marketing Executive: Anamika Singh
First published: October 2024
Production reference: 4070725
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul's Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83620-007-9
Forewords
As my co-founder at Hugging Face, Clement Delangue, and I often say, AI is becoming the default way of building technology.
Over the past 3 years, LLMs have already had a profound impact on technology, and they are bound to have an even greater impact in the coming 5 years. They will be embedded in more and more products and, I believe, at the center of any human activity based on knowledge or creativity.
For instance, coders are already leveraging LLMs and changing the way they work, focusing on higher-order thinking and tasks while collaborating with machines. Studio musicians rely on AI-powered tools to explore the musical creativity space faster. Lawyers are increasing their impact through retrieval-augmented generation (RAG) and large databases of case law.
At Hugging Face, we've always advocated for a future where not just one company or a small number of scientists control the AI models used by the rest of the population, but instead for a future where as many people as possible-from as many different backgrounds as possible-are capable of diving into how cutting-edge machine learning models actually work.
Maxime Labonne and Paul Iusztin have been instrumental in this movement to democratize LLMs by writing this book and making sure that as many people as possible can not only use them but also adapt them, fine-tune them, quantize them, and make them efficient enough to actually deploy in the real world.
Their work is essential, and I'm glad they are making this resource available to the community. This expands the convex hull of human knowledge.
Julien Chaumond
Co-founder and CTO, Hugging Face
As someone deeply immersed in the world of machine learning operations, I'm thrilled to endorse The LLM Engineer's Handbook. This comprehensive guide arrives at a crucial time when the demand for LLM expertise is skyrocketing across industries.
What sets this book apart is its practical, end-to-end approach. By walking readers through the creation of an LLM Twin, it bridges the often daunting gap between theory and real-world application. From data engineering and model fine-tuning to advanced topics like RAG pipelines and inference optimization, the authors leave no stone unturned.
I'm particularly impressed by the emphasis on MLOps and LLMOps principles. As organizations increasingly rely on LLMs, understanding how to build scalable, reproducible, and robust systems is paramount. The inclusion of orchestration strategies and cloud integration showcases the authors' commitment to equipping readers with truly production-ready skills.
Whether you're a seasoned ML practitioner looking to specialize in LLMs or a software engineer aiming to break into this exciting field, this handbook provides the perfect blend of foundational knowledge and cutting-edge techniques. The clear explanations, practical examples, and focus on best practices make it an invaluable resource for anyone serious about mastering LLM engineering.
In an era where AI is reshaping industries at breakneck speed, The LLM Engineer's Handbook stands out as an essential guide for navigating the complexities of large language models. It's not just a book; it's a roadmap to becoming a proficient LLM engineer in today's AI-driven landscape.
Hamza Tahir
Co-founder and CTO, ZenML
The LLM Engineer's Handbook serves as an invaluable resource for anyone seeking a hands-on understanding of LLMs. Through practical examples and a comprehensive exploration of the LLM Twin project, the author effectively demystifies the complexities of building and deploying production-level LLM applications.
One of the book's standout features is its use of the LLM Twin project as a running example. This AI character, designed to emulate the writing style of a specific individual, provides a tangible illustration of how LLMs can be applied in real-world scenarios.
The author skillfully guides readers through the essential tools and technologies required for LLM development, including Hugging Face, ZenML, Comet, Opik, MongoDB, and Qdrant. Each tool is explained in detail, making it easy for readers to understand their functions and how they can be integrated into an LLM pipeline.
LLM Engineer's Handbook also covers a wide range of topics related to LLM development, such as data collection, fine-tuning, evaluation, inference optimization, and MLOps. Notably, the chapters on supervised fine-tuning, preference alignment, and Retrieval Augmented Generation (RAG) provide in-depth insights into these critical aspects of LLM development.
A particular strength of this book lies in its focus on practical implementation. The author excels at providing concrete examples and guidance on how to optimize inference pipelines and deploy LLMs effectively. This makes the book a valuable resource for both researchers and practitioners.
This book is highly recommended for anyone interested in learning about LLMs and their practical applications. By providing a comprehensive overview of the tools, techniques, and best practices involved in LLM development, the authors have created a valuable resource that will undoubtedly be a reference for many LLM Engineers
Antonio Gulli
Senior Director, Google
Contributors
About the authors
Paul Iusztin is a senior ML and MLOps engineer with over seven years of experience building GenAI, Computer Vision and MLOps solutions. His latest contribution was at Metaphysic, where he served as one of their core engineers in taking large neural networks to production. He previously worked at CoreAI, Everseen, and Continental. He is the Founder of Decoding ML, an educational channel on production-grade ML that provides posts, articles, and open-source courses to help others build real-world ML systems.
Maxime Labonne is the Head of Post-Training at Liquid AI. He holds a PhD. in ML from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML. As an active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralDaredevil. He is the author of the best-selling book Hands-On Graph Neural Networks Using Python, published by Packt.
I want to thank my family and partner. Your unwavering support and patience made this book possible.
About the reviewer
Rany ElHousieny is an AI solutions architect and AI engineering manager with over two decades of experience in AI, NLP, and ML. Throughout his career, he has focused on the development and deployment of AI models, authoring multiple articles on AI systems architecture and ethical AI deployment. He has led groundbreaking projects at companies like Microsoft, where he spearheaded advancements in NLP and the Language Understanding Intelligent Service (LUIS). Currently, he plays a pivotal role at Clearwater Analytics, driving innovation in GenAI and AI-driven financial and investment management solutions.
I would like to thank Clearwater Analytics for providing a supportive and learning environment that fosters growth and innovation. The vision of our leaders, always staying ahead with the latest technologies, has...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.
File format: ePUB
Copy protection: without DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use a reader that can handle the file format ePUB, such as Adobe Digital Editions or FBReader – both free (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePUB works well for novels and non-fiction books – i.e., 'flowing' text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook does not use copy protection or Digital Rights Management
For more information, see our eBook Help page.