LLM Engineer's Handbook

Name: LLM Engineer's Handbook | Master the art of engineering large language models from concept to production
Brand: Packt Publishing Limited
Availability: OnlineOnly

Master the art of engineering large language models from concept to production

Paul Iusztin Niladri Sen Julien Chaumond Hamza Tahir Antonio Gulli(Author)

Packt Publishing Limited

1st Edition

Published on 13. January 2025

522 pages

E-Book

ePUB with Adobe-DRM

System requirements

E-Book

ePUB without DRM

System requirements

978-1-83620-006-2 (ISBN)

from €29.99

Available for download

Watchlist: see prices

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

Step into the world of LLMs with this practical guide that takes you from the fundamentals to deploying advanced applications using LLMOps best practicesKey Features - Build and refine LLMs step by step, covering data preparation, RAG, and fine-tuning
- Learn essential skills for deploying and monitoring LLMs, ensuring optimal performance in production
- Utilize preference alignment, evaluation, and inference optimization to enhance performance and adaptability of your LLM applications
Book DescriptionArtificial intelligence has undergone rapid advancements, and Large Language Models (LLMs) are at the forefront of this revolution. This LLM book offers insights into designing, training, and deploying LLMs in real-world scenarios by leveraging MLOps best practices. The guide walks you through building an LLM-powered twin that's cost-effective, scalable, and modular. It moves beyond isolated Jupyter notebooks, focusing on how to build production-grade end-to-end LLM systems. Throughout this book, you will learn data engineering, supervised fine-tuning, and deployment. The hands-on approach to building the LLM Twin use case will help you implement MLOps components in your own projects. You will also explore cutting-edge advancements in the field, including inference optimization, preference alignment, and real-time data processing, making this a vital resource for those looking to apply LLMs in their projects. By the end of this book, you will be proficient in deploying LLMs that solve practical problems while maintaining low-latency and high-availability inference capabilities. Whether you are new to artificial intelligence or an experienced practitioner, this book delivers guidance and practical techniques that will deepen your understanding of LLMs and sharpen your ability to implement them effectively.What you will learn - Implement robust data pipelines and manage LLM training cycles
- Create your own LLM and refine it with the help of hands-on examples
- Get started with LLMOps by diving into core MLOps principles such as orchestrators and prompt monitoring
- Perform supervised fine-tuning and LLM evaluation
- Deploy end-to-end LLM solutions using AWS and other tools
- Design scalable and modularLLM systems
- Learn about RAG applications by building a feature and inference pipeline
Who this book is forThis book is for AI engineers, NLP professionals, and LLM engineers looking to deepen their understanding of LLMs. Basic knowledge of LLMs and the Gen AI landscape, Python and AWS is recommended. Whether you are new to AI or looking to enhance your skills, this book provides comprehensive guidance on implementing LLMs in real-world scenarios

All prices

More details

Other editions

Persons

Content

Cover
Copyright
Forewords
Contributors
Table of Contents
Preface
Making the Most Out of This Book - Get to Know Your Free Benefits
Chapter 1: Understanding the LLM Twin Concept and Architecture
Understanding the LLM Twin concept
What is an LLM Twin?
Why building an LLM Twin matters
Why not use ChatGPT (or another similar chatbot)?
Planning the MVP of the LLM Twin product
What is an MVP?
Defining the LLM Twin MVP
Building ML systems with feature/training/inference pipelines
The problem with building ML systems
The issue with previous solutions
The solution - ML pipelines for ML systems
The feature pipeline
The training pipeline
The inference pipeline
Benefits of the FTI architecture
Designing the system architecture of the LLM Twin
Listing the technical details of the LLM Twin architecture
How to design the LLM Twin architecture using the FTI pipeline design
Data collection pipeline
Feature pipeline
Training pipeline
Inference pipeline
Final thoughts on the FTI design and the LLM Twin architecture
Summary
References
Chapter 2: Tooling and Installation
Python ecosystem and project installation
Poetry: dependency and virtual environment management
Poe the Poet: task execution tool
MLOps and LLMOps tooling
Hugging Face: model registry
ZenML: orchestrator, artifacts, and metadata
Orchestrator
Artifacts and metadata
How to run and configure a ZenML pipeline
Comet ML: experiment tracker
Opik: prompt monitoring
Databases for storing unstructured and vector data
MongoDB: NoSQL database
Qdrant: vector database
Preparing for AWS
Setting up an AWS account, an access key, and the CLI
SageMaker: training and inference compute
Why AWS SageMaker?
Summary
References
Chapter 3: Data Engineering
Designing the LLM Twin's data collection pipeline
Implementing the LLM Twin's data collection pipeline
ZenML pipeline and steps
The dispatcher: How do you instantiate the right crawler?
The crawlers
Base classes
GitHubCrawler class
CustomArticleCrawler class
MediumCrawler class
The NoSQL data warehouse documents
The ORM and ODM software patterns
Implementing the ODM class
Data categories and user document classes
Gathering raw data into the data warehouse
Troubleshooting
Selenium issues
Import our backed-up data
Summary
References
Chapter 4: RAG Feature Pipeline
Understanding RAG
Why use RAG?
Hallucinations
Old information
The vanilla RAG framework
Ingestion pipeline
Retrieval pipeline
Generation pipeline
What are embeddings?
Why embeddings are so powerful
How are embeddings created?
Applications of embeddings
More on vector DBs
How does a vector DB work?
Algorithms for creating the vector index
DB operations
An overview of advanced RAG
Pre-retrieval
Retrieval
Post-retrieval
Exploring the LLM Twin's RAG feature pipeline architecture
The problem we are solving
The feature store
Where does the raw data come from?
Designing the architecture of the RAG feature pipeline
Batch pipelines
Batch versus streaming pipelines
Core steps
Change data capture: syncing the data warehouse and feature store
Why is the data stored in two snapshots?
Orchestration
Implementing the LLM Twin's RAG feature pipeline
Settings
ZenML pipeline and steps
Querying the data warehouse
Cleaning the documents
Chunk and embed the cleaned documents
Loading the documents to the vector DB
Pydantic domain entities
OVM
The dispatcher layer
The handlers
The cleaning handlers
The chunking handlers
The embedding handlers
Summary
References
Chapter 5: Supervised Fine-Tuning
Creating an instruction dataset
General framework
Data quantity
Data curation
Rule-based filtering
Data deduplication
Data decontamination
Data quality evaluation
Data exploration
Data generation
Data augmentation
Creating our own instruction dataset
Exploring SFT and its techniques
When to fine-tune
Instruction dataset formats
Chat templates
Parameter-efficient fine-tuning techniques
Full fine-tuning
LoRA
QLoRA
Training parameters
Learning rate and scheduler
Batch size
Maximum length and packing
Number of epochs
Optimizers
Weight decay
Gradient checkpointing
Fine-tuning in practice
Summary
References
Chapter 6: Fine-Tuning with Preference Alignment
Understanding preference datasets
Preference data
Data quantity
Data generation and evaluation
Generating preferences
Tips for data generation
Evaluating preferences
Creating our own preference dataset
Preference alignment
Reinforcement Learning from Human Feedback
Direct Preference Optimization
Implementing DPO
Summary
References
Chapter 7: Evaluating LLMs
Model evaluation
Comparing ML and LLM evaluation
General-purpose LLM evaluations
Domain-specific LLM evaluations
Task-specific LLM evaluations
RAG evaluation
Ragas
ARES
Evaluating TwinLlama-3.1-8B
Generating answers
Evaluating answers
Analyzing results
Summary
References
Chapter 8: Inference Optimization
Model optimization strategies
KV cache
Continuous batching
Speculative decoding
Optimized attention mechanisms
Model parallelism
Data parallelism
Pipeline parallelism
Tensor parallelism
Combining approaches
Model quantization
Introduction to quantization
Quantization with GGUF and llama.cpp
Quantization with GPTQ and EXL2
Other quantization techniques
Summary
References
Chapter 9: RAG Inference Pipeline
Understanding the LLM Twin's RAG inference pipeline
Exploring the LLM Twin's advanced RAG techniques
Advanced RAG pre-retrieval optimizations: query expansion and self-querying
Query expansion
Self-querying
Advanced RAG retrieval optimization: filtered vector search
Advanced RAG post-retrieval optimization: reranking
Implementing the LLM Twin's RAG inference pipeline
Implementing the retrieval module
Bringing everything together into the RAG inference pipeline
Summary
References
Chapter 10: Inference Pipeline Deployment
Criteria for choosing deployment types
Throughput and latency
Data
Understanding inference deployment types
Online real-time inference
Asynchronous inference
Offline batch transform
Monolithic versus microservices architecture in model serving
Monolithic architecture
Microservices architecture
Choosing between monolithic and microservices architectures
Exploring the LLM Twin's inference pipeline deployment strategy
The training versus the inference pipeline
Deploying the LLM Twin service
Implementing the LLM microservice using AWS SageMaker
What are Hugging Face's DLCs?
Configuring SageMaker roles
Deploying the LLM Twin model to AWS SageMaker
Calling the AWS SageMaker Inference endpoint
Building the business microservice using FastAPI
Autoscaling capabilities to handle spikes in usage
Registering a scalable target
Creating a scalable policy
Minimum and maximum scaling limits
Cooldown period
Summary
References
Chapter 11: MLOps and LLMOps
The path to LLMOps: Understanding its roots in DevOps and MLOps
DevOps
The DevOps lifecycle
The core DevOps concepts
MLOps
MLOps core components
MLOps principles
ML vs. MLOps engineering
LLMOps
Human feedback
Guardrails
Prompt monitoring
Deploying the LLM Twin's pipelines to the cloud
Understanding the infrastructure
Setting up MongoDB
Setting up Qdrant
Setting up the ZenML cloud
Containerize the code using Docker
Run the pipelines on AWS
Troubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMaker
Adding LLMOps to the LLM Twin
LLM Twin's CI/CD pipeline flow
More on formatting errors
More on linting errors
Quick overview of GitHub Actions
The CI pipeline
GitHub Actions CI YAML file
The CD pipeline
Test out the CI/CD pipeline
The CT pipeline
Initial triggers
Trigger downstream pipelines
Prompt monitoring
Alerting
Summary
References
Appendix: MLOps Principles
1. Automation or operationalization
2. Versioning
3. Experiment tracking
4. Testing
Test types
What do we test?
Test examples
5. Monitoring
Logs
Metrics
System metrics
Model metrics
Drifts
Monitoring vs. observability
Alerts
6. Reproducibility
Other Books You May Enjoy
Index

LLM Engineer's Handbook

Master the art of engineering large language models from concept to production

Paul Iusztin

Maxime Labonne

LLM Engineer's Handbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Senior Publishing Product Manager: Gebin George

Acquisition Editor - Peer Reviews: Swaroop Singh

Project Editor: Amisha Vathare

Content Development Editor: Tanya D'cruz

Copy Editor: Safis Editing

Technical Editor: Karan Sonawane

Proofreader: Safis Editing

Indexer: Manju Arasan

Presentation Designer: Rajesh Shirsath

Developer Relations Marketing Executive: Anamika Singh

First published: October 2024

Production reference: 4070725

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul's Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83620-007-9

www.packt.com

Forewords

As my co-founder at Hugging Face, Clement Delangue, and I often say, AI is becoming the default way of building technology.

Over the past 3 years, LLMs have already had a profound impact on technology, and they are bound to have an even greater impact in the coming 5 years. They will be embedded in more and more products and, I believe, at the center of any human activity based on knowledge or creativity.

For instance, coders are already leveraging LLMs and changing the way they work, focusing on higher-order thinking and tasks while collaborating with machines. Studio musicians rely on AI-powered tools to explore the musical creativity space faster. Lawyers are increasing their impact through retrieval-augmented generation (RAG) and large databases of case law.

At Hugging Face, we've always advocated for a future where not just one company or a small number of scientists control the AI models used by the rest of the population, but instead for a future where as many people as possible-from as many different backgrounds as possible-are capable of diving into how cutting-edge machine learning models actually work.

Maxime Labonne and Paul Iusztin have been instrumental in this movement to democratize LLMs by writing this book and making sure that as many people as possible can not only use them but also adapt them, fine-tune them, quantize them, and make them efficient enough to actually deploy in the real world.

Their work is essential, and I'm glad they are making this resource available to the community. This expands the convex hull of human knowledge.

Julien Chaumond

Co-founder and CTO, Hugging Face

As someone deeply immersed in the world of machine learning operations, I'm thrilled to endorse The LLM Engineer's Handbook. This comprehensive guide arrives at a crucial time when the demand for LLM expertise is skyrocketing across industries.

What sets this book apart is its practical, end-to-end approach. By walking readers through the creation of an LLM Twin, it bridges the often daunting gap between theory and real-world application. From data engineering and model fine-tuning to advanced topics like RAG pipelines and inference optimization, the authors leave no stone unturned.

I'm particularly impressed by the emphasis on MLOps and LLMOps principles. As organizations increasingly rely on LLMs, understanding how to build scalable, reproducible, and robust systems is paramount. The inclusion of orchestration strategies and cloud integration showcases the authors' commitment to equipping readers with truly production-ready skills.

Whether you're a seasoned ML practitioner looking to specialize in LLMs or a software engineer aiming to break into this exciting field, this handbook provides the perfect blend of foundational knowledge and cutting-edge techniques. The clear explanations, practical examples, and focus on best practices make it an invaluable resource for anyone serious about mastering LLM engineering.

In an era where AI is reshaping industries at breakneck speed, The LLM Engineer's Handbook stands out as an essential guide for navigating the complexities of large language models. It's not just a book; it's a roadmap to becoming a proficient LLM engineer in today's AI-driven landscape.

Hamza Tahir

Co-founder and CTO, ZenML

The LLM Engineer's Handbook serves as an invaluable resource for anyone seeking a hands-on understanding of LLMs. Through practical examples and a comprehensive exploration of the LLM Twin project, the author effectively demystifies the complexities of building and deploying production-level LLM applications.

One of the book's standout features is its use of the LLM Twin project as a running example. This AI character, designed to emulate the writing style of a specific individual, provides a tangible illustration of how LLMs can be applied in real-world scenarios.

The author skillfully guides readers through the essential tools and technologies required for LLM development, including Hugging Face, ZenML, Comet, Opik, MongoDB, and Qdrant. Each tool is explained in detail, making it easy for readers to understand their functions and how they can be integrated into an LLM pipeline.

LLM Engineer's Handbook also covers a wide range of topics related to LLM development, such as data collection, fine-tuning, evaluation, inference optimization, and MLOps. Notably, the chapters on supervised fine-tuning, preference alignment, and Retrieval Augmented Generation (RAG) provide in-depth insights into these critical aspects of LLM development.

A particular strength of this book lies in its focus on practical implementation. The author excels at providing concrete examples and guidance on how to optimize inference pipelines and deploy LLMs effectively. This makes the book a valuable resource for both researchers and practitioners.

This book is highly recommended for anyone interested in learning about LLMs and their practical applications. By providing a comprehensive overview of the tools, techniques, and best practices involved in LLM development, the authors have created a valuable resource that will undoubtedly be a reference for many LLM Engineers

Antonio Gulli

Senior Director, Google

Contributors

About the authors

Paul Iusztin is a senior ML and MLOps engineer with over seven years of experience building GenAI, Computer Vision and MLOps solutions. His latest contribution was at Metaphysic, where he served as one of their core engineers in taking large neural networks to production. He previously worked at CoreAI, Everseen, and Continental. He is the Founder of Decoding ML, an educational channel on production-grade ML that provides posts, articles, and open-source courses to help others build real-world ML systems.

Maxime Labonne is the Head of Post-Training at Liquid AI. He holds a PhD. in ML from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML. As an active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralDaredevil. He is the author of the best-selling book Hands-On Graph Neural Networks Using Python, published by Packt.

I want to thank my family and partner. Your unwavering support and patience made this book possible.

About the reviewer

Rany ElHousieny is an AI solutions architect and AI engineering manager with over two decades of experience in AI, NLP, and ML. Throughout his career, he has focused on the development and deployment of AI models, authoring multiple articles on AI systems architecture and ethical AI deployment. He has led groundbreaking projects at companies like Microsoft, where he spearheaded advancements in NLP and the Language Understanding Intelligent Service (LUIS). Currently, he plays a pivotal role at Clearwater Analytics, driving innovation in GenAI and AI-driven financial and investment management solutions.

I would like to thank Clearwater Analytics for providing a supportive and learning environment that fosters growth and innovation. The vision of our leaders, always staying ahead with the latest technologies, has...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

LLM Engineer's Handbook

Description

All prices

More details

Other editions

Additional editions

Persons

Content

Forewords

Contributors

About the authors

About the reviewer

System requirements