
Data Science First
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Proven, practical techniques for integrating language models into your data science workflows
Data Science First: Using Language Models in AI-Enabled Applications, by Intersect AI's Chief AI Officer John Hawkins, explains how practicing data scientists can integrate language models in data science workflows without abandoning essential principles of reliability, accuracy, and efficacy. Hawkins offers crystal-clear guidance on when, where, and how data scientists can integrate language models into their existing workflows without exposing themselves or their companies to unnecessary risks.
This guide walks you through strategic design patterns for incorporating language models into real-world data science projects. It avoids strategies and techniques that rely heavily on proprietary tools that are likely to evolve very quickly (or could disappear entirely) in the near future. Instead, the author presents foundational methodologies that will remain valuable regardless of how individual platforms or services change. The book combines sound theory with practical case studies that cover common data science projects in the education, insurance, telecommunications, media and banking industries. Including customer churn analysis, customer complaint routing and document processing, demonstrating how language models can enhance rather than replace traditional data science methods.
You'll find:
- Three chapters providing a solid grounding in the ideas, principles and technologies that are used for data science with language models
- Nine chapters that discuss specific patterns for integrating language models into data science workflows, including semantic vector analysis, few-shot prompting, retrieval-based applications, synthetic data generation and AI agent development
- Real-world case studies discussing applications like fraud detection, customer churn, translation, document classification and sentiment analysis, with concrete business applications
- Comprehensive evaluation methods and testing frameworks are discussed in the context of language model applications in enterprise environments
- Practical code examples and implementation guidance using popular tools like HuggingFace, OpenAI, Google Gemini, as well as more development frameworks like LangChain, and PydanticAI
- Strategic insights for balancing model accuracy, interpretability, and business requirements while avoiding common pitfalls in AI deployment
An authoritative resource for data scientists and software engineers interested in using modern AI tools to build data-driven applications, Data Science First is a strategy guide for professionals navigating the discipline of data science as it is disrupted by generative AI. Whether you're looking to improve existing workflows or develop entirely new AI-powered solutions, you'll discover how to use language models in ways that consistently add value.
More details
Other editions
Additional editions

Content
Acknowledgments vii
About the Author ix
Introduction 1
Chapter 1: Language Models 5
Chapter 2: Tools and Terminology 31
Chapter 3: Data Science Essentials 59
Chapter 4: Semantic Vectors 87
Chapter 5: Insights and Interpretability 113
Chapter 6: Zero-Shot to Few-Shot Prompting 143
Chapter 7: Labeling and Feature Engineering 167
Chapter 8: Synthetic Data Generation 201
Chapter 9: Retrieval Applications 237
Chapter 10: Code as Language 265
Chapter 11: Automated Analytics 291
Chapter 12: Agentic AI 317
Index 347
Chapter 1
Language Models
Technological progress tends not to be smooth. Sometimes, a single development unlocks a large amount of potential, resulting in a burst of activity in which many applications are found. Enthusiastic technologists rapidly explore a space of new inventions and ways of working, opened by a single advance in a fundamental field. In the world of data science, we are living through just such a period-the invention of the transformer neural network. Coinciding with sufficient computational power, this invention has allowed us to build machines that seem to have unlimited capabilities to learn deep semantic relationships from text data alone. This fundamental transformation has unlocked near perfect language translation, the ability to generate web pages from text descriptions, and the new commercial world of general-purpose Artificial Intelligence (AI) systems that can respond to a very wide variety of requests. This is all possible only because the foundation models they are built upon can consume the library of data on the Internet and build rich semantic representations of various human natural languages, programming languages, and other data structures.
Many tasks that can be accomplished with these language models were previously done by dedicated data science teams, collecting data and crafting algorithms for that specific purpose. In many instances, they still perform that work whenever the task requires higher levels of accuracy or security than the general-purpose AI systems permit. Nevertheless, the abilities of language models allow us to improve many aspects of the data science workflow, from data collection, prototyping, and even scripting solutions. In this book, we explore those opportunities, beginning with the most fundamental utility of the representation of text as semantic vectors, through to using models to write code or experiment with agent-based AI systems. However, we must start with a discussion of what a language model is and the gradual development of the technology that has unlocked this possibility.
What Is a Language Model?
The term language model emerged from linguistics to describe the attempt to model the statistical structure of language production. In its most simplistic form, this model could be a finite-state automaton that captures the probability of the next text character, given the current text character. I show this idea schematically in Figure 1.1, using the limited English alphabet. The state consists of the last character that was printed and determines the probability distribution over characters, with a new character sampled from this distribution. This new character is output and then used to update the state of the system.
Figure 1.1: A simple statistical language model.
As you might imagine, such a machine is incapable of learning the subtleties of human language. However, as shown in the following example output, you can see that certain common vowel-to-consonant pairings are routinely generated. It isn't English, but it has language-like properties.
> The [I, tasinyey, lithecisoblits. u "Moupe "Ive Hon This output was generated by a simple script that opens a dataset consisting of three short stories by the science fiction writer Phillip K. Dick. It cleans the text to remove new lines, underscores, and a variety of characters used to separate chapters and sections. Then it builds a statistical model that simply learns the probability of the next character from the current character. In building this simple model, we now have multiple key ideas of language models that will recur throughout the book.
The first idea is that text data must be broken into pieces known as tokens in a process called tokenization. These tokens could be the individual letters in the text, but in many instances, they are multi-letter compounds that depend on the language and text domain. The complete list of possible tokens that a model can recognize is called the vocabulary. In many instances, the model will have a special token for coping with any unrecognized token. Unrecognized tokens may occur, as the model is being used in text that extends beyond the domain of the training data.
When a block of text is tokenized into a list of tokens, the order of the tokens from the original string is maintained. The simple idea of tokenization is to determine the most granular representation of text, and the individual token identities in the list determine how the text is encoded for the model.
Text can be arbitrary length, so it follows that tokenized text becomes an arbitrary length list of tokens. However, in practice, modern language models operate with a finite length input, called the context window. In this very simple model, the context window is one, because it looks at only one token to determine the probability of the next token. However, we could extend this very easily. I have modified the script so that it builds a new set of probabilities based on all possible consecutive token pairs, to model the probability of the next token that follows the current pair. An example output of this new model is shown here:
> They inkly bunkepler come sollitsiang therne st be c I seeded this model with the input pair "Th," as those were the first two characters of the previous model, which it extended into the word "They." Already you can see that the output looks slightly more like English. There is no such thing as an "inkly bunkepler," but it sounds like the kind of thing we might find in a science fiction novel. We can extend the context window again to three characters and generate output one more time, which results in the following:
> The was of france." Lt. As around stanticks refright This was seeded with the three characters "The" to follow on from the previous model. Now you can see that the text is getting very close to resembling English language; in fact, many of the words in this salad of characters are genuine English words.
We could continue this process of gradually growing the context window and would generally see that the quality improves. This is because we are starting to capture the statistical patterns of words or other language structures. But at some point, the data will become so sparse that it will be difficult to learn with this simple probability model. Simple statistical language models were stuck at this point of development for a very long time. Soon I explain how modern language models overcome these difficulties, but at this stage it's important to recognize that models generally have a context window that you need to work within, and the atomic units they operate on are called tokens.
If you try to input text longer than this input limitation, the text will be truncated to the maximum length defined by the context window. Considerable research goes into finding methods that extend the size of context windows. However, at this time of writing, they remain a fundamental limitation, because running efficient and economical language models requires reducing the amount of text processing.
note
Tokens and Tokenizers
Text is converted to a sequence of tokens by a tokenizer. The tokenizer determines the complete vocabulary of the model. The model and tokenizer are intrinsically linked: If a token is not recognized by the tokenizer, then it is not recognized by the model. When working with any language model, you must understand the nature and limitations of the tokenizer.
Representing Meaning
The previous section described how you could gradually build up a simple language model, by learning increasingly complicated probability distributions based on a larger context window of tokens. Two key ideas should have emerged: First, the more awareness the model has of previous characters, the better you can model a language. Second, as the context window grows, so too does the number of parameters. So, the counteracting force is that you need exponentially increasing data to learn the model. It is, of course, arguable that this approach would never succeed in perfectly modeling language, because it doesn't form any kind of meaning abstraction from language; it merely models empirical probabilities of characters. However, regardless of that theoretical distinction, this approach will always fail practically, because we cannot possibly generate datasets large enough to learn the character probabilities of all possible sequences. What is required to solve this problem is some way of learning an internal representation of the language's structure, such that we can generalize to sentences not seen in the training data.
Learning internal representations is the purpose of the hidden layers of neural networks. Typically, they contain fewer nodes and parameters than the entire input space, thus forcing the model to learn some form of abstractions over the training data. You can improve the simple language models introduced in the previous section by including a layer of hidden nodes and imposing a learning process where the parameters are slowly adjusted using the training data. Nevertheless, these models tend to fall short of generating language that resembles human ability. Why do they fall short?
One main reason appears to be that words are, in fact, very flexible tools. We can use numerous words in multiple ways, and...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.