Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Enables readers to develop foundational and advanced vectorization skills for scalable data science and machine learning and address real-world problems
Offering insights across various domains such as computer vision and natural language processing, Vectorization covers the fundamental topics of vectorization including array and tensor operations, data wrangling, and batch processing. This book illustrates how the principles discussed lead to successful outcomes in machine learning projects, serving as concrete examples for the theories explained, with each chapter including practical case studies and code implementations using NumPy, TensorFlow, and PyTorch.
Each chapter has one or two types of contents: either an introduction/comparison of the specific operations in the numerical libraries (illustrated as tables) and/or case study examples that apply the concepts introduced to solve a practical problem (as code blocks and figures). Readers can approach the knowledge presented by reading the text description, running the code blocks, or examining the figures.
Written by the developer of the first recommendation system on the Peacock streaming platform, Vectorization explores sample topics including:
From the essentials of vectorization to the subtleties of advanced data structures, Vectorization is an ideal one-stop resource for both beginners and experienced practitioners, including researchers, data scientists, statisticians, and other professionals in industry, who seek academic success and career advancement.
Edward DongBo Cui is a Data Science and Machine Learning Engineering Leader who holds a PhD in Neuroscience from Case Western Reserve University, USA. Edward served as Director of Data Science at NBC Universal, building the first recommendation system on the new Peacock streaming platform. Previously, he was Lead Data Scientist at Nielsen Global Media. He is an expert in ML engineering, research, and MLOps to drive data-centric decision-making and enhance product innovation.
About the Author xiii
Preface xv
Acknowledgment xix
1 Introduction to Vectorization 1
1.1 What Is Vectorization 1
1.1.1 A Simple Example of Vectorization in Action 2
1.1.2 Python Can Still Be Faster! 3
1.1.3 Memory Allocation of Vectorized Operations 4
1.2 Case Study: Dense Layer of a Neural Network 6
1.3 Vectorization vs. Other Parallel Computing Paradigms 9
1.3.1 Multithreading 9
1.3.2 Multiprocessing 9
1.3.3 Multiworker Distributed Computing 13
Bibliography 16
2 Basic Tensor Operations 19
2.1 Tensor Initializers 19
2.2 Data Type and Casting 24
2.2.1 Tips on Specifying the dtypes During Tensor Initialization 27
2.2.2 Tips on Casting 27
2.3 Mathematical Operations 27
2.4 Reduction Operations 31
2.5 Value Comparison Operations 31
2.6 Logical Operations 32
2.7 Ordered Array-Adjacent Element Operations 33
2.8 Array Reversing 33
2.9 Concatenation, Stacking, and Splitting 35
2.10 Reshaping 35
2.11 Broadcasting 38
2.12 Case Studies 44
2.12.1 Image Normalization 45
2.12.2 Pearson's Correlation 46
2.12.3 Pair-wise Difference 47
2.12.4 Construction of Magic Squares 48
Bibliography 57
3 Tensor Indexing 61
3.1 Get Values at Index 61
3.1.1 Integer Indexing 61
3.1.2 Flat Index vs. Multi-index 63
3.1.3 Boolean Indexing 69
3.2 Slicing 70
3.2.1 Reusing Slice Configuration 75
3.3 Case Study: Get Consecutive Index 78
3.4 Take and Gather 80
3.4.1 Take 80
3.4.2 Take Along Axis 83
3.4.3 Gather 87
3.4.4 N-Dimensional Gather 91
3.5 Assign Values at Index 95
3.6 Put and Scatter 98
3.6.1 Put 98
3.6.2 Put Along Axis 100
3.6.3 Multi-index Scatter Replacement 101
3.6.4 Additional Scatter Operations from PyTorch 108
3.7 Case Study: Batchwise Scatter Values 113
Bibliography 115
4 Linear Algebra 119
4.1 Tensor Multiplications 119
4.2 The matmul Operation 119
4.2.1 The @ Operator 122
4.3 The tensordot Operation 123
4.3.1 Heuristics of tensordot Operations 125
4.4 Einsum 129
4.5 Case Study: Pair-wise Pearson's Cross-Correlation 134
4.6 Case Study: Hausdorff Distance 135
4.7 Common Linear Algebraic Routines 139
4.8 Case Study: Fitting Single Exponential Curves 139
Bibliography 144
5 Masking and Padding 145
5.1 Masking 145
5.1.1 Triangular and Diagonal Masks 146
5.1.2 Changing Elements Using the where Operation 146
5.1.3 Use Multiplication to Apply Masks 146
5.1.4 Use Arithmetic Operations as Boolean Operations to Apply and Combine Masks 151
5.1.5 Select Elements Based on Masking 152
5.1.6 Case Study: Top-k Masking 153
5.2 Padding 155
5.2.1 Case Study: Padding in Convolutional Neural Networks 161
5.2.2 Case Study: Truncate or Pad Sequence to Desired Length 163
5.3 Advanced Case Studies 164
5.3.1 Scaled-Dot Product Attention 164
5.3.2 Variable-Length Range via Masking 168
5.3.3 Length Regulator Module of FastSpeech 2 171
Bibliography 181
6 String Processing 183
6.1 String Data Types 183
6.1.1 NumPy String, Bytes, and Object 183
6.1.2 Pandas String 184
6.1.3 Tensorflow Bytes 186
6.1.4 PyTorch 187
6.2 String Operations 187
6.3 Case Study: Parsing DateTime from String Representations 189
6.4 Mapping Strings to Indices 194
6.4.1 NumPy np.unique 194
6.4.2 Pandas pd.Categorical 195
6.4.3 Scikit-learn sklearn.preprocessing.LabelEncoder 198
6.4.4 Tensorflow tf.lookup 198
6.4.5 TorchText torchtext.vocab 200
6.5 Case Study: Factorization Machine 201
6.5.1 Factorization Machine Model 202
6.5.2 More Efficient Optimization Criterion 202
6.5.3 Implementation of Deep Factorization Machine in Tensorflow 203
6.5.4 Training DeepFM on MovieLens 1M Dataset 209
6.6 Regular Expressions (Regex) 215
6.7 Data Serialization and Deserialization 217
Bibliography 221
7 Sparse Matrix 223
7.1 Scipy's Sparse Matrix Classes 224
7.1.1 Coordinate Sparse Matrix (coo_matrix) 224
7.1.2 Compressed Sparse Column Matrix (csc_matrix) 225
7.1.3 Compressed Sparse Row Matrix (csr_matrix) 227
7.1.4 Block Sparse Row Matrix (bsr_matrix) 228
7.1.5 Dictionary of Keys Sparse Matrix (dok_matrix) 229
7.1.6 Row-Based List of List Sparse Matrix (lil_matrix) 230
7.1.7 Diagonal Storage Sparse Matrix (dia_matrix) 232
7.1.8 Comparisons Between Different Sparse Matrix Formats 233
7.2 Sparse Matrix Broadcasting 235
7.2.1 Scalar Broadcasting 235
7.2.2 Row-wise Broadcasting 236
7.2.3 Column-wise Broadcasting 237
7.2.4 Multiplication on Sparse Indices 237
7.3 Tensorflow's Sparse Tensors 238
7.3.1 SparseTensor Class 239
7.3.2 Sparse CSR Matrix 240
7.4 PyTorch's Sparse Matrix 242
7.5 Sparse Matrix in Other Python Libraries 245
7.6 When (Not) to Use Sparse Matrix 245
7.7 Case Study: Sparse Matrix Factorization with ALS 245
7.7.1 Matrix Factorization 246
7.7.2 Parameter Updates with ALS 246
7.7.3 Adding Bias Terms to Matrix Factorization 247
7.7.4 Adding Regularization Term 249
7.7.5 Implementing ALS 250
7.7.6 Training a Model with MovieLens-100k 255
Bibliography 257
8 Jagged Tensors 261
8.1 Left Align a Sparse Tensor to Represent Ragged Tensor 263
8.2 Index to Binary Indicator 269
8.3 Case Study: Jaccard Similarities Using Sparse Matrix 271
8.4 Case Study: Batchwise Set Operations 275
8.5 Case Study: Autoencoders with Sparse Inputs 283
8.5.1 Embedding Lookup on Sparse Inputs 284
8.5.2 Inputs with Weights 287
Bibliography 293
9 Groupby, Apply, and Aggregate 295
9.1 Pandas Groupwise Operations 296
9.2 Reshaping and Windowing of Dense Tensors 298
9.3 Case Study: Vision Transformer (ViT) 305
9.4 Bucketizing Values 315
9.5 Segment-wise Aggregation 319
9.6 Case Study: EmbeddingBag 325
9.7 Case Study: Vocal Duration Constrained by Note Duration 330
9.8 Case Study: Filling of Missing Values in a Sequence 336
Bibliography 341
10 Sorting and Reordering 343
10.1 Sorting Operations 343
10.2 Case Study: Top-k Using argsort and argpartition 346
10.3 Case Study: Sort the Rows of a Matrix 349
10.4 Case Study: Reverse Padded Sequence 353
10.5 Case Study: Gumbel-Max Sampling with Weights 358
10.6 Case Study: Sorting Articles Around Anchored Advertisements 367
Bibliography 373
11 Building a Language Model from Scratch 375
11.1 Language Modeling with Transformer 375
11.1.1 Encoder and Decoder of the Transformer Architecture 376
11.1.2 Training of Transformer Models 377
11.2 Pre-LN vs. Post-LN Transformer 378
11.3 Layer Normalization 383
11.4 Positional Encoding and Embedding 385
11.4.1 Sinusoidal Positional Encoding 385
11.4.2 Position as Categorical Embeddings 387
11.4.3 Relative Positional Encoding (RPE) 388
11.4.4 Rotary Positional Encoding (RoPE) 389
11.5 Activation Functions in Feedforward Layer 395
11.6 Case Study: Training a Tiny LLaMA Model for Next Token Prediction 398
11.7 A Word on AI Safety and Alignment 410
11.8 Concluding Remarks 412
Bibliography 412
Index 419
Vectorization is a type of parallel computing paradigm that performs arithmetic operations on an array of numbers [Intel, 2022] within a single data processing unit (either CPU or GPU). Modern CPUs can perform between 4 and 16 single precision (float32) parallel computations, depending on the type of instructions [Intel, 2022] (see Figure 1.1).
Here, SSE, or streaming SIMD (single-instruction-multiple-data) extension, has 128 registers; each register is used to store 1-bit of data for computations. This is equivalent to performing four single-precision (float32) operations or two double-precision (float64) operations. Intel's AVX512, or Advanced Vector Extension 512, on the other hand, has 512 registers and can therefore process 512 bits of data simultaneously. This is equivalent to performing 16 single-precision or 8 double-precision operations.
The term "Vectorization" also refers to the process of transforming an algorithm computable on one data point at a time to an operation that calculates a collection of data simultaneously [Khartchenko, 2018]. A classic example is the dot product operation,
where , and .
The dot product operation on the right-hand side of the equation is the vectorized form of the summation process on the left. Instead of operating on one element of the array at a time, it operates on the collections or arrays, and .
Figure 1.1 CPU vector computations.
Another definition of the term "vectorization" is related to the more recent advances in large language models: using a vector or an array of values to represent a word. This is more commonly known as embedding in the literature which originated from the word2vec model in natural language processing. In Chapters 6 and 7, we will introduce the idea of representing string or categorical features as numerical values, i.e. integer indices, sparse one-hot encoding vectors, and dense embedding vectors.
Suppose we want to take the dot product of 1 million random numbers drawn from the uniform distribution. Using Python's for loop, we would typically implement this algorithm as the follows:
for loop
On my M1 MacBook Pro under Python 3.11.4, this is the runtime of the above function:
If we implement the algorithm using numpy with vectorization
numpy
Under the same setup, we have the following runtime of the vectorized implementation.
This is a ~37× speed-up by vectorization!
However, the above example is not to discount that certain native Python operations are still faster than NumPy's implementations, despite NumPy being often used for vectorization operations to simplify for-loops in Python.
For example, we would like to split a list of strings delimited by periods ".". Using Python's list comprehension, we have
.
Line 2 runs for
However, in comparison, if we use NumPy's "vectorized" split operation
Line 3 runs for
which is about ~4.0 slower than native Python.
When using vectorized operations, we need to consider the trade-off between memory and speed. Sometimes, vectorized operations would need to allocate more memory for results from intermediate steps. Therefore, this has implications on the overall performance of the program, in terms of space vs. time complexity. Consider a function in which we would like to compute the sum of the squares of the elements in an array. This can be implemented as a for-loop, where we iterate through each element, take the square, and accumulate the square values as the total. Alternatively, the function can also be implemented as a vectorized operation where we take the square of each element first, store the results in an array, and then sum the squared vector.
Let us use the memory_profiler package to examine the memory usage of each line of the two functions. We can first install the package by pip install memory_profiler==0.61.0 and load it as a plugin in a Jupyter Notebook.
memory_profiler
pip install memory_profiler==0.61.0
We also save the above script into a file (which we call memory_allocation_example.py), then we measure the memory trace of each function in the notebook as follows:
memory_allocation_example.py
This gives us the following line-by-line memory trace of the function call:
We can also use %timeit magic function to measure the speed of the function:
%timeit
which gives
Similarly, let us also look at the for-loop implementation
And we also test the run time as
We can see that although the for-loop function is slower than the vectorized function, the for-loop had almost no memory increase. In contrast, the vectorized solution has a ~1M increase in memory usage while reducing the run time by about 30%. If X is several magnitudes larger than our current example, then we can see that even more memory will be consumed in exchange for speed. Hence, we need to consider carefully the trade-off between speed and memory (or time vs. space) when using vectorized operations.
X
Vectorization is a key skill for implementing various machine learning models, especially for deep learning algorithms. A simple neural network consists of several feed-forward layers, as illustrated in Figure 1.2.
A feed-forward layer (also called dense layer, linear layer, perceptron layer, or hidden layer, as illustrated in Figure 1.3) has the following expression:
Notice that is a linear model. Here,
(batch_size, num_inputs)
(num_inputs, hidden_size)
(1, hidden_size)
Figure 1.2 Multilayer perceptron.
Figure 1.3 Feed-forward layer.
relu
sigmoid
tanh
(batch_size, hidden_size)
Stacking multiple feed-forward layers creates the multilayer perceptron model, as illustrated by our example neural network above.
If we were to implement a feed-forward layer using only Python naively, we would need to run thousands to millions of iterations of the for-loop, depending on the batch_size, num_inputs, and hidden_size. When it comes to large language models like generative pretrained transformer (GPT) which has hundreds of billions of parameters to optimize, using for loops to make such computations can become so slow that training or making an inference using the model is infeasible. But if we take advantage of vectorization on either CPU or GPU, we would see significant gains in the performance of our model. Furthermore, certain operations like matrix multiplication are simply inefficient to perform if implemented naively by simply parallelizing a for-loop, so in practice must rely on special algorithms such as Strassen's Algorithm [Strassen, 1969] (or more recently, a new algorithm discovered by AlphaTensor [Fawzi et al., 2022]). Implementing these algorithms as an instruction on either CPU or GPU is a non-trivial task itself. Practitioners of machine learning and data science should therefore take advantage of pre-existing libraries that implement these algorithms. But to use these libraries efficiently, one needs to be adept in thinking in terms of vectorization.
for-loop
batch_size
num_inputs
hidden_size
for loops
In the following, let's take a look at an example of how to implement the above perceptron layer using numpy.
@
tensorflow
torch
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.