Chapter 1
Deep Learning Fundamentals for NLP
Why have deep neural models revolutionized natural language processing-and what foundational concepts underlie their extraordinary capability to transform raw text into meaning? In this chapter, we uncover the core building blocks of deep learning through the unique lens of NLP: exploring cutting-edge architectures, nuanced representations, and task-driven strategies that set the stage for high-performing language models. Whether your interests are theoretical or deeply practical, here you'll find the critical insights and frameworks essential for mastering NLP in the deep learning age.
1.1 Neural Network Foundations in NLP
Neural network architectures form the backbone of modern natural language processing (NLP), each instantiating distinct inductive biases that align with varying linguistic phenomena. This section presents a comprehensive analysis of the primary architectures: feedforward neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based models, focusing on their theoretical underpinnings, computational characteristics, and practical applications in language tasks.
Feedforward neural networks, or multilayer perceptrons, represent the earliest form of deep learning applied to NLP. These models consist of fully connected layers transforming input vectors through nonlinear activations, devoid of temporal or spatial memory. Their core limitation in language tasks is the absence of sequence modeling capabilities, rendering them unsuitable for capturing contextual dependencies inherent in linguistic data. Despite this, they serve as valuable components in embedding projection layers and feature transformation modules due to their straightforward differentiability and universal function approximation properties.
Convolutional neural networks introduce localized connectivity and parameter sharing via convolutional filters, enabling the extraction of hierarchical features from input sequences. In NLP, one-dimensional convolutions operate over word embeddings or character-level representations, capturing local n-gram patterns effectively. Their inductive bias towards locality and compositionality aligns well with phenomena such as morphology and phrase structure. Computationally, CNNs are highly parallelizable and less susceptible to vanishing or exploding gradients compared to recurrent architectures. However, their limited receptive field requires deeper stacking or dilated convolutions to encompass long-range dependencies, which are crucial in many language understanding tasks.
Recurrent neural networks address sequence modeling by maintaining hidden states that evolve as tokens are processed sequentially. Theoretical frameworks model them as discrete-time dynamical systems, where each hidden state approximates a function of the previous state and current input. This structure naturally captures temporal dependencies and is well-suited for modeling syntax and sequential semantics. Classical RNNs encountered fundamental training obstacles due to gradient instability over long sequences, leading to variants such as long short-term memory (LSTM) and gated recurrent units (GRU). These architectures incorporate gating mechanisms that regulate information flow, enhancing the modeling of long-distance dependencies and mitigating vanishing gradients. Nonetheless, RNNs impose inherent sequential computation constraints, impacting parallelizability and scalability in large datasets.
Transformer architectures, introduced as attention-centric models, revolutionized NLP by dispensing with recurrence and convolution in favor of self-attention mechanisms. The theoretical innovation lies in modeling pairwise interactions between all input tokens via scaled dot-product attention, yielding context-aware representations that dynamically integrate information across an entire sequence. This global receptive field and inherently parallelizable computation enable transformers to capture long-range dependencies more efficiently than RNNs or CNNs. The positional encoding complements the lack of recurrence by injecting sequential order information, ensuring model awareness of token positions.
The transition from classical RNNs to attention-focused frameworks reflects a paradigm shift in inductive biases. While RNNs embed strong sequential inductive assumptions through their temporal state transitions, transformers adopt a more flexible approach that supports direct modeling of arbitrary token interactions. This flexibility facilitates adaptation to diverse linguistic phenomena, including dependency structures and discourse-level relations, which may not exhibit strictly left-to-right sequential dynamics.
Adaptations of these neural architectures to language-specific tasks involve fine-tuning inductive biases and architectural parameters. For example, CNNs have been tailored for morphological inflection and character-level processing, capturing local patterns in morphosyntactic units. RNNs remain prevalent in tasks requiring explicit sequential generation, such as language modeling and machine translation, although transformers have increasingly supplanted them due to superior efficiency and performance on large-scale corpora. Transformer variants are further specialized through pretraining objectives and architectural modifications to address multilinguality, low-resource adaptation, and model compression.
Despite their successes, these neural architectures encounter limitations in real-world, large-scale NLP deployments. Feedforward networks lack sufficient representational power for contextualized understanding, while CNNs struggle with long-range dependency modeling without considerable depth. RNNs, constrained by sequential processing, present challenges for hardware acceleration and handling sequences exceeding typical lengths. Transformers, albeit versatile and performant, exhibit quadratic computational complexity with respect to input length, hindering scalability to very long documents without approximation techniques such as sparse attention or recurrence-augmented models.
The interplay between architectural choices and linguistic phenomena continues to motivate research in advancing neural foundations for NLP. A nuanced understanding of inductive biases and computational trade-offs informs the design of hybrid models and continual refinements. Integrating convolutions for local feature extraction within transformer layers, employing recurrence for moderate-length dependencies, or incorporating syntactic priors are active directions to balance generality, efficiency, and linguistic fidelity.
Feedforward, convolutional, recurrent, and transformer-based models constitute a layered hierarchy of neural architectures, each with distinct theoretical frameworks and suitability for capturing specific aspects of language. Their study elucidates the mapping between neural computation and linguistic structure, guiding the development of increasingly robust and scalable NLP systems in diverse application domains.
1.2 Principles of Word and Sentence Representation
Natural Language Processing (NLP) relies fundamentally on how textual data is represented for computational models. The transition from symbolic to continuous vector representations marks a watershed in this domain, enabling machines to capture semantic and syntactic regularities in language more effectively. Early approaches treated words as discrete atomic units represented by one-hot vectors, which are sparse, high-dimensional vectors with a single nonzero entry corresponding to the vocabulary index of a word. Formally, for a vocabulary of size V , the one-hot encoding of word wi is a vector ei ? RV where
Though straightforward, such encodings lack the ability to express semantic similarity, as the vectors are orthogonal and equidistant in both Euclidean and cosine metrics. Consequently, models using one-hot vectors cannot exploit distributional properties of language.
Dense, Distributed Word Embeddings
The advent of dense word embeddings remedied these limitations by embedding words into a low-dimensional continuous vector space Rd with d « V . Methods such as Word2Vec [?] and GloVe [?] operationalize the distributional hypothesis by learning word vectors that capture co-occurrence statistics from large corpora, enabling semantically and syntactically related words to lie close to each other in the embedding space.
Word2Vec introduces two architectures: Continuous Bag-of-Words (CBOW) and Skip-Gram. The Skip-Gram model optimizes the objective of predicting context words given a target word, maximizing
where T is the corpus length and c the context window size. The conditional probability p(wO│wI) is modeled with softmax over the dot product of embedding vectors. This training yields representations wherein linear operations can capture semantic relations such as analogies:
GloVe constructs...