Markov Decision Processes

Name: Markov Decision Processes | Discrete Stochastic Dynamic Programming
Brand: Wiley
Price: 120.99 EUR
Availability: OnlineOnly

Discrete Stochastic Dynamic Programming

Martin L. Puterman(Author)

Wiley (Publisher)

Published on 28. August 2014

672 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-62587-3 (ISBN)

€120.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Person

Content

Preface. 1. Introduction. 2. Model Formulation. 3. Examples. 4. Finite-Horizon Markov Decision Processes. 5. Infinite-Horizon Models: Foundations. 6. Discounted Markov Decision Problems. 7. The Expected Total-Reward. Criterion. 8. Average Reward and Related Criteria. 9. The Average Reward Criterion-Multichain and CommunicatingModels. 10. Sensitive Discount Optimality. 11. Continuous-Time Models. Afterword. Notation. Appendix A. Markov Chains. Appendix B. Semicontinuous Functions. Appendix C. Normed Linear Spaces. Appendix D. Linear Programming. Bibliography. Index.

CHAPTER 2 Model Formulation

This chapter introduces the basic components of a Markov decision process and discusses some mathematical and notational subtleties. Chapters 1 and 3 contain many examples of Markov decision processes. We encourage you to refer to those examples often to gain a clear understanding of the Markov decision process model. Section 2.2 illustrates these concepts and their interrelationship in the context of a one-period model.

A Markov decision process model consists of five elements: decision epochs, states, actions, transition probabilities, and rewards. We describe these in detail below.

2.1 PROBLEM DEFINITION AND NOTATION

A decision maker, agent, or controller (who we refer to as he with no sexist overtones intended) is faced with the problem, or some might say, the opportunity, of influencing the behavior of a probabilistic system as it evolves through time. He does this by making decisions or choosing actions. His goal is to choose a sequence of actions which causes the system to perform optimally with respect to some predetermined performance criterion. Since the system we model is ongoing, the state of the system prior to tomorrow’s decision depends on today’s decision. Consequently, decisions must not be made myopically, but must anticipate the opportunities and costs (or rewards) associated with future system states.

2.1.1 Decision Epochs and Periods

Decisions are made at points of time referred to as decision epochs. Let T denote the set of decision epochs. This subset of the non-negative real line may be classified in two ways: as either a discrete set or a continuum, and as either a finite or an infinite set. When discrete, decisions are made at all decision epochs. When a continuum, decisions may be made at When decisions are made continuously, the sequential decision problems are best analyzed using control theory methods based on dynamic system equations.

1. all decision epochs (continuously), 2. random points of time when certain events occur, such as arrivals to a queueing system, or 3. opportune times chosen by the decision maker.

Figure 2.1.1 Decision Epochs and Periods.

In discrete time problems, time is divided into periods or stages. We formulate models so that a decision epoch corresponds to the beginning of a period (see Fig. 2.1.1). The set of decision epochs is either finite, in which case T ≡ {1, 2,…, N} for some integer N < ∞, or infinite, in which case T ≡ {1, 2,…}. We write T = {1, 2,…, N}, N ≤ ∞ to include both cases. When T is an interval, we denote it by either T = [0, N] or T = [0, ∞). Elements of T (decision epochs) will be denoted by t and usually referred to as “time t.” When N is finite, the decision problem will be called a finite horizon problem; otherwise it will be called an infinite horizon problem. Most of this book will focus on infinite horizon models. We adopt the convention that, in finite horizon problems, decisions are not made at decision epoch N: we include it for evaluation of the final system state. Consequently, the last decision is made at decision epoch N-1. Frequently we refer to this as an N-1 period problem.

The primary focus of this book will be models with discrete T. A particular continuous time model (a semi-Markov decision process) will be discussed (Chapter 11).

2.1.2 State and Action Sets

At each decision epoch, the system occupies a state. We denote the set of possible system states by S. If, at some decision epoch, the decision maker observes the system in state s ∈ S, he may choose action a from the set of allowable actions in state s, As. Let A = ∪ s ∈ S As (Fig. 2.1.2.) Note we assume that S and As do not vary with t. We expand on this point below.

The sets S and As may each be either

1. arbitrary finite sets, 2. arbitrary countably infinite sets, 3. compact subsets of finite dimensional Euclidean space, or 4. non-empty Borel subsets of complete, separable metric spaces.

In nondiscrete settings, many subtle mathematical issues arise which, while interesting, detract from the main ideas of Markov decision process theory. We expand on such issues in Section 2.3 and other sections of this book. These more technical sections will be indicated by asterisks. Otherwise, we assume that S and As are discrete (finite or countably infinite) unless explicitly noted.

Actions may be chosen either randomly or deterministically. Denote by the collection of probability distributions on (Borel) subsets of As and by the set of probability distributions on (Borel) subsets of A. (We may regard as an element of with support As.) Choosing actions randomly means selecting a probability distribution , in which case action a is selected with probability q(a). Degenerate probability distributions correspond to deterministic action choice.

Figure 2.1.2 Relationship between action sets when S is a continuum.

This model may be generalized by allowing both or either of S and As to depend explicitly on t. Such generality is unnecessary for most applications and has little effect on theory. It can be included in the formulation above by setting S = ∪ t ∈ T St, where St denotes the set of possible states at time t and As= ∪ t ∈ T As, t, where As, t denotes the set of allowable actions in state s at time t. Transition probabilities and rewards (defined below) must be modified accordingly.

We might restrict the model by requiring that As = A for all s ∈S. This simplifies notation and doesn’t impact theory but limits the applicability of the model since applications as simple as those in Sections 3.1 and 3.3 are not easily included. (Problem 2.7)

Sometimes a distinction is made between a model in which decisions are made at every decision epoch, and one in which the system is uncontrolled and interventions or decisions are made only when necessary. We take the view that this distinction is unnecessary; the latter model is a special case of that model proposed here, in which, for each s ∈ S, As includes an action “do not intervene” (Problem 2.3).

2.1.3 Rewards and Transition Probabilities

As a result of choosing action a ∈ As in state s at decision epoch t,

1. the decision maker receives a reward, rt(s, a) and 2. the system state at the next decision epoch is determined by the probability distribution pt(·|s, a).

Let the real-valued function rt(s, a) defined for s ∈ S and a ∈ As denote the value at time t of the reward received in period t. When positive, rt(s, a) may be regarded as income, and when negative as cost. From the perspective of the models studied herein, it is immaterial how the reward is accrued during the period. We only require that its value or expected value be known before choosing an action, and that it not be effected by future actions. The reward might be

1. a lump sum received at a fixed or random time prior to the next decision epoch, 2. accrued continuously throughout the current period, 3. a random quantity that depends on the system state at the subsequent decision epoch, or 4. a combination of the above.

When the reward depends on the state of the system at the next decision epoch, we let rt(s, a, j) denote the value at time t of the reward received when the state of the system at decision epoch t is s, action a ∈ As is selected, and the system occupies state j at decision epoch t + 1. Its expected value at decision epoch t may be evaluated by computing

(2.1.1)

In (2.1.1), the non-negative function pt(j|s, a) denotes the probability that the system is in state j ∈ S at time t + 1, when the decision maker chooses action a ∈As in state s at time t. The function pt(j|s, a) is called a transition probability function. Note that many system transitions might occur in the time period between decision epoch t and decision epoch t + 1. We formulate the model so that transitions which occur between decision epochs do not influence the decision maker. Under most notions of optimality, all of the information necessary to make a decision at time t is summarized in rt(s, a) and pt(j|s, a);...

Content (EPUB)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Markov Decision Processes

Description

More details

Other editions

Additional editions

Person

Content

CHAPTER 2

Model Formulation

2.1 PROBLEM DEFINITION AND NOTATION

2.1.1 Decision Epochs and Periods

2.1.2 State and Action Sets

2.1.3 Rewards and Transition Probabilities

System requirements