Reinforcement Learning and Approximate Dynamic Programming for Feedback Control

Name: Reinforcement Learning and Approximate Dynamic Programming for Feedback Control
Brand: Wiley-IEEE Press
Price: 145.99 EUR
Availability: OnlineOnly

Frank L. Lewis Derong Liu(Editor)

Wiley-IEEE Press

Published on 28. January 2013

648 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-45397-1 (ISBN)

€145.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

PREFACE xix

CONTRIBUTORS xxiii

PART I FEEDBACK CONTROL USING RL AND ADP

1. Reinforcement Learning and Approximate Dynamic Programming (RLADP)-Foundations, Common Misconceptions, and the Challenges Ahead 3
Paul J. Werbos

1.1 Introduction 3

1.2 What is RLADP? 4

1.3 Some Basic Challenges in Implementing ADP 14

2. Stable Adaptive Neural Control of Partially Observable Dynamic Systems 31
J. Nate Knight and Charles W. Anderson

2.1 Introduction 31

2.2 Background 32

2.3 Stability Bias 35

2.4 Example Application 38

3. Optimal Control of Unknown Nonlinear Discrete-Time Systems Using the Iterative Globalized Dual Heuristic Programming Algorithm 52
Derong Liu and Ding Wang

3.1 Background Material 53

3.2 Neuro-Optimal Control Scheme Based on the Iterative ADP Algorithm 55

3.3 Generalization 67

3.4 Simulation Studies 68

3.5 Summary 74

4. Learning and Optimization in Hierarchical Adaptive Critic Design 78
Haibo He, Zhen Ni, and Dongbin Zhao

4.1 Introduction 78

4.2 Hierarchical ADP Architecture with Multiple-Goal Representation 80

4.3 Case Study: The Ball-and-Beam System 87

4.4 Conclusions and Future Work 94

5. Single Network Adaptive Critics Networks-Development, Analysis, and Applications 98
Jie Ding, Ali Heydari, and S.N. Balakrishnan

5.1 Introduction 98

5.2 Approximate Dynamic Programing 100

5.3 SNAC 102

5.4 J-SNAC 104

5.5 Finite-SNAC 108

5.6 Conclusions 116

6. Linearly Solvable Optimal Control 119
K. Dvijotham and E. Todorov

6.1 Introduction 119

6.2 Linearly Solvable Optimal Control Problems 123

6.3 Extension to Risk-Sensitive Control and Game Theory 130

6.4 Properties and Algorithms 134

6.5 Conclusions and Future Work 139

7. Approximating Optimal Control with Value Gradient Learning 142
Michael Fairbank, Danil Prokhorov, and Eduardo Alonso

7.1 Introduction 142

7.2 Value Gradient Learning and BPTT Algorithms 144

7.3 A Convergence Proof for VGL(1) for Control with Function Approximation 148

7.4 Vertical Lander Experiment 154

7.5 Conclusions 159

8. A Constrained Backpropagation Approach to Function Approximation and Approximate Dynamic Programming 162
Silvia Ferrari, Keith Rudd, and Gianluca Di Muro

8.1 Background 163

8.2 Constrained Backpropagation (CPROP) Approach 163

8.3 Solution of Partial Differential Equations in Nonstationary Environments 170

8.4 Preserving Prior Knowledge in Exploratory Adaptive Critic Designs 174

8.5 Summary 179

9. Toward Design of Nonlinear ADP Learning Controllers with Performance Assurance 182
Jennie Si, Lei Yang, Chao Lu, Kostas S. Tsakalis, and Armando A. Rodriguez

9.1 Introduction 183

9.2 Direct Heuristic Dynamic Programming 184

9.3 A Control Theoretic View on the Direct HDP 186

9.4 Direct HDP Design with Improved Performance Case 1-Design Guided by a Priori LQR Information 193

9.5 Direct HDP Design with Improved Performance Case 2-Direct HDP for Coorindated Damping Control of Low-Frequency Oscillation 198

9.6 Summary 201

10. Reinforcement Learning Control with Time-Dependent Agent Dynamics 203
Kenton Kirkpatrick and John Valasek

10.1 Introduction 203

10.2 Q-Learning 205

10.3 Sampled Data Q-Learning 209

10.4 System Dynamics Approximation 213

10.5 Closing Remarks 218

11. Online Optimal Control of Nonaffine Nonlinear Discrete-Time Systems without Using Value and Policy Iterations 221
Hassan Zargarzadeh, Qinmin Yang, and S. Jagannathan

11.1 Introduction 221

11.2 Background 224

11.3 Reinforcement Learning Based Control 225

11.4 Time-Based Adaptive Dynamic Programming-Based Optimal Control 234

11.5 Simulation Result 247

12. An Actor-Critic-Identifier Architecture for Adaptive Approximate Optimal Control 258
S. Bhasin, R. Kamalapurkar, M. Johnson, K.G. Vamvoudakis, F.L. Lewis, and W.E. Dixon

12.1 Introduction 259

12.2 Actor-Critic-Identifier Architecture for HJB Approximation 260

12.3 Actor-Critic Design 263

12.4 Identifier Design 264

12.5 Convergence and Stability Analysis 270

12.6 Simulation 274

12.7 Conclusion 275

13. Robust Adaptive Dynamic Programming 281
Yu Jiang and Zhong-Ping Jiang

13.1 Introduction 281

13.2 Optimality Versus Robustness 283

13.3 Robust-ADP Design for Disturbance Attenuation 288

13.4 Robust-ADP for Partial-State Feedback Control 292

13.5 Applications 296

13.6 Summary 300

PART II LEARNING AND CONTROL IN MULTIAGENT GAMES

14. Hybrid Learning in Stochastic Games and Its Application in Network Security 305
Quanyan Zhu, Hamidou Tembine, and Tamer Basar

14.1 Introduction 305

14.2 Two-Person Game 308

14.3 Learning in NZSGs 310

14.4 Main Results 314

14.5 Security Application 322

14.6 Conclusions and Future Works 326

15. Integral Reinforcement Learning for Online Computation of Nash Strategies of Nonzero-Sum Differential Games 330
Draguna Vrabie and F.L. Lewis

15.1 Introduction 331

15.2 Two-Player Games and Integral Reinforcement Learning 333

15.3 Continuous-Time Value Iteration to Solve the Riccati Equation 337

15.4 Online Algorithm to Solve Nonzero-Sum Games 339

15.5 Analysis of the Online Learning Algorithm for NZS Games 342

15.6 Simulation Result for the Online Game Algorithm 345

15.7 Conclusion 347

16. Online Learning Algorithms for Optimal Control and Dynamic Games 350
Kyriakos G. Vamvoudakis and Frank L. Lewis

16.1 Introduction 350

16.2 Optimal Control and the Continuous Time Hamilton-Jacobi-Bellman Equation 352

16.3 Online Solution of Nonlinear Two-Player Zero-Sum Games and Hamilton-Jacobi-Isaacs Equation 360

16.4 Online Solution of Nonlinear Nonzero-Sum Games and Coupled Hamilton-Jacobi Equations 366

PART III FOUNDATIONS IN MDP AND RL

17. Lambda-Policy Iteration: A Review and a New Implementation 381
Dimitri P. Bertsekas

17.1 Introduction 381

17.2 Lambda-Policy Iteration without Cost Function Approximation 386

17.3 Approximate Policy Evaluation Using Projected Equations 388

17.4 Lambda-Policy Iteration with Cost Function Approximation 395

17.5 Conclusions 406

18. Optimal Learning and Approximate Dynamic Programming 410
Warren B. Powell and Ilya O. Ryzhov

18.1 Introduction 410

18.2 Modeling 411

18.3 The Four Classes of Policies 412

18.4 Basic Learning Policies for Policy Search 416

18.5 Optimal Learning Policies for Policy Search 421

18.6 Learning with a Physical State 427

19. An Introduction to Event-Based Optimization: Theory and Applications 432
Xi-Ren Cao, Yanjia Zhao, Qing-Shan Jia, and Qianchuan Zhao

19.1 Introduction 432

19.2 Literature Review 433

19.3 Problem Formulation 434

19.4 Policy Iteration for EBO 435

19.5 Example: Material Handling Problem 441

19.6 Conclusions 448

20. Bounds for Markov Decision Processes 452
Vijay V. Desai, Vivek F. Farias, and Ciamac C. Moallemi

20.1 Introduction 452

20.2 Problem Formulation 455

20.3 The Linear Programming Approach 456

20.4 The Martingale Duality Approach 458

20.5 The Pathwise Optimization Method 461

20.6 Applications 463

20.7 Conclusion 470

21. Approximate Dynamic Programming and Backpropagation on Timescales 474
John Seiffertt and Donald Wunsch

21.1 Introduction: Timescales Fundamentals 474

21.2 Dynamic Programming 479

21.3 Backpropagation 485

21.4 Conclusions 492

22. A Survey of Optimistic Planning in Markov Decision Processes 494
Lucian Busoniu, Remi Munos, and Robert Babu¡ska

22.1 Introduction 494

22.2 Optimistic Online Optimization 497

22.3 Optimistic Planning Algorithms 500

22.4 Related Planning Algorithms 509

22.5 Numerical Example 510

23. Adaptive Feature Pursuit: Online Adaptation of Features in Reinforcement Learning 517
Shalabh Bhatnagar, Vivek S. Borkar, and L.A. Prashanth

23.1 Introduction 517

23.2 The Framework 520

23.3 The Feature Adaptation Scheme 522

23.4 Convergence Analysis 525

23.5 Application to Traffic Signal Control 527

23.6 Conclusions 532

24. Feature Selection for Neuro-Dynamic Programming 535
Dayu Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana

24.1 Introduction 535

24.2 Optimality Equations 536

24.3 Neuro-Dynamic Algorithms 542

24.4 Fluid Models 551

24.5 Diffusion Models 554

24.6 Mean Field Games 556

24.7 Conclusions 557

25. Approximate Dynamic Programming for Optimizing Oil Production 560
Zheng Wen, Louis J. Durlofsky, Benjamin Van Roy, and Khalid Aziz

25.1 Introduction 560

25.2 Petroleum Reservoir Production Optimization Problem 562

25.3 Review of Dynamic Programming and Approximate Dynamic Programming 564

25.4 Approximate Dynamic Programming Algorithm for Reservoir Production Optimization 566

25.5 Simulation Results 573

25.6 Concluding Remarks 578

23.6 Conclusions 532

24. Feature Selection for Neuro-Dynamic Programming 535
Dayu Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana

24.1 Introduction 535

24.2 Optimality Equations 536

24.3 Neuro-Dynamic Algorithms 542

24.4 Fluid Models 551

24.5 Diffusion Models 554

24.6 Mean Field Games 556

24.7 Conclusions 557

25. Approximate Dynamic Programming for Optimizing Oil Production 560
Zheng Wen, Louis J. Durlofsky, Benjamin Van Roy, and Khalid Aziz

25.1 Introduction 560

25.2 Petroleum Reservoir Production Optimization Problem 562

25.3 Review of Dynamic Programming and Approximate Dynamic Programming 564

25.4 Approximate Dynamic Programming Algorithm for Reservoir Production Optimization 566

25.5 Simulation Results 573

25.6 Concluding Remarks 578

26. A Learning Strategy for Source Tracking in Unstructured Environments 582
Titus Appel, Rafael Fierro, Brandon Rohrer, Ron Lumia, and John Wood

26.1 Introduction 582

26.2 Reinforcement Learning 583

26.3 Light-Following Robot 589

26.4 Simulation Results 592

26.5 Experimental Results 595

26.6 Conclusions and Future Work 599

References 599

INDEX 601

Chapter 1

Reinforcement Learning and Approximate Dynamic Programming (RLADP)—Foundations, Common Misconceptions, and the Challenges Ahead

Paul J. Werbos

National Science Foundation (NSF), Arlington, VA, USA

Abstract

Many new formulations of reinforcement learning and approximate dynamic programming (RLADP) have appeared in recent years, as it has grown in control applications, control theory, operations research, computer science, robotics, and efforts to understand brain intelligence. The chapter reviews the foundations and challenges common to all these areas, in a unified way but with reference to their variations. It highlights cases where experience in one area sheds light on obstacles or common misconceptions in another. Many common beliefs about the limits of RLADP are based on such obstacles and misconceptions, for which solutions already exist. Above all, this chapter pinpoints key opportunities for future research important to the field as a whole and to the larger benefits it offers.

1.1 Introduction

The field of reinforcement learning and approximate dynamic programming (RLADP) has undergone enormous expansion since about 1988 [1], the year of the first NSF workshop on Neural Networks for Control, which evaluated RLADP as one of several important new tools for intelligent control, with or without neural networks. Since then, RLADP has grown enormously in many disciplines of engineering, computer science, and cognitive science, especially in neural networks, control engineering, operations research, robotics, machine learning, and efforts to reverse engineer the higher intelligence of the brain. In 1988, when I began funding this area, many people viewed the area as a small and curious niche within a small niche, but by the year 2006, when the Directorate of Engineering at NSF was reorganized, many program directors said “we all do ADP now.”

Many new tools, serious applications, and stability theorems have appeared, and are still appearing, in ever great numbers. But at the same time, a wide variety of misconceptions about RLADP have appeared, even within the field itself. The sheer variety of methods and approaches has made it ever more difficult for people to appreciate the underlying unity of the field and of the mathematics, and to take advantage of the best tools and concepts from all parts of the field. At NSF, I have often seen cases where the most advanced and accomplished researchers in the field have become stuck because of fundamental questions or assumptions that were taken care of 30 years before, in a different part of the field. The goal of this chapter is to provide a kind of unified view of the past, present, and future of this field, to address those challenges. I will review many points that, though basic, continue to be obstacles to progress. I will also focus on the larger, long-term research goal of building real-time learning systems which can cope effectively with the degree of system complexity, nonlinearity, random disturbance, computer hardware complexity, and partial observability which even a mouse brain somehow seems to be able to handle [2]. I will also try to clarify issues of notation that have become more and more of a problem as the field grows more diverse. I will try to make this chapter accessible to people across multiple disciplines, but will often make side comments for specialists in different disciplines—as in the next paragraph.

Optimal control, robust control, and adaptive control are often seen as the three main pillars of modern control theory. ADP may be seen as part of optimal control, the part that seeks computationally feasible general methods for the nonlinear stochastic case. It may be seen as a computational tool to find the most accurate possible solutions, subject to computational constraints, to the HJB equation, as required by general nonlinear robust control. It may be formulated as an extension of adaptive control which, because of the implicit “look ahead,” achieves stability under much weaker conditions than the well-known forms of direct and indirect adaptive control. The most impressive practical applications so far have involved highly nonlinear challenges, such as missile interception [3] and continuous production of carbon–carbon thermoplastic parts [4].

1.2 What is RLADP?

1.2.1 Definition of RLADP and the Task it Addresses

The term “RLADP” is a broad and an inclusive term, attempting to unite several overlapping strands of research and technology, such as adaptive critics, adaptive dynamic programming (ADP), approximate dynamic programming (ADP), and reinforcement learning (RL).

Because the history through 2005 was very complex [3, 4], it is easier to focus first on one of the core tasks that ADP attempts to solve. Suppose that we are given a stochastic system defined by:

(1.1)

(1.2)

and our goal at every time t is to pick u(t) so as to maximize:

(1.3)

where r is a discount rate or interest rate, which may be zero or greater than zero, T is a terminal time, which may be finite or may be infinity, X(t) represents the actual state of the system (“the objective real world”) at time t, Y(t) represents what we directly observe about the system at time t, u(t) represents the actions or control we get to decide on at each time t, U represents our utility function, following the definitions of Von Neumann and Morgenstern [5], e1(t) and e2(t) are vectors or collections of random numbers, and <>is notation from physics for expectation value.

This task is called a Partially Observed Markov Decision Problem (POMDP), because any system of X(t) governed by Equation (1.1) is a Markov process.

We are asked to develop methods which are general in that they work for any reasonable nonlinear or linear functions F and H, which may also be functions of unknown weights or parameters W. For a true intelligent system, we want to be able to maximize performance for the case where all our knowledge of F and G comes from experience, from the database {Y(τ), u(τ), τ = 1 to t}, and from an “uninformative” prior probability distribution Pr(F, H) for what they might be [8].

Modern ADP includes any efforts to use, analyze, or develop general-purpose methods to find good approximate answers to this optimization problem, using learning or approximation methods to cope with complexity. Of course, it also includes efforts aimed at the continuous time version of the problem, and hybrid versions with multiple time scales. It also includes efforts to develop general-purpose methods aimed at major special cases of this problem (such as the deterministic case, where there are no vectors e1 or e2), or the fully observed case, where Y = X), so long as they are useful steps toward the general case, developing the kinds of methods needed for the general case as well, as discussed in Section 1.2.2.

Reinforcement learning (RL) is much older than ADP. As a result, the term RL means different things to different people. RL includes early work by the psychologist Skinner and his followers, such as Harry Klopf, developing models of how animals learn to change their behavior in response to reward (r) and punishment. Some of the recent work in RL still follows that tradition, using “r” instead of “U,” even when the system is intended to solve an optimization problem. Many computer scientists use the term RL to include systems that try to maximize a function U(u) without considering the impact of present actions on future times. A more modern formulation of RL [1] is essentially the same as ADP, except that we are trying to design a system which observes U(t) at each time t, without knowing the function U(Y, u) which underlies it. This is logically just a special case of ADP, since we can add U(t) itself into the list of observed variables included in Y.

Before 1968, research in RL and research related to dynamic programming were two entirely separate areas. Modern ADP dates back, at the earliest, to the 1968 paper [9] in which I first proposed that we can build reinforcement learning systems through adaptive approximation to the Bellman equation, as will be discussed in Section 1.2.2.

In a recent conference on modernizing the electric power grid [10], I heard a key researcher say “We really need new general methods to solve these complex multistage stochastic optimization problems, but ADP does not work so well. We need to develop better methods for this purpose.” Logically this does not make sense, because we have defined this field to include any such “better methods.” The researcher was actually thinking of one particular set of ADP tools, which do not represent the full capabilities of the field as it exists now, let alone in the future.

Equations (1.1)–(1.3) do not yet give a complete problem...

Content (EPUB)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Reinforcement Learning and Approximate Dynamic Programming for Feedback Control

Description

More details

Other editions

Additional editions

Persons

Content

Abstract

1.1 Introduction

1.2 What is RLADP?

1.2.1 Definition of RLADP and the Task it Addresses

System requirements