Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Peter Dayan, Gatsby Computational Neuroscience Unit, UCL, London, UK
Complexities in the relationship between value and choice are two central sources of anomalies. First, the different systems can disagree about their values. Actions involve such things as picking a stimulus or pressing a button. The environment specifies a set of rules governing the transitions between states depending on the action chosen. The trouble is that a tree typically grows exponentially with the number of layers considered, making this extremely difficult. Model-based and model-free controls are ways of doing this, which differ in the information about the environment they use and the computations they perform. The model-free system would learn the utility of pressing the lever but would not have the informational wherewithal to realize that this utility had changed when the cheese had been poisoned. Pavlovian control is also based on predictions of affectively important outcomes such as rewards and punishments. However, rather than determining the choices that would lead to the acquisition or avoidance of these outcomes, it expresses a set of hard-wired preparatory and consummatory choices.
Reinforcement learning; motivation; model-based control; model-free control; Pavlovian control; utility
Outline
Introduction
Reinforcement Learning
Controllers
Model-based Control
Model-free Control
Pavlovian Control
Combination
Net Choice
Interaction
Discussion
One lesson from modern neuroeconomics is that value precedes preference. A second lesson is that this happens in multiple competing and cooperating systems. In this chapter we consider some aspects and consequences of this multiplicity. Even if all systems share a common, Platonic, notion of utility, they acquire and use information about the environment and about the utility in different ways, spanning a spectrum of computationally rationalizable possibilities. We discuss how and why values may not be consistent across systems, and how choices emerging from some systems may not be consistent with their own underlying values. Such complexities may motivate some of the significant, albeit contained, anomalies of choice.
The relationship between value and choice seems very simple. We should choose the things we value and value the things we choose. Indeed, one of the most beautiful results in economics goes exactly along these lines – if only our choices between possible actions were to satisfy some simple, intuitive, prerequisites, such as being transitive, then these actions could be arranged along a single axis of preference, and could be endowed with values that could be treated as governing choice. Although psychologists or behaviorally-inclined economists might go no further than asking people to report their subjective values in one of a variety of ways, neuroscientists could search for postulated forms of such a value function in the activity of neurons, or their blood flow surrogates.
Unfortunately, life is not so simple. A subject’s choices fail to satisfy any set of intuitive axioms, instead exhibiting the wide range of anomalies explored in this book and elsewhere. When subjects can be persuaded to express values, not only do these also show anomalies and inconsistencies, but also their anomalies are not quite the same as those of the choices with which they would be associated. Thus, subjects will subscribe to values that are not consistent with their expressible preferences.
There have been many interesting approaches to save part of the bacon of value and/or the relationship between value and choice by appealing to external or internal factors. The former (such things as the radical asymmetry in the information state between the experimenter and the subjects) are covered in other chapters in this book. In this chapter, we will show that even if there was a Platonic utility function governing idealized values of outcomes, rationalizable complexities of the internal architecture of control imply a range of apparent inconsistencies.
We start by noting three separate systems that have been mooted as being involved. Two (called model-based and model-free – Daw, Niv& Dayan, 2005; Dickinson & Balleine, 2002) are instrumental; a third is Pavlovian. A fourth, episodic, system has also been suggested (Lengyel& Dayan, 2007), but enjoys less empirical support and we will not discuss it here.
Crudely speaking (and we will see later why this is too crude in the current context), for instrumental systems, the relationship in the environment between choices and affectively important outcomes such as rewards and punishments plays a key role in determining choice. Subjects repeat actions that lead to high value outcomes (rewards), and avoid ones that lead to low value outcomes (punishments).
Conversely, for the Pavlovian system, choices are determined by predictions of these outcomes, irrespective of the actual environmental relationship between choices and outcomes (Dickinson, 1980; Mackintosh, 1983). Thus, for instance, animals cannot help but approach (rather than run away from) a source of food, even if the experimenter has cruelly arranged things in a looking-glass world so that the approach appears to make the food recede, whereas retreating would make the food more accessible (Hershberger, 1986).
Rather than starting from choices, models of all three systems start from values, from which choices are derived. Consistent with this, value signals have duly been found in a swathe of neural systems including, amongst others, the orbitofrontal cortex, the amygdala, the striatum, and the dopaminergic neuromodulatory system that innervates various of these loci (see Morrison & Salzman, 2010; Niv, 2009; O’Doherty, 2004, 2007; Samejima, Ueda, Doya & Kimura, 2005; Schultz, 2002; Wallis & Kennerley, 2010, and references therein). In fact, the mechanisms turning these values into choices, is less clear.
Complexities in the relationship between value and choice are two central sources of anomalies. First, the different systems can disagree about their values (and thus the choices they would make) (Dickinson & Balleine, 2002). However, choice is, almost by definition, unitary. Thus, if the values produced by the different systems differ, then the ultimate behavior will clearly have to fail to follow all of them. Nevertheless, we will argue that it is adaptive to have multiple systems (even at the expense of value-choice inconsistencies), since they offer different “sweet-spots” in the trade-off between adaptivity and adaptability.
The second source of anomalies is the Pavlovian system itself. As described above, it has the property that choices are determined directly by predictions, without any regard for their appropriateness. Thus, for example, given the chance, the subjects in the looking-glass world would clearly exhibit a preference for food, but nevertheless emit actions that are equally clearly inconsistent with this preference.
Having discussed the individual systems and their properties, we then consider issues that arise from their interaction. For instance, we have interpreted various findings as suggesting that Pavlovian mechanisms may interfere with model-based instrumental evaluation (Dayan& Huys, 2008). Equally, if one system controls behavior, then it can prevent other systems from being able to gain sufficient experience to acquire a full set of values and associated preferences.
We conclude the chapter with some general remarks about the naivety of the original expectation for a simple mapping between value and choice for subjects suffering from limited computational power in an unknown and changing world.
Reinforcement learning (Sutton & Barto, 1998) formalizes the interaction between a subject and its environment. In the simplest cases, the environment comprises a set of possible states χ={x}, actions ={a} and outcomes ={o}. We will also consider that the subject is in an internal motivational state m∈, such as hunger or thirst, although sometimes, as we discuss later, we will concatenate external and internal states to make a single state of the world (also called x).
In human experiments, states are often like the stages of a task, and are typically signaled by cues such as lights and tones. A state could also be a location in a maze. Actions involve such things as picking a stimulus or pressing a button.1 The environment specifies a set of rules governing the transitions between states depending on the action chosen. This is often treated as a Markov chain, characterized by a transition matrix Txy (a) that specifies the probability of moving from state x to state y given action a (Puterman, 2005; Sutton & Barto, 1998). We will treat the outcome o(x,a) as being a deterministic function of the state x and action a; it is a simple generalization to make the outcomes also be probabilistic.
In this context, the subject’s choices comprise a policy π. We consider so-called...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.
Dateiformat: PDFKopierschutz: Adobe-DRM (Digital Rights Management)
Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Dateiformat: ePUBKopierschutz: Wasserzeichen-DRM (Digital Rights Management)
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet - also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Wasserzeichen-DRM wird hier ein „weicher” Kopierschutz verwendet. Daher ist technisch zwar alles möglich – sogar eine unzulässige Weitergabe. Aber an sichtbaren und unsichtbaren Stellen wird der Käufer des E-Books als Wasserzeichen hinterlegt, sodass im Falle eines Missbrauchs die Spur zurückverfolgt werden kann.