Test
Introduction
The vast majority of Reinforcement Learning (RL) and Neuro-Dynamic Programming (NDP) methods fall into one of the following two categories:
(a) Actor-only methods work with a parameterized family of policies. The gradient of the performance, with respecet to the actor parameters, is directly estimated by simulation, and the parameters are updated in a direction of improvement. A possible drawback of such methods is that the gradient estimators may have a large variance. Furthermore, as the policy changes, a new gradient is estimated independently of past estimates. Hence, there is no “learning”, in the sense of accumulation and consolidation of older information.
(b) Critic-only methods rely exclusively on value function approximation and aim at learning an approximate solution to the Bellman equation, which will then hopefully prescribe a near-optimal policy. Such
Markov decision processes and parameterized family of RSP’s
Consider a Markov decision process with finite state space $S$, and finite action space $A$. Let $g \colon S \times A \rightarrow \mathbb{R}$ be a given cost function. A randomized stationary policy (RSP) is a mapping $\mu$ that assigns to each state $x$ a probability distribution over the action space $A$. We consider a set of randomized stational policies $\mathbb{P} = {\mu_\theta ; \theta \in \mathbb{R}^n }$, parameterized in terms of a vector $\theta$. For each pair $(x,u)\in S \times A$, $\mu_\theta(x,u)$ denotes the probability of taking action $u$ when the state $x$ is encountered, under the policy corresponding to $\theta$. Let $p_{xy}(u)$ denote the probability that the next state is $y$, given that the current state is $x$ and the current action is $u$. Note that under any RSP, the sequence of states ${X_n}$ and of state-action pairs ${X_n,U_n}$ of the Markov decision process form Markov chains with state spaces $S$ and $S\times A$, respectively. We make the following assumptions about the family of policies $\mathbb{P}$.
(A1) For all $x\in S$ and $u \in A$ the map $\theta \mapsto \mu_\theta(x,u)$ is twice differentiable with bounded first, second derivatives,. Furthurmore, there exists a $\mathbb{R}^n$-valued function $\psi_\theta(x,u)$ such that $\nabla\mu_\theta(x,u) = \mu_\theta(x,u)\psi_\theta(x,u)$ where the mapping $\theta \mapsto \psi_\theta(x,u)$ is bounded and has first bounded derivatives for any fixed $x$ and $u$.
(A2) For each $\theta \in \mathbb{R}^n$, the Markov chains ${X_n}$ and ${X_n,U_n}$ are irreducible and aperiodic, with stationary probabilities $\pi_\theta(x)$ and $\eta_\theta(x,u) = \pi_\theta(x)\mu_\theta(x,u)$, respectively, under the RSP $\mu_\theta$.
In reference to Assumption (A1), note that whenever $\mu_\theta(x,u)$ is nonzero we have
\[\psi_\theta(x,u) = \frac{\nabla\mu_\theta(x,u)}{\mu_\theta(x,u)} = \nabla \ln \mu_\theta(x,u).\]Consider the average cost function $\lambda \colon \mathbb{R}^n \mapsto \mathbb{R}$, given by
\[\lambda(\theta) = \sum_{x\in S,u\in A} g(x,u)\eta_\theta(x,u).\]We are interested in minimizing $\lambda(\theta)$ over all $\theta$. For each $\theta \in \mathbb{R}^n$, let $V_\theta \colon S \mapsto \mathbb{R}$ be the “differential” cost function, defined as solution of Poisson equation:
\[\lambda(\theta) + V_\theta(x) = \sum_{u\in A} \mu_\theta(x,u)\left[g(x,u) + \sum_y p_{xy}(u)V_\theta(y)\right].\]