2018, Feb 27

Test

Introduction

The vast majority of Reinforcement Learning (RL) and Neuro-Dynamic Programming (NDP) methods fall into one of the following two categories:

(a) Actor-only methods work with a parameterized family of policies. The gradient of the performance, with respecet to the actor parameters, is directly estimated by simulation, and the parameters are updated in a direction of improvement. A possible drawback of such methods is that the gradient estimators may have a large variance. Furthermore, as the policy changes, a new gradient is estimated independently of past estimates. Hence, there is no “learning”, in the sense of accumulation and consolidation of older information.

(b) Critic-only methods rely exclusively on value function approximation and aim at learning an approximate solution to the Bellman equation, which will then hopefully prescribe a near-optimal policy. Such

Markov decision processes and parameterized family of RSP’s

Consider a Markov decision process with finite state space $S$, and finite action space $A$. Let $g \colon S \times A \rightarrow \mathbb{R}$ be a given cost function. A randomized stationary policy (RSP) is a mapping $\mu$ that assigns to each state $x$ a probability distribution over the action space $A$. We consider a set of randomized stational policies $\mathbb{P} = {\mu_\theta ; \theta \in \mathbb{R}^n }$, parameterized in terms of a vector $\theta$. For each pair $(x,u)\in S \times A$, $\mu_\theta(x,u)$ denotes the probability of taking action $u$ when the state $x$ is encountered, under the policy corresponding to $\theta$. Let $p_{xy}(u)$ denote the probability that the next state is $y$, given that the current state is $x$ and the current action is $u$. Note that under any RSP, the sequence of states ${X_n}$ and of state-action pairs ${X_n,U_n}$ of the Markov decision process form Markov chains with state spaces $S$ and $S\times A$, respectively. We make the following assumptions about the family of policies $\mathbb{P}$.

(A1) For all $x\in S$ and $u \in A$ the map $\theta \mapsto \mu_\theta(x,u)$ is twice differentiable with bounded first, second derivatives,. Furthurmore, there exists a $\mathbb{R}^n$-valued function $\psi_\theta(x,u)$ such that $\nabla\mu_\theta(x,u) = \mu_\theta(x,u)\psi_\theta(x,u)$ where the mapping $\theta \mapsto \psi_\theta(x,u)$ is bounded and has first bounded derivatives for any fixed $x$ and $u$.

(A2) For each $\theta \in \mathbb{R}^n$, the Markov chains ${X_n}$ and ${X_n,U_n}$ are irreducible and aperiodic, with stationary probabilities $\pi_\theta(x)$ and $\eta_\theta(x,u) = \pi_\theta(x)\mu_\theta(x,u)$, respectively, under the RSP $\mu_\theta$.

In reference to Assumption (A1), note that whenever $\mu_\theta(x,u)$ is nonzero we have

\[\psi_\theta(x,u) = \frac{\nabla\mu_\theta(x,u)}{\mu_\theta(x,u)} = \nabla \ln \mu_\theta(x,u).\]

Consider the average cost function $\lambda \colon \mathbb{R}^n \mapsto \mathbb{R}$, given by

\[\lambda(\theta) = \sum_{x\in S,u\in A} g(x,u)\eta_\theta(x,u).\]

We are interested in minimizing $\lambda(\theta)$ over all $\theta$. For each $\theta \in \mathbb{R}^n$, let $V_\theta \colon S \mapsto \mathbb{R}$ be the “differential” cost function, defined as solution of Poisson equation:

\[\lambda(\theta) + V_\theta(x) = \sum_{u\in A} \mu_\theta(x,u)\left[g(x,u) + \sum_y p_{xy}(u)V_\theta(y)\right].\]

Testing

\[\begin{aligned} & \phi(x,y) = \phi \left(\sum_{i=1}^n x_ie_i, \sum_{j=1}^n y_je_j \right) = \sum_{i=1}^n \sum_{j=1}^n x_i y_j \phi(e_i, e_j) = \\ & (x_1, \ldots, x_n) \left( \begin{array}{ccc} \phi(e_1, e_1) & \cdots & \phi(e_1, e_n) \\ \vdots & \ddots & \vdots \\ \phi(e_n, e_1) & \cdots & \phi(e_n, e_n) \end{array} \right) \left( \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right) \end{aligned}\] \[Q(\lambda,\hat{\lambda}) = -\frac{1}{2} P(O \mid \lambda ) \sum_s \sum_m \sum_t \gamma_m^{(s)} (t) \left( n \log(2 \pi ) + \log \left| C_m^{(s)} \right| + \left( \mathbf{o}_t - \hat{\mu}_m^{(s)} \right) ^T C_m^{(s)-1} \left(\mathbf{o}_t - \hat{\mu}_m^{(s)}\right) \right)\] \[Q(s,a) \gets Q(s,a) + \alpha(R + \gamma\max_{a′}Q(s′,a′) − Q(s,a))\] \[L_{i}(θ_i) = E_{(s,a,r,s′)∼U(D)}[(r+\gamma\max_{a′}Q(s′,a′;θ_{i}^{−})−Q(s,a;θ_{i}))^2]\]

Test

Introduction

Markov decision processes and parameterized family of RSP’s

Testing

B-Boy Seiok

Recent post