[Reading Note]: Asynchronous Stochastic Approximation and Q-Learning

What's this paper
- Proved the convergence of Q-learning algorithm
- Comparison with previous works
  - Expected reduction of a smooth Lyapunov function (Poljak & Tsypkin, 1973)
    Undiscounted problems is not proven
  - "averaging" techniques that lead to an ordinary differential equation (Kushner & Clark, 1978)
    Unnatural assumptions for Q-learning
  - This paper: Asynchronous convergence theory of Bertsekas (1982) and Bertsekas et al. (1989)
    Naturally covered undiscounted setting

Stochastic approximation
- What's that? → A family of iterative methods typically used for root-finding problems
- Basic definitions
  $x(t+1)=x(t)+\alpha(t)\left(F\left(x(t)\right)-x(t)+w(t)\right)$
  - $\alpha(t)$ : step size
  - $F$ : function of $\R^d \to \R^d$
  - $w(t)$ : noise
- $x(t)$ converges to the stationary point of $F$ , i.e., $F(x^*)= x^*$ (Theorem 2)
- Asynchronous update
  - $\tau_{j}^{i}(t)$ : Suppose $i$ is updated at $t$ , the last time $j$ was update at $\tau_{j}^{i}(t)$
  💡
  They claim that proof of convergence under this asynchronous setting is innovative. But I could not figure it out why it is innovative. I feel we can ignore this setting by assuming Assumption 1: $\lim _{t \rightarrow \infty} \tau_{j}^{i}(t)=\infty$ .

Statement of Theorem 2
- Assumptions
  - Assumption 1: (About outdated case)
    (For all $i$ ， $j$ ， $\lim _{t \rightarrow \infty} \tau_{j}^{i}(t)=\infty$ with probability 1)
  - Assumption 2: (About noise)
    ( $x(0)$ is $\mathcal{F}(0)$ -measurable )
    ( For all $t$ ， $w(t)$ is $\mathcal{F}(t+1)$ -measurable )
    For all $t$ ， $\alpha(t)$ is $\mathcal{F}(t)$ -measurable
    💡
    The fact $\alpha(t)$ is a random variable is this paper's contribution
    ( For all $i$ , $j$ , $\tau_{j}^{i}(t)$ is $\mathcal{F}(t)$ -measurable)
    For all $t$ ， $E\left[w(t) \mid \mathcal{F}(t)\right]=0$
    There exist constants $A$ ， $B$ such that
    $E\left[w_{i}^{2}(t) \mid \mathcal{F}(t)\right] \leq A+B \max _{j} \left|x_{j}(t)\right|^{2}, \quad \forall i, t$
  - Assumption 3: (About stepsize)
    $\sum_{t=0}^{\infty} \alpha_{i}(t)=\infty$ for all $i$
    $\sum_{t=0}^{\infty} \alpha_{i}^{2}(t) \leq C$ for all $i$
  - Assumption 4: (About $F$ )
    $F$ is monotone: if $x \leq y$ , then $F(x) \leq F(y)$ .
    $x\leq y$ means $\forall i:x_i\leq y_i$
    $F$ is continuous
    $F$ has a unique fixed point $x^*$ : $F(x^*) = x^*$
    If $e \in \Re^{n}$ is the vector with all components equal to 1, and $r$ is a positive scalar, then (basically Lipschitz)
    $F(x)-r e \leq F(x-r e) \leq F(x+r e) \leq F(x)+r e$
Statement : $x(t)$  converges to $x^{*}$

Proof of Theorem 2
- Define $\{U^{k}\}$ and $\{L^{k}\}$
  - Because $x(t)$ is bounded there always exists a constant $r$
    $x^{*}-r e \leq x(t) \leq x^{*}+r e$
    where $e = (1,1,\ldots,1)$
  - $L^{0}:=\left(L_{1}^{0}, \ldots, L_{n}^{0}\right)=x^{*}-r e$
  - $U^{0}:=\left(U_{1}^{0}, \ldots, U_{n}^{0}\right)=x^{*}+r e$
  - $U^{k+1}=\frac{U^{k}+F\left(U^{k}\right)}{2}$
  - $L^{k+1}=\frac{L^{k}+F\left(L^{k}\right)}{2}$
- Lemma 4: $F\left(U^{k}\right) \leq U^{k+1} \leq U^{k}$ and $F\left(L^{k}\right) \geq L^{k+1} \geq L^{k}$
- Lemma 5: $\text { The sequences }\left\{U^{k}\right\} \text { and }\left\{L^{k}\right\} \text { converge to } x^{*}$
- Show $L^k \leq x(t) \leq U^k$
  - Introduce $W_{i}(t)，W_{i}\left(t ; t_{0}\right)，X_{i}(t)$
    $t_k$ : $t$ of inductive hypothesis
    $W(t)$
    $W(0) = 0$
    $W(t+1)=\left(1-\alpha(t)\right) W(t)+\alpha(t) w(t)$
    By Lemma 1， $\lim _{t \rightarrow \infty} W_{i}(t)=0$
    $W\left(t ; t_{0}\right)$ for an arbitrary $t_0$
    $W\left(t_0 ; t_{0}\right):= 0$
    $W\left(t+1 ; t_{0}\right):=\left(1-\alpha(t)\right) W\left(t ; t_{0}\right)+\alpha(t) w(t)$
    By Lemma 2， $\lim _{t \rightarrow \infty} W_{i}\left(t ; t_{0}\right)=0$
    $X_{i}(t), t \geq t_{k}$
    $X\left(t_{k}\right):=U^{k}$
    $X(t+1):=\left(1-\alpha(t)\right) X(t)+\alpha(t) F\left(U^{k}\right), \quad t \geq t_{k}$
  - Lemma 6: $x_{i}(t) \leq X_{i}(t)+W_{i}\left(t ; t_{k}^{\prime}\right), \quad \forall t \geq t_{k}$
    $\begin{aligned} x_{i}(t+1) &=x_{i}(t)+\alpha_{i}(t)\left(F_{i}\left(x_{i}(t)\right)-x_{i}(t)+w_{i}(t)\right) \\ &=(1-\alpha_{i})x_{i}(t)+\alpha_{i}(t)\left(F_{i}\left(x_{i}(t)\right)+w_{i}(t)\right) \\ &=(1-\alpha_{i})x_{i}(t)+\alpha_{i}(t)\left(F_{i}\left(U^k\right)+w_{i}(t)\right) (\because \textrm{ I.H.})\\ & \leq\left(1-\alpha_{i}(t)\right)\left(X_{i}(t)+W_{i}\left(t ; t_{k}^{\prime}\right)\right)+\alpha_{i}(t) \left(F_{i}\left(U^{k}\right)+ w_{i}(t)\right) \\ & =\left(1-\alpha_{i}(t)\right)X_{i}(t)+\alpha_{i}(t) F_{i}\left(U^{k}\right)+\left(1-\alpha_{i}(t)\right)W_{i}\left(t ; t_{k}^{\prime}\right)+\alpha_{i}(t) w_{i}(t) \\ &=X_{i}(t+1)+W_{i}\left(t+1 ; t_{k}^{\prime}\right) \end{aligned}$
    Referece
  - Additional definition: $\delta_{k}$
    $\delta_{k}$ := smallest positive element of $\left(U_{i}^{k}-F_{i}\left(U^{k}\right)\right) / 4$
    Define $t_{k}^{\prime \prime}$
    First condition
    $\prod_{t=t_{k}^{\prime}}^{t_{k}^{\prime \prime}-1}\left(1-\alpha_{i}(t)\right) \leq \frac{1}{4}$
    Why does such $t_{k}^{\prime \prime}$ exist?
    Second condition
    $W_{i}\left(t ; t_{k}^{\prime}\right) \leq \delta_{k}\quad\forall t \geq t_{k}^{\prime \prime}$
    💡
    Because $\lim _{t \rightarrow \infty} W_{i}\left(t ; t_{0}\right)=0$ , $t_{k}^{\prime \prime}$ exists
    By Lemma 2， $\lim _{t \rightarrow \infty} W_{i}\left(t ; t_{0}\right)=0$
  - Lemma 7: $\text { We have } x_{i}(t) \leq U_{i}^{k+1}, \text { for all } i \text { and } t \geq t_{k}^{\prime \prime}$

Q-learning setup
- Definitions
  - $S$ : Finite state space
  - $U(i)$ : action available at state $i$
  - $p_{i j}(u), u \in U(i), j \in S$
    at state $i$ , took action $u$ , transit to $j$ with prob $p_{i j}(u)$ .
  - $c_{i u}$ (random variable )
    If you take action $u$ on state $i$ , you suffer cost $c_{i u}$
    $c_{i u}$ is not bounded ← important
    But variance is bounded
  - Stationary policy $\pi$ :
    A function defined on $S$ such that $\pi(i) \in U(i)$ for all $i \in S$ .
    Take state return action
  - Absorbing state:
    Ex) Suppose state 1 is absorbing: $p_{11}(u)=1$ and $c_{1 u}=0 \text { for all } u \in U(1)$
    💡
    Once you enter 1, you will stay there forever
  - Proper stationary policy:
    Stationary policy that converges to absorbing state with probability 1 when $T\to \infty$
  - $V_{i}^{\pi}$ :
    $V_{i}^{\pi}=\limsup _{T \rightarrow \infty} E\left[\sum_{t=0}^{T} \beta^{t} c_{s^{\pi}(t), \pi\left(s^{\pi}(t)\right)} \mid s^{\pi}(0)=i\right]$
    💡
    Maximum expected cost in the future when you are at state $i$
    $V_{i}^{*}=\inf V_{i}^{\pi}$
  - $\beta \in[0,1]$ : discount factor
    Undiscounted case $(\beta=1)$
    Discounted case ( $\beta < 1$ )
- Assumption 7
  - There exists at least one proper stationary policy.
  - Every improper stationary policy yields infinite expected cost for at least one initial state.
  💡
  Necessary for the convergence of $\beta < 1$ case
  The fact this assumption is natural is their contribution.
  Work before this required all policies have to be proper

Q-learning algorithm
- Given a dynamic programming operator $T$ :
  $T_{i}(V)=\min _{u \in U(i)}\left\{E\left[c_{i u}\right]+\beta \sum_{j \in S} p_{i j}(u) V_{j}\right\}$
- Q-learning is a method to compute the optimal $V^{*}$ based on the assumption that $V^{*}$ satisfies the following equation (Bellman principle)
$V^{*}=T\left(V^{*}\right)$
- update rule is as follows
  $Q_{i u}(t+1)=Q_{i u}(t)+\alpha_{i u}(t)\left[c_{i u}+\beta \min _{v \in U(s(i, u))} Q_{s(i, u), v}(t)-Q_{i u}(t)\right]$

Does Q-learning converges?
- Correspondence to Stochastic approximation
  - Q-learning's update rule can be formulated as the following stochastic approximation
    $x_{i}(t+1)=x_{i}(t)+\alpha_{i}(t)\left(F_{i}\left(x^{i}(t)\right)-x_{i}(t)+w_{i}(t)\right)$
  - with the following definitions
    $F_{i u}(Q)=E\left[c_{i u}\right]+\beta E\left[\min _{v \in U(s(i, u))} Q_{s(i, u), v}\right]$
    $w_{i u}(t)=c_{i u}-E\left[c_{i u}\right]+\min _{v \in U(s(i, u))} Q_{s(i, u), v}(t)-E\left[\min _{v \in U(s(i, u))} Q_{s(i, u), v}(t) \mid \mathcal{F}(t)\right]$
- Does $F_{iu}(Q)$ satisfies Assumption 4?
  - It is known that dynamic operator $T$ satisfies Assumption 4 $\Leftrightarrow$ $F$ satisfies Assumption 4
  - So Does $T$ satisfies Assumption 4?
  - When $\beta < 1$
    $T$ is contraction (well known)→ satisfies Assumption 4
    💡
    $f$ is a contraction mapping if
    $\forall x,y: d(f(x), f(y)) \leq k d(x, y)$
  - When $\beta = 1$
    All policies are proper
    There exists a vector $v>0$ such that $T$ is a contraction with respect to the norm $\|\cdot\|_{v}$ (Bertsekas et al., 1989,1991)
    Under Assumption 7
    $V^*$ is a unique stationary point regarding $T$ (Bertsekas et al., 1989, 1991)

Conclusion (Theorem 4)
Given the following