[Reading Note]: Asynchronous Stochastic Approximation and Q-Learning
What's this paper
- Proved the convergence of Q-learning algorithm
Comparison with previous works
Expected reduction of a smooth Lyapunov function (Poljak & Tsypkin, 1973)
- Undiscounted problems is not proven
"averaging" techniques that lead to an ordinary differential equation (Kushner & Clark, 1978)
- Unnatural assumptions for Q-learning
This paper: Asynchronous convergence theory of Bertsekas (1982) and Bertsekas et al. (1989)
- Naturally covered undiscounted setting
Stochastic approximation
- What's that? → A family of iterative methods typically used for root-finding problems
- Basic definitions
- : step size
- : function of
- : noise
- converges to the stationary point of , i.e., (Theorem 2)
- Asynchronous update
- : Suppose is updated at , the last time was update at
Statement of Theorem 2
Assumptions
Assumption 1: (About outdated case)
- (For all , , with probability 1)
Assumption 2: (About noise)
- ( is -measurable )
- ( For all , is -measurable )
- For all , is -measurable
- ( For all , , is -measurable)
- For all ,
- There exist constants , such that
Assumption 3: (About stepsize)
- for all
- for all
Assumption 4: (About )
- is monotone: if , then .
- means
- is continuous
- has a unique fixed point :
- If is the vector with all components equal to 1, and is a positive scalar, then (basically Lipschitz)
- is monotone: if , then .
Statement : converges to
Proof of Theorem 2
Define and
- Because is bounded there always exists a constant
- where
-
-
-
-
- Because is bounded there always exists a constant
- Lemma 4: and
- Lemma 5:
Show
Introduce
- : of inductive hypothesis
-
-
-
- By Lemma 1,
- for an arbitrary
-
-
- By Lemma 2,
-
-
-
Lemma 6:
- Referece
Additional definition:
- := smallest positive element of
- Define
- First condition
Why does such exist?
- Second condition
- First condition
Lemma 7:
Q-learning setup
Definitions
- : Finite state space
- : action available at state
-
- at state , took action , transit to with prob .
- (random variable )
- If you take action on state , you suffer cost
- is not bounded ← important
- But variance is bounded
- Stationary policy :
- A function defined on such that for all .
- Take state return action
- Absorbing state:
- Ex) Suppose state 1 is absorbing: and
- Proper stationary policy:
- Stationary policy that converges to absorbing state with probability 1 when
- :
-
- : discount factor
- Undiscounted case
- Discounted case ()
Assumption 7
- There exists at least one proper stationary policy.
- Every improper stationary policy yields infinite expected cost for at least one initial state.
Q-learning algorithm
- Given a dynamic programming operator :
- Q-learning is a method to compute the optimal based on the assumption that satisfies the following equation (Bellman principle)
- update rule is as follows
Does Q-learning converges?
Correspondence to Stochastic approximation
- Q-learning's update rule can be formulated as the following stochastic approximation
- with the following definitions
Does satisfies Assumption 4?
- It is known that dynamic operator satisfies Assumption 4 satisfies Assumption 4
- So Does satisfies Assumption 4?
- When
- is contraction (well known)→ satisfies Assumption 4
- When
All policies are proper
There exists a vector such that is a contraction with respect to the norm (Bertsekas et al., 1989,1991)
Under Assumption 7
is a unique stationary point regarding (Bertsekas et al., 1989, 1991)
Conclusion (Theorem 4)
Given the following