[Reading Note]: Asynchronous Stochastic Approximation and Q-Learning
What's this paper
- Proved the convergence of Q-learning algorithm
Comparison with previous works
Expected reduction of a smooth Lyapunov function (Poljak & Tsypkin, 1973)
- Undiscounted problems is not proven
"averaging" techniques that lead to an ordinary differential equation (Kushner & Clark, 1978)
- Unnatural assumptions for Q-learning
This paper: Asynchronous convergence theory of Bertsekas (1982) and Bertsekas et al. (1989)
- Naturally covered undiscounted setting
Stochastic approximation
- What's that? → A family of iterative methods typically used for root-finding problems
- Basic definitions
- : step size
- : function of
- : noise
- converges to the stationary point of , i.e., (Theorem 2)
- Asynchronous update
- : Suppose is updated at , the last time was update at
💡They claim that proof of convergence under this asynchronous setting is innovative. But I could not figure it out why it is innovative. I feel we can ignore this setting by assuming Assumption 1: .
Statement of Theorem 2
Assumptions
Assumption 1: (About outdated case)
- (For all , , with probability 1)
Assumption 2: (About noise)
- ( is -measurable )
- ( For all , is -measurable )
- For all , is -measurable💡The fact is a random variable is this paper's contribution
- ( For all , , is -measurable)
- For all ,
- There exist constants , such that
Assumption 3: (About stepsize)
- for all
- for all
Statement : converges to
Proof of Theorem 2
Define and
- Because is bounded there always exists a constant
- where
-
-
-
-
- Because is bounded there always exists a constant
- Lemma 4: and
- Lemma 5:
Q-learning setup
Definitions
- : Finite state space
- : action available at state
-
- at state , took action , transit to with prob .
- (random variable )
- If you take action on state , you suffer cost
- is not bounded ← important
- But variance is bounded
- Stationary policy :
- A function defined on such that for all .
- Take state return action
- Absorbing state:
- Ex) Suppose state 1 is absorbing: and
💡Once you enter 1, you will stay there forever
- Proper stationary policy:
- Stationary policy that converges to absorbing state with probability 1 when
- :💡Maximum expected cost in the future when you are at state
-
- : discount factor
- Undiscounted case
- Discounted case ()
Assumption 7
- There exists at least one proper stationary policy.
- Every improper stationary policy yields infinite expected cost for at least one initial state.
💡Necessary for the convergence of case- The fact this assumption is natural is their contribution.
- Work before this required all policies have to be proper
Q-learning algorithm
- Given a dynamic programming operator :
- Q-learning is a method to compute the optimal based on the assumption that satisfies the following equation (Bellman principle)
- update rule is as follows
- Given a dynamic programming operator :
Does Q-learning converges?
Correspondence to Stochastic approximation
- Q-learning's update rule can be formulated as the following stochastic approximation
- with the following definitions
- Q-learning's update rule can be formulated as the following stochastic approximation
Does satisfies Assumption 4?
- It is known that dynamic operator satisfies Assumption 4 satisfies Assumption 4
- So Does satisfies Assumption 4?
- When
- is contraction (well known)→ satisfies Assumption 4💡 is a contraction mapping if
- is contraction (well known)→ satisfies Assumption 4
- When
All policies are proper
There exists a vector such that is a contraction with respect to the norm (Bertsekas et al., 1989,1991)
Under Assumption 7
is a unique stationary point regarding (Bertsekas et al., 1989, 1991)
Conclusion (Theorem 4)
Given the following