[Reading Note]: Asynchronous Stochastic Approximation and Q-Learning
- What's this paper- Proved the convergence of Q-learning algorithm
 - Comparison with previous works- Expected reduction of a smooth Lyapunov function (Poljak & Tsypkin, 1973)- Undiscounted problems is not proven
 
 - "averaging" techniques that lead to an ordinary differential equation (Kushner & Clark, 1978)- Unnatural assumptions for Q-learning
 
 - This paper: Asynchronous convergence theory of Bertsekas (1982) and Bertsekas et al. (1989)- Naturally covered undiscounted setting
 
 
 
- Stochastic approximation- What's that? → A family of iterative methods typically used for root-finding problems
 - Basic definitions-  : step size
 - : function of 
 - : noise
 
 -  converges to the stationary point of , i.e.,  (Theorem 2)
 - Asynchronous update- : Suppose  is updated at , the last time  was update at 
 💡They claim that proof of convergence under this asynchronous setting is innovative. But I could not figure it out why it is innovative. I feel we can ignore this setting by assuming Assumption 1: .
 
- Statement of Theorem 2- Assumptions- Assumption 1: (About outdated case)- (For all  , , with probability 1)
 
 - Assumption 2: (About noise)- (  is -measurable )
 - ( For all , is -measurable )
 - For all , is -measurable💡The fact  is a random variable is this paper's contribution
 - ( For all , ,  is -measurable)
 - For all ,
 - There exist constants , such that
 
 - Assumption 3: (About stepsize)-  for all 
 -  for all 
 
 
 - Statement :  converges to  
- Proof of Theorem 2- Define  and - Because  is bounded there always exists a constant - where 
 
 - 
 - 
 - 
 - 
 
- Because  is bounded there always exists a constant 
 - Lemma 4:  and 
 - Lemma 5: 
 
- Q-learning setup- Definitions- : Finite state space
 - : action available at state 
 - - at state , took action , transit to  with prob .
 
 - (random variable )- If you take action  on state , you suffer cost 
 -  is not bounded ← important
 - But variance is bounded
 
 - Stationary policy :- A function defined on  such that  for all .
 - Take state return action
 
 - Absorbing state: - Ex) Suppose state 1 is absorbing:  and 
 💡Once you enter 1, you will stay there forever
 - Proper stationary policy: - Stationary policy that converges to absorbing state with probability 1 when 
 
 - :💡Maximum expected cost in the future when you are at state - 
 
 - : discount factor- Undiscounted case 
 - Discounted case ()
 
 
 - Assumption 7- There exists at least one proper stationary policy.
 - Every improper stationary policy yields infinite expected cost for at least one initial state.
 💡Necessary for the convergence of  case- The fact this assumption is natural is their contribution.
 - Work before this required all policies have to be proper
 
 
- Q-learning algorithm- Given a dynamic programming operator :
 - Q-learning is a method to compute the optimal  based on the assumption that  satisfies the following equation (Bellman principle)
 - update rule is as follows
 
- Given a dynamic programming operator :
- Does Q-learning converges?- Correspondence to Stochastic approximation- Q-learning's update rule can be formulated as the following stochastic approximation
 - with the following definitions
 
- Q-learning's update rule can be formulated as the following stochastic approximation
 - Does  satisfies Assumption 4?- It is known that dynamic operator  satisfies Assumption 4   satisfies Assumption 4
 - So Does  satisfies Assumption 4?
 - When -  is contraction  (well known)→ satisfies Assumption 4💡 is a contraction mapping if
 
-  is contraction  (well known)→ satisfies Assumption 4
 - When  - All policies are proper- There exists a vector  such that  is a contraction with respect to the norm  (Bertsekas et al., 1989,1991) 
 - Under Assumption 7-  is a unique stationary point regarding  (Bertsekas et al., 1989, 1991) 
 
 
 
- Conclusion (Theorem 4)- Given the following 





