Skip to content

Algorithm parameters: step size α(0,1],ϵ>0
Initialize Q(s,a), sS+,aA(s), arbitrarily except that Q(terminal,)=0

Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using some policy derived from Q (eg ϵ-greedy)
Take action A, observe R,S
Q(S,A)Q(S,A)+α[R+γmaxa(S,a)Q(S,A)]
SS
until S is terminal