Overview of Reinforcement Learning

What is Reinforcement Learning?

Reinforcement learning (RL) is simply a branch of machine learning concerned with optimizing the performance of a software agent on a particular task with experience. It is a field of ML considered distinct from both supervised and unsupervised learning.

Markov Decision Processes
A Markov Decision Process (MDP) is a model of reinforcement learning consisting of 4 parts:

State (S) - information the agent gets from the observation
Action (A) - a decision made in a particular state, from a set of discrete or continuous actions
Probability that A will lead from one state (S) to another (S')
Reward associated with the transition from S to S'

Markov Decision Processes are useful in reinforcement learning because they are useful in situations involving the cycle of observing the environment to get a state, and taking an action, which changes the state.

Exploration vs Exploitation

The issue of exploration vs exploitation refers to the issue in RL concerning the optimization of not only short term, but also long term value. In a video game for example, an agent may get stuck in a circle of small values, even though an even greater reward can be obtained through taking a non-greedy action.

Essentially, exploitation has to do with immediate gratification (i.e. taking the greedy action at all states), while exploration involves taking a random action in pursuit of an even larger reward. Exploration helps while training to maximize the total reward that can be gained in a given environment, as opposed to being caught in a local maximum.

Q Learning

Q Learning is a reinforcement learning algorithm that predicts the value of a given tuple of state and action. In other words, it estimates the value of taking a particular action in a given state. This value is expressed as a function of both state and action: Q(S, A). For a discrete set of states and actions, Q values are often expressed as a table, where the value can be found by tracing from a given state on one axis, and a particular action on the other axis.

Bellman Equation

The Bellman Equation is used in training to correctly estimate the value of Q(S, A). Before training, the Q function is unlikely to be a good predictor of value given state and action. However, with experience, it becomes progressively better. Here is the Bellman Equation:

Qt+1 (s,a) = Qt (s,a) + alpha*(Rt+1 + gamma*(maxaQt(st+1,a))-Qt(st,at))

Once again, this equation is simply used to modify the Q function such that it accurately predicts the value of taking an action in a given state. It does so by altering the current Q value by the reward given in the next state, scaled by alpha, the learning rate. Sometimes, a discount factor (gamma) is used, to also take into account the value of future actions. (One practical example of a discount factor is if someone pays you $1000 for your credit card info. While there is a short term reward of the $1000, the cost of losing all your life's savings—i.e. a negative future reward—would strongly discourage you from taking that action.)

Monte Carlo

Monte Carlo methods have to do with taking random actions to glean statistically significant information. This is useful in reinforcement learning because by taking random actions, the agent slowly but surely learns the best action to take in a given environment. In reinforcement learning, the probability of taking a random action is (1-epsilon), where epsilon gradually increases over the course of training.

DQN

Deep Q Learning Networks (DQN) combine deep neural networks with Q-Learning. The neural network learns to correctly predict the Q-value given the actual reward. This is very similar to plain Q-Learning, except that it uses an optimizer to minimize the loss (usually mean squared error between predicted value and actual value AKA reward).

Examples of RL Applications

Reinforcement learning is practical in several real life applications, such as robotics and playing games. RL is a great approach to these problems because they can be easily modeled as an environment, where the agent has to make a particular decision.

Notably, algorithms developed by Google like Alpha Zero use reinforcement learning to optimize their performance on board games, like Chess, Alpha Go, and Shogi, simply by playing itself over and over. Google Deepmind's program even defeated Stockfish, the previously reigning chess software. Other than chess, RL can also be used to play even complicated video games, thanks to the power of convolutional neural networks (CNN) in processing images. If you're interested in creating basic Reinforcement Learning programs, OpenAI's gyms are a great place to start.

People are also developing creative applications for RL, such as performing household tasks. Other robotics applications, such as teaching a robot how to walk or to make complex actions, are also popular. What's interesting is that in such robotics applications, it is not necessary to operate with a tangible robot; the environment can learn such in a virtual environment such that when it is connected to an actual robot, it has already learned all the steps!

Search This Blog

Ravit's Blog