Let's Play a Game: How Reinforcement Learning Changed the World
- Thomas Yin
- Feb 23
- 8 min read

Do you ever wonder how circus managers get bears to balance a ball, or a tiger to jump through flaming hoops? The answer: reinforcement. Tigers don’t usually jump through flaming hoops, but they will if you give them a tasty piece of meat every time it does. Eventually, a tiger learns that in order to get the food, it must perform the daring leaps, and so it does so ably and consistently. Humans learn in similar ways: we learn to eat healthy food, exercise, and study hard in order to earn something positive, whether it be a burst of dopamine, money, and success. This phenomenon, whereby a human (or virtually any other animal) increases a specific behavior after they are rewarded in some way for it, is an integral part of how we learn. For a long time, scientists wondered if we could teach a computer in the same way. Papers from the 1990s detailed how “Q Learning” could help make algorithms adaptable to a complex environment, but it wasn’t until the 2013 landmark research paper by Google DeepMind that taught the world how it could be done in practice. It is not an exaggeration to say that, over the course of a decade, Reinforcement Learning, or RL for short, has changed the world, and it will continue to do so for quite a while. This article discusses the technical aspects of this fascinating architecture while commenting on its irrevocable impact on LLM technology.
Wisdom Comes from Within
In 1938, behavioral psychologist B.F. Skinner coined the term “operant conditioning” to describe how organisms can increase their propensity to perform certain voluntary actions using a process called reinforcement. He found that, if an action (such as, in his case, a mouse pressing down a lever) was reinforced by something positive (he used food and water), it was likely to be repeated again. On the other hand, if an action (such as touching a hot stove) was punished by something harmful (the pain of burning), it was less likely to be repeated. As simple as this process is, humanity as we know it would not be the same without it. Imagine the world if all young children had difficulty learning that you shouldn’t do the things that always lead you to hurt yourself!
It was this simple truth that led Christopher Watkins to develop his 1989 Ph. D. thesis on exactly this topic. If humans and animals could learn by reinforcement, why couldn’t machines? He proposed the idea of a Q Learning process by which an agent learns through interactions with a limited environment. He proposed that, in any environment, an Q Learning agent’s goal is to develop a policy through correlating action-state pairs. To understand this, take the example of a videogame where one has to pass several gates, with each gate being locked by the password of a single random digit. In this case, the action, or the behavior the agent performs, would be the number that the agent guesses at each gate, and the state, or the environment surrounding the agent, would be the gate at which the agent is. The idea is simple: a reward is defined by humans attempting to train the agent. A reward can be a big one (perhaps for finishing a level) or a small one (for completing a single gate). Watkins proposed that a model would try all possible state-action pairs, or a specific situation where an action is associated with a state, then perform a simple maximization algorithm to prefer the memorized state-action pairs that led to a high reward. For example, if the gate in our hypothetical videogame is set up thus:

Then, a possible state-action pair would be to choose “4” at the second gate. Another one would be to choose “6” at the second gate, although this choice will probably lead to a much lower reward, since it is the wrong digit for the second gate. Let us say that the reward for passing the 6th gate is 1, and that each additional gate passed (not including the 6th one) gives a reward of 0.2. An early Q Learning agent would have probably tried the digits 1-9 at the first gate, the second gate, and so on. It would do so until it tried all the possible outcomes of state value pairs and received all the rewards for each combination of these pairs. It would then settle for the series of digits leading to the highest reward, which, as we can tell, is the correct combination 5-4-9-8-7-2, which gives the highest reward of 2. Q Learning is so named after Q Values, a proposed variable that would denote the reward that resulted from a specific action-value pair. By learning how to maximize the Q Value, Watkins hypothesized, models will be able to make optimal decisions in a non-probabilistic environment.
DeepMind Steps In
In 2013, researchers at the AI research lab DeepMind published what would become a landmark paper in AI research. Their paper marks the introduction of one of the most important types of what would become known as Reinforcement Learning (RL) models: The Deep-Q Network (DQN). The researchers acknowledged the capability of using Q Learning to train agents, but noted that Watkins’ system had a few issues. First of all, running through all possible state-action pairs would not work in complex games with millions of combinations of decisions. Second of all, simple maximization functions would not work, since more complex games tend to have locally maximized rewards. Let’s say that a maze game’s reward is defined by the length that a player traverses in total towards the goal. If the first Q Value pair that Watkin’s RL agent discovers is a dead end, it will continue to visit that dead end since it does not know better.
The DeepMind team solved both problems in a clever way. To mitigate the computational expense of running through all options, they introduced the epsilon-greedy policy for random exploration. This method, named after the greek letter epsilon (Ɛ), balances Watkin’s greedy policy of always going after the highest known reward with an exploratory policy. The idea is that, at each state, agent will have a Ɛ chance to explore (choose one of the actions randomly) and a 1 - Ɛ chance of following the maximum Q Value as dictated by the greedy policy. If you’re not into formal explanations, this basically means that the model will have a set probability to try out new actions every once in a while, a useful behavior that will save a lot of time by focusing on maximization (so that less valuable state-action pairs can be skipped) while also allowing for flexibility in decision making (so that the agent doesn’t get stuck on local maxima).
Then, there was the problem of evaluation. If the agent is still in the process of finishing a game, for instance, how will it know that certain actions will directly lead to a better outcome? Just because you clapped your hands before making a three-pointer doesn’t mean that the shot went in because of your clapping. Well, the agent has to predict. DeepMind introduced a new way of what they call “breaking the correlation” between state-actions pairs with the Q Network. The Q Network is basically a compact Machine Learning model inside of the complete DQN. The only job of the Q Network is to learn from the agent’s experiences, and, given a state, predict the Q Value resulting from each and every possible action. Going back to our example with gates and passwords, a well-trained Q Network will output a higher predicted Q Value for the action of guessing the correct number at each gate, rather than guessing an incorrect number. The Q Network itself evolves throughout the training process. Through experience replay, the network is able to be trained on a batch of data that the agent receives from the environment, and is thus able to adjust its weights in order to better predict Q Values and thus be more effective in the “advice” it gives to the agent. They are a truly a match made in heaven.
All the World’s a Game…
Reinforcement Learning in its purest form has had many advancements. DeepMind, after its acquisition by Google in 2014, went on to develop AlphaZero, one of the most celebrated RL models of all time. Trained using a probability distribution model enhanced by the Monte Carlo Tree Search (MCTS) algorithm, the AlphaZero team successfully generalized variants of the original AlphaZero model to various other complex tasks, including:
AlphaGo, which shocked the world by decisively defeating the world’s reigning Go champion, Lee Sedol, in what was considered to be one of the most complex board games ever made.
AlphaProof, a variant dedicated to solving Olympiad math problems by operating on LEAN-formalized proofs, achieved a Silver in simulated International Math Olympiad (IMO) benchmarking tests.
AlphaFold, which won its developing team a Nobel Prize in Biology in 2024, achieved breakthroughs in protein folding, one of the most complicated aspects of molecular biology.
The concept of Reinforcement Learning has a lot to teach us about life: figure out what things has the highest value, and seek to attain that value through actions. If something doesn’t go your way, try something else until it works. Humans often overlook the subtleties of the very systems we design, and this is why I love Reinforcement Learning so much. For something so simple and brilliant, its potential is confined by (ironically) the nature of humanity. One of the most important parts of the RL process, namely the reward function, is set by humans. Looking back at the AlphaZero team’s accomplishments, it’s obvious that we are the limiting factor in what could be done using RL. It seemed that, since AlphaZero could solve almost any game heuristically, the only thing left to do is turn every one of the world’s problems into games, and have AlphaZero play them.
And that’s what the world’s top researchers are doing. Well, sort of. When I first learned about RL in the summer of 2024, the technology had not had a major breakthrough since the triumphs of the AlphaZero team in 2017. Everyone was talking about ChatGPT, it seemed, as well as the new Transformers that had been dominating technology discussion for half a year. I thought wistfully about how cool RL was, and then I forgot about it. That is, until OpenAI had the idea of combining the Transformer architecture with Reinforcement Learning, creating an unholy hybrid I like to call RL-LLMs, or Reinforcement Learning-Large Language Models, for simplicity. It seemed like a no-brainer: strengthened by a paradigm named Reinforcement Learning with Human Feedback (RLHF), RL-LLMs could break down problems with the power of the transformer, and derive general solutions using step-based RL. While this combination is a logical next-step for the industry, the deployment of these models exacerbated an already disastrous worker exploitation problem in the AI industry, something that we have covered before and advocate against.
Learn More
This being said, RL has a long way to go before it reaches its maximum potential. Modern RL-LLMs use Chain of Thought (CoT) as direct RL steps for reasoning, improving inference capabilities but greatly outsizing development cost. Many traditional RL models have to train for millions of iterations before convergence, quickly burgeoning costs if the associated environment is large or overly complex, as is the case for generalizing solutions to complex problems written in natural language. To this end, future RL models may rely on LLMs (or even RL-LLMs) to “gamify” problems, just as the AlphaZero team did for AlphaProof, converting the equations it encountered to the formal proof format LEAN so that they may be solved. With this, we may be able to leverage field-specific RL much more cheaply and efficiently, although this is a step back from the widely-recognized commercial vision of producing Artificial General Intelligence (AGI). Again, RL might solve the world’s most complex problems… if we can figure out how to quickly turn them into games.
Comments