Reinforcement Learning notes

29 May, 2025

Video -1 :

Alan Turing in his paper first questioned if machines can think? The author then shows a paragraph from the paper which describes that to simulate a adult humans brain we need three things:

The initial state of the mind - A Childs brain
The education the above brain is subjected to
Other experiences that it has been subjected to

In the paper Turing states that it's easier to take a Childs Brain which he compares to the empty notebook and then subject it to a course of an education to obtain a adult human brain - He does this so because he thinks that a child brain will have a very little mechanism which meant that it could be easily programmable.

So the question that arises here through the be paragraph is - Is it easier to create a program that learn over time or to create a program that can learn will achieve over time.

Intelligence was then described as - To be able to learn to make decisions to achieve goals.

This brings us to RL:

Learning to make decisions by interacting with an environment.
Interactions are often sequential.
We are goal directed.
Can learn without examples of optimal behavior.

The interaction loop :

We have an agent and environment mainly. The agent takes some action which may or may not affect the environment and this then creates an observation from the environment which the agent takes in. The goal of this whole interaction loop is to maximize the sum of the rewards through repeated interaction.

If we don't have a goal specified for the interaction loop, we don't know what the agent will learn through repeated interaction.

RL - based on reward hypothesis : Any goal -> outcome of maximizing cumulative reward.

Formalizing RL:

At each timestamp t :

The agent receives observation Ot and maybe a reward Rt - sometimes the reward is internal to agent and depends on the observation according to the author
Executes action At
The reward Rt is a scalar feedback signal which means that its always a number generally positive and in some cases negative (penalties) according to the author.
As stated earlier, The goal of the agent is to maximize the cumulative reward. This is also called the return. Gt = Rt+1 + Rt+2 + Rt+3 + ....
Notice that the return function doesn't have Rt, This is because at time step t we take an action At and the reward of the action is obtained on timestep t+1 so we only consider the reward at t+1 when calculating the return.