Week 2: Discovering Hunger
- Abhijit Baruah
- Jun 8, 2022
- 2 min read
Updated: Jun 16, 2022
I began the week by training the RL model using the observations and actions mentioned in the Week 1: Research blog post.
I was able to observe that agent’s would learn quickly not to walk outside the bounds but would prefer to not move and stay in one place.
This was primarily because of the way the RL-PPO algorithm works , i.e.
Proximal Policy Optimization aims to pick a "policy" i.e. a set of actions mapped to set of states in the world , during each step of training while ensuring the newly picked policy does not differ very much from the previously chosen policy, this is done to reduce variance during training.
The negative reward of staying in one place and dying was far less than that of falling of bounds , thus the model would choose inputs corresponding to maximizing the reward function which in this case would be staying in one place (sigh).
For more information about PPO in RL :- https://www.geeksforgeeks.org/a-brief-introduction-to-proximal-policy-optimization/

The red numbers indicate the agent's health and to speed up training I decided to train 12 agents simultaneously. The above video is the start of the training and depicts agents continuously moving out of bounds.

Agent's now figured out that they could not move and earn a comparatively lesser negative reward by letting their health bar run out.
This behavior was not optimal and to further improve training, I made the following changes :-
1) Decreased the health every frame at the same rate instead of different rates when the bot was stationary/moving. This was done to motivate the agent to move around the world.
2) Positive reward for every frame the agent maintained a health above 50.
3) Negative reward for every frame the agent had health below 50
4) Added an observation that indicated the distance between the agent and the food.
5) Added an observation that indicated the days survived.
The results after training the agent were really promising!

The initial training iterations already show massive improvement as the bot does not move off bounds and stops closer to the food. The purple integer indicates the days survived.

The agent has already learned how to get food and tries to do so when its health has dropped below 50. I let the training run for close to 3M iterations before stopping it and saving the model.
To test the trained model I attached the trained model to the agent prefab :-

The graph shows the cumulative reward vs the number of training cycles, the agents achieved a maximum cumulative reward around 2.5M training cycles, indicating sufficiently trained model

The above image is a snapshot during training that shows almost every agent surviving multiple days, This shows the progress made.
For next week I am going to try and find a way to make the food spawn dynamically!

Comments