Spatial Furniture Arrangement using Reinforcement Learning
The past few years have seen the rise of large coworking spaces like WeWork and INNOV8. Our interdisciplinary team of an architect, engineers and an economist saw an interesting and challenging opportunity in trying to develop a space design solution for such coworking spaces. Coworking spaces get clients who are very diverse and demand environments which suit their goals. We thus started with aiming to build a solution which will provide designs for coworking spaces that are optimised for different parameters like cost, creativity, privacy, energy efficiency etc.
Before we dived in deep
We used the Plaksha network and reached out to the founder of one of India’s top coworking space, INNOV8. We spent a week interviewing the staff and clients at different INNOV8 centres to get a better understanding of what goes into building and using a space. We learnt that they currently design new spaces with the sales, supply, operations and design team working together based on heuristics and past experiences. One of their chief goals was to arrange objects in a space to ensure maximum utilization of assets. We felt that a data driven solution could help complement their work. We realised that a large coworking space firm like INNOV8 was not using it’s HVAC and client usage data as inputs to build the next coworking space. This is an avenue where we thought our project would help out coworking spaces like INNOV8.
Project scope and definition
The project’s scope has evolved from that point, where we believed optimizing an existing
office space would need sensor data and using that data Machine Learning models could
generate solutions that would suggest changes for the existing office spaces and also help in
generating new office space designs.
After our first presentation to our project advisors: Alexander Fred Ojala and Ikhlaq Sidhu,
the task was to reduce the complexity of our problem statement and work on a subset of the
problem, keeping limited measurement parameters to be used for developing the model.
We were aware of the possibility of garbage results generated by both RL and GAN models
due to the subjectivity of Architectural Design as a field and a few other factors. But finally,
the team decided to explore both RL and GANs as standalone approaches for the problem
of optimizing the floor plan of an office space by moving the objects in it and in turn
effectively optimizing the occupancy of that room.
As the lead of the project, I was handling the Reinforcement Learning part. The reason I choose reinforcement learning for spatial arrangement in particular was to create an agent that could learn through experiences/training in very different 3D plans and then could be deployed in a new 3D plan to optimally arrange the spatial layout.
The challenge
Various platforms and solution
I only relied on two famous platforms due to 6 week time crunch. the first one was using open AI gym which was being tried out by my team mate and it came with lot of problems. The ajor problem being the environment. The 3D enevironment was a must for our project and creating a new 3D environment for Open AI gym wasn't feasible. I came across Unity ML Agents and soon realized that creating a custom environment in Unity is much more easier. The only challenge in using Unity was that it uses C#. But finally, I had to use Unity ML agents due to its ease of use.
Rewards-Setup- PPO vs SAC
As discussed earlier, for reinforcement learning we need an environment that can provide us with information when an agent performs some actions in that particular space. The information is provided in form of vectors. Furthermore, rewards are also given to the agent when the agent performs an action that brings it closer to the desired goal and penalties when it moves away from the desired goal. This process helps in generating an optimal strategy to achieve the goal. In our case whenever the agent takes a step closer to the window, we reward it and penalize it fractionally for every step it takes. Our goal at this point of time is just limited to training the agent to achieve one particular task, a simple arrangement of the furniture close to window. We will define our rewards and penalties formally here:
Agent Reward Function:
-0.0025 for every step.
+1.0 for everytime the agent moves the table-chair to touch the window
To make the training much better and faster in terms of learning about the environment and performing better, I copied the 3D environment 10 times such that the agent will learn in an efficient manner. All the 10 3D environments are different and keep on changing everytime the agent moves the table and chair to the window. This also helps the agents not make the same mistake again but do those actions that maximize the reward.
Due to the advancement in algorithms, Unity ML agents offers two important state of the art algorithms called Proximal Policy Optimization(PPO) and Soft Actor Critic. While PPO on one hand a neural network to approximate the ideal function that maps an agent's observations to the best action an agent can take in a given state, the SAC learns from experiences collected at any time during the past as they are placed in an experience replay buffer and randomly drawn during training.SAC is better when the environment is heavy.
Model 1, max_steps = 50,000, PPO and SAC
The very first model that I trained with PPO took about half an hour to train but the one could clearly see the agent is struggling to move around the chair and the table to the window. Hence some changes are needed to be done in the hyperparameters. Here the blue box is the agent and the brown object is table and chair.
One can clearly see here that SAC is performing really badly as compared to the PPO. The probable reason being that since SAC is off policy, there are chances that during the random selection of an experience, it moves towards exploring the environment more.
Comparision between 50,000 steps with PPO and SAC
This was sort of an experimental phase where I was training the agent using two separate algorithms i.e Proximal Policy Optimization and Soft-Actor Critic on two different occasions. After several attempts 50,000 steps seemed to be making the agent experienced enough for placing the table and chair close to the window in decent amount of time. Following were the hyperparmeters that were used during training on two different occasions:
trainer: Proximal Policy Optimization(PPO)
batch_size: 128
beta: 0.01
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5.0e4
memory_size: 256
normalize: False
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 2000
use_recurrent: False
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
SAC Hyperparameters:
trainer: Soft Actor Critic
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 256
init_entcoef: 0.05
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e4
memory_size: 256
normalize: False
num_update: 1
train_interval: 1
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 2000
tau: 0.005
use_recurrent: False
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
The very initial training steps when the max_steps = 300,000
As I started the training with a new set of hyperparameters( see below for details of hyperparamters), the agent took lot of time to train and figure out what was happening in the surroundings. As the time passed, the agent started figuring out the difficulties and understanding the 3D environment and the goals.
The last few training steps when max_steps = 300,000
After over an hour, I clearly saw that the agent had improved significantly and was able to find a way out of difficulty that was caused because of changing 3D layout.
Comparision between 300,000 steps with PPO and SAC
Next-up I tuned some hyperparamters to see how the Agent is performing. Furthermore, I substantly increased the number of steps in this part. Again, one can clearly see that SAC algorithm wasn't performing so well. Also, the difference between Model 1 with 50,000 steps and this model is quite a few, the number of layers is 3 now and the buffer size has doubled too. The following are the hyperparameters I used for the above video:
algorithm used: Proximal Policy Optimization
batch_size: 128
beta: 0.01
buffer_size: 4096
epsilon: 0.2
hidden_units: 56
lambd: 0.95
learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 3.0e6
memory_size: 256
normalize: false
num_epoch: 3
num_layers: 3
time_horizon: 64
sequence_length: 64
summary_freq: 2000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
Single Agent -- Multiple furniture reinforcement Learning
Having successfully realized Single agent -- Single furniture spatial arrangement and having some time left, I moved on to arranging multiple furniture in a particular space. This is particularly challenging as well as more practical because in a particular space there will be multiple furnitures which will have to be arranged based on certain goals. As I move forward I would like to focus on this particular aspect and also on using Multi Agent reinforcement learning.
Code
As mentioned above one of the challenges in this project was my very first encounter with C# but since its based on Object Oriented Programming understnaidng and writing the code didn't took much time. There are three parts of the code that one has to write to train the agent in Unity ML agents. One code is dedicated to the Academy in which everything(environment) happens, one is dedicated to the agent and another one is dedicated to detecting windows.
using System.Collections;
using UnityEngine;
using MLAgents;
public class OfficeBasic : Agent
{
public GameObject ground;
public GameObject area;
[HideInInspector]
public Bounds areaBounds;
OfficeAcademy m_Academy;>
public GameObject goal;
public GameObject block;
public GameObject bloc;
[HideInInspector]
public WindowDetect winDetect;
[HideInInspector]
public WindowDetect winDetec;
public bool useVectorObs;
Rigidbody m_BlockRb;
Rigidbody m_BlocRb;
Rigidbody m_AgentRb;
Material m_GroundMaterial; //cached on Awake()
RayPerception m_RayPer;
float[] m_RayAngles = { 0f, 45f, 90f, 135f, 180f, 110f, 70f };
string[] m_DetectableObjects = { "block", "goal", "wall" };
Renderer m_GroundRenderer;
void Awake()
{
m_Academy = FindObjectOfType();
}
public override void InitializeAgent()
{
base.InitializeAgent();
winDetect = block.GetComponent();
winDetect.agent = this;
winDetec = bloc.GetComponent();
winDetec.agent = this;
m_RayPer = GetComponent();
m_AgentRb = GetComponent();
m_BlockRb = block.GetComponent();
m_BlocRb = bloc.GetComponent();
areaBounds = ground.GetComponent().bounds;
m_GroundRenderer = ground.GetComponent();
m_GroundMaterial = m_GroundRenderer.material;
SetResetParameters();
}
public override void CollectObservations()
{
if (useVectorObs)
{
var rayDistance = 20f;
AddVectorObs(m_RayPer.Perceive(rayDistance, m_RayAngles, m_DetectableObjects, 0f, 0f));
AddVectorObs(m_RayPer.Perceive(rayDistance, m_RayAngles, m_DetectableObjects, 1.5f, 0f));
}
}
public Vector3 GetRandomSpawnPos()
{
var foundNewSpawnLocation = false;
var randomSpawnPos = Vector3.zero;
while (foundNewSpawnLocation == false)
{
var randomPosX = Random.Range(-areaBounds.extents.x * m_Academy.spawnAreaMarginMultiplier,
areaBounds.extents.x * m_Academy.spawnAreaMarginMultiplier);
var randomPosZ = Random.Range(-areaBounds.extents.z * m_Academy.spawnAreaMarginMultiplier,
areaBounds.extents.z * m_Academy.spawnAreaMarginMultiplier);
randomSpawnPos = ground.transform.position + new Vector3(randomPosX, 1f, randomPosZ);
if (Physics.CheckBox(randomSpawnPos, new Vector3(2.5f, 0.01f, 2.5f)) == false)
{
foundNewSpawnLocation = true;
}
}
return randomSpawnPos;
}
public void ScoredAGoal()
{
AddReward(5f);
Done();
StartCoroutine(GoalScoredSwapGroundMaterial(m_Academy.goalScoredMaterial, 0.5f));
}
IEnumerator GoalScoredSwapGroundMaterial(Material mat, float time)
{
m_GroundRenderer.material = mat;
yield return new WaitForSeconds(time);
m_GroundRenderer.material = m_GroundMaterial;
}
public void MoveAgent(float[] act)
{
var dirToGo = Vector3.zero;
var rotateDir = Vector3.zero;
var action = Mathf.FloorToInt(act[0]);
switch (action)
{
case 1:
dirToGo = transform.forward * 1f;
break;
case 2:
dirToGo = transform.forward * -1f;
break;
case 3:
rotateDir = transform.up * 1f;
break;
case 4:
rotateDir = transform.up * -1f;
break;
case 5:
dirToGo = transform.right * -0.75f;
break;
case 6:
dirToGo = transform.right * 0.75f;
break;
}
transform.Rotate(rotateDir, Time.fixedDeltaTime * 200f);
m_AgentRb.AddForce(dirToGo * m_Academy.agentRunSpeed,
ForceMode.VelocityChange);
}
public override void AgentAction(float[] vectorAction, string textAction)
{
MoveAgent(vectorAction);
AddReward(-1f / agentParameters.maxStep);
}
public override float[] Heuristic()
{
if (Input.GetKey(KeyCode.D))
{
return new float[] { 3 };
}
if (Input.GetKey(KeyCode.W))
{
return new float[] { 1 };
}
if (Input.GetKey(KeyCode.A))
{
return new float[] { 4 };
}
if (Input.GetKey(KeyCode.S))
{
return new float[] { 2 };
}
return new float[] { 0 };
}
void ResetBlock()
{
block.transform.position = GetRandomSpawnPos();
bloc.transform.position = GetRandomSpawnPos();
m_BlockRb.velocity = Vector3.zero;
m_BlocRb.velocity = Vector3.zero;
m_BlockRb.angularVelocity = Vector3.zero;
m_BlocRb.angularVelocity = Vector3.zero;
}
public override void AgentReset()
{
var rotation = Random.Range(0, 4);
var rotationAngle = rotation * 90f;
area.transform.Rotate(new Vector3(0f, rotationAngle, 0f));
ResetBlock();
transform.position = GetRandomSpawnPos();
m_AgentRb.velocity = Vector3.zero;
m_AgentRb.angularVelocity = Vector3.zero;
SetResetParameters();
}
public void SetGroundMaterialFriction()
{
var resetParams = m_Academy.resetParameters;
var groundCollider = ground.GetComponent();
groundCollider.material.dynamicFriction = resetParams["dynamic_friction"];
groundCollider.material.staticFriction = resetParams["static_friction"];
}
public void SetBlockProperties()
{
var resetParams = m_Academy.resetParameters;
m_BlockRb.transform.localScale = new Vector3(resetParams["block_scale"], 0.75f, resetParams["block_scale"]);
m_BlocRb.transform.localScale = new Vector3(resetParams["block_scale"], 0.75f, resetParams["block_scale"]);
m_BlockRb.drag = resetParams["block_drag"];
m_BlocRb.drag = resetParams["block_drag"];
}
public void SetResetParameters()
{
SetGroundMaterialFriction();
SetBlockProperties();
}
}
If you like my work and this website or would like to further discuss about any project, feel free to connect with me. Thank you so much for your time.