Effective decision-making in complex environments with multi-discrete action spaces poses significant challenges for agent architectures, particularly in image-based settings. While Decision Transformers have shown promise in various domains, their performance often suffers in environments where agents must handle multi-discrete actions. Existing enhancements to Decision Transformer architectures have yet to address this critical issue, limiting their ability to support agents in learning robust policies in these environments. To address this gap, we propose Multi-State Action Tokenisation (M-SAT), a novel approach designed to improve agent decision-making by tokenising actions at the individual action level and incorporating auxiliary state information. This disentanglement of actions improves both the performance of agents and the interpretability of individual actions within attention layers, fostering better visibility into agent decision processes. Importantly, M-SAT facilitates the development of more interpretable and transparent agents capable of making complex decisions in dynamic environments involving multi-discrete action spaces. We evaluate M-SAT on the challenging ViZDoom environments, focusing on scenarios with multi-discrete action spaces and image-based observations, such as Deadly Corridor, My Way Home and Death Match. Our approach demonstrates superior performance compared to baseline Decision Transformers, with no additional data or significant computational overheads. Furthermore, we observe that M-SAT does not require positional encoding to achieve high performance, with its removal occasionally leading to further improvements. These findings suggest that M-SAT enables more efficient and interpretable agent-based decision-making in multi-discrete action spaces.