Spatial and Temporal Hierarchy for Autonomous Navigation: Comparison
Please note this is a comparison between Version 1 by Daria de Tinguy and Version 2 by Rita Xu.

Robust evidence suggests that humans explore their environment using a combination of topological landmarks and coarse-grained path integration. This approach relies on identifiable environmental features (topological landmarks) in tandem with estimations of distance and direction (coarse-grained path integration) to construct cognitive maps of the surroundings. This cognitive map is believed to exhibit a hierarchical structure, allowing efficient planning when solving complex navigation tasks.

Robust evidence suggests that humans explore their environment using a combination of topological landmarks and coarse-grained path integration. This approach relies on identifiable environmental features (topological landmarks) in tandem with estimations of distance and direction (coarse-grained path integration) to construct cognitive maps of the surroundings. This cognitive map is believed to exhibit a hierarchical structure, allowing efficient planning when solving complex navigation tasks.

Inspired by human behaviour, this paper presents a scalable hierarchical active inference model for autonomous navigation, exploration, and goal-oriented behaviour. The model uses visual observation and motion perception to combine curiosity-driven exploration with goal-oriented behaviour. Motion is planned using different levels of reasoning, i.e., from context to place to motion. This allows for efficient navigation in new spaces and rapid progress toward a target. By incorporating these human navigational strategies and their hierarchical representation of the environment, this model proposes a new solution for autonomous navigation and exploration. The approach is validated through simulations in a mini-grid environment.

  • active inference
  • autonomous navigation
  • spatial hierarchy
  • temporal hierarchy

1. Introduction

The development of autonomous systems that can navigate in their environment is a crucial step towards building intelligent agents that can interact with the real world. Just as animals possess the ability to navigate their surroundings, developing navigation skills in artificial agents has been a topic of great interest in the field of robotics and artificial intelligence [1][2][3][1,2,3]. This has led to the exploration of various approaches, including taking inspiration from animal navigation strategies (e.g., building cognitive maps [4]), as well as state-of-the-art techniques using neural networks [5]. However, despite significant advancements, there are still limitations in both non-neural-network- and neural-network-based navigation approaches [2][3][2,3].
In the animal kingdom, cognitive mapping plays a crucial role in navigation. Cognitive maps allow animals to understand the spatial layout of their surroundings [6][7][8][6,7,8], remember key locations, solve ambiguities from context [9], and plan efficient routes [9][10][9,10]. By leveraging cognitive mapping strategies, animals can successfully navigate complex environments, adapt to changes, and return to previously visited places.
In the field of robotics, traditional approaches have been explored to develop navigation systems. These approaches often rely on explicit mapping and planning techniques, such as grid-based [11][12][11,12] and/or topological maps [13][14][13,14], to guide agent movement. While these methods have shown some success, they suffer from limitations in handling complex spatial relationships and dynamic environments as well as scalability issues as the environment grows larger [2][3][15][2,3,15].
To overcome the limitations of these non-neural network approaches, recent advancements have focused on utilising neural networks for navigation [5][16][17][18][5,16,17,18]. Neural-network-based models, trained on large datasets, have shown promise in learning navigational policies directly from raw sensory input. These models can capture complex spatial relationships and make decisions based on learned representations. However, the current neural-network-based navigation approaches also face challenges, including the need for extensive training data, limitations in generalisation to unseen environments, distinguishing aliased areas, and the difficulty of handling dynamic and changing environments [2].
Active inference is a framework allowing agents to actively gather information through perception, select and execute actions in their environment, and learn from accumulated experiences [19][20][19,20]. World models, within this framework, form internal representations of the world. Agents endowed with a world model and engaged in active exploration continually update their internal understanding of the environment, empowering them to make well-informed decisions and predictions [21][22][21,22]. This principled approach enables continuous belief updates and active information gathering, facilitating effective navigation [20].
Noting that biological agents are building hierarchically structured models, reswearchers construct multi-level world models as hierarchical active inference. Hierarchical active inference warrants agents to utilise layers of world models, facilitating a higher level of spatial abstraction and temporal coarse-graining. It enables learning complex relationships in the environment and allows more efficient decision-making processes and robust navigation capabilities [23]. By incorporating hierarchical structures into active inference-based navigation systems, agents can effectively handle complex environments and perform tasks with greater adaptability [24].

2. Spatial and Temporal Hierarchy for Autonomous Navigation

Navigating complex environments is a fundamental challenge for both humans and artificial agents. To solve navigation, traditional approaches often address simultaneous localisation and mapping (SLAM) by building a metric (grid) map [11][12][11,12] and/or topological map of the environment [13][14][13,14]. Although there is progress in this area, Placed et al. [3] state that active SLAM may still fail to be fully autonomous in complex environments. The current approaches are also still lacking in distinct capabilities important for navigation, such as predicting the uncertainty over robot location, abstracting over features of the environment (e.g., having a semantic map instead of a precise 3D map), and reasoning in dynamic, changing spaces. The recent studies have explored the adoption of machine learning techniques to add autonomy and adaptive skills in order to learn how to handle new scenarios in real-world situations. Reinforcement learning (RL) typically relies on rewards to stimulate agents to navigate and explore. In contrast, theour model breaks away from this convention, as it does not necessitate the explicit definition of a reward during agent training. Moreover, despite the success of recent machine learning, these techniques typically require a considerable amount of training data to build accurate environment models. This training data can be obtained from simulation [25][26][26,27]; provided by humans (either by labelling, as in the works in [27][28][28,29] or by demonstration, as in [29][30]); or by gathering data in an experimental setting [16][30][31][16,31,32]. These methods all aim to predict the consequences of actions in the environment but typically generalise poorly across environments. As such, they require considerable human intervention when deployed in new settings [2]. ThWe aim is to reduce both the human intervention and the quantity of data required for training by simultaneously familiarising the agent with the structure and dynamics found in its environment. When designing an autonomous adaptable system, nature is a source of inspiration. Tolman’s cognitive map theory [32][33] proposes that brains build a unified representation of the spatial environment to support memory and guide future actions. More recent studies postulate that humans create mental representations of spatial layouts to navigate [6], integrating routes and landmarks into cognitive maps [7]. Additionally, the research into neural mechanisms suggests that spatial memory is constructed in map-like representations fragmented into sub-maps with local reference frames [33][34]; meanwhile, hierarchical planning is processed in the human brain during navigation tasks [9]. The studies of Balaguer et al. [9] and Tomov et al. [10] show that hierarchical representations are essential for efficient planning for solving navigation tasks. Hierarchies provide a structured approach for agents to learn complex environments, breaking down planning into manageable levels of abstraction and enhancing navigation capabilities, both spatially (sub-maps) and temporally (time-scales). Thus, theour model incorporates these elements as the foundation of its operation. The concept of hierarchical models has gained interest in navigation research [13][24][13,24]. Hierarchical structures enable agents to learn complex relationships within the environment, leading to more efficient decision-making and enhancing adaptability in dynamic scenarios. There are two main types of hierarchy, both considered in theour work: temporal—planning over a sequence of timesteps [34][35][36][37][35,36,37,38]—and spatial—planning over structures [13][23][38][39][13,23,39,40]. In order to navigate without teaching the agent how to do so, reswearchers use the principled approach of active inference (AIF), a framework combining perception, action, and learning. It is a promising avenue for autonomous navigation [22]. By actively exploring the environment and formulating beliefs, agents can make informed decisions. Within this framework, world models play a pivotal role in creating internal representations of the environment and facilitating decision-making processes. A few models have proposed combining AIF and hierarchical models for navigation. Safron et al. [40][41] proposes a hierarchical model composed of two layers of complexity to learn the structure of the environment. The lowest level infers the state of each step while the higher level represents locations, created in a more coarse manner. Large, complex, aliased, and/or dynamic environments are challenges to this model. Nozari et al. [41][42] construct a hierarchical system by using a dynamic Bayesian network (DBN) over a naive and an expert agent, in which the naive agent learns temporal relationships, with the highest level capturing semantic information about the environment and low-level distributions capturing rough sensory information with their respective evolution through time. This system, however, requires expert data to be trained by imitation learning, which limits the performance of the model to that of the expert.
Video Production Service