Bringing Geometric Foundation Models to SLAM: VGGT-SLAM and SL(4) Factor Graph Optimization

GTSAM Posts

Author: Dominic Maggio

In the past couple years, a new type of foundation models called the Geometric Foundation Model (GFM) has been creating a lot of excitement for 3D scene reconstruction starting with initial works DUSt3R and MASt3R. GFMs take in uncalibrated monocular RGB images and output a dense 3D point cloud and camera poses. One of the most popular models, VGGT, won best paper at CVPR 2025 and follow-up work VGGT-Omega was a best paper finalist at CVPR 2026. Their simplicity and ability to create dense reconstruction without depending on known camera calibration or stereo rigs begs the question of how to best leverage them for a robotic SLAM system. In this post, we’ll discuss how VGGT-SLAM (and its extension VGGT-SLAM 2.0) does just that.

Using VGGT for a SLAM System

The first challenge in bringing GFMs to SLAM is robots may need to process many thousands of images; however, GPU memory bounds how many frames VGGT can process (around 60 on a 3090 GPU with 24 GB of memory). The second challenge is most practical use cases require incremental mapping as the robot explores a scene – not just one large batch processing of images for the entire scene.

Both challenges can be solved by a simple idea: create smaller submaps with VGGT as the robot explores a scene and chain these submaps together to create a global map. Sounds easy enough; now we just need to pick which transformation to use to align the submaps. Classical SLAM logic tells us $\text{Sim(3)}$ should do it. Each VGGT submap is defined in its own local frame ($\text{SE(3)}$ alignment is needed) and since VGGT doesn’t estimate metric scale we need the extra DoF of $\text{Sim(3)}$ to align the submaps. Unfortunately, this $\text{Sim(3)}$ alignment isn’t always enough. As an example, chaining two submaps together in this apartment scene with a $\text{Sim(3)}$ transformation shows poor alignment.

Using a $\text{Sim(3)}$ transformation to align VGGT submaps is not always sufficient. Here, the alignment of two submaps shows 
  substantial discrepancy.
Using a $\text{Sim(3)}$ transformation to align VGGT submaps is not always sufficient. Here, the alignment of two submaps shows substantial discrepancy. Figure adapted from the Maggio et al. VGGT-SLAM paper.


The need for a higher DoF transformation

The missing piece is VGGT does not know the camera calibration. While it tries to estimate the calibration, this uncertainty causes the submaps to have a higher DoF ambiguity. To understand what transformation we need for this state-of-the-art foundation model, we find the answer buried deep in what’s considered the bible of classical computer vision – Multiple View Geometry in Computer Vision by Hartley and Zisserman. Which is kind of amusing since during their CVPR talk, the VGGT authors had a slide saying “You don’t have to be Zisserman” to do 3D reconstruction anymore. Anyway, chapter 10 (second edition) mentions “The Projective Reconstruction Theorem” which in summary states that if you have a set of images and solve for a 3D reconstruction given known pixel correspondences, the reconstruction has a 15 DoF projective ambiguity to the true scene. Given extra information such as the vanishing point, this can be reduced to a 12 DoF affine ambiguity.

This 15 DoF transformation is a $4 \times 4$ homography matrix (the lesser known 3D version of the common $3 \times 3$ homography used in 2D vision tasks like image stitching). Now that we know to use a 15 DoF projective transformation, we can solve for the homography between submaps. The homography can be estimated with a 5-point RANSAC solver.

Looking at the apartment example from before, we get a much cleaner submap alignment using the homography matrix.

A 15 DoF projective transformation provides correct alignment between VGGT submaps.
A 15 DoF projective transformation provides correct alignment between VGGT submaps. Figure adapted from the Maggio et al. VGGT-SLAM paper.


Integration with GTSAM

Now for the really cool part. We can normalize the $4 \times 4$ homography matrix to have determinant 1 which maps it to a unique matrix on the Special Linear Group, $\text{SL(4)}$, manifold. $\text{SL(4)}$ is the group of $4 \times 4$ matrices with determinant 1. This lets us chain submaps together (along with loop closure constraints) and use GTSAM to create a factor graph optimized on the $\text{SL(4)}$ manifold. Support for $\text{SL(4)}$ factors was added to GTSAM in PR #2207 and has a similar interface as $\text{SE(3)}$ factors.

Left: reconstruction of a loop around an office corridor showing each submap as a unique color. There is a loop closure 
  at the end of the trajectory. Right: reconstruction of a large 4200 sq ft barn.
Left: reconstruction of a loop around an office corridor showing each submap as a unique color. There is a loop closure at the end of the trajectory. Right: reconstruction of a large 4200 sq ft barn. Figure adapted from the Maggio and Carlone VGGT-SLAM 2.0 paper.


VGGT-SLAM 2.0

One downside to solving 15 DoF transformations for submap alignment is high-dimensional drift can quickly build up without loop closures. Additionally, a 5-point solver for the homography matrix requires the 5 points not all be co-planar - which can cause degeneracy when submaps view flat floors or a single wall. To get around this, VGGT-SLAM 2.0 maintains an $\text{SL(4)}$ factor graph but recognizes the VGGT submap alignment problem can be constrained to a subset of variables. For example, consecutive submaps are created so that they share a common keyframe - the first keyframe of submap $n$ is the same keyframe as the last one of submap $n-1$. This means that their respective translation and rotation must be trivially identical in the world frame. Likewise, while the true calibration is unknown, we know that the shared frames must come from a camera with the same calibration, allowing us to further reduce the variables when estimating projective alignments.

Multiple additional improvements, such as more reliable loop closures, are also included in the VGGT-SLAM 2.0 paper and a demonstration that the entire SLAM system can run in real-time onboard a robot with a Jetson Thor.

Further browsing