Bringing Geometric Foundation Models to SLAM: VGGT-SLAM and SL(4) Factor Graph Optimization
GTSAM Posts
Author: Dominic Maggio
In the past couple years, a new type of foundation models called the Geometric Foundation Model (GFM) has been creating a lot of excitement for 3D scene reconstruction starting with initial works DUSt3R and MASt3R. GFMs take in uncalibrated monocular RGB images and output a dense 3D point cloud and camera poses. One of the most popular models, VGGT, won best paper at CVPR 2025 and follow-up work VGGT-Omega was a best paper finalist at CVPR 2026. Their simplicity and ability to create dense reconstruction without depending on known camera calibration or stereo rigs begs the question of how to best leverage them for a robotic SLAM system. In this post, we’ll discuss how VGGT-SLAM (and its extension VGGT-SLAM 2.0) does just that.
Using VGGT for a SLAM System
The first challenge in bringing GFMs to SLAM is robots may need to process many thousands of images; however, GPU memory bounds how many frames VGGT can process (around 60 on a 3090 GPU with 24 GB of memory). The second challenge is most practical use cases require incremental mapping as the robot explores a scene – not just one large batch processing of images for the entire scene.
Both challenges can be solved by a simple idea: create smaller submaps with VGGT as the robot explores a scene and chain these submaps together to create a global map. Sounds easy enough; now we just need to pick which transformation to use to align the submaps. Classical SLAM logic tells us $\text{Sim(3)}$ should do it. Each VGGT submap is defined in its own local frame ($\text{SE(3)}$ alignment is needed) and since VGGT doesn’t estimate metric scale we need the extra DoF of $\text{Sim(3)}$ to align the submaps. Unfortunately, this $\text{Sim(3)}$ alignment isn’t always enough. As an example, chaining two submaps together in this apartment scene with a $\text{Sim(3)}$ transformation shows poor alignment.
The need for a higher DoF transformation
The missing piece is VGGT does not know the camera calibration. While it tries to estimate the calibration, this uncertainty causes the submaps to have a higher DoF ambiguity. To understand what transformation we need for this state-of-the-art foundation model, we find the answer buried deep in what’s considered the bible of classical computer vision – Multiple View Geometry in Computer Vision by Hartley and Zisserman. Which is kind of amusing since during their CVPR talk, the VGGT authors had a slide saying “You don’t have to be Zisserman” to do 3D reconstruction anymore. Anyway, chapter 10 (second edition) mentions “The Projective Reconstruction Theorem” which in summary states that if you have a set of images and solve for a 3D reconstruction given known pixel correspondences, the reconstruction has a 15 DoF projective ambiguity to the true scene. Given extra information such as the vanishing point, this can be reduced to a 12 DoF affine ambiguity.
This 15 DoF transformation is a $4 \times 4$ homography matrix (the lesser known 3D version of the common $3 \times 3$ homography used in 2D vision tasks like image stitching). Now that we know to use a 15 DoF projective transformation, we can solve for the homography between submaps. The homography can be estimated with a 5-point RANSAC solver.
Looking at the apartment example from before, we get a much cleaner submap alignment using the homography matrix.
Integration with GTSAM
Now for the really cool part. We can normalize the $4 \times 4$ homography matrix to have determinant 1 which maps it to a unique matrix on the Special Linear Group, $\text{SL(4)}$, manifold. $\text{SL(4)}$ is the group of $4 \times 4$ matrices with determinant 1. This lets us chain submaps together (along with loop closure constraints) and use GTSAM to create a factor graph optimized on the $\text{SL(4)}$ manifold. Support for $\text{SL(4)}$ factors was added to GTSAM in PR #2207 and has a similar interface as $\text{SE(3)}$ factors.
VGGT-SLAM 2.0
One downside to solving 15 DoF transformations for submap alignment is high-dimensional drift can quickly build up without loop closures. Additionally, a 5-point solver for the homography matrix requires the 5 points not all be co-planar - which can cause degeneracy when submaps view flat floors or a single wall. To get around this, VGGT-SLAM 2.0 maintains an $\text{SL(4)}$ factor graph but recognizes the VGGT submap alignment problem can be constrained to a subset of variables. For example, consecutive submaps are created so that they share a common keyframe - the first keyframe of submap $n$ is the same keyframe as the last one of submap $n-1$. This means that their respective translation and rotation must be trivially identical in the world frame. Likewise, while the true calibration is unknown, we know that the shared frames must come from a camera with the same calibration, allowing us to further reduce the variables when estimating projective alignments.
Multiple additional improvements, such as more reliable loop closures, are also included in the VGGT-SLAM 2.0 paper and a demonstration that the entire SLAM system can run in real-time onboard a robot with a Jetson Thor.
Further browsing
- Example of using $\text{SL(4)}$ factors are available in a GTSAM example notebook
- ArXiv: VGGT-SLAM
- ArXiv: VGGT-SLAM 2.0
- ArXiv: FOUND-IT