The Segment Anything model (SAM) is a state-of-the-art segmentation model released by Meta in April 2023. The model can be used to predict segmentation masks for any object of interest given an input image. It has been trained on the largest segmentation dataset to date, with over 1 billion masks on 11 million licensed and privacy-respecting images. The model is capable of converting an image input into multiple segments, as well as performing segmentation on any image using multiple bounding boxes and multiple point coordinates. The unpublished version of the model can perform segmentation of an object through prompts, but we will not discuss this since it has not been released.

This model is pre-trained and has three versions: vit-b, vit-m, and vit-h weights. The vit-b model, which we will be using, is lighter compared to the others, making it fast and the most stable option for our task. Only the mask decoder part of the model is fine-tunable, so we will focus primarily on this part in the Fine Tune of SAM section.

As mentioned in the model paper, the model consists of three parts: Image encoder, flexible prompt encoder, and fast mask decoder.

  • The Image encoder is the application of a minimally integrated Mask Autoencoder pre-trained Vision Transformer with high inputs.
  • The Prompt encoder consists of two types of prompts: sparse (points, boxes, and text) and dense (masks).
  • The Mask Decoder efficiently maps an image embedding, prompt embedding, and an output token to a mask. It uses a modification of the Transformer decoder block followed by a dynamic mask prediction head.

Next Page