SuperPoint: Self-Supervised Interest Point Detection and Description

Saraswathi Mamidala
5 min readDec 29, 2021

--

Hey readers!! You might have heard before about keypoints in the area of computer vision.

There are two types of keypoints in common use in computer vision.

Semantic keypoints are points of interest with semantic meaning for objects in an image, such as the left eye corner of a face, the right shoulder of a person or the front left tire hub of a car

Interest points are more low-level points that may not have clear semantic meaning, such as a corner point or ending point of a line segment.

As interest points are semantically ill-defined and thus a human annotator cannot reliably and repeatedly identify the same set of interest points.

It is therefore impossible to formulate the task of interest point detection as a supervised learning problem.

Applications of superpoint

  • Feature Detection
  • Gesture Recognition
  • Object Tracking
  • Outlier Detection
  • Fingerprint Recognition
  • Robotics and Augmented Reality

Let’s learn a little about Interest points

Interest points are 2D locations in an image which are stable and repeatable from different lighting conditions and viewpoints.

Instead of using human supervision to define interest points in real images, SuperPoint presents a self-supervised solution using self-training.

This is done through creating a large dataset of pseudo-ground truth interest point locations in real images using a base detector called Magic Point.

Training of Superpoint:

Self-supervised learning, an unsupervised learning as it does not require explicit human annotation.

Superpoint training includes multiple steps.

  1. Interest Point Pre-Training
  2. Interest Point Self-Labeling
  3. Joint Training

Let’s look at these steps in detail

Interest Point Pre-Training

We first generate synthetic dataset which includes simple geometrical shapes like cubes lines stars and checkerboards using simple python code.

In this data generation we create the geometric shape images along with the labels. By using this dataset we will train the base detector which is called MagicPoint

To generate the pseudo-ground truth interest points, we first train a fully-convolutional neural network on millions of examples from a synthetic dataset we created called Synthetic Shapes.

MagicPoint performs well on Synthetic Shapes but does not generalize too well on real images when compared to classical interest point detectors on a diverse set of image textures and patterns.

MagicPoint misses many potential interest point locations.

To bridge this gap in performance on real images, a multi-scale, multi-transform technique called Homographic Adaptation was developed.

Homographic Adaptation is used in conjunction with the MagicPoint detector to boost the performance of the detector and generate the pseudo-ground truth interest points.

Homographic Adaptation is designed to enable self-supervised training of interest point detectors

In this process we warp the input image multiple times using random homography and apply the magic point detector to get the interest points on the warped image.

Once we have the interest points on the warped image we will unwarp the interest points, in this way we will get the points for the original image in different view points and scales.

Interest Point Self-Labeling

Once we have the trained model MagicPoint we will use this model to generate the Pseudo-Ground Truth Interest Points. Here we will use Homographic Adaptation process as we discussed earlier.

As we discussed here we will generate the Pseudo-Ground Truth Interest Points for MSCOCO 2014.

This generated dataset will be used for the training of Magic point.

We will do this process for multiple times.

Superpoint Architecture

SuperPoint architecture uses a VGG-style encoder to reduce the dimensionality of the image.

The encoder consists of convolutional layers, spatial downsampling via pooling and non-linear activation functions

Encoder
This architecture has eight 3x3 convolution layers sized 64–64–64–64–128–128- 128–128. For every two conv layers there is a 2x2 max pool layer.
All convolution layers in the network are followed by ReLU non-linear activation and BatchNorm normalization

Interest Point Decoder
For interest point detection, each pixel of the output corresponds to a probability of “point-ness” for that pixel in the input.

This decoder has no parameters, and is known as “sub-pixel convolution” or “depth to space” in TensorFlow

Descriptor Decoder

The descriptor head computes D ∈ Hc×Wc×D and outputs a tensor sized H×W×D.

The decoder then performs tf.image.resize_bilinear of the descriptor and then L2-normalizes(tf.nn.l2_normalize) the activations to be unit length.

Joint training

Joint training is training the super point using the coco dataset.

Joint training is done on 2 images which are related by randomly generated homography H. This training process allows us to optimize the two losses simultaneously.

Most of the network’s parameters are shared between the two tasks, different form the traditional systems which first detect interest points, then compute descriptors and lack the ability to share computation and representation across the two tasks.

Results
The green lines show correct correspondences. SuperPoint tends to produce more dense and correct matches compared to LIFT, SIFT and ORB. While ORB has the highest average repeatability, the detections cluster together and generally do not result in more matches or more accurate homography estimates

Please clap if you find this article helpful and write in the comments if you want to know about the loss functions in detailed :)

Read my other interesting stories

--

--

Saraswathi Mamidala
Saraswathi Mamidala

Written by Saraswathi Mamidala

Data scientist at Innominds Software private limited, forever a student

No responses yet