AI Insights

Unsupervised Learning for Land Cover Classification in Satellite Imagery

March 26, 2022

article featured image

K-Means, Principal Component Analysis, Autoencoders, and Transfer Learning applied for land cover classification in a challenging data scenario.

Location: Kenya, Africa.

The problem background

A swarm of Desert Locusts in Samburu County, Kenya

A swarm of Desert Locusts in Samburu County, Kenya. Source: FAO

Desert locusts, or Schistocerca gregaria, are among the world’s most dangerous pests. Such locusts can easily multiply and form large swarms, which can fly rapidly and feed on any kind of green vegetation. Therefore, they can cause widespread damage to crops and pasture. Being highly migratory, locust outbreaks can threaten the food security of millions of people. In 2020, a deadly locust invasion ravaged regions across East Africa, the Arabian peninsula up to the Indian subcontinent. This attack was the worst locust invasion that Kenya had experienced in 70 years, with hundreds of millions of swarms raging into the country from Somalia and Ethiopia.

This AI challenge, hosted by Omdena in collaboration with the Kenya Red Cross Society, sought to assess the impact caused by the 2020 desert locust attack in Kenya. Multispectral satellite imagery of affected counties in Kenya was analyzed to assess the damage suffered by various land-cover types. The starting point of the challenge was, therefore, creating ML models to perform land-cover classification for Kenya. However, one of the biggest challenges with satellite imagery modeling is the lack of suitable, labeled datasets. Thus, in the initial stages of the project, we focused on developing unsupervised ML models.

In this article, we present the unsupervised approaches that were explored.

Unsupervised learning

Unsupervised learning refers to the training of machine learning algorithms on input data without labels, thereby giving the algorithm room to find hidden patterns and important features. Cluster analysis is arguably one of the most commonly used unsupervised methods and its popularity among practitioners can be attributed to its ability to group data points based on similarity or difference of features. The dataset used for this work consisted of 259 Sentinel-2 satellite images with dimensions 154×154, covering all the 16 affected counties in Kenya.

Sentinel-2 Satellite Images for affected regions in Kenya

Sentinel-2 Satellite Images for affected regions in Kenya. Source: Omdena

The quality of clusters that are obtained from using different algorithms depends on the ability of the algorithm to identify all/ most hidden patterns in the data. Using feature reduction and feature extraction techniques, the model is able to create clusters using only the most important features in the data. In this article, we will focus on the implementation of K-Means, which was the primary clustering algorithm, along with multiple variants of feature extraction techniques including Principal Component Analysis for pixel-level clustering along with Autoencoders and Transfer learning(VGG16) for image-level clustering. The article also covers the implementation of Semantic Clustering by Adopting Nearest Neighbors (SCAN) as an alternative clustering approach.

Unsupervised Learning pipeline

Unsupervised Learning pipeline


Clustering can be seen as a means of Exploratory Data Analysis (EDA), to discover hidden patterns or structures in data. The similarity of data is established with a distance measure such as Euclidean, Manhattan distance, Spearman correlation, Cosine similarity, Pearson correlation, etc. The choice of this metric is critical as it defines the similarity between data points, and therefore the shape of clusters. For most standard clustering techniques, the Euclidean distance is chosen by default. Depending on the data, other dissimilarity measures may be used, for eg. correlation-based distances are widely used in gene expression analyses.


The K-means clustering algorithm is based on Euclidean distances and is a centroid-based model, i.e the value of K (centroids) must be known beforehand. The data points are then iteratively grouped according to their proximity to the centroids. It is also a hard clustering technique, implying that a given data point can belong only to 1 cluster.

The steps involved are:

  • Specify the number of clusters as k
  • Random initialization of centroids & allocation of data points to the nearest centroid
  • Compute centroids by averaging data points within a cluster
  • Re-assign data points to their closest centroid
  • Re-compute centroids

Steps 4 and 5 are continued iteratively till the algorithm converges and no further modification is possible.

Convergence of K-means algorithm

Convergence of K-means algorithm. Source: Wikipedia

Principal Component Analysis

Principal Component Analysis is one of the most commonly used dimensionality reduction techniques. It condenses information from a large set of variables into fewer variables by applying transformations in such a way that the linearly correlated variables get transformed into uncorrelated variables. Correlation between variables tells that there is a redundancy of information and removing it will lead to compression of data.

Sentinel-2 image crop for Tharaka Nithi county

Sentinel-2 image crop for Tharaka Nithi county. Source: Omdena

Multispectral imagery uses multiple bands/wavelengths of light to capture different features, but these spectral images exhibit a high correlation among themselves. Thus, PCA seems to be a good approach here.

Steps involved in PCA:

  • Standardization of the data
  • Computing the covariance matrix
  • Calculating the eigenvectors and eigenvalues
  • Computing the Principal Components
  • Reducing the dimensions of the data set
6 principal components identified in the Sentinel-2 image

6 principal components identified in the Sentinel-2 image. Source: Omdena

After performing PCA on sentinel-2 data (with 6 bands), 6 principal components were identified, out of which 3 top principal components together retained 96% of the total information.

Information retention graph

Information retention graph. Source: Omdena

Thus, this approach seems to be successful in terms of compressibility and information retention, but it involves a lot of computations. Computing the covariance matrix, eigenvectors and eigenvalues are very expensive, especially for multispectral satellite images.

After PCA, k-means clustering along with the elbow method was applied on the resultant image to assign a particular cluster to each of the pixels in the image.

Resultant image after PCA and K-means

Resultant image after PCA and K-means. It has 4 shades- red, yellow, green, and blue with each shade denoting a particular cluster. Thus, each pixel has a particular cluster assigned to it. Source: Omdena

Limitation of this method

A total of 259 images from Sentinel-2 were acquired to cover the whole of Kenya, where each image represented an area of 2km around an attack point on the map. Applying k-means individually to each image was challenging as the clusters formed with principal components of one image, differed from the clusters in another image. Ideally, each cluster would be a representation of a land cover type, hence it was difficult to have the clusters across images be mapped correctly to the same land cover type class. One possibility we had tried was to apply PCA (with k-means) on a single large image corresponding to the whole of Kenya, using Google Colab. However, this failed due to the extensive computations involved. Therefore, we moved next to exploring image-level clustering methods.

Image clustering methods

In this section, we present the various image clustering methods that were attempted by the team. Starting with 259 Sentinel-2 images covering the whole of Kenya, we wanted to group them into k different clusters, each cluster being representative of a land-cover type.

Kenya image database (left), images after clustering labels by SCAN algorithm (right)

Kenya image database (left), images after clustering labels by SCAN algorithm (right). Source: Omdena

Today, deep convolutional neural networks form the backbone of supervised learning in major computer vision tasks. However, unsupervised learning using deep learning has also started to be promising to group real-world data due to their high representational power. By applying deep neural networks, non-linear mappings can be learned which can transform the data into more useful representations that do not require manual feature extraction or selection. In this project, we tried the following three methods of clustering with deep learning — autoencoders, SCAN and transfer learning using VGG16. As for the data, RGB channels were extracted from the tiff images and the images were saved in .jpeg format for experiments in this section.

Method 1: Auto-encoders

An autoencoder is an unsupervised learning technique that implements artificial neural networks for representational learning to automatically identify important features from raw data. An autoencoder is composed of 3 main components which include an encoder, a bottleneck, and a decoder. Much like principal analysis, this is also a dimensionality reduction technique.

An illustration of a deep autoencoder

An illustration of a deep autoencoder. x represents the input data, y represents the learned features, and z represents the reconstructed data. Source: Deep Autoencoder Conference Paper (Bhat, C et. al, 2017)

The role of the encoder is to reduce the dimensionality of the input data by learning the most effective way to compress the input data into an encoded representation without capturing the ‘noise’. This aspect of autoencoders makes them very similar to principal component analysis with the main difference being that the use of neural nets in autoencoders allows for the introduction of non-linearities, therefore, allowing the encoder to learn more complex features.

The bottleneck refers to the final layer of the encoder that contains the encoded features in the lowest possible dimensions as a set in the autoencoder. This layer serves as a constraint for the amount of information that traverses the network and thereby preventing the model from simply memorizing the data. Given the encoded features, the decoder functions to reconstruct the data to match the original input. During each forward propagation pass, the performance of the autoencoder is evaluated using a reconstruction loss which is then minimized through backpropagation.

Although there are several variations of autoencoders, our implementation focused on deep autoencoders where the encoder and decoder consisted of two symmetrical deep belief networks each with 4 hidden layers.

An illustration of the dimensions of the different layers in the autoencoder.

An illustration of the dimensions of the different layers in the autoencoder. Images are clustered using features learned from the last layer of the encoder. Source: Omdena

In the first step, we trained the autoencoder on the 259 images for 700 epochs using stochastic gradient descent with momentum as the optimization algorithm to facilitate faster convergence. In the second step, the output from the encoder is passed to a clustering layer with weights initialized using KMeans. By adding a clustering layer to the neural network, it is possible to fine-tune the feature representation of the encoder and the cluster assignments simultaneously.

The final clusters after clustering with KMeans, K=4

The final clusters after clustering with KMeans, K=4. Source: Omdena

We observe that the model attempts to group the images with the most noticeable being the cloudy images while there is an obvious pattern in land cover types in different clusters. Although the images are of high dimensionality, the autoencoder greatly reduces the feature space to 4 dimensions which makes it easy to assign clusters. Additionally, training of the autoencoder was done using Google Colab’s GPU which greatly reduces training time.

Method 2: SCAN

SCAN stands for Semantic Clustering by adopting the nearest neighbors. It belongs to the family of unsupervised algorithms and claims to achieve the state of the art performance in image classification without using labels. The algorithms are divided into three stages. Let’s discuss each in brief.

1. Pretext

The idea of the pretext task comes from Representational learning where feature representation is learned solely from the images. This is done by minimizing the distance between an image and its augmentation. Since the neural net used has a limited capacity it tries to remember only the important characteristics which help classify that image.

Source: Van Gansbeke, Wouter, et al. “Scan: Learning to classify images without labels.” European Conference on Computer Vision. Springer, Cham, 2020.

Source: Van Gansbeke, Wouter, et al. “Scan: Learning to classify images without labels.” European Conference on Computer Vision. Springer, Cham, 2020.

2. SCAN and Self-label

SCAN: This model learns on the weights of the pretext models and outputs a probability distribution for C classes where C is the number of classes. Hence the output of this stage is similar to that of a multi-class classifier. The self-labeling is a fine-tuning step performed after the SCAN stage. The model’s architecture is the same as SCAN. If the output predictions are above a certain threshold, we are confident that the image falls in that particular cluster. The idea is quite simple since it will have well-classified samples (probability > threshold) these samples then help the wrongly classified nearest neighbors gradually become more certain.

Source: Van Gansbeke, Wouter, et al. “Scan: Learning to classify images without labels.” European Conference on Computer Vision. Springer, Cham, 2020.

Source: Van Gansbeke, Wouter, et al. “Scan: Learning to classify images without labels.” European Conference on Computer Vision. Springer, Cham, 2020.


The SCAN model was trained for cluster C = 5,6,8. The aim of the clustering step was to obtain the Pseudo Labels. Using C=8, the model was able to group all cloudy images in one cluster. But several other clusters were homogenous, and thus could be further merged by lowering the value of C. The best results were finally obtained for C = 5 using stock augmentations.

Clusters formed using SCAN with C=5

Clusters formed using SCAN with C=5. Source: Omdena

One downside of the reduction in the number of clusters was that cloudy images were now distributed among other classes. However, this was not considered significant and therefore ignored. In future works, it might be useful to use the elbow method to identify suitable values of C.

Related work with SCAN modeling

3. PreText + Kmeans Stage on Kenya Data

Instead of using the entire scan model (all the 3 stages), we can use the pretext model trained in the first stage as a feature extractor. The pretext stage outputs a D dimension vector (D is defined by the user). These D dimension vectors are then fed to a clustering algorithm like kmeans. In this case, the pretext model can be used alone without using the scan and self-labeling stage. In the below image, a 2d scatter plot shows the Pretext model outputs.

SCAN pretext model output with k-means on Kenya data. Source: Omdena

SCAN pretext model output with k-means on Kenya data. Source: Omdena

  • Verify SCAN performance on labeled Eurosat data 
  • During the initial stages, the labeled EuroSAT dataset was used as a proof-of-concept for the SCAN algorithm and to get the pipeline ready for Kenya Images later. As EuroSAT has labels, we could perform the accuracy compared with the final results from SCAN, and conclude that the algorithm performs remarkably well as an unsupervised learning approach. EuroSAT is composed of Sentinel-2 satellite images of Europe, it covers 13 spectral bands and consists of 10 classes with 27000 labeled and geo-referenced samples. The RGB bands were used in this modeling.
  • Clusters after SCAN modeling
Eurosat data original (left), after clustering with SCAN (right). Source: Omdena

Eurosat data original (left), after clustering with SCAN (right). Source: Omdena

Scatter plots of the EuroSAT images (plotted using PCA) are shown above. Each color represents a particular class (see legend). On the left are the ground truth labels and SCAN predicted labels are on the right. As one can see the two images are nearly identical, and the model is found to have a high 92% accuracy. Since the EuroSat dataset has been designed for supervised learning tasks, ground truth labels were available and used in this accuracy calculation. The below table shows the accuracy of different stages.

Stage: Pretext + kmeans

  • Accuracy: 52 %

Stage: SCAN

  • Accuracy: 92 %

Stage: Self-Label

  • Accuracy: 92.5%

It is important to note that the pretext model is a feature extraction stage and needs a clustering model like k means to output cluster labels. Secondly, in the self-labeling stage, the batch-size is important as a low batch size can degrade our accuracy (the authors used a batch size of 1000 for cifar dataset).

For more details and an in-depth tutorial for SCAN, please refer to this article: SCAN, Lets Classify images without Labels! | by clive fernandes

Method 3: Image feature vectors from VGG16

This method can be seen as similar to the functioning of the pretext model stage of SCAN. Here, image features are extracted from a CNN and then clustered with the standard k-means algorithm. A VGG16 network was selected, which was already pre-trained on the large ImageNet dataset having a lot of variety of classes and therefore works as a good feature extractor.

In the Keras model of the VGG16 network, the penultimate fully connected layer ‘fc2’ is set as the layer to extract the image features. Github LinkModel

As the final prediction layer of the model is removed, the new final layer is a fully-connected layer with 4,096 output nodes. This constitutes the feature vector that we want to extract. So the predict step with an image gives the feature vector that can next be used for clustering. Github Link


Once the feature vectors for all images are obtained, they are grouped in k different clusters. Each cluster is expected to gather visually similar images. Using the elbow method, k=4 was chosen for the number of clusters.

Image clusters using VGG16 features with K-means using k=4

Image clusters using VGG16 features with K-means using k=4. Source: Omdena

Since the feature vector dimension is over 4000 using VGG16 as a feature extractor, it can be considered computationally expensive to implement this method. As this work was done on Google Colab using their GPU facilities, this was not a problem. However, if you wish to reproduce transfer learning for feature extraction in your own work, it is recommended to perform dimensionality reduction (for example with Principal Component Analysis). This can help reduce the dimension of the vector (up to the value you prefer, depending on your computation resource) while preserving the most important information from the original data.


Our challenges with land cover classification

In this work, we faced two main challenges. First, the data collection (which was linked to sites of desert locust attack) generated a highly unbalanced dataset. As a result, some land-cover classes were poorly represented, having only a few images. This made it difficult for the clustering models to learn appropriate class representations. Thus, such classes are misgrouped, as can be seen visually. Secondly, we have applied k-means, which is a hard clustering algorithm, where one point could belong to only one cluster. However, satellite images tend to have multiple classes within the same image as they represent several kilometers of land in-ground (for example, the Sentinel-2 images used here cover 2km around each attack point). Therefore, more than one land-cover type may-be found within the same image, making it a multi-class labeling problem. In future works, soft clustering methods like fuzzy c-means could be explored. In such techniques, a given data point could be part of multiple clusters (classes) with a probability or likelihood of being in a given cluster. Therefore, multiple land-cover types could be found within the same image.


This article presents the various unsupervised techniques that were explored in our challenge. It is seen that when there is a lack of labeled datasets, clustering methods can be useful to gain insights and find similarities existing in a given database. It was interesting to apply deep learning in unsupervised approaches, as it allowed us to exploit the advantages of CNNs over other manual feature extraction methods. In particular, as seen from the quality of clusters formed, the recently released SCAN algorithm works remarkably well as an unsupervised learning approach.

However, a major limitation of these methods is that there is no information about which clusters represent what class (land-cover type in our case). To overcome this, after cluster formation by each method, we adopted an annotation technique using a labeled map of Kenya from 2019. All images in the clusters were plotted on the map (as Sentinel-2 images have the location information) and using majority voting of the points, clusters were assigned an overall label. Find details about this work in the article below:

A Damage Assessment Solution of the Desert Locust Surge in Kenya Using AI and Satellite Imagery.

Applying Machine Learning to dynamically assess vegetation health changes before and after desert locust attacks

This article is written by authors: Clive Fernandes, Debaleena Misra, Marvin Oluoch, Abhilasha Sinha.

Ready to test your skills?

If you’re interested in collaborating, apply to join an Omdena project at:

media card
Smart Solutions Battling Malaria in Liberia with AI
media card
Harnessing AI to Monitor and Optimize Reforestation Efforts in Madagascar
media card
How We Leveraged Advanced Data Science and AI to Make Farms Greener
media card
A Beginner’s Guide to Exploratory Data Analysis with Python