AI Insights

Applying CNNs To Images For Computer Vision And Text For NLP

June 11, 2021

article featured image

Convolutional Neural Networks (CNNs, ConvNets) have become crucial in artificial intelligence. These networks are reputable for excellently capturing spatial and dependency information in matrices (images, sentences), while effectively reducing their sizes without discarding information. In this tutorial, we walk you through the methods of applying CNNs in computer vision and NLP with the essential code needed!


Convolution generally refers to the mathematical combination of two functions to produce a third function. It is a reliable way to produce an output with a relationship between 2 functions.

Now a grayscale image is a matrix whose values are the intensities of each pixel in a display.

   A grayscale image with 7x7 resolution

A grayscale image with 7×7 resolution

A colored image would have more channels (similarly shaped matrixes) representing each contributing color, usually Red, Blue, Green. The shape of the image matrix in this case would be 3x7x7. 

 A coloured image is a multidimensional tensor (complex matrix) - Convolutional network

A coloured image is a multidimensional tensor (complex matrix) – Convolutional network

Convolution is utilised to transform an image by applying a filter (a square matrix) to it, creating a feature map that summarizes the presence of detected features in the input. The filter could be for example a rotation or an edge detection matrix whose values/weights were mathematically constructed. 

A rotation matrix - for convolutional network

A rotation matrix – for convolutional network

This is achieved by the dot product of the filter weights/values and the input (image pixel matrix). The filter is applied systematically to each overlapping part or filter-sized patch of the input data, left to right, top to bottom. 

CNN kernels and pooling - Omdena

CNN kernels and pooling

The distance at which the slide shifts when attempting to downsample the image is called the stride. To maintain the dimension of output as in input, the image is padded with zeros.

Multiple filters are used to learn diverse features, these filters per image channel hence form a complex and large feature map. Pooling is an operation used to reduce the dimension of each matrix in the tensor.

CNN tutorial - Omdena

Max pooling

The most common technique is Max Pooling, where the maximum value in the chosen window is chosen. This is carried out systematically just like in convolution. Average pooling takes the average of the window.


While these filters can be handcrafted, convolutional neural networks learn the filter values/weights during training in the context of a specific prediction problem. These learnable weight matrices that contribute to a model’s predictive power, updated during back-propagation are called parameters.

MNIST Classification Example

The MNIST images are ready-to-use images of handwritten digits. The goal is to build a model that distinguishes between the 10 digits.

Step 1: Import files and load dataset

from tensorflow import keras
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np

(train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()

Step 2: Normalize images by dividing each pixel by 255 (maximum value of pixels) to keep the values no more than 1. Benefits include efficiency and more.

train_images, test_images = train_images / 255.0, test_images / 255.0

Step 3: Expand the dimension of the images to 

train_images = np.expand_dims(train_images, axis=3)
test_images = np.expand_dims(test_images, axis=3)

Step 4: One-hot encode the labels where each label is represented by a vector with length = num of classes. The value at the correct index = 1 while the rest is zero.

test_labels = to_categorical(test_labels)
train_labels = to_categorical(train_labels)

Step 5: Check the structure of the images

print("x_train shape:", train_images.shape)
print(train_images.shape[0], "train samples")
print(test_images.shape[0], "test samples")

 Output :

x_train shape: (60000, 28, 28, 1)

60000 train samples

10000 test samples

Step 6: Assign variables

num_classes = 10
input_shape = (28, 28, 1)
num_filters = 8
filter_size = 3
pool_size = 2

Step 7: Create the model

model = keras.Sequential([
 layers.Conv2D(num_filters, filter_size, input_shape=(28, 28, 1)),
 layers.Dense(num_classes, activation='softmax'),
CNN process - Omdena tutorial

An Illustration of the entire process (2d)

CNN application and tutorial - Omdena

CNN flattening

Because this is a classification task, the output of the pooling layer needs to be flattened to enable general operations.

CNN flattening and applying CNN - Omdena tutorial

The flattening of a multi-channel image

And passed through a fully connected layer units = num of classes, and activation = ‘Softmax’ to calculate the probabilities. 

Step 8: Compile and train

              ), train_labels, batch_size=128, epochs=2, validation_split=0.1)


Step 9: Evaluate

score = model.evaluate(test_images, test_labels, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

CNN in Natural Language Processing

Now consider a text classification task where the context of each word (surrounding word) has been proven to help distinguish the words (the purpose of n-grams). In computer vision tasks, the filters used in CNNs slide over patches of an image whereas in NLP tasks, the filters slide over the sentence matrices, a few words at a time, producing outputs that combine words and their contexts like n-grams, just more compact and efficient.

Applying CNN for computer vision and NLP - Omdena tutorial

Convolution in a sentence

After tokenizing and representing each sentence with their integer values to get a dataset like







Where each vector is a sentence and each value a word (pad shorter sentences if needed), a simple keras architecture would be something like this.

model4.add(Embedding(vocab_size, desired_embed_dim, input_length=256))
model4.add(Dense(1, activation='sigmoid'))

In the embedding layer, each word would be represented with a fixed-length vector, transforming each sentence into a matrix like:

                                        [0.345, 0.33556, 0.235]

                                        [0.345, 0.33556, 0.235]

                                        [0.345, 0.33556, 0.235]

                                        [0.345, 0.33556, 0.235]

                                        [0.345, 0.33556, 0.235]

Where the values are determined by word similarity. The convolution would be carried out on this matrix. Alternatively, the embeddings can be extracted from a pre-trained model and formed into a matrix. Think word2vec, GloVe, ElMo, USE, BERT, etc.

While using CNNs for text tasks, it is imperative to be wary of preprocessing as techniques like stop word removal and lemmatization would alter the sentence, hence the convolution could occur with the wrong word contexts.

It is generally assumed that CNNs are only suitable for text classifications unlike in images. Now imagine convolving a word matrix where each character is represented as an embedding vector, and each word could be classified (think POS tagging and entity recognition). The biggest advantage of CNNs over recurrent networks is their ability to capture contextual information over really long sequences (like articles) efficiently and their swift convergence. Only transformers outperform CNNs in this regard due to their attention mechanism. 

However, using CNNs for text generation is challenging, but CNNs and RNN variants can be creatively combined, leveraging each method’s strengths to solve problems in various domains (Computer Vision, NLP, robotics, and more).


We learned how to extract features from an image using convolution, reduce the dimension by subsampling (pooling), and flattened to facilitate classification based on the differences using a softmax-activated dense layer. We also learned to use the same principles in a document matrix for text classification.

This article is written by Henry Ndubuaku.

Ready to test your skills?

If you’re interested in collaborating, apply to join an Omdena project at:

More Articles
media card
Revolutionizing Short-term Traffic Congestion Prediction with Machine Learning
media card
Transforming Artwork Analysis with Advanced Computer Vision Techniques
media card
FloodGuard: Harnessing the Power of AI and GIS to Protect Bangladesh from the Fury of Floods
media card
How We Created an Innovative Solution for Power Accessibility without the Available Resources