Creating a Google Virtual Machine Instance to Reduce Dataset Size for Improved Visibility

June 5, 2024

Working with large datasets, such as NetCDF files, can be a daunting task, especially when your local machine lacks the necessary computational resources. This is where the power of cloud computing comes into play. By setting up a customized Google Virtual Machine (GVM) instance, you can harness the scalability and flexibility of the cloud to efficiently process and analyze large datasets. This technique not only allows you to overcome the limitations of your local machine’s RAM memory but also enables you to leverage the optimized compute resources provided by the GVM instance. As a result, you can streamline your workflow, reduce processing time, and gain valuable insights from your data more effectively.

Introduction

This tutorial will guide you through setting up a Google Virtual Machine (GVM) instance, connecting it to your local machine, and using Jupyter notebooks to manage NetCDF files up to 5MB. We’ll show you how to leverage the VM’s optimized compute resources to load and manipulate large NetCDF files, even if your local machine lacks sufficient RAM. We’ll also explore techniques to reduce dataset size for more efficient analysis.

This comprehensive, Linux-based use case is easy to follow and will help you unlock the potential of big data using Google Cloud’s powerful computing capabilities.

Working with a GVM Instance connected to a Local Ubuntu Machine

Google Cloud setup

To streamline the setup process, we are going to use the Google Cloud Platform (GCP) console via web browser, along with some console commands on the local Ubuntu machine.

1. Start by creating a Google Cloud account, you can take advantage of the free tier which offers access to many services for 90 days.

2. Now, create a new project, navigate to the `NEW PROJECT` button (refer to Figure 1). Assign your project a relevant name to easily identify it later.

3. Generate SSH Keys to connect your local machine with your GVM instance. Several methods are available to access the instance, but we are going to use the SSH protocol to locally install requirements in our GVM instance. You can find the official documentation to generate SSH keys here.

Figure 1 - Create a new project

On your local Ubuntu machine, open a terminal and go to the `.ssh` directory using the command `cd .ssh/`. Optionally you can list the files in this directory and if you have worked with Github you will see some files in there. Next, generate your keys with the following command, replacing `KEY_FILENAME` and `USERNAME` with your details. You are going to be asked for a password, you can press enter to leave it empty.

ssh-keygen -t rsa -f ~/.ssh/<KEY_FILENAME> -C <USERNAME> -b 2048

Please be mindful of potential security risks, especially if multiple individuals use the machine or if you manage confidential information. In such cases, you might want to establish a password. After executing the provided instructions, the private and public keys will be saved as <KEY_FILENAME> and <KEY_FILENAME.pub> files, respectively, within your `.ssh` directory.

Now return to your GCP console, navigate to `Metadata` under the `Compute Engine` menu, and select the `SSH Keys` tab. Click on the `ADD SSH KEY`, paste the content of your public key file (content of the file with extension .pub), and save it.

Create a GVM instance

Now we are going to create the GVM instance.

1. First, navigate to “VM instances” under the “Compute Engine” service of the Google Cloud menu. Then click on the “ENABLE” button to have access to the API, as shown in Figure 2.

Figure 2 - Enable the Compute engine API

2. Create the GVM instance by clicking on “CREATE INSTANCE” (see Figure 3).

Figure 3 - Create a GVM instance

Proceed to configure the GVM Instance:

Instance Name: choose a descriptive name using a customized naming convention to help track your work.
Region and zone: according to your geographic location.
A `General purpose`, `E2`, and `Standard` Machine configuration with 32 GB of memory have proven to be effective for our requirements to process NetCDF files up to 5MB (refer to Figure 4).

Figure 4 - Machine configuration

Configure the Boot Disk, you can follow the setup shown in Figure 5.

Click on the “CREATE” button to finalize your instance setup. You can change them according to your requirements. For example, you may want to create automated backups or increase the machine capacity all depending on your workload demand.

Figure 5 - Boot Disks settings

Additionally, in case you want to link more GC services like Google Storage or BigQuery, you might want to create a Google service account and enable all services you would like to use.

Working Locally on the GVM instance

To begin working, install necessary tools on the GVM instance. Use the SSH keys previously generated for access locally to the instance. To expedite future logins, create a `config`file as follow:

1. Create a `config` file in your local machine under the `.ssh/` directory.

2. Edit the file, add the configuration as shown in Figure 6 with the following parameters:

Host: Name of your instance
HostName: The external IP of your running instance (note: this changes with each session).
User: The same user you used to generate your SSH keys.
IdentityFile: Path to your private key file.

Figure 6. Create the config file. The HostName parameter is the "External IP" on the GCP console.

3. Access your instance, update the `External IP` value in the config file and use the command `ssh <Host>` from your terminal, e.g. `ssh tutorial-omdena`.

Installing Dependencies

Anaconda

Installing Anaconda will cover most of the requirements due to the simplicity of the project goal; Anaconda gives access to Jupyter notebook and many Python packages.

1. Start by copying the link of the Anaconda installer from the official repository. Then, download the installer from a terminal as in the next example:

~$wget https://repo.anaconda.com/archive/Anaconda3-22024.02-1-Linux-x86_64.sh

2. Once the installer is downloaded, install Anaconda by executing the shell script as follow:

~$bash Anaconda3-2024.02-1-Linux-x86_64.sh

3. During the installation, accept the license terms by pressing enter. You can provide a path for the installation or simply type enter. The extraction and installation will take some time when prompted, type `yes` to initiate the process installation.

4. After the installation, the hidden `.bashrc` file has been modified using the command `source .bashrc` to persist the changes on the environment.

5. To finalize, type `logout` to exit the instance and log back in, you should see `(base)` at the beginning of the prompt, indicating that Anaconda is now running.

Good job! Anaconda is installed in your GVM instance environment. Now proceed to connect with the Jupyter notebook server in your browser . Start by installing the `Remote-SSH` VisualStudioCode (VSC) extension locally. Then, click on the lower left corner to open a remote window of your VSC, under the displayed menu select `Connect to Host` and choose your instance (refer to Figure 7).

A new VSC window will be open showing the name of the instance in the lower left corner.

Figure 7 - Connecting with the GVM instance

In this window, open a terminal and navigate to the tab `PORTS`, `Forward a Port` and enter the default port for Jupyter Notebook 8888 as shown in Figure 8.

Figure 8 - Open the port to enable communication and work with Jupiter Notebook

Open a new browser window and type `http://localhost:8888/`, proceed to create a new Jupyter notebook.

Packages to process NetCDF files

There are several options to work with NetCDF datasets, we implemented xarray and netCDF4. While xarray is already installed, you need to install the second one by executing `pip install netCDF4`.

Processing NetCDF datasets using a Jupyter Notebook

What is NetCDF?

NetCDF is a set of software libraries and data formats designed to foster the creation, access, and sharing of array-oriented scientific data. It is widely recognized as a community standard for data sharing. The data can include several subsets in different formats, this facilitates to add a new subset making netCDF scalable and appendable. Additionally, this format ensures that the data is self-describing and portable, meaning the data can include its own description and be accessible across different computing architectures. These files can be processed in various programming languages including C, Java, Fortran, Python, and others.

Starting with NetCDF Dataset

Following our introductory overview of NetCDF we will delve into the specific case of reducing the dataset size by substitute data for measures of central tendency. To facilitate this, we created a Jupyter Notebook. The dataset to process contains daily fire weather index projections for Europe, the data is provided by the Copernicus system.

Proceeding with the Jupyter notebook:

1. Loading packages.

import xarray as xr
import pandas as pd
import numpy as np
from netCDF4 import Dataset
import random
import seaborn as sns
import matplotlib.pyplot as plt

2. Load, explore, and access data.

# Loading data
file_path = 'wildfire_index_europe.nc'
ds = xr.open_dataset(file_path)

The dataset contains dimensions, coordinates, variables, indexes, and attributes. The data can be explored by displaying the variable `ds` (refer to Figure 9). In this way, it is possible to understand the meaning and relationships among variables. The `attributes` component provides metadata and detailed information about the data and its source.

3. Partitioning data.

The data can be accessed without needing to know the storage details, within our setup there are at least two ways to handle the data; using the pandas and/or the xarray capabilities.

We create data partitions per year by using the method `dataset.isel()` from the package `array`, each segment is an element of the dictionary `ds_years`.

ds_years = {}
ds_years["ds_2037"] = ds.isel(time=slice(1827, 2192))

4. Create dataframes.

Figure 9. Dataset structure

To process data efficiently we create pandas dataframes and drop extra information for each year.

# Creating DataFrames by year
cols_drop = ["rlon", "rlat", "rotated_pole"]
dfs = {}
for dsy_n, dsy_ds in ds_years.items():
   ds = dsy_n.replace("s","f")
   temp = dsy_ds.to_dataframe().reset_index()
   dfs[ds] = temp.drop(cols_drop, axis=1)

Statistical Review

The ESG (Environmental, Social, and Governance) risk analysis involves forecasting environmental hazards to provide analytical insights aimed at preventing and mitigating such risks for businesses. Given the potential for information overload with daily data, it is crucial to compress such datasets effectively. We can visualize the distribution of the data to observe and study the data behavior to make an informed selection between the median or mean values to substitute data.

Visualizing data distribution

Then, following our case of study, for each annual dataset, we display data distributions for randomly selected geocoordinates (longitude and latitude pair) for three different years. These charts are useful tools to analyze the data distribution, especially the influence of the outliers over the mean and the median values and they can help to determine which value can replace the original thirty values (corresponding to daily projections) for a complete month. One of these charts is shown in Figure 10.

Figure 10. Daily fire weather index projections distribution for each month, for a given year. The projection values correspond to a unique geocoordinates

Scaling down the dataset size

Using mean or median values to replace data smoothes it by eliminating outliers, preserves the overall distribution and trends, offers an efficient solution for handling missing data, and facilitates future analysis. However, it’s important to consider actionable insights aligned with business goals when choosing between mean and median.

For example, in Air Quality Management, median pollution levels address common issues, while mean levels offer trend snapshots. In Freshwater Resource Management, median water availability ensures consistent access. And in Wildfire Risk Assessment, median wildfire risk helps insurance companies set premiums based on typical fire seasons.

The standard deviation quantifies the spread of data around the mean, indicating the uncertainty or risk associated with predictions. A small standard deviation suggests more reliable predictions, while a large one implies less reliability. This information is crucial for companies to assess potential variability in environmental impacts and adapt accordingly.

We calculate the monthly median, mean, and standard deviation values.

cols_group = ["lon", "lat", "month", "fwi-daily-proj"]
df_stats = {}
for df_k, df_v in dfs.items():
   dfn = df_k.replace("df","stats")
   tmp = pd.DataFrame()
   tmp["monthly_mean"] = df_v[cols_group].groupby(["lon", "lat", "month"]).mean()
   tmp["monthly_median"] = df_v[cols_group].groupby(["lon", "lat", "month"]).median()
   tmp["monthly_std"] = df_v[cols_group].groupby(["lon", "lat", "month"]).std()
   df_stats[dfn] = tmp

The substitution of the original monthly data by the mean or median values ends up reducing the original amount of records by a rate of thirty to one which represents a reduction greater than 95% from the original dataset size. This scaling down not only facilitates adding layers on a map but also provides a summarized amount of data that is easy to understand and be used to make informed decisions.

Important Reminder: Always remember to `STOP` your GVM instance when you have completed your tasks. This helps you to effectively manage your free credits or avoid unexpected charges for a paid account.

Potential Challenges

When working with large datasets and cloud computing, you may encounter various challenges. Here are some potential hurdles to keep in mind:

Data Transfer and Storage: Transferring large datasets to/from the cloud can be slow and costly. Ensure a reliable internet connection and use data compression. Be mindful of storage costs.
Data Security and Privacy: Ensure proper security for sensitive data. Encrypt data, implement access controls, and follow privacy regulations. Use your cloud provider’s security features and best practices.
Resource Management: Manage resources effectively to avoid unnecessary costs. Monitor usage, adjust configurations, and use automated scaling or alerts to optimize utilization.
Debugging and Troubleshooting: Debugging can be challenging in distributed systems. Use your cloud provider’s logging and monitoring tools. Implement robust error handling and logging in your code.
Learning Curve: Working with cloud platforms may require additional learning. Familiarize yourself with documentation, tutorials, and best practices. Seek training materials and join communities for support.

By being aware of these potential challenges and proactively addressing them, you can minimize roadblocks and ensure a smoother experience when working with large datasets and cloud computing resources.

Potential Uses and Applications

This technique of using cloud computing resources, such as Google Virtual Machine instances, to process and analyze large datasets has a wide range of potential uses and applications across various domains. Here are some notable examples:

Environmental Monitoring and Climate Change Research: Analyze satellite imagery, weather patterns, and climate models to study climate change and ecosystem dynamics.
Genomics and Bioinformatics: Store and analyze massive genomic datasets to uncover genetic variations and develop personalized medicine.
Financial Analytics and Risk Management: Process real-time financial data to detect fraud, assess credit risk, and make data-driven investment decisions.
Social Media Analytics and Sentiment Analysis: Analyze user-generated content to gain insights into user behavior, preferences, and sentiments for targeted marketing and public opinion analysis.
Internet of Things (IoT) and Smart City Applications: Process real-time data from IoT devices and sensors for intelligent applications like traffic management and energy optimization.
Healthcare and Biomedical Research: Securely store and analyze patient data, including electronic health records and genomic information, for personalized treatment plans and drug discovery.

By leveraging the power of cloud computing and the techniques described in this tutorial, researchers, businesses, and organizations across various domains can unlock the potential of big data and drive innovation in their respective fields.

Conclusion

In this guide, we have provided a comprehensive, step-by-step guide for configuring a Google Virtual Machine instance and establishing an SSH connection from your local machine. The GVM instance serves as a scalable computing resource, for efficient processing of NetCDF datasets that exceed the capabilities of your local machine. We have developed a Jupyter notebook to facilitate the visualization process of georeferenced data, allowing for better comprehension of the information by scaling down the dataset size employing measures of central tendency such as median or mean values in place of numerical data. Additionally, we computed the standard deviation value which can be associated as an error for the median value.

This approach not only empowers users to efficiently handle massive datasets but also significantly enhances data accessibility and usability.