AI Insights

Predicting Soil Organic Carbon Changes Using RothC and Machine Learning

August 23, 2022

article featured image

In this article, we will present how to develop a data visualization dashboard to measure Soil Organic Carbon (SOC) changes from historic data using RothC.

Why is regenerative farming important?

Regenerative farming is a conservation and rehabilitation approach to food and farming systems that focuses on topsoil regeneration, increasing biodiversity, improving the water cycle, enhancing ecosystem services, supporting biosequestration, increasing resilience to climate change, and strengthening the health and vitality of farm soil¹.

The core of regenerative farming is good soil. When soil is in good health, it produces a large, high-quality, and reliable harvest with as few production inputs as possible. Healthy soil benefits farmers, the environment, and society as a whole in many different ways¹. Many farmers are already implementing regenerative farming measures in their fields such as minimizing cultivation and genuine plant cover.

Regenerative farming is important because it prioritizes climate, ecosystems, water quality, people’s health, relationships within and across ecosystems, fair pay, and racial equity for farmers². Climate change above two degrees Celsius per year is a universal problem according to the Paris agreement. Reducing GreenHouse Gas (GHG) has become the primary means of controlling global climate change. In the agriculture sector, Soil Organic Carbon (SOC) stock has been considered a critical approach to mitigating GHG emissions, keeping the soil fertile, and reducing desertification².

Above all else, regenerative farming provides an opportunity to produce high-yield profitably while securing future production inputs by locking nutrients in the soil².

(1) Regenerative agriculture – Wikipedia.
(2) Regenerative Agriculture 101 – NRDC.
(3) Regenerative Agriculture – NRDC.

How machine learning can be applied

The EU is pursuing ambitious policies to combat climate change. It requires carbon emissions neutrality in all sectors. For the agricultural sector, it requires a transition toward a more sustainable model, and for the rest of the sectors, it applies restrictions regarding activities that emit CO2. This has caused real problems in many economic sectors.

There are three main barriers to adopting regenerative farming. First, farmers lack the knowledge and financial resources to implement regenerative farming. Second, companies need by legal imperative, for competitiveness, and for CSR reasons to achieve neutrality of emissions. But presently, efficient monitoring of carbon credits is not very cost-effective. Third, current carbon standards and certification bodies require accurate monitoring of carbon sequestration. Agriculture has a very high uptake potential but monitoring costs are very high making the process unprofitable.

Machine Learning (ML) and AI can help break down these barriers by developing a system that provides knowledge about how to transition to regenerative farming and helps farmers with financial resources to make that transition.

Project pipeline

The project scope consists of the following:

  • Developing a data visualization dashboard that can demonstrate the benefits of regenerative practices
  • Describing how Soil organic carbon (SOC) stock changes from historic data
  • Developing a pathway of one or more methods to predict SOC changes


Project pipeline

Source: Omdena

Data sources

The following are the data sources used in the project:

  • Wise3: A global soil database created by International Soil Reference and Information Centre(ISRIC). The data contains a variety of soil features.
  • Carbosal: A Georeferenced Soil Profile Analytical Database For Spain created by a group of Spanish institutions. The dataset contains similar soil features as the Wise3 dataset.
  • Sigpac/DUN: Contains the official geographic data for the registered soil parcels in Spain.
  • ICGC data: A data collection of soil profiles measured from 2008 -2018 from various lands of Catalonia by the Institute of Geology and Cartography in Catalonia.
  • LUCAS data: Europe’s Land Use-Land Cover Area Frame Survey (LUCAS) is an in situ area frame survey carried out every three years over the entire European territory. Gathered by direct observations made by surveyors in the field.


Predicting soil organic carbon stock in t0 using machine learning 

The team aimed to predict the SOC value at a given location using other known location features. They utilized Python for modeling due to their expertise and various machine learning libraries. The main dataset was the LUCAS 2015 topsoil dataset merged with “micro” and “ancillary” datasets of the same year. The team selected all samples from the Mediterranean bio-geo region and all possible points from Spain and Italy due to similarities in climate and geography. Initially, 65 features were used, but the team planned to reduce the feature space over time. The team used pandas library for data processing, which was similar to their approach for exploratory data analysis.

Further preprocessing was required before modeling. The team noticed that the main land use description column had certain descriptors, such as “grasslands,” used in multiple categories, so they split it into additional boolean features. They also created separate features for high-value count categories like “barley” and “vineyards.” The team converted the existing categories to pandas categorical type, ordered those that could be ordered correctly, and used pandas box-plots to identify outliers.

After the preprocessing stage, the team performed machine learning-specific work. They utilized five-fold cross-validation to evaluate each model due to the variance in different data splits encountered. This required bundling all preprocessing into a preprocessing pipeline, which allowed the cross-validation function to run on each fold. A median value imputer filled in missing values for numerical features, and standard scaling was chosen for feature scaling. For categorical features, only an ordinal encoder was required. All pipeline components came from the sklearn library.

The team tested KNN regressor, linear regression, random forest regressor, and xgboost regressor on the dataset with and without outliers. They evaluated the models using MAE with five-fold cross-validation. The team trained models on smaller sets of features after obtaining feature importance information. For model interpretation, the random forest was mainly studied using both MDI and permutation style feature importance due to their relative modeling power, robustness to non-optimal hyperparameters, and reputation for producing reliable feature importance information. The team used sklearn library for all models and metrics, except for the xgboost regressor, which came from the xgboost library.

Using RothC to predict soil organic carbon stock in t=1 

Carbon in the soil is stored in two forms: Soil Organic Carbon (SOC), which is the main constituent of Soil Organic Matter (SOM), and Soil Inorganic Carbon (SIC). Organic matter mainly comprises carbon (58%), with water and other nutrients such as nitrogen and potassium making up the remaining mass.

SOC, being the primary component of soil organic matter, is crucial for all soil processes. The soil’s organic material comes from residual plant and animal material, synthesized by microbes, and decomposed under the influence of temperature, moisture, and other soil conditions.

The rate of annual loss of organic matter can differ significantly, depending on various factors, such as cultivation practices, type of plant or crop cover, soil drainage status, and weather conditions. There are two categories of factors that influence the inherent organic matter content: natural factors (climate, soil parent material, land cover, vegetation, and topography) and human-induced factors (land use, management, and degradation).

To predict carbon levels in soils, models were applied that could evaluate carbon trends. One of the most popular models is the RothC model. The RothC model simulates the turnover of organic carbon based on soil type, temperature, moisture content, and vegetation cover. SOC is divided into four active pools in the RothC model: Decomposable Plant Material (DPM), Resistant Plant Material (RPM), Microbial Biomass (BIO), and Humified Organic Matter (HUM). There is also a small pool of Inert Organic Matter (IOM) not involved in turnover processes. The RothC model requires initial Carbon stocks in DPM, RPM, BIO, HUM, and IOM, measured in tonnes per hectare (t C/ha).

The model’s inputs include precipitation and potential evapotranspiration (mm); average temperature (◦C); degree of soil cover (bare or vegetated); carbon inputs from crop residues (t C ha−1) with the related DPM/RPM ratio; and carbon inputs from manure (t C ha−1). In addition, the following soil parameters are required: clay concentration (%); soil tillage depth (cm), and IOM content (t C ha−1) calculated from the initial measured SOC value. The table below summarizes the input data requirements.

Table summarises the input data requirements.

The table summarises the input data requirements.

The outputs provided by RothC are the values of SOC and the four active pools that compose it, in addition to the carbon emitted as CO2. The output time step can be monthly or annual. The distribution of carbon pools and their variation over time can be evaluated within simulations varying from years to centuries. The RothC model is available on the Rothamsted website.

In a nutshell, the RothC model projects SOC stock for 20 years, starting with a baseline, and calculates average annual variation using three different Scenarios: SSM1 (Low Carbon Inputs Sustainable Management Scenario), SSM2 (Medium Carbon Inputs Sustainable Management Scenario), SSM3 ( High Carbon Inputs Sustainable Management Scenario). SOC sequestration will be the difference in SOC stock in 20 years. The Annual Sequestration rate will be calculated by dividing the SOC stock difference accumulated all these years by the number of years. This is finally compared with the Business As Usual Scenario (BOA) where no changes to the current situation are assumed. 


The application has been deployed using the Streamlit Framework. To bring the visualizations to life, various libraries such as osmnx, folium, geopandas, and plotly were used. The application takes into consideration various soil properties such as Organic Carbon, Nitrogen, Potassium, Phosphorus, Calcium, and pH Value for data analysis. In addition, the application provides options for selecting regenerative farming practices like Grazing, Crop Residue, Water Management, and more.

Results and insights: Regenerative farming practices and SOC stocks

By using Sigpac and Farm info data, we were able to identify the agricultural parcels that correspond to a particular location. Our goal was to determine the number of agricultural parcels that contribute to a specific plot. To achieve this, we utilized the overlay function of the geopandas library. By overlaying the mapped contours of Sigpac and Farm info data, we were able to identify the matching parcels between the two sets of data. In the image provided, the green plot represents a cadastral parcel, which is comprised of multiple agricultural parcels depicted in red. This image depicts a section of the Alt Penedes community in the Catalunya province.

find the number of agricultural parcels

Source: Omdena

Nextly, we processed the WISE 3 data and found that only two observations belonged to the Catalunya region and we might need to expand the scope of the region in order to get more data. 

Map of Spain

Source: Omdena

The map above shows the region of Catalunya in Spain, with two red points representing soil observation locations. Unfortunately, the Wise3 dataset did not provide Soil Organic Carbon (SOC) values for these observations. As SOC was a key variable in our analysis of the impact of regenerative farming practices on carbon sequestration, we needed to find additional resources. Fortunately, we were able to use the LUCAS dataset to expand our analysis to cover all of Spain.

During our exploratory data analysis, we discovered a high degree of skewness in the measurements of soil chemical properties. This was largely due to the concentration of soil carbon in grassland, treeland, and shrublands, which caused some values to be outliers.

Source: Omdena

Source: Omdena

Only nitrogen showed a positive linear relationship with the change in soil carbon whereas other properties did not provide definitive insight into the relationship, though correlation does not always mean causation. For example, higher soil pH values have a good significance of higher capturing of CO2 from the air but the plotting evidently does not represent a strong linear relationship.

The following image demonstrates the skewness in the organic carbon value distribution and the data is highly right-skewed. 

Source: Omdena

Source: Omdena

It was found that using regenerative farming practices like having crop residue on the farms, reduced tilling or no tilling, managed grazing activities, and proper water management can significantly increase soil carbon levels as a result of enrichment of the soil and its minerals.

When comparing the values of SOC between the years 2009 and 2015, we found 1.592 g/kg of average carbon sequestration with grasslands and shrublands accounting for the maximum sequestration. The dashboard visual analytics also shows the relationship between carbon and other soil chemical properties. Following is a glimpse of our dashboard visualization:

Dashboard visualization

Source: Omdena

Dashboard visualization

Source: Omdena

The above image is the page that shows the land use and land cover data according to regenerative farming practices applied. The pie chart displays the amount of Grazing practice utilized in different communities of Spain with Castilla-Leon being at the top for practicing grazing on farms and other types of land. On the bar graph on the left, we can also visualize land cover for different communities and provinces. For Castilla-Leon it appears that Shrubland with sparse tree cover resulted in the highest amount of grazing practices and hence high soil carbon values. 

In the later part, we also introduced a treemap that helped us to find the type of soil used for the cultivation of particular types of crops and analyze how different communities have different types of soils. In this example, the Castilla-Leon community (which is in blue) has Calcisols as the major type of soil where Barley, common wheat, and sunflower were the top 3 crops cultivated for the year 2009.

Soil Classification

Source: Omdena


To reach our conclusion, we deployed an online data visualization application that establishes the initial state of the farm. During this process, we observed that certain crops have an impact on SOC stock, which is essential for Carbon Prediction.

We successfully developed models for predicting SOC at t=0. However, data constraints halted the development and validation of ML solutions for t=1. Future implementation would require Roth-C model experiments and validation on more data.

Based on the data visualization results, we can set up a bird’s eye view recommendation system for farms to adopt regenerative farming methods.

Future scope 

The future scope of the project involves:

  • Getting more data, specifically at least one more time point for t=1 models (LUCAS 2012 topsoil), and more data with methods used on a given farm plot
  • Developing and comparing both ML and Roth-C models for t=1
  • Working on a possible farming method recommender system
  • Addition of climate data for the 2009 year to compare with 2015
  • Further expanding the use of satellite data for SOC prediction


  1. FAO. 2020. GSOCseq Global Soil Organic Carbon Sequestration Potential Map Technical Manual. G. Peralta,L. Di Paolo, C. Omuto, K. Viatkin, I. Luotto, Y. Yigini, 1st Edition, Rome
  2. Coleman, K., & Jenkinson, D. S. (1996). RothC-26.3-A Model for the turnover of carbon in soil. In Evaluation of soil organic matter models (pp. 237-246). Springer, Berlin, Heidelberg.
  3. Fantin, V., Buscaroli, A., Buttol, P., Novelli, E., Soldati, C., Zannoni, D., … & Righi, S. (2022). The RothC Model to Complement Life Cycle Analyses: A Case Study of an Italian Olive Grove. Sustainability, 14(1), 569.
  4. Rothamsted Carbon Model (RothC). Available online: (accessed on 22 July  2022).
  5. (accessed 22 July 2022)
Product Owner: Mario Rodriguez.
Authors: Faris Baker, Hardik Seju, Esther Kamau, Shubham Trivedi, Prathima Kadari.

Ready to test your skills?

If you’re interested in collaborating, apply to join an Omdena project at:

media card
Smart Solutions Battling Malaria in Liberia with AI
media card
Harnessing AI to Monitor and Optimize Reforestation Efforts in Madagascar
media card
How We Leveraged Advanced Data Science and AI to Make Farms Greener
media card
A Beginner’s Guide to Exploratory Data Analysis with Python