AI Agents Inference Benchmarking Challenge
Challenge Background
With the exponential growth of AI agents across different frameworks and platforms, inference time has become a critical factor in real-time applications. AI processing represents less than 10% of overall transaction time in many cases, with most time spent on data preparation and reference data retrieval. The challenge of optimizing inference speed while maintaining accuracy has become increasingly important, especially since up to 90% of an AI model's life is spent in inference mode.
The Problem
While numerous AI agent frameworks exist, there's no standardized way to compare their inference performance across different scenarios. The research community needs a comprehensive benchmark system that can:
- Evaluate inference speed across different AI agent architectures/Frameworks
- Consider both end-to-end latency and throughput metrics
- Account for various optimization techniques and their impact
- Assess real-world performance under different computational constraints
Goal of the Project
- Develop Comparison Metrics: Establish metrics for effectively comparing inference times across different AI agent implementations.
- Define Scenarios: Include two distinct scenarios—Simple AI Agent Tasks and Complex AI Agent Tasks—to evaluate performance comprehensively.
- Framework Comparison: Conduct comparative analyses between frameworks such as CrewAI, Langchain, LangGraph, Swarn and custom AI agents within the defined scenarios.
- Parameter Tuning: Optimize parameters within the frameworks such as CrewAI to enhance performance metrics.
- Public Leaderboard: Create and maintain a public leaderboard to facilitate transparent comparisons and track performance across different frameworks.
Project Timeline
- Design Standardized Testing Methodology: Define protocols for evaluating different AI agent frameworks under consistent conditions.
- Establish Baseline Metrics: Creating two distinct scenarios—Simple AI Agent Tasks and Complex AI Agent Tasks.
- Build Pipeline for Frameworks: Create scripts, tasks , agents and tools for different frameworks CrewAI, Autogen, Langchain, LangGraph, Semantic Kernel, TxTAI by NeuML & Swarm.
- Run & Test initial task: Execute tests on selected AI agent frameworks for both Scenarios.
- Parameter Tuning: For all AI Agents frameworks for optimum inference performance.
- Create Visualization Tools (Optional): Develop dashboards and visualisation interfaces to display benchmarking results clearly and intuitively.
- Validate Results: Ensure the accuracy and reliability of the benchmarking tasks through repeated tests and cross-verification.
- Deploy Public Leaderboard or Research Article
- Create Comprehensive Documentation: Develop detailed guides and documentation to help users understand and utilize the benchmarking tasks effectively.
What you'll learn
- Understanding AI inference optimization techniques
- Mastering performance measurement and benchmarking
- Analyzing trade-offs between different AI agent architectures
- Implementing various optimization strategies
- Research Article Publication
First Omdena Local Chapter Project?
Beginner-friendly, but also welcomes experts
Education-focused
Duration: 4 to 8 weeks
Open-source
Your Benefits
Address a significant real-world problem with your skills
Build your project portfolio
Access paid projects (as an Omdena Top Talent)
Get hired at top organizations
Requirements
Good English
Suitable for AI/ Data Science beginners but also more senior collaborators
Learning mindset
Application Form
This Challenge is hosted by:
Become an Omdena Collaborator

