AI Insights

The Ethics of AI Data Collection: Ensuring Privacy and Fair Representation

November 6, 2023

article featured image


In the heart of a bustling city, Maria unknowingly becomes part of a vast AI network when a camera captures her casual stroll past a store. By evening, her phone buzzes with an ad for that very shop. Miles away, in a Kenyan village, Amina’s quest for better crop yields unknowingly feeds a database that ultimately disadvantages her community. From sprawling urban landscapes to remote corners of developing countries, the omnipresent web of artificial intelligence is woven with data. These stories highlight the profound challenges at the intersection of technology and ethics. Delve with us into this intricate dance of consent, privacy, bias, and representation, as we explore the urgent need for a universally ethical approach to AI data collection.

Ethical Challenges in Data Collection

Imagine being a patient undergoing a medical procedure. Would a simple “yes” suffice, or would you want a detailed understanding of the procedure, its implications, and potential outcomes? 

Just like in medical scenarios, the ethical foundation of data collection demands more than a cursory agreement. It beckons a realm where transparent communication reigns, ensuring every individual truly grasps how their digital footprint will be utilized, cultivating a trust-based relationship in the realm of data. In this spirit of transparency, it is paramount that AI practitioners not only secure informed consent but also provide a continuous channel for dialogue where concerns can be addressed and questions answered.

This entails a responsibility to make complex AI terminologies accessible, demystifying the technology so that non-experts can make educated decisions about their data contributions. Moreover, ethical data collection must involve a commitment to protecting the privacy and dignity of community members, never exploiting their information for unintended purposes, and always allowing them the autonomy to opt out without penalty or prejudice.

Source: AI-generated

Source: AI-generated

Bias and Representation 

Picture a classroom where only the voices of a select few are heard, while others are perpetually silenced. In the AI world, this classroom is a dataset. If only certain voices (or data) dominate, biases naturally emerge, sidelining the marginalized. By diving into these biases, we unravel the deeper questions: Who’s not present in this data class? What narratives are we missing? Championing an inclusive data spectrum, we push for a thorough introspection of both overt and covert biases that shape our AI’s worldview. Much like a vibrant classroom thrives on diverse perspectives, an inclusive AI dataset mirrors this richness. Imagine the symphony of varied voices contributing to the algorithmic conversation, ensuring that the dissonance of biases is replaced by a harmonious blend of insights.

Delving into the intricacies of data representation becomes a quest for equity, where the silent narratives find their rightful place in the digital discourse. The call for an inclusive data spectrum is a plea to not only acknowledge the voices already present but to actively seek out the unheard, the underrepresented. By championing diversity in our datasets, we lay the foundation for an AI that not only reflects but respects the multiplicity of human experiences.

Source: AI-generated

Source: AI-generated

Data Provenance 

Think of a cherished family heirloom passed down through generations. Its value isn’t just in its age or beauty but in its story—where it came from, who owned it, and its journey through time. Similarly, in the digital realm, the backstory of data holds paramount importance. Tracing the lineage of data is akin to ensuring the authenticity of that heirloom, safeguarding against potential falsities or distortions. Emphasizing its lineage, we advocate for a robust system that celebrates transparency, responsibility, and the genuine narrative of every data point.

Imagine a digital tapestry woven with the threads of information, each thread telling a unique story. By delving into the lineage of data, we not only preserve its authenticity but also unveil the interconnected narratives that shape our understanding. This emphasis on the life journey of data is a clarion call for accountability and accuracy, ensuring that the digital realm mirrors the principles we hold dear in safeguarding tangible treasures passed down through generations. In championing a narrative-rich approach to data, we weave a tapestry that not only stands the test of time but also reflects the ethical values ingrained in its creation and evolution.

Source: AI-generated

Source: AI-generated

Best Practices in Ethical Data Collection

Anonymization and Differential Privacy 

The delicate balance between extracting meaningful insights from data and safeguarding individual privacy is explored through anonymization techniques and the implementation of differential privacy. Diving into the technical aspects of these practices, it highlights their significance in allowing for the responsible use of data without compromising personal information. Omdena is committed to safeguarding the privacy of our stakeholders and those impacted by our projects. 

Example: Anonymization of medical patient data 

An example is our collaboration with Heart Kinetics—a delicate project involving close work with medical data related to heart failure. Before using heart patient data for AI research, our team transformed the data to protect patient privacy. We removed names and personal details, mixed up some of the non-critical information, and made sure the data couldn’t be traced back to any individual. This way, we kept the valuable medical information needed for our study while ensuring everyone’s personal data stayed secure. Our careful process meant we could focus on improving heart health technology without compromising patient confidentiality. The result was a clean, safe dataset ready for AI applications.

Source: AI-generated

Source: AI-generated

Active Inclusivity in Data Sources 

Ethical data collection demands more than passive inclusion—it requires a proactive approach to inclusivity. Actively seeking out diverse data sources is not only a moral imperative but also a strategic move to create AI models that accurately reflect the richness and complexity of the real world. Advocating for a paradigm shift towards actively seeking diversity in data sources to address systemic biases. For instance, Omdena is dedicated to incorporating diverse datasets into our projects through community involvement and human oversight. This collaborative ethos not only leads to more robust AI models but also fosters a sense of shared responsibility in the data collection process.

Example: Detecting Mis/Disinformation

Collaborative Community-Driven Data Collection offers a comprehensive approach to tackling mis/disinformation detection and enabling inclusive media practices. By drawing data from diverse global communities, it ensures that the datasets are rich, and representative, and encompass local nuances, making AI models more adept at identifying a wide range of misinformation tactics. Engaging communities directly in the data collection process also promotes ethical gathering, reduces biases, and instills a sense of shared responsibility, ensuring the data is both trustworthy and broadly applicable.

Human oversight 

Drawing on crowd wisdom enriches our methodology by tapping into the collective insights of diverse individuals. This amplifies our perspective and fortifies our capacity to pinpoint and rectify biases within our AI systems. The harmonious blend of human oversight and stakeholder engagement and crowd wisdom underscores our dedication to upholding fairness and accuracy. In any given Omdena project, where a collaborative force of over 50 contributors is at play, this collective effort ensures a stringent oversight process, facilitating the identification of potential biases.

Transparency emerges as the cornerstone of ethical data practices. Clear communication about the purpose and scope of data collection, coupled with empowering individuals with the right to control and manage their data, forms the essence of this best practice. Exploring the multifaceted nature of transparency, it emphasizes its role in building and maintaining trust in the data collection process. 

Example: Community-Based Social Sentiment Analysis Towards Carbon Credit Projects 

Omdena is dedicated to sustainable and responsible AI solutions, focusing on environmental conservation. Our recent collaborative effort produced a data visualization dashboard for assessing regenerative farming practices and generating a Social Sentiment Score for Carbon Credit Projects in Marginalized Communities. This tool is both visually appealing and practical, offering insights into the current state and guiding predictive pathways for environmental trends. It exemplifies how ethical data practices can address complex sustainability issues, particularly in the face of climate change.

The project emphasizes empowering marginalized communities in the transition to carbon neutrality, using innovative technology to enable them to offset their carbon footprint. By collecting sentiments, the project aims to include their perspectives and address their unique vulnerabilities in carbon credit initiatives. The geographical focus is on Latin American countries, especially Spanish-speaking regions like Peru, known for having a substantial number of marginalized communities engaged in carbon neutrality projects.

Source: AI-generated

Source: AI-generated

Relevance of Data Life Cycle 

Ethical considerations are not static; they must traverse the entire data life cycle. From the initial gathering of data through its storage to eventual disposal, it argues for responsible practices at each stage. Such an approach ensures that the ethical principles governing data collection are not just theoretical but are ingrained in the very fabric of the data’s journey. 

Source: AI-generated

Source: AI-generated

Example: Creating a national ID system in the Philippines

Omdena’s local chapter project in the Philippines, aimed at establishing a national ID system, exemplifies the pervasive importance of AI ethics throughout its entire journey. Starting from the grassroots data gathering, the team focused on community engagement and transparent communication. 

They ensured that individuals comprehended the purpose, implications, and safeguards related to the collection of their personal information throughout the data creation process, upholding principles of informed consent and involving the community.

Beyond data collection, the project implemented robust security measures for storage and management, including encryption protocols, access controls, and regular audits to prevent unauthorized access and misuse. The project’s responsibility extended to environmentally conscious data deletion, ensuring ethical considerations guided the data’s life cycle even after project completion.

The team maintained transparent communication with stakeholders, updating them on stringent data security measures. This open dialogue fostered collaboration and inclusivity, strengthening the commitment to ethical practices. The project went beyond mere compliance, making ethical considerations a shared responsibility between the implementing body and the citizens.

This approach not only enhanced the credibility of the national ID system but also empowered individuals to actively protect their own data privacy.

AI Built on the Three C’s is Essential 

Omdena believes that the best bet we have to tackle ‘bad’ AI is we the people. Collaboration among varied talents enables us to bridge gaps in understanding between different mindsets, share knowledge, and unite people and values. It, therefore, helps to create compassion and harnesses crowd wisdom, diversity, and inclusion to serve the long-term interests of those communities. 

The other key element is consciousness. As so much division exists in this world, we need to understand that deep down we all are one. Thus our consciousness is collective. Through forming a sense of community, we collaborate together with compassion and consciousness.

Thus, AI built by the three C’s (Collaboration, Compassion, and Consciousness) will help us to remove endemic sociological and historical bias and other inequalities that exist in society. Omdena’s framework of bottom-up collaboration helps to achieve AI models built with the three C’s.

Read more about Omdena´s AI ethics code here.


The responsibility for ethical data collection is not a mere organizational concern; it is a collective imperative for the entire AI community. The concluding emphasis is on the proactive steps necessary to refine data collection methodologies, ensuring they align with ethical principles of privacy, fairness, and accountability.

In an era where AI’s influence permeates society, building trust through ethical practices becomes paramount. The conclusion advocates for a future where ethical considerations are not only a regulatory requirement but a fundamental ethos guiding the development and deployment of AI technologies. By encouraging a proactive stance towards refining data collection methodologies, the AI community can collectively contribute to a future where AI benefits all of humanity, without compromising privacy, fairness, or ethical integrity.

Ready to test your skills?

media card
The Future of AI is Ethical: Why Your Organization Should Care
media card
Revolutionizing Short-term Traffic Congestion Prediction with Machine Learning
media card
How We Leveraged Advanced Data Science and AI to Make Farms Greener
media card
Unlocking Financial Inclusion: Omdena’s Ethical AI Journey in Inclusive Finance