In computer science, building powerful AI models isn’t enough – we need to ensure they work reliably and fairly in the real world. Data & Evaluation is the field focused on making that happen.
It covers two main parts:
- Datasets: This involves creating high-quality collections of data used to train and test AI models. It’s not just about quantity; it’s about ensuring datasets are representative, unbiased, and relevant to the task at hand.
- Benchmarking & Evaluation: This is how we measure a model’s performance. Traditionally, this meant focusing on accuracy. However, modern evaluation goes much deeper – assessing robustness (how well it handles unexpected inputs), fairness (avoiding discrimination), and explainability (understanding why the model makes its decisions).
Essentially, Data & Evaluation ensures AI systems are not only smart but also trustworthy and beneficial. It’s a rapidly evolving field driven by new model types (like LLMs) and growing ethical concerns around AI deployment.
Some of the most relevant research topics are:
I. Dataset Creation – Addressing Limitations & New Paradigms
- Synthetic Data Generation: Creating datasets programmatically (e.g., using GANs) to overcome data scarcity/privacy issues.
- Active Learning & Data Programming: Strategically selecting data for labeling or using weak supervision to reduce annotation effort.
- Bias Detection & Mitigation in Datasets: Identifying and reducing biases that lead to unfair model predictions.
- Long-Tailed Data & Imbalanced Learning Datasets: Creating datasets/techniques to handle scenarios where some classes are much rarer than others.
- Data for Multimodal Learning: Building datasets with multiple data types (text, images, audio) aligned for combined analysis.
II. Benchmarking & Evaluation Metrics – Beyond Accuracy
- Adversarial Robustness Evaluation: Testing model susceptibility to small input changes that cause incorrect predictions.
- Out-of-Distribution (OOD) Generalization Evaluation: Assessing performance on data different from the training set.
- Evaluation of Large Language Models (LLMs): Benchmarking LLM capabilities (reasoning, knowledge, safety).
- Explainable AI (XAI) Evaluation Metrics: Measuring the quality & understandability of model explanations.
- Human-in-the-Loop Evaluation: Incorporating human feedback for subjective task assessment.
Associated Researchers:
– Romeo Lanzino
– Alessio Fagioli