In computer science, building powerful AI models isn’t enough – we need to ensure they work reliably and fairly in the real world. Data & Evaluation is the field focused on making that happen.

It covers two main parts:

  • Datasets: This involves creating high-quality collections of data used to train and test AI models. It’s not just about quantity; it’s about ensuring datasets are representative, unbiased, and relevant to the task at hand.
  • Benchmarking & Evaluation: This is how we measure a model’s performance. Traditionally, this meant focusing on accuracy. However, modern evaluation goes much deeper – assessing robustness (how well it handles unexpected inputs), fairness (avoiding discrimination), and explainability (understanding why the model makes its decisions).

 

Essentially, Data & Evaluation ensures AI systems are not only smart but also trustworthy and beneficial. It’s a rapidly evolving field driven by new model types (like LLMs) and growing ethical concerns around AI deployment.

Some of the most relevant research topics are:

I. Dataset Creation – Addressing Limitations & New Paradigms

  • Synthetic Data Generation: Creating datasets programmatically (e.g., using GANs) to overcome data scarcity/privacy issues.
  • Active Learning & Data Programming: Strategically selecting data for labeling or using weak supervision to reduce annotation effort.
  • Bias Detection & Mitigation in Datasets: Identifying and reducing biases that lead to unfair model predictions.
  • Long-Tailed Data & Imbalanced Learning Datasets: Creating datasets/techniques to handle scenarios where some classes are much rarer than others.
  • Data for Multimodal Learning: Building datasets with multiple data types (text, images, audio) aligned for combined analysis.

 

II. Benchmarking & Evaluation Metrics – Beyond Accuracy

  • Adversarial Robustness Evaluation: Testing model susceptibility to small input changes that cause incorrect predictions.
  • Out-of-Distribution (OOD) Generalization Evaluation: Assessing performance on data different from the training set.
  • Evaluation of Large Language Models (LLMs): Benchmarking LLM capabilities (reasoning, knowledge, safety).
  • Explainable AI (XAI) Evaluation Metrics: Measuring the quality & understandability of model explanations.
  • Human-in-the-Loop Evaluation: Incorporating human feedback for subjective task assessment.

 

Associated Researchers:
– Romeo Lanzino
– Alessio Fagioli