Machine learning (ML) is a buzzword in today’s technology world. From automatic friend tagging on social media apps and booking a cab online to online maps confirming that you are travelling on the ‘fastest route,’ ML is part of our daily life. The increase in ML applications involving private and sensitive data requires algorithms to protect that data. The EU-funded GENERALIZATION project will work on specifying the amount of data needed to facilitate private machine learning. The answers will advance the field in terms of efficiency, reliability and applicability. The project’s work combines ideas from various areas in computer science and mathematics.
Recent years have witnessed tremendous progress in the field of Machine Learning (ML). Learning algorithms are applied in an ever-increasing variety of contexts, ranging from engineering challenges such as self-driving cars all the way to societal contexts involving private data. These developments pose important challenges (i) Many of the recent breakthroughs demonstrate phenomena that lack explanations, and sometimes even contradict conventional wisdom. One main reason for this is because classical ML theory adopts a worst-case perspective which is too pessimistic to explain practical ML: in reality data is rarely worst-case, and experiments indicate that often much less data is needed than predicted by traditional theory. (ii) The increase in ML applications that involve private and sensitive data highlights the need for algorithms that handle the data responsibly. While this need has been addressed by the field of Differential Privacy (DP), the cost of privacy remains poorly understood: How much more data does private learning require, compared to learning without privacy constraints? Inspired by these challenges, our guiding question is: How much data is needed for learning? Towards answering this question we aim to develop a theory of generalization which complements the traditional theory and is better fit to model real-world learning tasks. We will base it on distribution-, data-, and algorithm-dependent perspectives. These complement the distribution-free worst-case perspective of the classical theory, and are suitable for exploiting specific properties of a given learning task. We will use this theory to study various settings, including supervised, semisupervised, interactive, and private learning. We believe that this research will advance the field in terms of efficiency, reliability, and applicability. Furthermore, our work combines ideas from various areas in computer science and mathematics; we thus expect further impact outside our field.