Advances in data analysis using aggregated data

Dec 16, 2024·
Boris Béranger
Boris Béranger
· 1 min read
Abstract
The necessity for faster and more efficient statistical modelling techniques has been motivated by the rise of big and complex data. For example, the huge volume of internet data collected on a daily basis implies that simple statistical models cannot be fitted on a regular computer, and sometimes even be stored. One strategy is to reduce the amount of data by aggregating it into summaries and perform an analysis on the summaries themselves. For a general aggregated function, we propose a likelihood-based approach to fit statistical models defined at the underlying data level. Theoretical guarantees about those maximum likelihood estimators are established, including consistency results for generic continuous aggregation functions. We then dive into the important, yet (almost) unexplored, topic of summary design. Focusing on the family of (univariate) random bin histogram aggregation functions and develop methodology to provide some answers to the burning question: how many bins do we need and where to place them? Some simulation experiments are provided to illustrate the insights drawn from the methodology.
Date
Dec 16, 2024
Event
Location

University College, London

Invited talk in the session Recent advances in symbolic data analysis, organised by Dr Andrej Srakar and Dr Yaser Samadi. Other presenters: Lynne Billard, Yaser Samadi and Abdolnasser Sadeghkhani.