Fitting models to underlying data using aggregates


Two common issues that arise when performing a statistical analysis are the complexity and the volume of the data at hand. Indeed, due to data collection techniques or necessity for anonymisation, observations take a more complex structure than usual point wise observations. Furthermore, the ever increasing speed at which technological innovations appear in our day to day life has led to a data-centric era. Huge amounts of data are continuously collected, requiring appropriate storage capabilities and efficient tools for statistical analysis. Symbolic data analysis (SDA) is an emerging area of statistics which has mainly focus on the Exploratory Data Analysis of group-based distributional summaries called symbols. The current literature commonly assume only symbolic data is available and aims at modelling directly these in order to gain knowledge about their characteristics. The available techniques are often based on simplifying (uniformity) assumptions that are known to be false and non-inferential. The latter is crucial when we place ourselves in the context of gaining information at the underlying data level. This talk introduces a novel general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying classical data, while only observing the distributional summaries. This approach opens the door for new classes of symbol design and construction, in addition to developing SDA as a viable tool to enable and improve upon classical data analyses, particularly for very large and complex datasets. The usefulness of the proposed methodology is demonstrated for several statistical analyses.

United States of America