Estimating Equations for Data Summaries


In the current data centric world, considerable efforts are made to analyse complex and big datasets. Symbolic data analysis (SDA) offers opportunities for the statistical modelling of data (symbols) possessing internal variation as the result of a measurement imprecision or the application of an aggregation function to reduce the size of the original dataset (data summaries). Traditionally, much of the work undertaken in this field involves the strong assumption that the underlying data is uniformly distributed with conclusions drawn at the symbol level. Recently, new methods have been developed where a parametric assumption is made about the underlying distribution and estimated directly using the symbolic data. However, such assumptions are sometimes not suitable and working within a non-parametric framework preferable. We thus propose a non-parametric counter-part to these parametric methods relying on concepts from estimating equations. This requires us to initially define an extension of the empirical likelihood to the symbolic context, allowing to then derive symbolic estimating equations and estimate some summary statistics of the underlying data (e.g. mean, variance, quantiles, etc.) directly using the symbolic dataset. This is permitted conditionally on the structure of the microdata within each symbol being well defined. Furthermore, we provide new non-parametric procedures to improve the estimation of the within-symbol structure of the microdata for interval and histogram-valued data. Improvements are demonstrated through various simulation studies and the utility of the proposed framework is illustrated on some real data analyses. .

Institut Henri Poincare, Paris, France