CRoNoS & MDA 2019: Summer Course

Final CRoNoS Spring Course

The Spring Course will consists of a series of tutorials in representative areas of CRoNoS (Computationally-Intensive methods for the robust analysis of non-standard data).

Dates: 14-16 April 2019
Venue: Poseidonia Beach Hotel, Limassol, Cyprus.
Speakers:
Stefan Van Aelst, KU Leuven, Belgium.
Tim Verdonck, KU Leuven, Belgium.
Karel Hron, Palacký University, Czech Republic.
Alastair Young, Imperial College, UK.
Peter Winker, University of Giessen, Germany.
Zlatko Drmac, University of Zagreb, Croatia.
Ivette Gomes, Universidade de Lisboa, Portugal.
Daniela Zaharie, West University of Timisoara, Romania.

Grants

PhD students and Early Career Investigators (who have obtained their PhD degree in 2010 or after) from eligible COST countries* can apply for a limited number of grants. The granted participants will be reimbursed up to 600 Euro for accommodation and travelling plus the (standard) registration fee.

In order to apply for the grants candidates should submit their CV by e-mail to cronos.cost@gmail.com.
Deadline for applications: 8th January 2019.
Granted candidates will be informed by e-mail after the deadline and must register 7 days after the notification to cronos.cost@gmail.com to secure their grants. Otherwise, their grants will be revoked and assigned to other candidate.
The granted candidates must attend all the sessions of the Spring course and sign the attendance list in order to obtain their grants.

*Eligible COST countries: Austria, Belgium, Bosnia and Herzegovina, Bulgaria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Israel, Italy, Latvia, Lithuania, Luxembourg, Malta, Montenegro, The Netherlands, Norway, Poland, Portugal, Romania, Serbia, Slovakia, Slovenia, Spain, Sweden, Switzerland, Turkey, United Kingdom and the former Yugoslav Republic of Macedonia.

Robust high-dimensional data analysis (4 hours)

Stefan Van Aelst, KU Leuven, Belgium, and Tim Verdonck, TKU Leuven, Belgium.

Robust statistics develops methods and techniques to reliably analyze data in the presence of outlying measurements. Next to robust inference outlier detection is also an important goal of robust statistics. When analyzing high-dimensional data sparse solutions are often desired to enhance interpretability of the results. Moreover, when the data are of uneven quality robust estimators are needed that are computationally efficient such that solutions can be obtained in a reasonable amount of time. Moreover, if many variables in high-dimensional data can have some anomalies in their measurements, then it is not reasonable anymore to assume that a majority of the cases is completely free of contamination. In such cases the standard paradigm of robust statistics is not valid anymore, but alternative methods need to be used. In this tutorial we will discuss robust procedures for high-dimensional data, such as estimation of location and scatter, linear regression, generalized linear models and principal component analysis. The good performance of these methods is illustrated on real data using R.

Applied compositional data analysis (3.5 hours)

Karel Hron, Palacký University, Czech Republic.

Compositional data are multivariate observations that carry relative information. They are measured in units like proportions, percentages, mg/l, mg/kg, ppm, and so on, i.e., as data that might obey (or not) a constant sum of components. Due to their specific features, the statistical analysis of compositional data must obey the geometry of the simplex sample space. In order to enable processing of compositional data using standard statistical methods, compositions can be conveniently expressed by means of real vectors of logratio coordinates. Their meaningful interpretability is of primary importance in practice. Aim of the course is to introduce the logratio methodology of compositional data together with a wide range of its possible applications. The first part of the course will be devoted to theoretical aspects of the methodology including principles of compositional data analysis, geometrical representation of compositions, construction of logratio coordinates and their interpretability. In the second part exploratory data analysis including visualization will be presented, followed by concrete popular statistical methods, e.g. correlation and regression analysis, or principal component analysis, and even methods for processing of high-dimensional data adapted within the logratio methodology. Also robust counterparts to some of these methods will be discussed. Numerical examples will be presented using the package robCompositions of the statistical software R.

Selective inference (3.5 hours)

Alastair Young, Imperial College, UK.

Selective inference is concerned with performing valid statistical inference when the questions being addressed are suggested by examination of data, rather than being specified before data collection. In this tutorial we describe key ideas in selective inference, from both frequentist and Bayesian perspectives. In frequentist analysis, the fundamental notion is that valid inference, in the sense of control of error rates, is only obtained by conditioning on the selection event, that is, by considering hypothetical repetitions which lead to the same inferential questions being asked. The Bayesian standpoint is less clear, but it may be argued that such conditioning on the selection is required if this takes place on the parameter space as well as on the sample space. We provide an overview of conceptual and computational challenges, as well as asymptotic properties of selective inference in both frameworks, under the assumption that selection is made in a well-defined way.

Text mining in econometrics (1.5 hours)

Peter Winker, University of Giessen, Germany.

There is a growing interest in the use of textual information in different fields of economics ranging from financial markets (analysts’ statements, communication of central banks) over innovation activities (patent abstracts, websites) to the history of economic science (journal articles). In order to draw meaningful conclusions from this type of data, the analysis has to cover a substantial number of steps including 1) the selection of appropriate sources (corpora) and establishing access, 2) the preparation of the text data for further analysis, 3) the identification of themes within documents, 4) quantifying the relevance of themes in different documents, 5) aggregating relevance information, e.g. across sectors or over time, 6) analysis of the generated indicators. The course will provide some first insights into these steps of the analysis and indicate open issues regarding, e.g. computational complexity and robustness of the methods. It will be illustrated with empirical examples.

Numerical linear algebra for computational statistics (2.5 hours approx.)

Zlatko Drmac, University of Zagreb, Croatia.

If one googles "covariance matrix, negative eigevalues", one finds many questions and discussion on "why do I get negative eigenvalues of a covariance matrix". Indeed, why? One can argue that in most cases the explanations offered in those discussions are plainly wrong. We will discuss this and other questions related to numerical procedures in computational statistics, in particular on the eigenvalues, singular value decomposition (SVD) and its generalization, the GSVD (including the QSVD, PSVD and the cosine-sine decomposition CSD of partitioned orthonormal matrices). These are the tools of trade in various applications, including computational statistics, least squares modeling, vibration analysis in structural engineering - just to name a few. In essence, the GSVD can be reduced to the SVD of certain products and quotients of matrices. For instance, in the canonical correlation analysis of two sets of variables $x$, $y$, with joint distribution and the covariance matrix $C= ( C_{xx}, C_{xy} ; C_{yx}, C_{yy})$, wanted is the SVD of the product $C_{xx}^{-1/2}C_{xy}C_{yy}^{-1/2}$. However, using software implementations of numerical algorithms is not that simple, despite availability of many well known state of the art software packages. We will review the recent advances in this important part of numerical linear algebra, with particular attention to \emph{(i)} understanding the sensitivity and condition numbers; \emph{(ii)} numerical robustness and limitations of numerical algorithms; \emph{(iii)} careful selection and deployment of reliable mathematical software to be able to interpret and use the computed output with confidence in concrete applications. We illustrate the theoretical numerical issues on selected tasks from computational statistics. .

Statistics of extremes and risk assessment using R (2 hours)

Ivette Gomes, Universidade de Lisboa, Portugal.

Extreme value theory (EVT) helps us to control potentially disastrous events, of high relevance to society and with a high social impact. Floods, fires, and other extreme events have provided impetus for recent re-developments of EV analysis. In EVT, just as in almost all areas of statistics, the ordering of a sample is of primordial relevance. After a brief reference to a few concepts related to ordering, we provide some motivation for the need of EVT in the analysis of rare events, in fields as diverse as environment, finance and insurance, among others. Next, the general EV and the generalized Pareto distributions are introduced, together with the concepts of extreme value index and the notion of tail-heaviness. Finally, we deal with several topics in the field of statistics of extremes, an highly useful area in applications, whenever we want to make inference on tails, estimating rare events’ parameters, either univariate or multivariate. Apart from providing a review of most of the parametric approaches in the area, we further refer a few semi-parametric approaches, with the analysis of case-studies in the aforementioned fields, performed essentially through the use of R-packages for extreme values, like the evd, evdbayes, evir, ismev, extRemes, fExtremes, POT, and SpatialExtremes, among others.

Machine learning methods for multivariate data analysis (1.5 hours)

Daniela Zaharie, West University of Timisoara, Romania.

Predictive tasks (e.g. classification or regression) can be solved by using various machine learning models constructed from data (e.g. k-nearest neighbours, decision trees, support vector machines, neural networks etc). The prediction accuracy of individual models can be improved by aggregating several models using various ensemble techniques (e.g bagging, boosting, stacking). Besides these explicit ensemble techniques, there are also strategies (e.g. dropout) which induce an implicit ensemble with shared parameters by injecting extra randomness into the machine learning model and therefore generating various model instances which are then aggregated. On the other hand, in real-world applications it would be useful to provide a measure for the prediction uncertainty. Most of the machine learning models, particularly the black-box ones (e.g. neural networks), do not provide directly estimates of the prediction uncertainty. However the information provided by ensemble models can be exploited in order to estimate uncertainty measures. For instance, in the context of neural networks it has been recently proved that by using Monte Carlo dropout one can obtain uncertainty estimates as in the case of using Bayesian estimates of the neural networks parameters. The aim of this tutorial is to provide first an overview on meta-models with a focus on ensemble strategies applied to decision trees (e.g. random forests, boosted decision trees). Then the particularities of randomly dropping out parameters of the model and its impact on the performance are discussed. Finally several approaches in estimating the uncertainty of the prediction are discussed in the context of solving predictive tasks in biology and for semantic segmentation of satellite images.