ESSEC METALAB

IDEAS

STATISTICS IN THE TIME OF BIG DATA

[ESSEC Knowledge] by Olga Klopp - Professor at ESSEC Business School

“Big data”, “high-dimensional statistics”, “predictive analytics” … while we have likely all heard at least one of these terms, their meanings are a little more elusive. How are they applied? How can advances in data analytics impact our daily lives? ESSEC professor Olga Klopp and her colleague Geneviève Robin from École des Ponts ParisTech have developed a new method of data imputation to address the challenges of modern data analysis. 

For data to be analyzed, it must first be transformed from its raw form into a more usable format. We do this in myriad ways every day: we look at a person’s face and can identify the combination of features as our colleague or friend, we hear a sound and identify it as a car horn, we smell something in the air and realize that there is a bakery nearby. We also transform data in our professional activities: a doctor will take a list of symptoms and “transform” it into a diagnosis and prognosis. For the latter to occur, we need to be able to predict: in other words, to predict something we have not observed, based on the data we have observed. Think, for example, of a patient who has been in a serious car accident. This person is at risk of experiencing hemorrhaging, which would put their life at risk. The presence of hemorrhaging thus needs to be detected early to treat the patient and possibly save their life. Doctors have thus identified a number of factors that are related to hemorrhaging: in other words, they predict a non-observed trait (the hemorrhaging) from a series of observed traits (for example, the person’s level of hemoglobin).

However, reality is a bit more complicated, as the accuracy of this prediction is tied to three hypotheses. Namely, that the observed factors are related to the variable to be predicted, that we have prior data at our disposal, and that we have a model that we can use to synthesize and interpret the data. It is not always possible to verify these hypotheses. Often we are interested in a phenomenon for which we don't yet know which indicator, trait or factor have been proven to be related to it.  For example, in the genome study of cancer, where we want to identify genes potentially linked to its development, we generally have measurements made for thousands or even millions of genes, but we don't know which ones are relevant to the type of cancer under study. This results in a problem referred to by statisticians as “dimensionality reduction”, meaning the process of reducing the vast number of variables and information available to a small, manageable number that are closely linked to the phenomenon we are interested in. 

[To read the full article please follow this link.]

Ideas list
arrow-right