The Problem of Missing Data
The dataset has many important variables which have a significant number of values that are missing. Data is often missing because of issues that could have arisen during the collection and curation of data. This type of data is usually coded as “Null” or “Not Available”, in short “NA.” What are the best assumptions for NAs? Do you assume they have been generated by a completely random process? Do you drop them? Or impute (fill in) appropriate values? Simulation of data might result in data that is incongruent with reality. It turns out, imputation (and simulation/ imputation) is an active area of research. Here are some resources — R and Python packages (mice, simputation and autoimpute) and blogs that address the issue of missing data in modeling challenges:
For R users:
Multivariate Imputation by Chained Equations
Imputing Missing Data with R
Multivariate Imputation by Chained Equations
5 Powerful R Packages used for imputing missing values
Missing Data Analysis with mice
A Brief Introduction to Mice R Package
Announcing the simputation package: make imputation simple
For Python users:
Imputation Methods in Python
Imputation of missing values
Python packages for numerical data imputation [closed]
Simple techniques for missing data imputation
Is there a good Python library to impute missing values using more advanced methods rather than mean/median/most frequent value?
There is quite a bit of discussion in the technical literature about whether it is enough to simply replace NA values with the mean or median of existing values. Some researchers believe this could be a problem because values that are completely unreasonable and incorrect could replace the Null values. The preferred approach is capturing and simulating the entire underlying multivariate distribution. In other words, model-based imputation is the preferred approach. In case you decide to pursue imputation and are looking for further reading about different approaches to take, I also recommend the following resources:
- Van der Loo, Mark, de Jonge Edwin, “Statistical Data Cleaning with Applications in R” Wiley 2018. Chapter 10 focuses on Imputation and Adjustment. There is a detailed discussion of model-based imputation.
- Van Buuren, Stef and Groothuis-Oudshoorn “mice: Multivariate Imputation by Chained Equations in R” Journal of Statistical Software, December 2011. On page 6 there is a discussion of the problems in imputing multivariate data.
https://www.jstatsoft.org/article/view/v045i03
Multicollinearity
When I started working with the dataset, I saw right away that there could be a problem with several of the variables being highly correlated with each other. This situation is called multicollinearity and it could affect the significance of regression results. It could also impact the results of a classification exercise. Here are some resources that might put things into perspective, and may provide some avenues for handling multicollinearity.
The problem of Multicollinearity:
- Why is multicollinearity so bad for machine learning models and what can we do about it?
- Why is multicollinearity not checked in modern statistics/machine learning
- Differences between L1 and L2 as Loss Function and Regularization
- Collinearity: A review of methods to deal with it and a simulation study evaluating their performance
- Enough Is Enough! Handling Multicollinearity in Regression Analysis
Linear or Nonlinear Approaches
A question one might like to address early on is the type of patterns embedded in the data that could result in appropriate classification. For example, can the data points be separated easily with a line in the middle? Or could there be nonlinear patterns in the data that if represented appropriately, would assist with the separation of the classes? If you wish to explore nonlinear separation, there are several approaches. Here are some pointers:
Other general thoughts:
Searching and Searching
It is tempting to do a grid search that includes every possible combination of parameters. Tempting and also somewhat infeasible. I would suggest keeping your grid searches reasonable.
Housekeeping
Definitely fit ensembles of models. But keep track of your dataset versions and models. In one week, you could fit dozens of models (not including parameter grid/ random searches with hundreds or thousands of fits.) Was RF_52 the Random Forest fit on the dataset where I dropped NAs or imputed NAs? After all that fitting, it is easy to lose track. Start off by creating a system, folder and logs where you can track your assumptions right at the moment you train the models.
Above all…
Enjoy the data challenge. There is a lot to observe and experiment with. Take your time — iterate through the models. Go back to early fits. What was fit poorly or well? How could you change the algorithms you fit? Are there research papers discussing your observations? Once you start on this journey into statistical questioning and research, there’s no telling where it might lead you. Think of it as an invitation into the deep and rich world of mathematical statistics — an ocean of knowledge in its own right.
Learn more about the WiDS Datathon
Sharada Kalanidhi is Director of Data Science at Stanford Genome Technology Center, Stanford University School of Medicine.