Skip to main content
WiDS Posts | September 15, 2021

Dealing with Missing Data

This year, during the Women in Data Science (WiDS) Worldwide conference, Professor Fatima Abu Salem from the American University of Beirut (AUB) delivered a technical talk, “Doing Data Science in Data Deserts”. In the case studies that she described, her biggest hurdle was gaining access to the data sets required to do her analysis. As Fatima says, “…we have logistical and financial hurdles against the ability to collect data and when we do it, it has low temporal or spatial resolution”.

Fatima’s talk is also available in Arabic. You can also get to know Fatima better during her WiDS Worldwide Meet the Speaker session, moderated by WiDS ambassador, Lama Moussawi, Associate Dean for Research and Faculty Development at AUB. In addition, you can hear more about Fatima’s work and her journey from high school teacher to theoretical mathematician to data scientist and professor in thisepisode of the WiDS podcast.

Also during the WiDS Worldwide conference, Megan Price and Maria Gargiulo from the Human Rights Data Analysis Group (HRDAG), delivered a workshop, “Data Processing & Statistical Models to Impute Missing Perpetrator Information”. For Megan, Maria, and their colleagues at HRDAG, the datasets that they receive from their partners have missing data on the perpetrators, as well as other variables. In this workshop, Maria describes her process and statistical methods for imputing the missing variables, which then helped her to impute the missing perpetrator data.

Megan Price also delivered a fascinating technical talk during the WiDS conference at Stanford in 2017, talking about how she and her colleagues used statistics and computer science methods to quantify how many people were killed in the conflict in Syria. You can learn more about how Megan uses data science to fight for human rights in this episode of the WiDS podcast.

Madeleine Udell, Assistant Professor at Cornell University and Stanford ICME alumnus, delivered a talk titled, “Filling in Missing Data with Low Rank Models”. Madeleine talks about how to use low rank models to analyze big, messy data sets, introducing the mathematics behind these models along the way. During the talk, she cites several examples including a 300-million-row data set with non-numeric and missing data that she wrangled during Obama’s 2020 campaign. Madeleine also delivered an excellent workshop at the WiDS Worldwide conference in 2021 on Automating Machine Learning.

As the volume of big, messy datasets continues to grow, the challenge of missing data will grow, too. Whether the problem is lack of access to the data sets that you need, or large swaths of missing data, there are data science methods for solving the problem. Thanks to Fatima, Megan, Maria, and Madeleine, you now have some strategies and approaches to dealing with missing data.