Skip to main content
WiDS Posts | January 5, 2022

A Beginner’s Tutorial for the WiDS Datathon 2022 challenge

The challenges of predicting energy consumption are multiple and complex. Modeling energy consumption depends on multiple building-related variables. Climatic variables such as temperature, rainfall, snowfall etc drive patterns of energy consumption that are both cross-section and temporal. For example, one would expect in general zip codes on the east coast to experience much higher snowfall in the winter than zip codes on the west coast. That said, there are annual climate patterns that are clearly temporal. The challenge we are faced with is choosing a modeling methodology that permits us to model these patterns appropriately.

Step 1: Consider alternate modeling approaches
It is often very helpful to consider alternate approaches to a given modeling problem. Here I review two methodologies that you might consider when deciding how to develop a model.

a) XGBoost
This approach might come in handy if you decide to focus on capturing the cross-section patterns in the data. In previous tutorials I’ve described how you might fit an XGBoost model. I discussed how you could i) build a DMatrix and ii) work through hyper-parameter tuning. For your reference, here’s the blog:

Also, here’s a resource for more information:

Here’s an example (in R programming language) of how you might create a DMatrix and try alternate parameters to fit an XGboost model.
b.Deep Learning- Recursive Neural Net
If you decide to pursue a temporal approach towards modeling, you might consider deep learning, a family of models that has been very effective in modeling complex datasets. In this vast field, RNNs (Recursive Neural Nets) have been particularly useful in modeling temporal patterns. Here I provide you with resources to further explore this area, focusing on resources that are freely available over the internet.

The internet has many great free resources discussing deep learning in detail. Here are a few resources: (by Goodfellow, Ian et all)

As discussed in the resource by Goodfellow et all, RNNs (Recursive Neural Nets) maintain state information between loops (or cycles) of the temporal sequence. We are trying to compute the conditional distribution of the next sequence element given the past inputs. Decomposing the joint probability over the sequence of values as a series of one-step probabilistic predictions helps to compute the full joint distribution across the whole sequence. Essentially, we are using the RNN to try to parametrize long-term relationships and dependencies between variables efficiently.

There are problems you might run into while trying to build an RNN. As Goodfellow et all further discuss, results are highly non-linear and gradients repeated over many stages might vanish or explode. Architecture approaches such as LSTM help to overcome the vanishing gradient problem.

The deeplearningbook also has practical recommendations on fitting your model. Goodfellow et all recommend assessing the components that are overfitting or under-fitting. They also recommend using dropout, which they suggest is an excellent and easy-to-use regularizer. Batch normalization could also sometimes reduce generalization error. Understanding and tuning hyper-parameters (such as the learning rate), either manually or through grid-search or random search, is highly recommended.

I highly recommend you also look closely at the Keras api (, which has many examples addressing specifically the type of modeling problem we’re focused on. As you prepare the dataset, you might consider re-formatting it to reflect temporal dependencies. If you decide to pursue this direction, here’s a resource that focuses specifically on this subject.

Step 2. Read advanced domain-specific background information
It might also be useful to get domain-specific knowledge to help you with the process of formulating a model. To help you get started, we have a tutorial from Marcus Voss and Nikola Milojevic-Dupont, domain experts at CCAI!

Step 3. Consider Research
Last year, we had a tremendous response to the Research Phase II of the Datathon. In response to the interest, we have significantly expanded Phase II this year. I recommend that you think about research questions that might emerge from your exploration of this data that you could pursue in Phase II.

Finally, have fun! This exercise is really meant as an introduction to this vast field. We hope that now that your curiosity is piqued, you will explore this area both more fully.

Ready to try it yourself? Sign up to take the challenge!

A discussion by Sharada Kalanidhi, WiDS Datathon Co-Chair and Data Scientist at the Stanford Genome Technology Center (Department of Biochemistry), Stanford University School of Medicine. Her research interests are mathematical and statistical analysis of multi-omics data.