Skip to main content

Workshop 04/27/2022

Join Us

In the current era, Data Science is rapidly evolving and proving very decisive in ERP (Enterprise Resource Planning). The dataset required for building the analytical model using data science, is collected from various sources such as Government, Academic, Web Scraping, API’s, Databases, Files, Sensors and many more. We cannot use such real-world data for analysis process directly because it is often inconsistent, incomplete, and more likely to contain bulk errors. We often hear the phrase “garbage in, garbage out”. Dirty data or messy data riddled with inaccuracies and errors, result in a bad/improperly trained model which in turn might result in poor business decisions and sometimes even hazardous to the domain. Any powerful algorithm is failed in providing correct analysis when applied to bad data. Therefore, data must be curated, cleaned and refined to be used in data science and products based on data science. To perform these tasks, “Data Preparation” is required which includes two methods that are: Data Pre-processing, and Data Wrangling. Most data scientists spend the majority of their time in data preparation.

Data pre-processing method converts the raw unstructured data into an understandable format that is the requirement of most machine
learning algorithms. It does a pre-analysis of data, in order to transform them into a standard and normalized format.

Data Wrangling also known as data munging, is the process of discovering, cleaning, organizing, restructuring, and enriching the
raw/complex data into a convenient format for the consumption of data for further analysis and visualization purposes. With more
amount of unstructured data, it is essential to perform Data Wrangling for making smarter and more accurate business decisions.

Data pre-processing is used before building an analytic model, while data wrangling is used to adjust data sets interactively while analysing data and building a model.

Thus, data preparation helps in establishing the quality of data on various parameters before applying to data science like: accuracy, completeness, consistency, timeliness, believability, and interpretability, etc. Such quality data when operated upon with appropriate machine learning algorithms fetch the perfect analysis which can be utilized efficiently in taking correct decisions.

The workshop would focus on the basic to intermediate levels of SQL. We will start with querying a database, using filters to clean the data. Joining different tables. Aggregate functions and use of ‘CASE WHEN’ for better query performances. Subqueries and Common Table Expressions (CTEs) and a comparison between them. Use of window functions. Lead and lag functions and the scenarios when they can be used. Pivot tables and when not to use them!


– Query a database
– Data Cleaning made easy using filters
– Use and join different tables in a database
– Aggregate functions in SQL and using the ‘HAVING’ clause
– Advantages of using ‘CASE WHEN’ on query performances
– When to use subqueries and when to use Common Table Expressions
– Window functions in SQL
– ‘RANK’ and ‘DENSE RANK’ functions – makes everything easy
– ‘LEAD’ and ‘LAG’ functions and the scenarios it can be useful
– Pivot tables and when not to use them

A propensity model attempts to estimate the propensity (probability) of a behavior (e.g., conversion, churn, purchase, etc.) happening during a well-defined time period into the future based on historical data. It is a widely used technique by organizations or marketing teams for providing targeted messages, products or services to customers. This workshop shares an open-sourced package developed by Google, for building an end-to-end Propensity Modeling solution using datasets like GA360, Firebase or CRM and using the propensity predictions to design, activate and measure the impact of a media campaign. The package has enabled companies from e-commerce, retail, gaming, CPG and other industries to make accelerated data-driven marketing decisions.

Neural networks have been widely celebrated for their power to solve difficult problems across a number of domains. We explore an approach for leveraging this technology within a statistical model of customer choice. Conjoint-based choice models are used to support many high-value decisions at GM. In particular, we test whether using a neural network to model customer utility enables us to better capture non-compensatory behavior (i.e., decision rules where customers only consider products that meet acceptable criteria) in the context of conjoint tasks. We find the neural network can improve hold-out conjoint prediction accuracy for synthetic respondents exhibiting non-compensatory behavior only when trained on very large conjoint data sets. Given the limited amount of training data (conjoint responses) available in practice, a mixed logit choice model with a traditional linear utility function outperforms the choice model with the embedded neural network.

Event Program

April 27, 2022

*All times are UTC -8

Workshop Instructors

Sreelaxmi Chakkadath

Data Science Master's Student, Indiana University Bloomington

Shalini Pochineni

Senior Data Scientist, Google

Xi Li

Data Scientist, Google

Lingling Xu

Marketing Data Scientist, Google

Bingjie Xu

Marketing Data Scientist, Google

Kathryn Schumacher

Staff Researcher, Chief Data and Analytics Office, General Motors


Demystifying Data Pre-processing & Data Wrangling for Data Science | Pariza Kamboj

TOPICS: Algorithms , Data Science as a Career , Data Wrangling , Software Design and Engineering , Values

Basic to Intermediate Level SQL | Sreelaxmi Chakkadath

TOPICS: Algorithms , Data Generation/Collection , Foundations (Mathematics/Statistics) , Software Design and Engineering , Values

Open-sourced Propensity Model Package: Accelerating Data-Driven Decisions (Workshop #1) | Google

TOPICS: Algorithms , Data Generation/Collection , Values

Predicting customer choice: A case study on integrating AI within a discrete choice model | Kathryn

TOPICS: Algorithms , Data Generation/Collection , Data Wrangling , Foundations (Mathematics/Statistics) , Software Design and Engineering , Values