Sprint 1 (40%) – Data Exploration and Cleaning
Use the data provided to answer the following
1. Which variables are continuous/numerical? Which are ordinal? Which are nominal?
2. Calculate following summary statistics: mean, median, max and standard deviation for each of the continuous variables, and count (frequency count) for each categorical variable. Is there any evidence of extreme values (outliers)? Briefly discuss.
3. Plot histograms for each of the continuous variables and create summary statistics. Based on the histogram and summary statistics answer the following and provide brief explanations:
a. Which variables have the largest variability?
b. Which variables seems skewed?
c. How to deal with skewed data? Compare the transformation results
4. Are there any categorical values that needs to be transformed into numerical values?
a. Point out the categorical variables that needs to be transformed
b. Suggest the best possible transformation.
c. Use this method to transform the variable(s).
5. Which, if any, of the variables have missing values?
a. What are the methods of handling missing values?
b. Apply these methods and demonstrate the output (summary statistics and transformation plot).
c. Which method of handling missing values is most suitable for this data set? Discuss briefly referring to the data set.
Sprint 1 Deliverables:
1. Prepare the answers for the Sprint 1 questions (Q1-Q5) in word/pdf format.
Scrum master of the team should submit this to LMS ‘Assignment 01 – Sprint 1’ submission portal before the Sprint 1 closure (i.e., 7:15 PM on Wednesday or 3:15 PM on Thursday). Only the scrum master needs to submit the document.
Please note that submission portal will close at the sprint 1 deadline – i.e., 7:15 PM/3:15 PM. Late submissions will not be accepted.
2. Present the answers for Sprint 1 questions to the evaluation panel (in 3 minutes). Evaluation will commence after the sprint 1 closure. You may move on to Sprint 2 as soon Sprint 1 ends.
Sprint 2 (60%) – Building predictive models
1. Feature Selection (10%)
a. Evaluate the correlations between the variables. Which variables should be selected for dimension reduction? Explain. Carry out dimensionality reduction.
b. Explore the distribution of selected variables against the target variable. Explain.
2. Regression Modelling (20%)
a. Build a regression model with the selected variables.
b. Evaluate the regression model and carry out feature selection to build a better regression model. You need to try out at least 3 regression models to identify the optimal model.
c. Compare these regression models based on evaluation metrics and provide the formula for each regression model.
3. Decision Tree Modelling (20%)
a. Build a decision tree with the selected variables.
b. Evaluate the decision tree model and carry out pruning to build a better decision tree model. You need to try out at least 3 decision trees to obtain the optimal tree.
c. Compare these decision tree models based on evaluation metrics and provide the tree plot for each model and explain the outputs.
4. Model Comparison (10%)
a. Compare the accuracy of the selected (optimal) regression model and (optimal) decision tree and discuss and justify the most suitable predictive model for the business case.
Sprint 2 Deliverables:
1. Presentation of your findings and actionable insights to the evaluation panel (in 4 minutes). Presentation slides may contain 5 slides including;
i. Title slide
ii. Model development – Regression modeling
iii. Model development – Decision tree modeling
iv. Model evaluation and selection
v. Team Roles (roles and responsibility of each team member)
Evaluation will commence after the Sprint 1 closure (i.e., 8:30 PM on Wednesday or 4:30 PM on Thursday). No further work after the closure is accepted.
2. Scrum master should submit the presentation slides to LMS ‘Assignment 01 – Sprint 2’ submission portal before the Sprint 2 closure. The presentation slides will be downloaded from the LMS for the presentation.
Only the scrum master needs to submit the slides.
Please note that submission portal will close at the sprint 2 deadline, i.e., 8:30 PM/ 4:30 PM. Late submissions will not be accepted.