Question 0 - Hello World!
We've included a little "Hello World!" example. There will be an accompanying video on Moodle (the link will be posted on the forum as an announcement) which you can follow to learn how to use this Jupyter Notebook.
Even if you've used Jupyter before, it's highly recommended that you watch the video and go through this introductory exercise, because this assignment uses auto-marking and requires you to follow a certain (straightforward) convention.
Question 0.a - Saying Hello in Markdown
For this assignment, you will need to answer the questions in the "cells" below each question. Some questions will require written work (which you can either do on paper and scan, or do in the cell using Markdown), and some will require R code (which you must do using R code in the provided cells).
Watch through the video to see how this should be done.
Question 0.b - Saying Hello in R
You will need to write some R code in this assignment. In this section, save a variable called "hello.world" (don't include the quotes in the variable name) and set it to the value "Hello World!". Then run the cell below the one you wrote your code in to verify that your answers have been registered and given the correct variable names.
Like before, follow the video tutorial to see how this is done.
Question 1 - Probabilities
Suppose we are playing a simple collectable card game (e.g. like Hearthstone, or Magic the Gathering). In this game, each player has a card deck which contains 30 cards (with no duplicate cards). At the start of this game, both players shuffle their decks. Then the player going first draws five cards, and the player going second draws six cards. After this, the game starts, and players alternate turns.
Each player draws an additional card at the start of their turn. So, for example, after their third turn player one should have drawn eight cards in total (the 5 cards they started with, plus another three cards over three turns). Player two should have drawn nine cards in total after their third turn.
For the following questions, suppose there is a special combination of five cards, and if you have those five cards in your hand you instantly win the game.
A bit of help
As a little hint for some of the questions below, you're reminded that if you have some product n Ã— (n âˆ’ 1)Ã— (n âˆ’ 2)Ã— ...Ã— (n âˆ’ k), we can express this as n!/(n âˆ’ k âˆ’ 1)!. That is,
n(n âˆ’ 1)(n âˆ’ 2)(n âˆ’ 3)...(n âˆ’ k) = ( 1)!
This is because we can think of the product n Ã— (n âˆ’ 1)Ã— (n âˆ’ 2)Ã— ...Ã— (n âˆ’ k) to be , which is
n Ã— (n 1)Ã— (n 2)Ã— ...Ã— 2 Ã— 1, but with the last (n-k-1) parts removed. Because this is a multiplication of
terms, we can think of removing terms as the same as dividing by them, meaning that
n Ã— (n âˆ’ 1)Ã— (n âˆ’ 2)Ã— ...Ã— 2 Ã— 1
n Ã— (n âˆ’ 1)Ã— (n âˆ’ 2)Ã— ...Ã— (n âˆ’ k) = ( 1)( 2)...(2)(1) = ( 1)!
What is the probability that the first player will draw this combination on their first turn and win the game immediately? What about the second player?
What is the probability that the five cards required for victory are all at the bottom of a player's deck (i.e. they are the last five cards in their deck)?
Suppose a player has drawn 15 cards from their deck. What is the probability that all of the cards in the winning combination are still in their deck?
Suppose a player has drawn cards from their deck, where is between 0 and 30. What is the probability that all of the cards in the winning combination is still in their deck (i.e. that they have not drawn any piece of the winning combination yet) in terms of ?
Question 2 - PDFs and Expectations
Suppose we have defined a probability density function for a random variable as follows:
2 0 â‰¤ x â‰¤ Î±
Notice that our PDF has two constants, and . is a parameter, and is a coefficient which we will carefully
choose so the integral of 2 between and (with respect to ) is equal to .
Suppose . Find the value of which would cause the integral of p(x) from 0 to with respect to x to be
equal to . That is, find such that
c 2 dx = 1
Find the value of for a general value of That is, find such that (you can do this in a way similar to how you answered question 2.a).
c 2 dx = 1
Suppose = 3 and = 1. Find E(X ), the expected value of our variable .
Suppose = 3 and = 1. Find Var(X ), the variance of our variable .
Question 3 - Distributions
Suppose we are given the following information:
You are modelling the number of people visiting a particular doctor's office within a day, with the hope of identifying a disease outbreak in the local area of the doctor
It is known that, on an average day, 30 patients will see this doctor, with a standard deviation of 3 patients per day
Describe a model you might use to model the number of patients on a given day (there might be more than one choice, so pick one and justify it). Also give the parameters of this model based on the given information.
On one particular day, 45 patients visit the doctor. Considering the model you developed in your answer to the previous question, do you think that this number of patients in a given day is cause for alarm? Use calculations to back up your answer by determining the probability of seeing 45 or more patients in a given day.
Question 4 - Maximum Likelihood Estimation of Parameters
Suppose we are developing a new plant treatment which will (hopefully) improve crop yields. We have a dataset which contains weights for two candidate treatments, as well as a control group (which receives neither of the candidate treatments).
Suppose we want to create models for the weight of each group. You think a normal distribution would be suitable for this purpose, but a colleague has suggested that you should use a binomial distribution instead. Someone else proposed using a uniform distribution instead.
For both the binomial and uniform distributions, explain whether they would be a good choice (justifying your answer).
Also justify why using the normal distribution is a good choice here.
Suppose, rather than modelling the weights directly, we instead want to model the probability that a plant will grow to weigh over 6 units of weight, for each of the three treatments we are testing (treatments 1 and 2, and the control). Suggest a model that would be suitable for this purpose, and justify your choice.
After considering our answers to questions 4.a and 4.b, we have decided to model the weights directly (i.e. we will use the model discussed in question 4.a, not 4.b). To do this, we will create three models: one for each of the three groups. We will use normal distributions to model each of the three groups, and then compare the estimated means of each group.
We now have to decide how we will calculate our estimates of the mean (Î¼) and standard deviation ( ) of each of our datasets. One approach is to use the maximum likelihood method, where we wish to maximize the likelihood of the data given the parameters Î¼ and (that is, we wish to find the values of Î¼ and which cause P(x Î¼,Ïƒ) to be maximized). Note that maximizing something is the same as maximizing the log of that thing, because log (for any base) is "monotonically increasing"- that is, if , log(a) > log(b). We're actually going to maximize the log-likelihood below.
A colleague of yours seems to think that maximizing the log-likelihood is the same as minimizing the mean absolute error. Another colleague disagrees, saying that they are misremembering and the likelihood is the same as minimizing the mean squared error. Yet another colleague seems to believe that we minimize the negative log-likelihood by minimizing the log-cosh loss (since they both have the word "log" in them; you are not convinced by this argument).
One of your colleagues in Question 4.c is correct; which one does it appear to be based on our calculations in the previous question? Prove this colleague correct using algebra (you only have to prove them correct; you don't have to disprove the other two).
Given your maximum likelihood estimates for the mean of each population (and keeping in mind that we have a very small number of samples for each group), which treatment appears to work best?
Question 5 - Central Limit Theorem
Suppose our company is trialling a new production method for phone cases, based on 3D printing. 3D printing can a volatile process, and the company has decided to accept the fact that there will be a certain proportion of failures out of the total number of 3D prints.
However, before committing to the new process, management would like to estimate the probability of failure by printing a number of phone cases. They have asked you how many cases they should print to ensure they have a reasonably good idea of the probability of failure.
The engineers developing the new production method assure management that the probability of failure is somewhere between 1% and 20%, but they are unwilling to make any guarantees beyond this without testing the method first.
We will model this problem with a binomial distribution. Justify why the binomial distribution is a good choice for this problem.
Suppose that we are considering three potential failure probabilities:
We also are considering three potential sizes for our test production run (i.e. the number of phone cases we will print in our test run):
For each combination of failure probability and number of cases printed, calculate the limiting distribution for the sample mean. You should calculate 9 limiting distributions in total
For this question, do this using written calculations (i.e. not using R) and with the Central Limit Theorem.
Verify the results you obtained by hand in the previous question using R code.
For each of the sample sizes and potential failure probabilities listed above, we now know the theoretical distribution by the Central Limit Theorem (we calculated this in Questions 6.b and 6.c). However, management is still not convinced and have asked us to develop a simulation which will experimentally demonstrate our calculations were correct.
R has a built-in function called rbinom, which takes three arguments (the number of simulations you want to run, the number of trials per simulation, and the probability of success for each trial). Hint: you are allowed to use the rbinom function, although you don't have to.
We're presenting our findings to management; they have asked us to provide visualisations for our results. For each failure probability discussed above (0.01, 0.05 and 0.2) and for each potential sample size discussed above (50, 200, and 800), produce a histogram plot of the maximum likelihood estimates of the failure probability (calculated 50,000 times through 50,000 simulations).
Management has asked us recommend how many tests they should run. Based on all the information we have computed, do you recommend 50, 200, 800, or even more tests than that? Justify your answer using relevant calculations and/or by referring to the above plots.