Question 1

A decision tree is built to determine individuals likely to default on an unsecured loan. The null hypothesis states that an individual will not default on the loan. The decision tree correctly classifies 80% of the instances in a test dataset. Fifteen percent of the mistakes made by the model are type 1 errors. What can be said about the performance of the model?

A) The accuracy of the model for correctly determining those individuals who did not default on their loan was at least 75%.
B) The accuracy of the model for correctly determining those individuals who defaulted on their loan was at least 75%.
C) The majority of errors made by the model accepted individuals who defaulted.
D) The majority of errors made by the model rejected individuals who did not default.
E) More than one of a,b,c or d is correct.

D

Accepted Answer

D

Question 2

The average squared difference between classifier predicted output and actual output.&#10;A) mean squared error&#10;B) root mean squared error&#10;C) mean absolute error&#10;D) mean relative error

Accepted Answer

Mean squared error is the appropriate choice as it measures the average squared difference between predicted and actual values. It is a commonly used metric in regression analysis and is represented by the formula sum of the squared differences between predicted and actual values divided by the total number of observations.

Question 3

We have performed a supervised classification on a dataset containing 100 test set instances. Eighty of the test set instances were correctly classified. The 95% test set accuracy confidence boundaries are:

A) 76% and 84%
B) 72% and 88%
C) 78% and 82%
D) 70% and 90%

B

Accepted Answer

The 95% confidence interval for a proportion can be calculated using the formula for the binomial proportion confidence interval, which often approximates to $p \pm 1.96\sqrt{\frac{p(1-p)}{n}}$, where $p$ is the sample proportion and $n$ is the sample size. Here, $p = \frac{80}{100} = 0.8$, and $n = 100$, so the interval is $0.8 \pm 1.96\sqrt{\frac{0.8(0.2)}{100}}$, which calculates to approximately 0.72 to 0.88, or 72% to 88%.

Question 4

The standard error is defined as the square root of this computation.&#10;A) The sample variance divided by the total number of sample instances.&#10;B) The population variance divided by the total number of sample instances.&#10;C) The sample variance divided by the sample mean.&#10;D) The population variance divided by the sample mean.

Accepted Answer

The formula for standard error is the sample standard deviation divided by the square root of the sample size (n). The sample standard deviation is the square root of the sample variance, which is the sum of the squared deviations from the mean divided by (n-1). Therefore, the standard error is the square root of the sample variance divided by (n-1), which is the same as the sample variance divided by n.

Question 5

If a real-valued attribute is normally distributed, we know that approximately 95% of all attribute values lie within&#10;A) one standard deviation of the mean.&#10;B) two standard deviations of the mean.&#10;C) three standard deviations of the mean.&#10;D) four standard deviations of the mean.

Accepted Answer

According to the empirical rule, approximately 95% of all attribute values lie within two standard deviations of the mean for a normally distributed attribute.

Question 6

The correlation coefficient for two real-valued attributes is -0.85. What does this value tell you?&#10;A) The attributes are not linearly related.&#10;B) As the value of one attribute increases the value of the second attribute also increases.&#10;C) As the value of one attribute decreases the value of the second attribute increases.&#10;D) The attributes show a curvilinear relationship.

Accepted Answer

The negative correlation coefficient (-0.85) indicates that as the value of one attribute decreases, the value of the other attribute increases. Therefore, the best choice is C, which states that "As the value of one attribute decreases the value of the second attribute increases."

Question 7

Data used to optimize the parameter settings of a supervised learner model.&#10;A) training&#10;B) test&#10;C) verification&#10;D) validation

Accepted Answer

The process of optimizing parameter settings of a supervised learner model is accomplished through a technique called hyperparameter tuning, which involves adjusting the parameters of the machine learning model and evaluating the performance of the model on a validation dataset. The validation dataset is used to fine-tune the model and find the best hyperparameters to produce the best results on unseen data. Therefore, the most appropriate choice for data used in hyperparameter tuning is validation data (choice D).

Question 8

The hypothesis of no significant difference.&#10;A) nil&#10;B) invalid&#10;C) null&#10;D) void

Accepted Answer

The hypothesis of no significant difference is commonly referred to as the null hypothesis. It assumes that there is no significant difference between two groups or variables being compared. Therefore, the correct answer is C - null.

Question 9

The correlation between the number of years an employee has worked for a company and the salary of the employee is 0.75. What can be said about employee salary and years worked?&#10;A) There is no relationship between salary and years worked.&#10;B) Individuals that have worked for the company the longest have higher salaries.&#10;C) Individuals that have worked for the company the longest have lower salaries.&#10;D) The majority of employees have been with the company a long time.&#10;E) The majority of employees have been with the company a short period of time.

Accepted Answer

The positive correlation coefficient (0.75) indicates that there is a positive relationship between years worked and salary. This means that as the number of years an employee works for the company increases, their salary tends to increase as well. Therefore, individuals that have worked for the company the longest have higher salaries.

Question 10

Unsupervised evaluation can be internal or external. Which of the following is an internal method for evaluating alternative clusterings produced by the K-Means algorithm?&#10;A) Use a production rule generator to compare the rule sets generated for each clustering.&#10;B) Compute and compare class resemblance scores for the clusters formed by each clustering.&#10;C) Compare the sum of squared error differences between instances and their corresponding cluster centers for each alternative clustering.&#10;D) Create and compare the decision trees determined by each alternative clustering.

Accepted Answer

Comparing the sum of squared error differences between instances and their corresponding cluster centers is an internal method for evaluating alternative clusterings produced by the K-Means algorithm. This method does not require external knowledge or datasets. Options A, B, and D are all external methods as they require comparing the clustering results to external sources.

Question 11

We have built and tested two supervised learner models-M1 and M2. We compare the test set accuracy of the models using the classical hypothesis testing paradigm using a 95% confidence setting.
The computed value of P is 2.53. What can we say about this result?

A) Model M₁performs significantly better than M_2.
B) Model M₂performs significantly better than M_1.
C) Both models perform at the same level of accuracy.
D) The models differ significantly in their performance.
E) More than one of a,b,c or d is correct.

Accepted Answer

The answer of We have built and tested two supervised...

Question 12

Bootstrapping allows us to&#10;A) choose the same training instance several times.&#10;B) choose the same test set instance several times.&#10;C) build models with alternative subsets of the training data several times.&#10;D) test a model with alternative subsets of the test data several times.

Accepted Answer

The answer of Bootstrapping allows us to&#10;A) choose the same...

Question 13

Selecting data so as to assure that each class is properly represented in both the training and test set.&#10;A) cross validation&#10;B) stratification&#10;C) verification&#10;D) bootstrapping

Accepted Answer

The answer of Selecting data so as to assure that...

Deck 7: Formal Evaluation Techniques