Question 1

Which is NOT a primary option for addressing missing data?&#10;A)&#8203;To discard observations with any missing values&#10;B)&#8203;To discard any variable with missing values&#10;C)&#8203;To fill in missing entries with estimated values&#10;D)&#8203;To generate random data to replace the missing values

Accepted Answer

Generating random data to replace missing values is not a primary method for handling missing data, as it can introduce bias and inaccuracies. The other options are standard techniques used to address missing data issues.

Question 2

In preparing categorical variables for analysis, it is usually best to&#8203;&#10;A)convert the categories to numeric representations.&#10;B)convert the categories to binary, dummy variables.&#10;C)combine as many categories as possible.&#10;D)let them remain categorical.

Accepted Answer

Converting categorical variables to binary, dummy variables is the best choice because it allows for easy analysis and interpretation while keeping the categorical nature of the variable. It also avoids assigning any particular order or magnitude to the categories that may not exist.

Question 3

Observation refers to the&#10;A)estimated continuous outcome variable.&#10;B)set of recorded values of variables associated with a single entity.&#10;C)goal of predicting a categorical outcome based on a set of variables.&#10;D)mean of all variable values associated with one particular entity.

Accepted Answer

Observation refers to the set of recorded values of variables associated with a single entity. It includes both continuous and categorical variables.

Question 4

Which of the following reasons contributes to the increase in the use of data-mining techniques in business?&#10;A)The lack of methods to electronically track data&#10;B)The dearth of information to analyze and interpret&#10;C)The ability to electronically warehouse data&#10;D)The ability to manually analyze all the data

Accepted Answer

The ability to electronically warehouse data allows for easy access and storage, which in turn makes it easier to use data-mining techniques in analyzing and interpreting the data. The other options (A, B, and D) are all factors that would hinder the use of data-mining techniques in business.

Question 5

In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?&#10;A)Data sampling&#10;B)Data preparation&#10;C)Model construction&#10;D)Model assessment

Accepted Answer

Data preparation involves the manipulation and transformation of the data to make it suitable for modeling. This may include tasks such as cleaning, filtering, and feature engineering to create a dataset that can be used effectively in building models.

Question 6

__________________ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.&#10;A)Single linkage&#10;B)Complete linkage&#10;C)Average linkage&#10;D)Average group linkage

Accepted Answer

Complete linkage calculates dissimilarity between two clusters by considering only the two most dissimilar observations in the two clusters. Single linkage considers the two most similar points and average linkage considers the average dissimilarity between all pairs of points. Average group linkage is a variant of average linkage where the size of the clusters is taken into account. However, only complete linkage matches the description given in the question.

Question 7

Single linkage is a measure of calculating dissimilarity between clusters by&#10;A)considering only the two most dissimilar observations in the two clusters.&#10;B)computing the average dissimilarity between every pair of observations between the two clusters.&#10;C)considering only the two most similar observations in the two clusters.&#10;D)considering the distance between the cluster centroids.

Accepted Answer

Single linkage calculates the dissimilarity between two clusters based on the shortest distance, i.e., the distance between the closest pair of observations, one from each cluster. This method considers only the two most similar observations in the clusters to determine the clusters' dissimilarity.

Question 8

The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called&#10;A)data visualization.&#10;B)cluster analysis.&#10;C)market analysis.&#10;D)supervised learning.

Accepted Answer

The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is cluster analysis. This technique involves grouping similar data points together based on their characteristics or behaviors. It is a widely used technique in marketing research to identify customer segments and tailor marketing strategies accordingly. Options A, C, and D are not directly related to this specific technique.

Question 9

Which of the following is true of Euclidean distances?&#10;A)It is used to measure dissimilarity between categorical variable observations.&#10;B)It is not affected by the scale on which variables are measured.&#10;C)It increases with the increase in similarity between variable values.&#10;D)It is commonly used as a method of measuring dissimilarity between quantitative observations.

Accepted Answer

Euclidean distance is commonly used as a method of measuring dissimilarity between quantitative observations. It is not an appropriate measure for categorical variables since they do not have numerical values that can be measured in terms of distance. Euclidean distance is affected by the scale on which variables are measured, as it calculates the square root of the sum of squared differences between variables. Finally, the distance decreases as the similarity between variable values increases, so option C is incorrect.

Question 10

If a model's implications depend on the inclusion or exclusion of outliers, one should spend additional time to track down&#10;A)&#8203;the cause of the outliers.&#10;B)&#8203;the missing values.&#10;C)a better estimation of the outliers.&#10;D)&#8203;another source of data.

Accepted Answer

Tracking down the cause of outliers is the best option because it can help identify any issues with data quality or measurement error that may be affecting the model's results. This can lead to better understanding of the underlying processes that affect the data and improve the accuracy of the model. The other options do not address the cause of the outliers and may not provide a meaningful solution to the problem.

Question 11

k-means clustering is the process of&#10;A)agglomerating observations into a series of nested groups based on a measure of similarity.&#10;B)organizing observations into distinct groups based on a measure of similarity.&#10;C)reducing the number of variables to consider in data-mining.&#10;D)estimating the value of a continuous outcome variable.

Accepted Answer

The answer of k-means clustering is the process of&#10;A)agglomerating observations...

Question 12

&#8203;The goal of ___________________ is to use the variable values to identify relationships between observations.&#10;A)&#8203;unsupervised learning&#10;B)&#8203;data mining&#10;C)&#8203;McQuitty's method&#10;D)&#8203;Ward's method

Accepted Answer

The answer of &#8203;The goal of ___________________ is to use...

Question 13

Suppose we had a data set of from a call center where customers were asked to choose between the following three options:hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"?

A)000
B)100
C)010
D)001

Accepted Answer

The answer of Suppose we had a data set of...

Question 14

Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance.

A)66.21
B)72.28
C)75.39
D)88.57

Accepted Answer

The answer of Euclidean distance can be used to calculate...

Question 15

Imputing values is valid only if the variable values are We may replace missing values with the variable's mode, mean, or median, but only if the variable values are&#10;A)not missing at random (MNAR).&#10;B)&#8203;missing at random (MAR).&#10;C)missing completely at random (MCAR).&#10;D)Classification and Regression Trees (CART).

Accepted Answer

The answer of Imputing values is valid only if the...

Question 16

_______________ approaches are designed to describe patterns and relationships in large data sets with many observations of many variables.&#10;A)Data mining&#10;B)Unsupervised learning&#10;C)Dimension reduction&#10;D)Data sampling

Accepted Answer

The answer of _______________ approaches are designed to describe patterns...

Question 17

Data preparation includes all of the following except which task?&#10;A)calculating the confidence ratio for all association rules&#10;B)treating missing data&#10;C)identifying erroneous data and outliers&#10;D)defining the appropriate way to represent variables

Accepted Answer

The answer of Data preparation includes all of the following...

Question 18

If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster?&#8203;&#10;A)&#8203;the short leg&#10;B)&#8203;the long leg&#10;C)&#8203;the hypotenuse&#10;D)&#8203;Eudlidean distance is not related to right triangles.

Accepted Answer

The answer of If the Euclidean distance were to be...

Question 19

The process of eliminating variables from formal analysis without losing any crucial information is called&#10;A)dimension reduction.&#10;B)data sampling.&#10;C)data reduction.&#10;D)aggregation.

Accepted Answer

The answer of The process of eliminating variables from formal...

Question 20

Jaccard's coefficient is different from the matching coefficient in that the former&#10;A)measures overlap while the latter measures dissimilarity.&#10;B)does not count matching zero entries while the latter does.&#10;C)deals with categorical variable while the latter deals with continuous variables.&#10;D)is affected by the scale used to measure variables while the latter is not.

Accepted Answer

The answer of Jaccard's coefficient is different from the matching...

Question 21

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a&#10;A)dendrogram.&#10;B)scatter chart.&#10;C)decile-wise lift chart.&#10;D)cumulative lift tree.

Accepted Answer

The answer of A tree diagram used to illustrate the...

Question 22

Euclidean distance can be used to measure the distance between________________ in cluster analysis.&#10;A)objects&#10;B)clusters&#10;C)observations&#10;D)ward

Accepted Answer

The answer of Euclidean distance can be used to measure...

Question 23

Suppose the dissimilarity between clusters A and B has the value 24 and the dissimilarity between cluster B and C has the value 12. Use McQuitty's method to determine the dissimilarity of clusters A and B.

A)12
B)18
C)24
D)36

Accepted Answer

The answer of Suppose the dissimilarity between clusters A and...

Question 24

In which of the following scenarios would it be appropriate to use hierarchical clustering?&#10;A)When the number of observations in the dataset is relatively high.&#10;B)When it is not necessary to know the nesting of clusters.&#10;C)When the number of clusters is known beforehand.&#10;D)When binary or ordinal data needs to be clustered.

Accepted Answer

The answer of In which of the following scenarios would...

Question 25

The endpoint of a k-means clustering algorithm occurs when&#10;A)Euclidean distance between clusters is minimized.&#10;B)Euclidean distance between observations in a cluster is maximized.&#10;C)no further changes are observed in cluster structure and number.&#10;D)all of the observations are encompassed within a single large cluster with mean k.

Accepted Answer

The answer of The endpoint of a k-means clustering algorithm...

Question 26

When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations is called the&#10;A)&#8203;matching coefficient.&#10;B)Jaccard's coefficient.&#10;C)Euclidean distance.&#10;D)&#8203;antecedent.

Accepted Answer

The answer of When clustering only by dummy variables that...

Question 27

The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters?

A)0.5
B)1
C)1.5
D)2

Accepted Answer

The answer of The strength of a cluster can be...

Question 28

A method for modifying variables that reduces bias prior to cluster analysis is&#10;A)&#8203;standardization.&#10;B)&#8203;weighting.&#10;C)&#8203;removing outliers.&#10;D)&#8203;randomizing.

Accepted Answer

The answer of A method for modifying variables that reduces...

Question 29

________________ is a measure that computes the dissimilarity between a cluster AB and a cluster C by averaging the distance between A and C and the distance between B and C.&#8203;&#10;A)&#8203;Ward's method&#10;B)&#8203;Jaccard's coefficient&#10;C)&#8203;McQuitty's method&#10;D)&#8203;None of these.

Accepted Answer

The answer of ________________ is a measure that computes the...

Question 30

A ___________ refers to the number of times a collection of items occur together in a transaction data set.&#10;A)consequent&#10;B)validation count&#10;C)support count&#10;D)antecedent

Accepted Answer

The answer of A ___________ refers to the number of...

Question 31

Single linkage can be used to measure the distance between clusters that are the _______________ in cluster analysis.&#8203;&#10;A)&#8203;most similar&#10;B)&#8203;most different&#10;C)&#8203;farthest apart&#10;D)&#8203;closest

Accepted Answer

The answer of Single linkage can be used to measure...

Question 32

Complete linkage can be used to measure the distance between clusters that are the _________________ in cluster analysis.&#8203;&#10;A)&#8203;most similar&#10;B)&#8203;most different&#10;C)&#8203;farthest apart&#10;D)&#8203;closest

Accepted Answer

The answer of Complete linkage can be used to measure...

Question 33

&#8203;In k-means clustering, k represents the&#10;A)number of variables.&#10;B)number of clusters.&#10;C)number of observations in a cluster.&#10;D)mean of the cluster.

Accepted Answer

The answer of &#8203;In k-means clustering, k represents the&#10;A)number of...

Question 34

___________________ can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.&#10;A)Single linkage&#10;B)Ward's method&#10;C)Average group linkage&#10;D)Dendrogram

Accepted Answer

The answer of ___________________ can be used to partition observations...

Question 35

Heirarchial clusting using ____________ results in a sequence of aggregated clusters that minimizes the loss of information between the individual observation level and the cluster level&#8203;&#10;A)McQuitty's method&#10;B)centroid linkage&#10;C)median linkage&#10;D)Ward's method

Accepted Answer

The answer of Heirarchial clusting using ____________ results in a...

Question 36

____________________ measures cluster similarity by calculating the distance between the centroids of the two clusters.&#10;A)Single linkage&#10;B)Complete linkage&#10;C)Average linkage&#10;D)Cendroid linkage

Accepted Answer

The answer of ____________________ measures cluster similarity by calculating the...

Question 37

A cluster's _____________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

A)dimension
B)affordability
C)durability
D)span

Accepted Answer

The answer of A cluster's _____________ can be measured by...

Question 38

Average linkage is a measure of calculating dissimilarity between two clusters by&#10;A)finding the distance between the two most dissimilar observations in the two clusters.&#10;B)computing the average distance between every pair of observations between two clusters.&#10;C)finding the distance between the two closest observations in the two clusters.&#10;D)computing the distance between the cluster centroids.

Accepted Answer

The answer of Average linkage is a measure of calculating...

Question 39

Complete linkage can be used to measure the distance between _________ in cluster analysis.&#8203;&#10;A)&#8203;objects&#10;B)&#8203;clusters&#10;C)&#8203;observations&#10;D)&#8203;wards

Accepted Answer

The answer of Complete linkage can be used to measure...

Question 40

An analysis of items frequently co-occurring in transactions is known as&#10;A)market segmentation.&#10;B)market basket analysis.&#10;C)regression analysis.&#10;D)cluster analysis.

Accepted Answer

The answer of An analysis of items frequently co-occurring in...

Question 41

Using the data given, apply hierarchical clustering with 5 clusters using Wait Time (min), Purchase Amount ($),
Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the
XLMiner Hierarchical Clustering procedure. Use Ward's method as the clustering method.
a. Use a PivotTable on the data in the HC_Clusters1 worksheet to compute the cluster centers for the five clusters
in the hierarchical clustering.
b. Identify the cluster with the largest average waiting time. Using all the variables, how would you characterize
this cluster?
c. Identify the smallest cluster.
d. By examining the dendrogram on the HC_Dendrogram worksheet (as well as the sequence of clustering stages
in HC_Output1), what number of clusters seems to be the most natural fit based on the distance?

Accepted Answer

The answer of Using the data given, apply hierarchical clustering...

Question 42

Using the data given, apply k-means clustering using Price ($) as the variable with k = 3. Be sure to Normalize input data, and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Then create one distinct data set for each of the three resulting clusters of price.
a. For the observations composing the cluster with low home price, apply hierarchical clustering with Ward's method to form three clusters using Acres and Age as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters1, report the characteristics of each cluster.
b. For the observations composing the cluster with medium home price, apply hierarchical clustering with Ward's method to form three clusters using Acres and Age as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters1, report the characteristics of each cluster.
c. Comment on the cluster with high home price.

Accepted Answer

The answer of Using the data given, apply k-means clustering...

Question 43

Platinum Gym has 10,000 gyms members out of which 1500 memberships included Unlimited Fitness Training and use of the tanning salon, and out of which 750 included Unlimited Hydromassage. If the Fitness Training are considered A, the use of the tanning salon are considered B, and the Hydromassage are considered C, then the associate rule for these sales become &#34;If A and B are purchased, then C is also purchased.&#34; Calculate the confidence level.&#8203;

Accepted Answer

The answer of Platinum Gym has 10,000 gyms members out...

Question 44

Suppose that the confidence of an association rule is 0.75 and the total number of transactions is 250. How many of those transactions support the consequent if the lift ratio is 1.875?&#10;A)100&#10;B)125&#10;C)150&#10;D)175

Accepted Answer

The answer of Suppose that the confidence of an association...

Question 45

To identify patterns across transactions, we can use&#10;A)association rules.&#10;B)&#8203;complete linkage.&#10;C)centroid linkage.&#10;D)k-means.

Accepted Answer

The answer of To identify patterns across transactions, we can...

Question 46

Which statement is true of an association rule?&#8203;&#10;A)&#8203;It is ultimately judged on how actionable it is and how well it explains the relationship between item sets.&#10;B)&#8203;It is a data reduction technique that reduces large information into smaller homogeneous groups.&#10;C)&#8203;It uses analytic models to describe the relationship between metrics that drive business performance.&#10;D)&#8203;It seeks to classify a categorical outcome into two or more categories.

Accepted Answer

The answer of Which statement is true of an association...

Question 47

The strength of the association rule is known as ____________ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.&#10;A)&#8203;lift&#10;B)&#8203;antecedent&#10;C)support count&#10;D)consequent

Accepted Answer

The answer of The strength of the association rule is...

Question 48

___________________uses the averaging concept of cluster centroids to define between-cluster similarity.

Accepted Answer

The answer of ___________________uses the averaging concept of cluster centroids...

Question 49

A retailer is interested in analyzing the shopping trend of men concerning the items: Shirts, pants, Jeans, t-shirts, Shoes, and Belts. A sample of 50 male customers is selected and the data are given below.

a. Using a minimum support of 20 transactions and a minimum confidence of 50 percent, use XLMiner to generate a
list of association rules. How many rules satisfy this criterion?
b. Using the list of rules from part a, consider the rule with the largest lift ratio. Interpret what this rule is saying
about the relationship between the antecedent item set and consequent item set.
c. Interpret the support count of the item set composed of the all the items involved in the rule with the largest
lift ratio.
d. Interpret the confidence of the rule with the largest lift ratio.
e. Interpret the lift ratio of the rule with the largest lift ratio.

Accepted Answer

The answer of A retailer is interested in analyzing the...

Question 50

Using the data given, apply hierarchical clustering with 5 clusters using Wait Time (min), Purchase Amount ($),
Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the
XLMiner Hierarchical Clustering procedure. Use Ward's method as the clustering method.
a. Use a PivotTable on the data in the HC_Clusters1 worksheet to compute the cluster centers for the five clusters
in the hierarchical clustering.
b. Identify the cluster with the largest average waiting time. Using all the variables, how would you characterize
this cluster?
c. Identify the smallest cluster.
d. By examining the dendrogram on the HC_Dendrogram worksheet (as well as the sequence of clustering stages
in HC_Output1), what number of clusters seems to be the most natural fit based on the distance?

Accepted Answer

The answer of Using the data given, apply hierarchical clustering...

Question 51

Using the data given, apply k-means clustering using Wait time (min) as the variable with k = 3. Be sure to Normalize input data, and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Then create one distinct data set for each of the three resulting clusters for waiting time.&#10;a. For the observations composing the cluster which has the low waiting time, apply hierarchical clustering with Ward's method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.&#10;b. For the observations composing the cluster which has the medium waiting time, apply hierarchical clustering with Ward's method to form three clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.&#10;c. For the observations composing the cluster which has the high waiting time, apply hierarchical clustering with Ward's method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.

Accepted Answer

The answer of Using the data given, apply k-means clustering...

Question 52

Platinum Gym has 10,000 gyms members out of which 1500 memberships included Unlimited Fitness Training and use of the tanning salon, and out of which 750 included Unlimited Hydromassage. If the Fitness Training are considered A, the use of the tanning salon are considered B, and the Hydromassage are considered C, then the associate rule for these sales become, &#34;If A and B are purchased, then C is also purchased.&#34; Given total transactions for C is 3000. Calculate the lift for this rule.

Accepted Answer

The answer of Platinum Gym has 10,000 gyms members out...

Question 53

The lift ratio of an association rule with a confidence value of 0.45 and in which the consequent occurs in 6 out of 10 cases is&#10;A)1.40.&#10;B)0.54.&#10;C)1.00.&#10;D)0.75.

Accepted Answer

The answer of The lift ratio of an association rule...

Question 54

Platinum Gym has 10,000 gyms members out of which 1500 memberships included Unlimited Fitness Training and use of the tanning salon, and out of which 750 included Unlimited Hydromassage. If the Fitness Training are considered A, the use of the tanning salon are considered B, and the Hydromassage are considered C, then the associate rule for these sales become &#34;If A and B are purchased, then C is also purchased.&#34; Given total transactions for C is 3000. Calculate benchmark confidence level.

Accepted Answer

The answer of Platinum Gym has 10,000 gyms members out...

Question 55

The __________ the lift ratio, the ____________ the association rule.&#10;A)&#8203;higher; stronger&#10;B)&#8203;higher; weaker&#10;C)lower; stronger&#10;D)&#8203;lower; weaker

Accepted Answer

The answer of The __________ the lift ratio, the ____________...

Question 56

____________________ clustering method defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different.

Accepted Answer

The answer of ____________________ clustering method defines the similarity between...

Deck 4: Descriptive Data Mining