Deck 10: Introduction to Data Mining
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Unlock Deck
Sign up to unlock the cards in this deck!
Unlock Deck
Unlock Deck
1/53
Play
Full screen (f)
Deck 10: Introduction to Data Mining
1
Divisive clustering method is different from agglomerative clustering methods in that divisive clustering methods .
A) can only have a pair of subjects in each cluster
B) separate objects into a particular cluster in one step
C) separate n objects successively into finer groupings
D) can only have a single subject in each cluster
A) can only have a pair of subjects in each cluster
B) separate objects into a particular cluster in one step
C) separate n objects successively into finer groupings
D) can only have a single subject in each cluster
C
2
is a collection of techniques that seek to group or segment a collection of objects or observations into subsets, such that those within each subset are more closely related to one another than objects assigned to different subsets.
A) Association rule mining
B) Discriminant analysis
C) Cluster analysis
D) Logistic regression
A) Association rule mining
B) Discriminant analysis
C) Cluster analysis
D) Logistic regression
C
3
In the method, the distance between groups is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered.
A) Ward's linkage clustering
B) single linkage clustering
C) divisive clustering
D) average group linkage clustering
A) Ward's linkage clustering
B) single linkage clustering
C) divisive clustering
D) average group linkage clustering
B
4
U-botit.com is an electronic commerce company that sells music online. It keeps a tab of what genre of music their registered customers buy. If u-botit.com were to use the data mining approach of association, which of the following actions would it take?
A) send recommendations to customers based on their buying habits
B) conduct surveys to customers to gauge customer satisfaction
C) classify the customers based on the genre of music they listen to
D) provide discounts to certain genre buyers where sales are less
A) send recommendations to customers based on their buying habits
B) conduct surveys to customers to gauge customer satisfaction
C) classify the customers based on the genre of music they listen to
D) provide discounts to certain genre buyers where sales are less
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
5
The effectiveness of a classification rule can be judged making a probability of misclassification errors and summarizing the results in a .
A) classification matrix
B) classification hierarchy
C) dendogram
D) classification algorithm
A) classification matrix
B) classification hierarchy
C) dendogram
D) classification algorithm
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
6
Which of the following formulas calculates the Euclidean distance between X and Y? 

Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
7
The data mining approach called involves the developing of analytic models to describe the relationship between metrics that drive business performance like profitability, customer satisfaction, or employee satisfaction.
A) association
B) reduction
C) cause-and-effect modeling
D) classification
A) association
B) reduction
C) cause-and-effect modeling
D) classification
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
8
Which of the following features of classification, used in Excel, for a particular database will necessarily be coded to a certain value?
A) categorical variables
B) non-numerical variables
C) predictor variables
D) numerical variables
A) categorical variables
B) non-numerical variables
C) predictor variables
D) numerical variables
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
9
Which of the following data sets provides the most realistic estimate of the performance of a model on completely unseen data?
A) validation data set
B) test data set
C) training data set
D) linear regression data set
A) validation data set
B) test data set
C) training data set
D) linear regression data set
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
10
Which of the following is a definition of distance between two clusters in a complete linkage clustering?
A) the average of distances between all pairs of objects, where each pair is made up of one object from each group
B) the distance between the most distant pair of objects, one from each group
C) the sum of squares of the distance between clusters
D) the distance between the value of the shortest link between the clusters
A) the average of distances between all pairs of objects, where each pair is made up of one object from each group
B) the distance between the most distant pair of objects, one from each group
C) the sum of squares of the distance between clusters
D) the distance between the value of the shortest link between the clusters
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
11
Which of the following is true of a training data set?
A) They are primarily used to fine-tune models.
B) They provide the most realistic estimate for a model's performance.
C) They are used to build models where the data is unknown.
D) They have known outcomes.
A) They are primarily used to fine-tune models.
B) They provide the most realistic estimate for a model's performance.
C) They are used to build models where the data is unknown.
D) They have known outcomes.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
12
Which of the following is true of hierarchical clustering?
A) All clusters must have the same number of data.
B) No single cluster can have all objects
C) The data are not partitioned into a particular cluster in a single step.
D) All clusters must have more than one object in it.
A) All clusters must have the same number of data.
B) No single cluster can have all objects
C) The data are not partitioned into a particular cluster in a single step.
D) All clusters must have more than one object in it.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
13
Spam filtering for e-mails can be seen as an example of which of the following types of approaches of data mining?
A) reduction
B) association
C) cause-and-effect modeling
D) classification
A) reduction
B) association
C) cause-and-effect modeling
D) classification
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
14
In classification, which of the following would be considered as a categorical variable of interest for a credit approval decision for a requester?
A) age of the requester
B) income of the requester
C) revolving balance of the requester
D) reject or accept credit approval
A) age of the requester
B) income of the requester
C) revolving balance of the requester
D) reject or accept credit approval
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
15
Which of the following is true of cluster analysis?
A) It is a cause-and-modeling type of analytic model.
B) It does not provide a definitive answer from analyzing the data.
C) It is primarily a prescriptive analytical method.
D) It uses clustered data that are not affected by the specific method used.
A) It is a cause-and-modeling type of analytic model.
B) It does not provide a definitive answer from analyzing the data.
C) It is primarily a prescriptive analytical method.
D) It uses clustered data that are not affected by the specific method used.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
16
Which of the following is the first stage of joining clusters in agglomerative hierarchical clustering?
A) by separating clusters into finer groups
B) by joining two clusters farthest away from each other
C) by joining two clusters that are not at a Euclidean distance
D) by joining two clusters that are closest to each other
A) by separating clusters into finer groups
B) by joining two clusters farthest away from each other
C) by joining two clusters that are not at a Euclidean distance
D) by joining two clusters that are closest to each other
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
17
Which of the following uses the sum of squares between the objects in the cluster when measuring their distances?
A) divisive clustering
B) average group linkage clustering
C) Ward's hierarchical clustering
D) single linkage clustering
A) divisive clustering
B) average group linkage clustering
C) Ward's hierarchical clustering
D) single linkage clustering
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
18
If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two objects of a cluster?
A) the hypotenuse
B) the small leg
C) the long leg
D) the average of the sum of both the legs
A) the hypotenuse
B) the small leg
C) the long leg
D) the average of the sum of both the legs
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
19
Sendstars is a package delivering company that recently made a study on its customer retention and service renewal metrics. They found that most customers defected from using Sendstars' services due to customer dissatisfaction stemming from delivery personnel being rude or ill-mannered. To curb this issue, Sendstars gave special training to its employees in customer service. Which of the following data mining approaches did Sendstars employ when they decided to train their employees in customer care based on the study?
A) association
B) cause-and-effect modeling
C) classification
D) reduction
A) association
B) cause-and-effect modeling
C) classification
D) reduction
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
20
Which of the following is included in the data mining approach of data exploration and reduction?
A) analyzing data to predict how to classify a new data element
B) identifying groups in which the elements of the groups are in some way similar
C) creating rules for target marketing based on association of variables
D) developing analytic models to describe the relationship between metrics
A) analyzing data to predict how to classify a new data element
B) identifying groups in which the elements of the groups are in some way similar
C) creating rules for target marketing based on association of variables
D) developing analytic models to describe the relationship between metrics
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
21
is a data-mining technique used for classifying a set of observations into predefined classes.
A) Logistic regression
B) k-Nearest Neighbors algorithm
C) Discriminant analysis
D) Linear regression
A) Logistic regression
B) k-Nearest Neighbors algorithm
C) Discriminant analysis
D) Linear regression
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
22
A musical instruments retailer has 10,000 point-of-sale transactions out of which 1500 sales included both items of electric guitars and guitar cases, and out of which 750 had sales of new strings. If the electric guitars are considered A, the guitar cases are considered B, and the strings are considered C, then the associate rule for these sales become "If A and B are purchased, then C is also purchased." Calculate the confidence level, expected confidence level, and lift for this rule, given that total transactions for C is 3000.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
23
Which of the following would be considered a lagging measure in a restaurant using the cause-and-modeling method of data mining?
A) a manager who is having trouble meeting the daily demands
B) a waiter's behavior toward a customer
C) an accurately billed meal
D) a satisfied customer
A) a manager who is having trouble meeting the daily demands
B) a waiter's behavior toward a customer
C) an accurately billed meal
D) a satisfied customer
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
24
is the ratio of the number of transactions that include all items in the consequent as well as the antecedent to the number of transactions that include all items in the antecedent.
A) Lift
B) Logit
C) Support for the association rule
D) Confidence of the association rule
A) Lift
B) Logit
C) Support for the association rule
D) Confidence of the association rule
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
25
Which of the following types of data-mining methods provides probabilistic if-then statements?
A) association rule
B) logistic regression
C) reduction
D) cause-and effect modeling
A) association rule
B) logistic regression
C) reduction
D) cause-and effect modeling
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
26
Validation data sets differ from training data sets in that validation data sets .
A) have known outcomes
B) test a model with unseen data
C) are used to teach data-mining algorithms
D) provide the most realistic test for models with known data
A) have known outcomes
B) test a model with unseen data
C) are used to teach data-mining algorithms
D) provide the most realistic test for models with known data
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
27
Which of the following is true of random partitioning?
A) It cannot be used for creating training data sets.
B) It can contain any non-negative value from the observations.
C) It can include negative-value observations in its data.
D) It selects its data based on observations that have similar properties.
A) It cannot be used for creating training data sets.
B) It can contain any non-negative value from the observations.
C) It can include negative-value observations in its data.
D) It selects its data based on observations that have similar properties.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
28
When using logistic regression, where p being the probability that the dependent variable Y = 1, X1, X2 ..., Xk are the independent variables, and β0, β1, β2 ..., βk are unknown regression coefficients, is called the odds of belonging to category 1(Y = 1). 

Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
29
If bs are weights, Xs are input values, and c is a constant or intercept, provide the equation for discriminant functions, L. 

Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
30
Logistic regression is different from discriminant analysis in that logistic regression .
A) does not predict the weights
B) sets observation into predefined classes
C) does not depend on assumptions
D) depends on assumptions such as normalization of independent variables
A) does not predict the weights
B) sets observation into predefined classes
C) does not depend on assumptions
D) depends on assumptions such as normalization of independent variables
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
31
Which of the following is true of the value of k in the k-Nearest Neighbors algorithm?
A) The value of k is always taken as a constant and is equal to 1.
B) If the value of k is large, it drastically increases variability.
C) If the value k is very large, it introduces biases into the classification decisions.
D) If the value k is large, the classification of a record is very sensitive to the classification of the single record to which it is closest.
A) The value of k is always taken as a constant and is equal to 1.
B) If the value of k is large, it drastically increases variability.
C) If the value k is very large, it introduces biases into the classification decisions.
D) If the value k is large, the classification of a record is very sensitive to the classification of the single record to which it is closest.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
32
The algorithm is a classification scheme that attempts to find records in a database that are similar to one we wish to classify.
A) linear regression
B) logistic regression
C) k-Nearest Neighbors
D) discriminant analysis
A) linear regression
B) logistic regression
C) k-Nearest Neighbors
D) discriminant analysis
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
33
Which of the following is true of the lift value in association rule?
A) Its strength is directly proportional to the strength of the association rule.
B) It cannot have a value that is more than 1 to be considered a good minimum.
C) It provides information on the increase of the antecedent given the consequent.
D) It does not require the expected confidence to be calculated.
A) Its strength is directly proportional to the strength of the association rule.
B) It cannot have a value that is more than 1 to be considered a good minimum.
C) It provides information on the increase of the antecedent given the consequent.
D) It does not require the expected confidence to be calculated.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
34
The strength of the association rule, known as lift, is calculated as the ratio of the .
A) sum of the antecedents and the consequents to the antecedents
B) antecedents to the consequents
C) support to the confidence level
D) confidence to expected confidence
A) sum of the antecedents and the consequents to the antecedents
B) antecedents to the consequents
C) support to the confidence level
D) confidence to expected confidence
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
35
The weights for determining the discriminant functions are determined by .
A) assessing the number of outliers that are present in each group
B) calculating the distance between the two closest observations in each group
C) measuring the closeness between predictor values of each set
D) maximizing the between-group variance relative to the within-group variance
A) assessing the number of outliers that are present in each group
B) calculating the distance between the two closest observations in each group
C) measuring the closeness between predictor values of each set
D) maximizing the between-group variance relative to the within-group variance
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
36
Which of the following is true of association rule mining?
A) It develops analytic models to describe the relationship between metrics that drive business performance.
B) It identifies attributes that occur frequently together in a given data set.
C) It seeks to classify a categorical outcome into one of two or more categories.
D) It is a data reduction technique that reduces large information into smaller heterogeneous groups.
A) It develops analytic models to describe the relationship between metrics that drive business performance.
B) It identifies attributes that occur frequently together in a given data set.
C) It seeks to classify a categorical outcome into one of two or more categories.
D) It is a data reduction technique that reduces large information into smaller heterogeneous groups.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
37
In a k-Nearest Neighbors algorithm, similarity of records is based on the .
A) closeness of a record to numerical predictors in the other records
B) sum of the squares of the distance between the numerical predictors
C) set of linear functions of predictors called discriminant functions
D) nearness of a record to its own observations
A) closeness of a record to numerical predictors in the other records
B) sum of the squares of the distance between the numerical predictors
C) set of linear functions of predictors called discriminant functions
D) nearness of a record to its own observations
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
38
Which of the following is true of logistic regression as a classifying method?
A) Its dependent variable is always categorical.
B) Its independent variable is always continuous or numerical.
C) It predicts the probability of output variables based on dependent variables.
D) It cannot be used when the dependent variable is binary.
A) Its dependent variable is always categorical.
B) Its independent variable is always continuous or numerical.
C) It predicts the probability of output variables based on dependent variables.
D) It cannot be used when the dependent variable is binary.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
39
In the cause-and-effect modeling, internal metrics, such as employee satisfaction, productivity, and turnover are considered to be measures.
A) logit
B) life
C) leading
D) lagging
A) logit
B) life
C) leading
D) lagging
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
40
Which of the following typically describes the support for the association rule?
A) the number of transactions that include all items that are in the antecedent parts of the rule
B) the number of transactions that include all items in the antecedent and consequent parts of the rule
C) the number of transactions that include half from the antecedent and half from the consequent
D) the number of transactions that include all items that are in the consequent parts of the rule
A) the number of transactions that include all items that are in the antecedent parts of the rule
B) the number of transactions that include all items in the antecedent and consequent parts of the rule
C) the number of transactions that include half from the antecedent and half from the consequent
D) the number of transactions that include all items that are in the consequent parts of the rule
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
41
Expected confidence assumes independence between the consequent and the antecedent.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
42
In cluster analysis, the objects within clusters should exhibit a high amount of dissimilarity.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
43
Lagging and leading measures in cause-and-effect modeling are uncorrelated.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
44
In association analysis, the antecedent and consequent are sets of items that do not have any items in common.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
45
Briefly explain classification as a data-mining tool with an example.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
46
In the average linkage clustering, the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
47
The market share of a business would be considered a lagging measure in the cause-and- effect modeling of data mining.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
48
Logistic regression cannot be employed when the dependent variable is binary.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
49
Explain how data-mining using lagging and leading measures of the cause-and-effect model can help managers make business decisions.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
50
How is the strength of an association measured?
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
51
The accuracy of the model on the test data gives a realistic estimate of the performance of the model on completely unseen data.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
52
How are objects clustered in agglomerative hierarchical clustering?
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck
53
Exemplify the k-Nearest Neighbor algorithm of classification.
Unlock Deck
Unlock for access to all 53 flashcards in this deck.
Unlock Deck
k this deck