Deck 4: Descriptive Data Mining

Full screen (f)
exit full mode
Question
Suppose we had a data set of from a call center where customers were asked to choose between the following three options: hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"?

A)000
B)100
C)010
D)001
Use Space or
up arrow
down arrow
to flip the card.
Question
If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster?

A)The short leg
B)The long leg
C)The hypotenuse
D)Euclidean distance is not related to right triangles.
Question
__________ can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.

A)Single linkage
B)Ward's method
C)Average group linkage
D)Dendrogram
Question
In preparing categorical variables for analysis, it is usually best to

A)convert the categories to numeric representations.
B)convert the categories to binary, dummy variables.
C)combine as many categories as possible.
D)let them remain categorical.
Question
Average linkage is a measure of calculating dissimilarity between two clusters by

A)finding the distance between the two most dissimilar observations in the two clusters.
B)computing the average distance between every pair of observations between two clusters.
C)finding the distance between the two closest observations in the two clusters.
D)computing the distance between the cluster centroids.
Question
Euclidean distance can be used to measure the distance between __________ in cluster analysis.

A)objects
B)clusters
C)observations
D)ward
Question
__________ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.

A)Single linkage
B)Complete linkage
C)Average linkage
D)Average group linkage
Question
__________ is a method of calculating dissimilarity between clusters by calculating the distance between the centroids of the two clusters.

A)Single linkage
B)Complete linkage
C)Average linkage
D)Centroid linkage
Question
k-means clustering is the process of

A)agglomerating observations into a series of nested groups based on a measure of similarity.
B)organizing observations into distinct groups based on a measure of similarity.
C)reducing the number of variables to consider in data-mining.
D)estimating the value of a continuous outcome variable.
Question
__________ approaches are designed to describe patterns and relationships in large data sets with many observations of many variables.

A)Data mining
B)Unsupervised learning
C)Dimension reduction
D)Data sampling
Question
Jaccard's coefficient is different from the matching coefficient in that the former

A)measures overlap while the latter measures dissimilarity.
B)does not count matching zero entries while the latter does.
C)deals with categorical variable while the latter deals with continuous variables.
D)is affected by the scale used to measure variables while the latter is not.
Question
Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year-old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year-old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance.

A)66.21
B)72.28
C)75.39
D)88.57
Question
When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations is called the

A)matching coefficient.
B)Jaccard's coefficient.
C)Euclidean distance.
D)antecedent.
Question
Single linkage is a measure of calculating dissimilarity between clusters by

A)considering only the two most dissimilar observations in the two clusters.
B)computing the average dissimilarity between every pair of observations between the two clusters.
C)considering only the two most similar observations in the two clusters.
D)considering the distance between the cluster centroids.
Question
Which of the following is true of Euclidean distances?

A)It is used to measure dissimilarity between categorical variable observations.
B)It is not affected by the scale on which variables are measured.
C)It increases with the increase in similarity between variable values.
D)It is commonly used as a method of measuring dissimilarity between quantitative observations.
Question
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called

A)data visualization.
B)cluster analysis.
C)market analysis.
D)supervised learning.
Question
Suppose the dissimilarity between clusters A and B has the value 24 and the dissimilarity between cluster B and C has the value 12. Use McQuitty's method to determine the dissimilarity of clusters A and B.

A)12
B)18
C)24
D)36
Question
The goal of __________ is to use the variable values to identify relationships between observations.

A)unsupervised learning
B)data mining
C)McQuitty's method
D)Ward's method
Question
Observation refers to the

A)estimated continuous outcome variable.
B)set of recorded values of variables associated with a single entity.
C)goal of predicting a categorical outcome based on a set of variables.
D)mean of all variable values associated with one particular entity.
Question
A method for modifying variables that reduces bias prior to cluster analysis is

A)standardization.
B)weighting.
C)removing outliers.
D)randomizing.
Question
An analysis of items frequently co-occurring in transactions is known as

A)market segmentation.
B)market basket analysis.
C)regression analysis.
D)cluster analysis.
Question
In k-means clustering, k represents the

A)number of variables.
B)number of clusters.
C)number of observations in a cluster.
D)mean of the cluster.
Question
The __________ the lift ratio, the __________ the association rule.

A)higher; stronger
B)higher; weaker
C)lower; stronger
D)lower; weaker
Question
A cluster's __________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

A)dimension
B)affordability
C)durability
D)span
Question
In which of the following scenarios would it be appropriate to use hierarchical clustering?

A)When the number of observations in the dataset is relatively high
B)When it is not necessary to know the nesting of clusters
C)When the number of clusters is known beforehand
D)When binary or ordinal data needs to be clustered
Question
Which statement is true of an association rule?

A)It is ultimately judged on how actionable it is and how well it explains the relationship between item sets.
B)It is a data reduction technique that reduces large information into smaller homogeneous groups.
C)It uses analytic models to describe the relationship between metrics that drive business performance.
D)It seeks to classify a categorical outcome into two or more categories.
Question
Single linkage can be used to measure the distance between clusters that are the __________ in cluster analysis.

A)most similar
B)most different
C)farthest apart
D)closest
Question
Complete linkage can be used to measure the distance between clusters that are the __________ in cluster analysis.

A)most similar
B)most different
C)farthest apart
D)closest
Question
To identify patterns across transactions, we can use

A)association rules.
B)complete linkage.
C)centroid linkage.
D)k-means.
Question
The strength of the association rule is known as __________ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.

A)lift
B)antecedent
C)support count
D)consequent
Question
A __________ refers to the number of times a collection of items occurs together in a transaction data set.

A)consequent
B)validation count
C)support count
D)antecedent
Question
Hierarchical clustering using __________ results in a sequence of aggregated clusters that minimizes the loss of information between the individual observation level and the cluster level.

A)McQuitty's method
B)centroid linkage
C)median linkage
D)Ward's method
Question
The lift ratio of an association rule with a confidence value of 0.45 and in which the consequent occurs in 6 out of 10 cases is

A)1.40.
B)0.54.
C)1.00.
D)0.75.
Question
__________ is a measure that computes the dissimilarity between a cluster AB and a cluster C by averaging the distance between A and C and the distance between B and C.

A)Ward's method
B)Jaccard's coefficient
C)McQuitty's method
D)None of these are correct.
Question
Suppose that the confidence of an association rule is 0.75 and the total number of transactions is 250. How many of those transactions support the consequent if the lift ratio is 1.875?

A)100
B)125
C)150
D)175
Question
Complete linkage can be used to measure the distance between _________ in cluster analysis.

A)objects
B)clusters
C)observations
D)wards
Question
The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters?

A)0.5
B)1
C)1.5
D)2
Question
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a

A)dendrogram.
B)scatter chart.
C)decile-wise lift chart.
D)cumulative lift tree.
Question
The process of extracting useful information from text data is known as __________.

A)text mining
B)tokenization
C)stemming
D)corpus
Question
The endpoint of a k-means clustering algorithm occurs when

A)Euclidean distance between clusters is minimized.
B)Euclidean distance between observations in a cluster is maximized.
C)no further changes are observed in cluster structure and number.
D)all of the observations are encompassed within a single large cluster with mean k.
Question
The process of converting a word to its stem, or root word, is referred to as __________.

A)data cleaning
B)stemming
C)tokenization
D)stacking
Question
In the text mining process, the text is first preprocessed by deriving a smaller set of _________ from the larger set of words contained in a collection of documents.

A)tokens
B)stems
C)terms
D)stack
Question
A collection of text documents to be analyzed is called a ___________.

A)book
B)corpus
C)library
D)consequent
Question
The process of dividing text into separate terms is referred to as __________.

A)data cleaning
B)stemming
C)tokenization
D)stacking
Unlock Deck
Sign up to unlock the cards in this deck!
Unlock Deck
Unlock Deck
1/44
auto play flashcards
Play
simple tutorial
Full screen (f)
exit full mode
Deck 4: Descriptive Data Mining
1
Suppose we had a data set of from a call center where customers were asked to choose between the following three options: hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"?

A)000
B)100
C)010
D)001
001
2
If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster?

A)The short leg
B)The long leg
C)The hypotenuse
D)Euclidean distance is not related to right triangles.
The hypotenuse
3
__________ can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.

A)Single linkage
B)Ward's method
C)Average group linkage
D)Dendrogram
Ward's method
4
In preparing categorical variables for analysis, it is usually best to

A)convert the categories to numeric representations.
B)convert the categories to binary, dummy variables.
C)combine as many categories as possible.
D)let them remain categorical.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
5
Average linkage is a measure of calculating dissimilarity between two clusters by

A)finding the distance between the two most dissimilar observations in the two clusters.
B)computing the average distance between every pair of observations between two clusters.
C)finding the distance between the two closest observations in the two clusters.
D)computing the distance between the cluster centroids.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
6
Euclidean distance can be used to measure the distance between __________ in cluster analysis.

A)objects
B)clusters
C)observations
D)ward
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
7
__________ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.

A)Single linkage
B)Complete linkage
C)Average linkage
D)Average group linkage
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
8
__________ is a method of calculating dissimilarity between clusters by calculating the distance between the centroids of the two clusters.

A)Single linkage
B)Complete linkage
C)Average linkage
D)Centroid linkage
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
9
k-means clustering is the process of

A)agglomerating observations into a series of nested groups based on a measure of similarity.
B)organizing observations into distinct groups based on a measure of similarity.
C)reducing the number of variables to consider in data-mining.
D)estimating the value of a continuous outcome variable.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
10
__________ approaches are designed to describe patterns and relationships in large data sets with many observations of many variables.

A)Data mining
B)Unsupervised learning
C)Dimension reduction
D)Data sampling
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
11
Jaccard's coefficient is different from the matching coefficient in that the former

A)measures overlap while the latter measures dissimilarity.
B)does not count matching zero entries while the latter does.
C)deals with categorical variable while the latter deals with continuous variables.
D)is affected by the scale used to measure variables while the latter is not.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
12
Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year-old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year-old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance.

A)66.21
B)72.28
C)75.39
D)88.57
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
13
When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations is called the

A)matching coefficient.
B)Jaccard's coefficient.
C)Euclidean distance.
D)antecedent.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
14
Single linkage is a measure of calculating dissimilarity between clusters by

A)considering only the two most dissimilar observations in the two clusters.
B)computing the average dissimilarity between every pair of observations between the two clusters.
C)considering only the two most similar observations in the two clusters.
D)considering the distance between the cluster centroids.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
15
Which of the following is true of Euclidean distances?

A)It is used to measure dissimilarity between categorical variable observations.
B)It is not affected by the scale on which variables are measured.
C)It increases with the increase in similarity between variable values.
D)It is commonly used as a method of measuring dissimilarity between quantitative observations.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
16
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called

A)data visualization.
B)cluster analysis.
C)market analysis.
D)supervised learning.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
17
Suppose the dissimilarity between clusters A and B has the value 24 and the dissimilarity between cluster B and C has the value 12. Use McQuitty's method to determine the dissimilarity of clusters A and B.

A)12
B)18
C)24
D)36
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
18
The goal of __________ is to use the variable values to identify relationships between observations.

A)unsupervised learning
B)data mining
C)McQuitty's method
D)Ward's method
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
19
Observation refers to the

A)estimated continuous outcome variable.
B)set of recorded values of variables associated with a single entity.
C)goal of predicting a categorical outcome based on a set of variables.
D)mean of all variable values associated with one particular entity.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
20
A method for modifying variables that reduces bias prior to cluster analysis is

A)standardization.
B)weighting.
C)removing outliers.
D)randomizing.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
21
An analysis of items frequently co-occurring in transactions is known as

A)market segmentation.
B)market basket analysis.
C)regression analysis.
D)cluster analysis.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
22
In k-means clustering, k represents the

A)number of variables.
B)number of clusters.
C)number of observations in a cluster.
D)mean of the cluster.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
23
The __________ the lift ratio, the __________ the association rule.

A)higher; stronger
B)higher; weaker
C)lower; stronger
D)lower; weaker
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
24
A cluster's __________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

A)dimension
B)affordability
C)durability
D)span
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
25
In which of the following scenarios would it be appropriate to use hierarchical clustering?

A)When the number of observations in the dataset is relatively high
B)When it is not necessary to know the nesting of clusters
C)When the number of clusters is known beforehand
D)When binary or ordinal data needs to be clustered
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
26
Which statement is true of an association rule?

A)It is ultimately judged on how actionable it is and how well it explains the relationship between item sets.
B)It is a data reduction technique that reduces large information into smaller homogeneous groups.
C)It uses analytic models to describe the relationship between metrics that drive business performance.
D)It seeks to classify a categorical outcome into two or more categories.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
27
Single linkage can be used to measure the distance between clusters that are the __________ in cluster analysis.

A)most similar
B)most different
C)farthest apart
D)closest
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
28
Complete linkage can be used to measure the distance between clusters that are the __________ in cluster analysis.

A)most similar
B)most different
C)farthest apart
D)closest
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
29
To identify patterns across transactions, we can use

A)association rules.
B)complete linkage.
C)centroid linkage.
D)k-means.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
30
The strength of the association rule is known as __________ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.

A)lift
B)antecedent
C)support count
D)consequent
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
31
A __________ refers to the number of times a collection of items occurs together in a transaction data set.

A)consequent
B)validation count
C)support count
D)antecedent
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
32
Hierarchical clustering using __________ results in a sequence of aggregated clusters that minimizes the loss of information between the individual observation level and the cluster level.

A)McQuitty's method
B)centroid linkage
C)median linkage
D)Ward's method
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
33
The lift ratio of an association rule with a confidence value of 0.45 and in which the consequent occurs in 6 out of 10 cases is

A)1.40.
B)0.54.
C)1.00.
D)0.75.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
34
__________ is a measure that computes the dissimilarity between a cluster AB and a cluster C by averaging the distance between A and C and the distance between B and C.

A)Ward's method
B)Jaccard's coefficient
C)McQuitty's method
D)None of these are correct.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
35
Suppose that the confidence of an association rule is 0.75 and the total number of transactions is 250. How many of those transactions support the consequent if the lift ratio is 1.875?

A)100
B)125
C)150
D)175
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
36
Complete linkage can be used to measure the distance between _________ in cluster analysis.

A)objects
B)clusters
C)observations
D)wards
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
37
The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters?

A)0.5
B)1
C)1.5
D)2
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
38
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a

A)dendrogram.
B)scatter chart.
C)decile-wise lift chart.
D)cumulative lift tree.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
39
The process of extracting useful information from text data is known as __________.

A)text mining
B)tokenization
C)stemming
D)corpus
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
40
The endpoint of a k-means clustering algorithm occurs when

A)Euclidean distance between clusters is minimized.
B)Euclidean distance between observations in a cluster is maximized.
C)no further changes are observed in cluster structure and number.
D)all of the observations are encompassed within a single large cluster with mean k.
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
41
The process of converting a word to its stem, or root word, is referred to as __________.

A)data cleaning
B)stemming
C)tokenization
D)stacking
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
42
In the text mining process, the text is first preprocessed by deriving a smaller set of _________ from the larger set of words contained in a collection of documents.

A)tokens
B)stems
C)terms
D)stack
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
43
A collection of text documents to be analyzed is called a ___________.

A)book
B)corpus
C)library
D)consequent
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
44
The process of dividing text into separate terms is referred to as __________.

A)data cleaning
B)stemming
C)tokenization
D)stacking
Unlock Deck
Unlock for access to all 44 flashcards in this deck.
Unlock Deck
k this deck
locked card icon
Unlock Deck
Unlock for access to all 44 flashcards in this deck.