Deck 9: Text Analytics
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Unlock Deck
Sign up to unlock the cards in this deck!
Unlock Deck
Unlock Deck
1/15
Play
Full screen (f)
Deck 9: Text Analytics
1
Which of the following is NOT a text analytic procedure?
A) Creating keyword-in-context dictionaries
B) Sentiment analysis
C) Topic modeling
D) Analyzing ngrams
E) All of the above are text analytic procedures
A) Creating keyword-in-context dictionaries
B) Sentiment analysis
C) Topic modeling
D) Analyzing ngrams
E) All of the above are text analytic procedures
E
Text analytic functions include creating keyword-in-context dictionaries, creating bar charts of word frequencies, scraping we pages for text, scraping social media for text, creating multigroup word frequency charts, creating word clouds, creating word comparison clouds, creating word maps, doing sentiment analysis, conducting topic modeling, creating lexical dispersion plots, and analyzing bigrams and ngrams.
Text analytic functions include creating keyword-in-context dictionaries, creating bar charts of word frequencies, scraping we pages for text, scraping social media for text, creating multigroup word frequency charts, creating word clouds, creating word comparison clouds, creating word maps, doing sentiment analysis, conducting topic modeling, creating lexical dispersion plots, and analyzing bigrams and ngrams.
2
A collection of documents to be used in text analytic research is called a __________________.
A corpus.
However, there are various formats for corpora. The "corpus" format is associated with the quanteda package. The "Corpus" format is associated with the tm package. There are other formats.
However, there are various formats for corpora. The "corpus" format is associated with the quanteda package. The "Corpus" format is associated with the tm package. There are other formats.
3
In one short sentence, what does the "gutenbergr" package do?
It contains functions to access many of the tens of thousands of public domain books and texts archived by Project Gutenberg, some of which, such as the Declaration of Independence, were used as examples in Chapter 9.
4
Which R package can be used to read in Word (.doc or .docx) files?
A) tm
B) quanteda
C) textreadr
D) tidytext
A) tm
B) quanteda
C) textreadr
D) tidytext
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
5
In one short sentence, what are "stopwords" in the quanteda package?
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
6
What is a kwic index?
A) For a given corpus, a kwic index alphabetically lists all words or words of researcher interest, one instance of a word per line, surrounded by a window of leading and trailing words.
B) For a given corpus, a kwic index is the digital analog of the index found in the back of most textbooks, except it is generated automatically.
C) For a given corpus, a kwic index is an automated index constructed in the K language, which is known for its speed, facility in handling arrays, and expressive syntax.
D) For a given corpus, a kwic index is an index to the global policy documents of the Kawartha World Issues Centre.
A) For a given corpus, a kwic index alphabetically lists all words or words of researcher interest, one instance of a word per line, surrounded by a window of leading and trailing words.
B) For a given corpus, a kwic index is the digital analog of the index found in the back of most textbooks, except it is generated automatically.
C) For a given corpus, a kwic index is an automated index constructed in the K language, which is known for its speed, facility in handling arrays, and expressive syntax.
D) For a given corpus, a kwic index is an index to the global policy documents of the Kawartha World Issues Centre.
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
7
What does the term "bigram" mean?
A) A graph with two levels.
B) A phrase with two words treated as a single token.
C) Two unigrams.
D) The term "bigram" is short for "big ram", where ram is computer memory. Many text analytic processes require big ram.
A) A graph with two levels.
B) A phrase with two words treated as a single token.
C) Two unigrams.
D) The term "bigram" is short for "big ram", where ram is computer memory. Many text analytic processes require big ram.
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
8
What is the name of the visualization package used for many of the figures for text analytic results in Chapter 9? _________________________.
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
9
What was the name of the package used in Chapter 9 for very simple web scraping from html (web) format to text format? _____________________________.
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
10
Let "x" be an object of data class "Corpus" and "SimpleCorpus", associated with the "tm" package. To view the contents of "x" directly, which command from the tm package is used:
A) View
B) str
C) summary
D) inspect
A) View
B) str
C) summary
D) inspect
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
11
The TextMiner (tm) package works with objects in both tdm and dtm formats. In one short sentence, what do tdm and dtm stand for and what is the main difference?
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
12
"Tokens" can NOT be which of the following?
A) A word
B) An ngram
C) A paragraph
D) Any of the above can be a token
A) A word
B) An ngram
C) A paragraph
D) Any of the above can be a token
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
13
What are these terms examples of : UTF-8 (Unicode), ISO-8859-1 (Latin 1), and EBCDIC (IBM)?
A) Tokenization
B) Types of character encoding
C) Types of lexical dispersion
D) Types of latent dirichlet analysis
A) Tokenization
B) Types of character encoding
C) Types of lexical dispersion
D) Types of latent dirichlet analysis
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
14
Which technique might be loosely characterized as a form of "factor analysis for words"?
A) Latent Dirichlet Analysis (LDA)
B) Multigroup Analysis of Word Frequencies
C) Comparison Word Clouds
D) Lexical Dispersion Analysis
A) Latent Dirichlet Analysis (LDA)
B) Multigroup Analysis of Word Frequencies
C) Comparison Word Clouds
D) Lexical Dispersion Analysis
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck
15
When the term "excit" is made to include "excite", "excited", "excitement", and other works beginning with "excit", one has engaged in word …
A) Truncating
B) Wild cards
C) Trimming
D) Stemming
A) Truncating
B) Wild cards
C) Trimming
D) Stemming
Unlock Deck
Unlock for access to all 15 flashcards in this deck.
Unlock Deck
k this deck