Lexical Dispersion Plot for Large Corpora of Texts
plotSpike.Rd
The function creates a lexical dispersion plot for large corpus objects based on keywords or dictionaries.
Usage
plotSpike(data=NULL,
palette=c("#017a4a",
"#9F248F",
"#FFCE4E",
"#244579",
"#c6242d"),
doc_id="name",
index_var="index",
word_var="word",
group_var="group",
top_n=NULL,
sort=FALSE,
polar=TRUE,
quartiles=FALSE,
text_label=NULL,
tooltip_values=NULL,
tooltip_doc="name",
label.size=1,
ring.col="red3",
line.width=0.1,
legend.position="top",
legend.title="Group",
title="Lexical Spike Plot",
subtitle="Keyword dispersion in texts.",
svg.height=5,
svg.width=6,
ncol=NULL,
nrow=NULL,
interactive=TRUE)
Arguments
- data
A data frame containing the data generated by the function filterWords to be plotted containing: the name of the document, a list of keywords retrieved, their relative position in the text (index), the group or dictionary category each keyword belongs.
- palette
A vector of colors to be used in the plot to represent groups or categories.
- doc_id
The name of the variable containing the document identifier. The default is "name".
- index_var
The name of the variable containing the index of the word in the text. The default is "index".
- word_var
The name of the variable containing the keyword found in the corpus. The default is "word".
- group_var
The name of the variable containing the group. The default is "group".
- top_n
The top number of documents to be plotted. The default is NULL.
- sort
A logical value indicating whether the documents should be sorted by frequency of terms found. The default is FALSE.
- polar
A logical value indicating whether the plot should be circular or rectangular. The default is TRUE.
- quartiles
A logical value indicating whether the quartiles should be plotted. The default is FALSE.
- text_label
A variable containing the text for the labeling documents. The default is NULL.
- tooltip_values
A variable that displays individual values in the tooltip. The default is NULL (the values are automatically generated by the function).
- tooltip_doc
A variable indicating the text to be shown in the tooltip. The default is "name".
- label.size
The size of the labels.
- ring.col
The color of the ring. The default is "red3".
- line.width
The width of the lines. The default is 0.1.
- legend.position
The position of the legend. The default is "top".
- legend.title
The title of the legend. The default is "Group".
- title
The title of the plot. The default is "Lexical Spike Plot".
- subtitle
The subtitle of the plot. The default is "Keyword dispersion in texts."
- svg.height
The height of the svg for the interactive chart. The default is 5.
- svg.width
The width of the svg for the interactive chart. The default is 6.
- ncol
The number of columns in the graph. The default is NULL.
- nrow
The number of rows in the graph. The default is NULL.
- interactive
A logical value indicating whether the plot should be interactive. The default is TRUE.
Details
The function generates two types of Lexical Dispersion Plot for keywords or dictionaries categories. It represents the position of the selected keywords or dictionary categories, or metadata values in each text forming a corpus object. The default type is circular, while the alternative, rectangular, represents keyword positions linearly. The purpose is to represent a large volume of texts and to allow users to interact with the results.
Examples
if (FALSE) {
# Aggregate the text by session from
# The Spanish Parliament Speeches Corpus
ag <- aggregate(list(text=spa.sessions$speech.text),
by=list(session_number=spa.sessions$session.number),
paste,
collapse="\n")
# Paste zeros to the number to allow
# sorting the sessions
ag$session_number[nchar(ag$session_number)==1] <-
paste0("00", ag$session_number[nchar(ag$session_number)==1])
ag$session_number[nchar(ag$session_number)==2] <-
paste0("0", ag$session_number[nchar(ag$session_number)==2])
# Convert the results into a corpus object
library(quanteda)
cp <- corpus(ag,
docid_field = "session_number")
# Create a dictionary
dic <- dictionary(
list(Territorio=c("federal","estatuto","nacionalismo",
"regionalismo","cataluña","lengua"),
Genero=c("violencia machista","mujer","violencia sexual",
"aborto","reproductivo","género","trans"),
Memoria=c("memoria","franquismo","franquista","dictadura"),
COVID=c("covid","pandemia","muerte")))
# Searche the keywords and their
# position in each session
ter <- filterWords(cp, dic)
# Define a proper description for naming
# the sessions.
ter$name <- paste0("Session ", ter$name)
# Plot the results
plotSpike(data=ter,
legend.title="Tema:",
title="Congreso de los Diputados - XIV Legislatura (2019-2023)",
subtitle="Territorio, género, memoria y COVID en los debates de los plenos.")
}