Lexical Dispersion Plot for Large Corpora of Texts

The function creates a lexical dispersion plot for large corpus objects based on keywords or dictionaries.

Usage

plotSpike(data=NULL, 
          palette=c("#017a4a",
                    "#9F248F",
                    "#FFCE4E",
                    "#244579",
                    "#c6242d"), 
          doc_id="name",
          index_var="index",
          word_var="word",
          group_var="group",
          top_n=NULL, 
          sort=FALSE,
          polar=TRUE, 
          quartiles=FALSE, 
          text_label=NULL, 
          tooltip_values=NULL, 
          tooltip_doc="name",
          label.size=1,
          ring.col="red3",
          line.width=0.1,
          legend.position="top",
          legend.title="Group",
          title="Lexical Spike Plot",
          subtitle="Keyword dispersion in texts.",
          svg.height=5,
          svg.width=6,
          ncol=NULL,
          nrow=NULL,
          interactive=TRUE)

Arguments

data: A data frame containing the data generated by the function filterWords to be plotted containing: the name of the document, a list of keywords retrieved, their relative position in the text (index), the group or dictionary category each keyword belongs.
palette: A vector of colors to be used in the plot to represent groups or categories.
doc_id: The name of the variable containing the document identifier. The default is "name".
index_var: The name of the variable containing the index of the word in the text. The default is "index".
word_var: The name of the variable containing the keyword found in the corpus. The default is "word".
group_var: The name of the variable containing the group. The default is "group".
top_n: The top number of documents to be plotted. The default is NULL.
sort: A logical value indicating whether the documents should be sorted by frequency of terms found. The default is FALSE.
polar: A logical value indicating whether the plot should be circular or rectangular. The default is TRUE.
quartiles: A logical value indicating whether the quartiles should be plotted. The default is FALSE.
text_label: A variable containing the text for the labeling documents. The default is NULL.
tooltip_values: A variable that displays individual values in the tooltip. The default is NULL (the values are automatically generated by the function).
tooltip_doc: A variable indicating the text to be shown in the tooltip. The default is "name".
label.size: The size of the labels.
ring.col: The color of the ring. The default is "red3".
line.width: The width of the lines. The default is 0.1.
legend.position: The position of the legend. The default is "top".
legend.title: The title of the legend. The default is "Group".
title: The title of the plot. The default is "Lexical Spike Plot".
subtitle: The subtitle of the plot. The default is "Keyword dispersion in texts."
svg.height: The height of the svg for the interactive chart. The default is 5.
svg.width: The width of the svg for the interactive chart. The default is 6.
ncol: The number of columns in the graph. The default is NULL.
nrow: The number of rows in the graph. The default is NULL.
interactive: A logical value indicating whether the plot should be interactive. The default is TRUE.

Details

The function generates two types of Lexical Dispersion Plot for keywords or dictionaries categories. It represents the position of the selected keywords or dictionary categories, or metadata values in each text forming a corpus object. The default type is circular, while the alternative, rectangular, represents keyword positions linearly. The purpose is to represent a large volume of texts and to allow users to interact with the results.

Value

A chart representing the dispersion of terms or concepts.

Examples

if (FALSE) {

# Aggregate the text by session from 
# The Spanish Parliament Speeches Corpus
ag <- aggregate(list(text=spa.sessions$speech.text),
                by=list(session_number=spa.sessions$session.number),
                paste, 
                collapse="\n")

# Paste zeros to the number to allow
# sorting the sessions
ag$session_number[nchar(ag$session_number)==1] <- 
paste0("00", ag$session_number[nchar(ag$session_number)==1])

ag$session_number[nchar(ag$session_number)==2] <- 
paste0("0", ag$session_number[nchar(ag$session_number)==2])

# Convert the results into a corpus object
library(quanteda)
cp <- corpus(ag, 
             docid_field = "session_number")


# Create a dictionary
dic <- dictionary(
  list(Territorio=c("federal","estatuto","nacionalismo",
                    "regionalismo","cataluña","lengua"),
       Genero=c("violencia machista","mujer","violencia sexual",
                "aborto","reproductivo","género","trans"),
       Memoria=c("memoria","franquismo","franquista","dictadura"),
       COVID=c("covid","pandemia","muerte")))


# Searche the keywords and their
# position in each session
ter <- filterWords(cp, dic)

# Define a proper description for naming
# the sessions.
ter$name <- paste0("Session ", ter$name)


# Plot the results
plotSpike(data=ter, 
          legend.title="Tema:",
          title="Congreso de los Diputados - XIV Legislatura (2019-2023)",
          subtitle="Territorio, género, memoria y COVID en los debates de los plenos.")

}