Skip to contents

Calculates the ratio between the frequency of a term in each document compared to its appearance in the whole corpus.

Usage

tfRatio(text,
        keyword,
        threshold=0,
        return.selected=FALSE,
        remove.accent=TRUE,
        identifier="Latin-ASCII"
        )

Arguments

text

Collection of text documents.

keyword

Keyword to be searched in the documents.

threshold

Defines the limits for selecting those texts were the keyword is particularly more frequent than most texts.

return.selected

Logical. Should the ratios values be returned or only the index of documents accoding to the established threshold value.

remove.accent

Logical. Should the accents be removed from the text before the search? The default is TRUE. It is recommended to remove accents before using the function in the case of large number of texts.

identifier

a single string with transform identifier, see stri_trans_list (stringi package), or custom transliteration rules. The default is "Latin-ASCII".

Details

Calculates the ratio of the frequency of a given term compared to its appearance in the whole corpus. This function is particularly useful for selecting texts according to themes or issues. It allows users to select only those documents containing the ideas for interest.

Value

The function presents two possible values to be returned. The default is the frequency ratio. The ratio of a term will be high (much higher than 1) in those cases where few documents concentrate most of its occurence. It will be zero for the texts with no matches and will be between 0 and 1 for the documents where the frequency is below average.

If the option "return.selected=TRUE", the function will return all the values of the ratio above the limit established in the "threshold" argument. The default for this last parameter is to return all values above 0, i.e., all the texts where the term were obsersed at least once.

Examples

if (FALSE) {
# Loads the dataset on US Presidential inaugural speeches
tx <- quanteda::data_corpus_inaugural

# Calculates the ratio for the root "democ" in all documents
tfRatio(text=tx, keyword="democ")

# Select just those with the double of occurence of the
# term than the average.
tfRatio(text=tx,
        keyword="democ",
        return.selected=TRUE,
        threshold=2)
}