model

The model module provides needed functions for modeling text corpuses and extracting keywords

Functions

kwx.model.get_topic_words(text_corpus, labels, num_topics=None, num_keywords=None)[source]

Get top words within each topic for cluster models.

Parameters:
text_corpuslist, list of lists, or str

The text corpus over which analysis should be done.

labelslist

The labels assigned to topics.

num_topicsint (default=None)

The number of categories for LDA and BERT based approaches.

num_keywordsint (default=None)

The number of keywords that should be extracted.

Returns:
topics, non_blank_topic_idxslist and list

Topic keywords and indexes of those that are not empty lists.

kwx.model.get_coherence(model, text_corpus, num_topics=10, num_keywords=10, measure='c_v')[source]

Gets model coherence from gensim.models.coherencemodel.

Parameters:
modelkwx.topic_model.TopicModel

A model trained on the given text corpus.

text_corpuslist, list of lists, or str

The text corpus over which analysis should be done.

num_topicsint (default=10)

The number of categories for LDA and BERT based approaches.

num_keywordsint (default=10)

The number of keywords that should be extracted.

measurestr (default=c_v)

A gensim measure of coherence.

Returns:
coherencefloat

The coherence of the given model over the given texts.

kwx.model._order_and_subset_by_coherence(tm, num_topics=10, num_keywords=10)[source]

Orders topics based on their average coherence across the text corpus.

Parameters:
tmkwx.topic_model.TopicModel

A model trained on the given text corpus.

num_topicsint (default=10)

The number of categories for LDA and BERT based approaches.

num_keywordsint (default=10)

The number of keywords that should be extracted.

Returns:
ordered_topic_words, selection_indexes: list of lists and list of lists

Topics words ordered by average coherence and indexes by which they should be selected.

kwx.model._select_kws(method='lda', kw_args=None, words_to_ignore=None, n=10)[source]

Selects keywords from a group of extracted keywords.

Parameters:
methodstr (default=lda)

The modelling method.

Options:

frequency: a count of the most frequent words.

TFIDF: Term Frequency Inverse Document Frequency.

  • Allows for words within one text group to be compared to those of another.

  • Gives a better idea of what users specifically want from a given publication.

LDA: Latent Dirichlet Allocation

  • Text data is classified into a given number of categories.

  • These categories are then used to classify individual entries given the percent they fall into categories.

BERT: Bidirectional Encoder Representations from Transformers

  • Words are classified via Google Neural Networks.

  • Word classifications are then used to derive topics.

kw_argsdict (default=None)

A dictionary of keywords and metrics through which to order them as values.

words_to_ignorelist (default=None)

Words to not include in the selected keywords.

nint (default=10)

The number of keywords to select.

Returns:
keywordslist

Selected keywords from those extracted.

kwx.model.extract_kws(method='lda', bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', text_corpus=None, input_language=None, output_language=None, num_keywords=10, num_topics=10, corpuses_to_compare=None, return_topics=False, ignore_words=None, prompt_remove_words=True, return_kw_args=False, **kwargs)[source]

Extracts keywords given data, metadata, and model parameter inputs.

Parameters:
methodstr (default=lda)

The modelling method.

Options:

frequency: a count of the most frequent words.

TFIDF: Term Frequency Inverse Document Frequency.

  • Allows for words within one text group to be compared to those of another.

  • Gives a better idea of what users specifically want from a given publication.

LDA: Latent Dirichlet Allocation

  • Text data is classified into a given number of categories.

  • These categories are then used to classify individual entries given the percent they fall into categories.

BERT: Bidirectional Encoder Representations from Transformers

  • Words are classified via Google Neural Networks.

  • Word classifications are then used to derive topics.

bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens)

The BERT model to use.

text_corpuslist, list of lists, or str

The text corpus over which analysis should be done.

input_languagestr (default=None)

The spoken language in which the texts are found.

output_languagestr (default=None: same as input_language)

The spoken language in which the results should be given.

num_keywordsint (default=10)

The number of keywords that should be extracted.

num_topicsint (default=10)

The number of categories for LDA and BERT based approaches.

corpuses_to_comparelistcontains lists (default=None)

A list of other text corpuses that the main corpus should be compared to using TFIDF.

return_topicsbool (default=False)

Whether to return the topics that are extracted by an LDA model.

ignore_wordsstr or list (default=None)

Words that should be removed.

prompt_remove_wordsbool (default=True)

Whether to prompt the user for keywords to remove.

**kwargskeyword arguments

Keyword arguments correspoding to sentence_transformers.SentenceTransformer.encode, gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer.

Returns:
output_keywordslist or list of lists

A list of lists where sub_lists are the keywords best associated with the data entry.

kwx.model.gen_files(method=['lda', 'bert'], text_corpus=None, input_language=None, output_language=None, num_keywords=10, topic_nums_to_compare=None, corpuses_to_compare=None, ignore_words=None, prompt_remove_words=True, verbose=True, fig_size=(20, 10), incl_most_freq=True, org_by_pos=True, incl_visuals=True, save_dir=None, zip_results=True)[source]

Generates a directory or zip file of all keyword analysis elements.

Parameters:
Most parameters for the following kwx functions:

visuals.graph_topic_num_evals

visuals.gen_word_cloud

visuals.pyLDAvis_topics

model.extract_kws

utils.prompt_for_word_removal

incl_most_freqbool (default=True)

Whether to include the most frequent words in the output.

org_by_posbool (default=True)

Whether to organize words by their parts of speech.

incl_visualsstr or bool (default=True)

Which visual graphs to include in the output.

Str options: topic_num_evals, word_cloud, pyLDAvis, t_sne.

Bool options: True - all; False - none.

save_dirstr (default=None)

A path to a directory where the results will be saved.

zip_resultsbool (default=True)

Whether to zip the results from the analysis.

Returns:
A directory or zip file in the current working or save_dir directory.