model¶
The model
module provides needed functions for modeling text corpuses and extracting keywords
Functions
- kwx.model.get_topic_words(text_corpus, labels, num_topics=None, num_keywords=None)[source]¶
Get top words within each topic for cluster models.
- Parameters:
- text_corpuslist, list of lists, or str
The text corpus over which analysis should be done.
- labelslist
The labels assigned to topics.
- num_topicsint (default=None)
The number of categories for LDA and BERT based approaches.
- num_keywordsint (default=None)
The number of keywords that should be extracted.
- Returns:
- topics, non_blank_topic_idxslist and list
Topic keywords and indexes of those that are not empty lists.
- kwx.model.get_coherence(model, text_corpus, num_topics=10, num_keywords=10, measure='c_v')[source]¶
Gets model coherence from gensim.models.coherencemodel.
- Parameters:
- modelkwx.topic_model.TopicModel
A model trained on the given text corpus.
- text_corpuslist, list of lists, or str
The text corpus over which analysis should be done.
- num_topicsint (default=10)
The number of categories for LDA and BERT based approaches.
- num_keywordsint (default=10)
The number of keywords that should be extracted.
- measurestr (default=c_v)
A gensim measure of coherence.
- Returns:
- coherencefloat
The coherence of the given model over the given texts.
- kwx.model._order_and_subset_by_coherence(tm, num_topics=10, num_keywords=10)[source]¶
Orders topics based on their average coherence across the text corpus.
- Parameters:
- tmkwx.topic_model.TopicModel
A model trained on the given text corpus.
- num_topicsint (default=10)
The number of categories for LDA and BERT based approaches.
- num_keywordsint (default=10)
The number of keywords that should be extracted.
- Returns:
- ordered_topic_words, selection_indexes: list of lists and list of lists
Topics words ordered by average coherence and indexes by which they should be selected.
- kwx.model._select_kws(method='lda', kw_args=None, words_to_ignore=None, n=10)[source]¶
Selects keywords from a group of extracted keywords.
- Parameters:
- methodstr (default=lda)
The modelling method.
- Options:
frequency: a count of the most frequent words.
TFIDF: Term Frequency Inverse Document Frequency.
Allows for words within one text group to be compared to those of another.
Gives a better idea of what users specifically want from a given publication.
LDA: Latent Dirichlet Allocation
Text data is classified into a given number of categories.
These categories are then used to classify individual entries given the percent they fall into categories.
BERT: Bidirectional Encoder Representations from Transformers
Words are classified via Google Neural Networks.
Word classifications are then used to derive topics.
- kw_argsdict (default=None)
A dictionary of keywords and metrics through which to order them as values.
- words_to_ignorelist (default=None)
Words to not include in the selected keywords.
- nint (default=10)
The number of keywords to select.
- Returns:
- keywordslist
Selected keywords from those extracted.
- kwx.model.extract_kws(method='lda', bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', text_corpus=None, input_language=None, output_language=None, num_keywords=10, num_topics=10, corpuses_to_compare=None, return_topics=False, ignore_words=None, prompt_remove_words=True, return_kw_args=False, **kwargs)[source]¶
Extracts keywords given data, metadata, and model parameter inputs.
- Parameters:
- methodstr (default=lda)
The modelling method.
- Options:
frequency: a count of the most frequent words.
TFIDF: Term Frequency Inverse Document Frequency.
Allows for words within one text group to be compared to those of another.
Gives a better idea of what users specifically want from a given publication.
LDA: Latent Dirichlet Allocation
Text data is classified into a given number of categories.
These categories are then used to classify individual entries given the percent they fall into categories.
BERT: Bidirectional Encoder Representations from Transformers
Words are classified via Google Neural Networks.
Word classifications are then used to derive topics.
- bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens)
The BERT model to use.
- text_corpuslist, list of lists, or str
The text corpus over which analysis should be done.
- input_languagestr (default=None)
The spoken language in which the texts are found.
- output_languagestr (default=None: same as input_language)
The spoken language in which the results should be given.
- num_keywordsint (default=10)
The number of keywords that should be extracted.
- num_topicsint (default=10)
The number of categories for LDA and BERT based approaches.
- corpuses_to_comparelistcontains lists (default=None)
A list of other text corpuses that the main corpus should be compared to using TFIDF.
- return_topicsbool (default=False)
Whether to return the topics that are extracted by an LDA model.
- ignore_wordsstr or list (default=None)
Words that should be removed.
- prompt_remove_wordsbool (default=True)
Whether to prompt the user for keywords to remove.
- **kwargskeyword arguments
Keyword arguments correspoding to sentence_transformers.SentenceTransformer.encode, gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer.
- Returns:
- output_keywordslist or list of lists
A list of lists where sub_lists are the keywords best associated with the data entry.
- kwx.model.gen_files(method=['lda', 'bert'], text_corpus=None, input_language=None, output_language=None, num_keywords=10, topic_nums_to_compare=None, corpuses_to_compare=None, ignore_words=None, prompt_remove_words=True, verbose=True, fig_size=(20, 10), incl_most_freq=True, org_by_pos=True, incl_visuals=True, save_dir=None, zip_results=True)[source]¶
Generates a directory or zip file of all keyword analysis elements.
- Parameters:
- Most parameters for the following kwx functions:
visuals.graph_topic_num_evals
visuals.gen_word_cloud
visuals.pyLDAvis_topics
model.extract_kws
utils.prompt_for_word_removal
- incl_most_freqbool (default=True)
Whether to include the most frequent words in the output.
- org_by_posbool (default=True)
Whether to organize words by their parts of speech.
- incl_visualsstr or bool (default=True)
Which visual graphs to include in the output.
Str options: topic_num_evals, word_cloud, pyLDAvis, t_sne.
Bool options: True - all; False - none.
- save_dirstr (default=None)
A path to a directory where the results will be saved.
- zip_resultsbool (default=True)
Whether to zip the results from the analysis.
- Returns:
- A directory or zip file in the current working or save_dir directory.