
The utils module provides needed functions for data loading, cleaning, output formatting, and user interaction


kwx.utils.load_data(data, target_cols=None)[source]

Loads data from a path and formats it into a pandas df.

datapd.DataFrame or csv/xlsx path

The data in df or path form.

target_colsstr or list (default=None)

The columns in the csv/xlsx or dataframe that contain the text data to be modeled.


The texts as a df.

kwx.utils._combine_texts_to_str(text_corpus, ignore_words=None)[source]

Combines texts into one string.

text_corpusstr or list

The texts to be combined.

ignore_wordsstr or list

Strings that should be removed from the text body.


A string of the full text with unwanted words removed.


Lower cases tokens and removes numbers and possibly names.

argslist of tuples

The following arguments zipped.


The text to clean.

words_to_ignorestr or list

Strings that should be removed from the text body.

stop_wordsstr or list

Stopwords for the given language.


The text without unwanted tokens.

kwx.utils._lemmatize(tokens, nlp=None, verbose=True)[source]

Lemmatizes tokens.

tokenslist or list of lists

Tokens to be lemmatized.

nlpspacy.load object

A spacy language model.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

base_tokenslist or list of lists

Tokens that have been lemmatized for nlp analysis.

kwx.utils.clean(texts, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]

Cleans and tokenizes a text body to prepare it for analysis.

textsstr or list

The texts to be cleaned and tokenized.

input_languagestr (default=None)

The English name of the language in which the texts are found.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the text corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

text_corpuslist or list of lists

The texts formatted for analysis.

kwx.utils.prepare_data(data=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]

Prepares input data for analysis from a pandas.DataFrame or path.

datapd.DataFrame or csv/xlsx path

The data in df or path form.

target_colsstr or list (default=None)

The columns in the csv/xlsx or dataframe that contain the text data to be modeled.

input_languagestr (default=None)

The English name of the language in which the texts are found.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the text corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

text_corpus, clean_texts, selected_idxslist or list of lists, list, list

The texts formatted for text analysis both as tokens and strings, as well as the indexes for selected entries.

kwx.utils._prepare_corpus_path(text_corpus=None, clean_texts=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]

Checks a text corpus to see if it’s a path, and prepares the data if so.

text_corpusstr or list or list of lists

A path or text corpus over which analysis should be done.


The texts formatted for analysis as strings.

target_colsstr or list (default=None)

The columns in the csv/xlsx or dataframe that contain the text data to be modeled.

input_languagestr (default=None)

The English name of the language in which the texts are found.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the text corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

text_corpuslist or list of lists

A prepared text corpus for the data in the given path.

kwx.utils.translate_output(outputs, input_language, output_language)[source]

Translates model outputs using https://github.com/ssut/py-googletrans.


Output keywords of a model.


The English name of the language in which the texts are found.


The English name of the desired language for outputs.


A list of keywords translated to the given output_language.

kwx.utils.organize_by_pos(outputs, output_language)[source]

Orders a keyword output by the part of speech of the words.


The keywords that have been extracted.


The spoken language in which the results should be given.


The given keywords ordered by their pos.


Prompts the user for words that should be ignored in kewword extraction.

words_to_ignorestr or list

Words that should not be included in the output.

ignore words, words_addedlist, bool

A new list of words to ignore and a boolean indicating if words have been added.