utils

The utils module provides needed functions for data loading, cleaning, output formatting, and user interaction

Functions

kwx.utils.load_data(data, target_cols=None)[source]

Loads data from a path and formats it into a pandas df.

Parameters:
datapd.DataFrame or csv/xlsx path

The data in df or path form.

target_colsstr or list (default=None)

The columns in the csv/xlsx or dataframe that contain the text data to be modeled.

Returns:
df_textspd.DataFrame

The texts as a df.

kwx.utils._combine_texts_to_str(text_corpus, ignore_words=None)[source]

Combines texts into one string.

Parameters:
text_corpusstr or list

The texts to be combined.

ignore_wordsstr or list

Strings that should be removed from the text body.

Returns:
texts_strstr

A string of the full text with unwanted words removed.

kwx.utils._remove_unwanted(args)[source]

Lower cases tokens and removes numbers and possibly names.

Parameters:
argslist of tuples

The following arguments zipped.

textlist

The text to clean.

words_to_ignorestr or list

Strings that should be removed from the text body.

stop_wordsstr or list

Stopwords for the given language.

Returns:
text_words_removedlist

The text without unwanted tokens.

kwx.utils._lemmatize(tokens, nlp=None, verbose=True)[source]

Lemmatizes tokens.

Parameters:
tokenslist or list of lists

Tokens to be lemmatized.

nlpspacy.load object

A spacy language model.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

Returns:
base_tokenslist or list of lists

Tokens that have been lemmatized for nlp analysis.

kwx.utils.clean(texts, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]

Cleans and tokenizes a text body to prepare it for analysis.

Parameters:
textsstr or list

The texts to be cleaned and tokenized.

input_languagestr (default=None)

The English name of the language in which the texts are found.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the text corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

Returns:
text_corpuslist or list of lists

The texts formatted for analysis.

kwx.utils.prepare_data(data=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]

Prepares input data for analysis from a pandas.DataFrame or path.

Parameters:
datapd.DataFrame or csv/xlsx path

The data in df or path form.

target_colsstr or list (default=None)

The columns in the csv/xlsx or dataframe that contain the text data to be modeled.

input_languagestr (default=None)

The English name of the language in which the texts are found.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the text corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

Returns:
text_corpus, clean_texts, selected_idxslist or list of lists, list, list

The texts formatted for text analysis both as tokens and strings, as well as the indexes for selected entries.

kwx.utils._prepare_corpus_path(text_corpus=None, clean_texts=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]

Checks a text corpus to see if it’s a path, and prepares the data if so.

Parameters:
text_corpusstr or list or list of lists

A path or text corpus over which analysis should be done.

clean_textsstr

The texts formatted for analysis as strings.

target_colsstr or list (default=None)

The columns in the csv/xlsx or dataframe that contain the text data to be modeled.

input_languagestr (default=None)

The English name of the language in which the texts are found.

min_token_freqint (default=2)

The minimum allowable frequency of a word inside the text corpus.

min_token_lenint (default=3)

The smallest allowable length of a word.

min_tokensint (default=0)

The minimum allowable length of a tokenized text.

max_token_indexint (default=-1)

The maximum allowable length of a tokenized text.

min_ngram_countint (default=5)

The minimum occurrences for an n-gram to be included.

remove_stopwordsbool (default=True)

Whether to remove stopwords.

ignore_wordsstr or list

Strings that should be removed from the text body.

sample_sizefloat (default=1)

The amount of data to be randomly sampled.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

Returns:
text_corpuslist or list of lists

A prepared text corpus for the data in the given path.

kwx.utils.translate_output(outputs, input_language, output_language)[source]

Translates model outputs using https://github.com/ssut/py-googletrans.

Parameters:
outputslist

Output keywords of a model.

input_languagestr

The English name of the language in which the texts are found.

output_language

The English name of the desired language for outputs.

Returns:
translated_outputslist

A list of keywords translated to the given output_language.

kwx.utils.organize_by_pos(outputs, output_language)[source]

Orders a keyword output by the part of speech of the words.

Parameters:
outputslist

The keywords that have been extracted.

output_languagestr

The spoken language in which the results should be given.

Returns:
ordered_outputslist

The given keywords ordered by their pos.

kwx.utils.prompt_for_word_removal(words_to_ignore=None)[source]

Prompts the user for words that should be ignored in kewword extraction.

Parameters:
words_to_ignorestr or list

Words that should not be included in the output.

Returns:
ignore words, words_addedlist, bool

A new list of words to ignore and a boolean indicating if words have been added.