utils¶

The utils module provides needed functions for data loading, cleaning, output formatting, and user interaction

Functions

kwx.utils.load_data()
kwx.utils._combine_texts_to_str()
kwx.utils._remove_unwanted()
kwx.utils._lemmatize()
kwx.utils.clean()
kwx.utils.prepare_data()
kwx.utils._prepare_corpus_path()
kwx.utils.translate_output()
kwx.utils.organize_by_pos()
kwx.utils.prompt_for_word_removal()

kwx.utils.load_data(data, target_cols=None)[source]¶

Loads data from a path and formats it into a pandas df.

Parameters:

datapd.DataFrame or csv/xlsx path: The data in df or path form.
target_colsstr or list (default=None): The columns in the csv/xlsx or dataframe that contain the text data to be modeled.

Returns:

df_textspd.DataFrame: The texts as a df.

kwx.utils._combine_texts_to_str(text_corpus, ignore_words=None)[source]¶

Combines texts into one string.

Parameters:

text_corpusstr or list: The texts to be combined.
ignore_wordsstr or list: Strings that should be removed from the text body.

Returns:

texts_strstr: A string of the full text with unwanted words removed.

kwx.utils._remove_unwanted(args)[source]¶

Lower cases tokens and removes numbers and possibly names.

Parameters:

argslist of tuples: The following arguments zipped.
textlist: The text to clean.
words_to_ignorestr or list: Strings that should be removed from the text body.
stop_wordsstr or list: Stopwords for the given language.

Returns:

text_words_removedlist: The text without unwanted tokens.

kwx.utils._lemmatize(tokens, nlp=None, verbose=True)[source]¶

Lemmatizes tokens.

Parameters:

tokenslist or list of lists: Tokens to be lemmatized.
nlpspacy.load object: A spacy language model.
verbosebool (default=True): Whether to show a tqdm progress bar for the query.

Returns:

base_tokenslist or list of lists: Tokens that have been lemmatized for nlp analysis.

kwx.utils.clean(texts, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]¶

Cleans and tokenizes a text body to prepare it for analysis.

Parameters:

textsstr or list: The texts to be cleaned and tokenized.
input_languagestr (default=None): The English name of the language in which the texts are found.
min_token_freqint (default=2): The minimum allowable frequency of a word inside the text corpus.
min_token_lenint (default=3): The smallest allowable length of a word.
min_tokensint (default=0): The minimum allowable length of a tokenized text.
max_token_indexint (default=-1): The maximum allowable length of a tokenized text.
min_ngram_countint (default=5): The minimum occurrences for an n-gram to be included.
remove_stopwordsbool (default=True): Whether to remove stopwords.
ignore_wordsstr or list: Strings that should be removed from the text body.
sample_sizefloat (default=1): The amount of data to be randomly sampled.
verbosebool (default=True): Whether to show a tqdm progress bar for the query.

Returns:

text_corpuslist or list of lists: The texts formatted for analysis.

kwx.utils.prepare_data(data=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]¶

Prepares input data for analysis from a pandas.DataFrame or path.

Parameters:

datapd.DataFrame or csv/xlsx path: The data in df or path form.
target_colsstr or list (default=None): The columns in the csv/xlsx or dataframe that contain the text data to be modeled.
input_languagestr (default=None): The English name of the language in which the texts are found.
min_token_freqint (default=2): The minimum allowable frequency of a word inside the text corpus.
min_token_lenint (default=3): The smallest allowable length of a word.
min_tokensint (default=0): The minimum allowable length of a tokenized text.
max_token_indexint (default=-1): The maximum allowable length of a tokenized text.
min_ngram_countint (default=5): The minimum occurrences for an n-gram to be included.
remove_stopwordsbool (default=True): Whether to remove stopwords.
ignore_wordsstr or list: Strings that should be removed from the text body.
sample_sizefloat (default=1): The amount of data to be randomly sampled.
verbosebool (default=True): Whether to show a tqdm progress bar for the query.

Returns:

text_corpus, clean_texts, selected_idxslist or list of lists, list, list: The texts formatted for text analysis both as tokens and strings, as well as the indexes for selected entries.

kwx.utils._prepare_corpus_path(text_corpus=None, clean_texts=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]¶

Checks a text corpus to see if it’s a path, and prepares the data if so.

Parameters:

text_corpusstr or list or list of lists: A path or text corpus over which analysis should be done.
clean_textsstr: The texts formatted for analysis as strings.
target_colsstr or list (default=None): The columns in the csv/xlsx or dataframe that contain the text data to be modeled.
input_languagestr (default=None): The English name of the language in which the texts are found.
min_token_freqint (default=2): The minimum allowable frequency of a word inside the text corpus.
min_token_lenint (default=3): The smallest allowable length of a word.
min_tokensint (default=0): The minimum allowable length of a tokenized text.
max_token_indexint (default=-1): The maximum allowable length of a tokenized text.
min_ngram_countint (default=5): The minimum occurrences for an n-gram to be included.
remove_stopwordsbool (default=True): Whether to remove stopwords.
ignore_wordsstr or list: Strings that should be removed from the text body.
sample_sizefloat (default=1): The amount of data to be randomly sampled.
verbosebool (default=True): Whether to show a tqdm progress bar for the query.

Returns:

text_corpuslist or list of lists: A prepared text corpus for the data in the given path.

kwx.utils.translate_output(outputs, input_language, output_language)[source]¶

Translates model outputs using https://github.com/ssut/py-googletrans.

Parameters:

outputslist: Output keywords of a model.
input_languagestr: The English name of the language in which the texts are found.
output_language: The English name of the desired language for outputs.

Returns:

translated_outputslist: A list of keywords translated to the given output_language.

kwx.utils.organize_by_pos(outputs, output_language)[source]¶

Orders a keyword output by the part of speech of the words.

Parameters:

outputslist: The keywords that have been extracted.
output_languagestr: The spoken language in which the results should be given.

Returns:

ordered_outputslist: The given keywords ordered by their pos.

kwx.utils.prompt_for_word_removal(words_to_ignore=None)[source]¶

Prompts the user for words that should be ignored in kewword extraction.

Parameters:

words_to_ignorestr or list: Words that should not be included in the output.

Returns:

ignore words, words_addedlist, bool: A new list of words to ignore and a boolean indicating if words have been added.