utils¶
The utils
module provides needed functions for data loading, cleaning, output formatting, and user interaction
Functions
- kwx.utils.load_data(data, target_cols=None)[source]¶
Loads data from a path and formats it into a pandas df.
- Parameters:
- datapd.DataFrame or csv/xlsx path
The data in df or path form.
- target_colsstr or list (default=None)
The columns in the csv/xlsx or dataframe that contain the text data to be modeled.
- Returns:
- df_textspd.DataFrame
The texts as a df.
- kwx.utils._combine_texts_to_str(text_corpus, ignore_words=None)[source]¶
Combines texts into one string.
- Parameters:
- text_corpusstr or list
The texts to be combined.
- ignore_wordsstr or list
Strings that should be removed from the text body.
- Returns:
- texts_strstr
A string of the full text with unwanted words removed.
- kwx.utils._remove_unwanted(args)[source]¶
Lower cases tokens and removes numbers and possibly names.
- Parameters:
- argslist of tuples
The following arguments zipped.
- textlist
The text to clean.
- words_to_ignorestr or list
Strings that should be removed from the text body.
- stop_wordsstr or list
Stopwords for the given language.
- Returns:
- text_words_removedlist
The text without unwanted tokens.
- kwx.utils._lemmatize(tokens, nlp=None, verbose=True)[source]¶
Lemmatizes tokens.
- Parameters:
- tokenslist or list of lists
Tokens to be lemmatized.
- nlpspacy.load object
A spacy language model.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- Returns:
- base_tokenslist or list of lists
Tokens that have been lemmatized for nlp analysis.
- kwx.utils.clean(texts, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]¶
Cleans and tokenizes a text body to prepare it for analysis.
- Parameters:
- textsstr or list
The texts to be cleaned and tokenized.
- input_languagestr (default=None)
The English name of the language in which the texts are found.
- min_token_freqint (default=2)
The minimum allowable frequency of a word inside the text corpus.
- min_token_lenint (default=3)
The smallest allowable length of a word.
- min_tokensint (default=0)
The minimum allowable length of a tokenized text.
- max_token_indexint (default=-1)
The maximum allowable length of a tokenized text.
- min_ngram_countint (default=5)
The minimum occurrences for an n-gram to be included.
- remove_stopwordsbool (default=True)
Whether to remove stopwords.
- ignore_wordsstr or list
Strings that should be removed from the text body.
- sample_sizefloat (default=1)
The amount of data to be randomly sampled.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- Returns:
- text_corpuslist or list of lists
The texts formatted for analysis.
- kwx.utils.prepare_data(data=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]¶
Prepares input data for analysis from a pandas.DataFrame or path.
- Parameters:
- datapd.DataFrame or csv/xlsx path
The data in df or path form.
- target_colsstr or list (default=None)
The columns in the csv/xlsx or dataframe that contain the text data to be modeled.
- input_languagestr (default=None)
The English name of the language in which the texts are found.
- min_token_freqint (default=2)
The minimum allowable frequency of a word inside the text corpus.
- min_token_lenint (default=3)
The smallest allowable length of a word.
- min_tokensint (default=0)
The minimum allowable length of a tokenized text.
- max_token_indexint (default=-1)
The maximum allowable length of a tokenized text.
- min_ngram_countint (default=5)
The minimum occurrences for an n-gram to be included.
- remove_stopwordsbool (default=True)
Whether to remove stopwords.
- ignore_wordsstr or list
Strings that should be removed from the text body.
- sample_sizefloat (default=1)
The amount of data to be randomly sampled.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- Returns:
- text_corpus, clean_texts, selected_idxslist or list of lists, list, list
The texts formatted for text analysis both as tokens and strings, as well as the indexes for selected entries.
- kwx.utils._prepare_corpus_path(text_corpus=None, clean_texts=None, target_cols=None, input_language=None, min_token_freq=2, min_token_len=3, min_tokens=0, max_token_index=-1, min_ngram_count=3, remove_stopwords=True, ignore_words=None, sample_size=1, verbose=True)[source]¶
Checks a text corpus to see if it’s a path, and prepares the data if so.
- Parameters:
- text_corpusstr or list or list of lists
A path or text corpus over which analysis should be done.
- clean_textsstr
The texts formatted for analysis as strings.
- target_colsstr or list (default=None)
The columns in the csv/xlsx or dataframe that contain the text data to be modeled.
- input_languagestr (default=None)
The English name of the language in which the texts are found.
- min_token_freqint (default=2)
The minimum allowable frequency of a word inside the text corpus.
- min_token_lenint (default=3)
The smallest allowable length of a word.
- min_tokensint (default=0)
The minimum allowable length of a tokenized text.
- max_token_indexint (default=-1)
The maximum allowable length of a tokenized text.
- min_ngram_countint (default=5)
The minimum occurrences for an n-gram to be included.
- remove_stopwordsbool (default=True)
Whether to remove stopwords.
- ignore_wordsstr or list
Strings that should be removed from the text body.
- sample_sizefloat (default=1)
The amount of data to be randomly sampled.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- Returns:
- text_corpuslist or list of lists
A prepared text corpus for the data in the given path.
- kwx.utils.translate_output(outputs, input_language, output_language)[source]¶
Translates model outputs using https://github.com/ssut/py-googletrans.
- Parameters:
- outputslist
Output keywords of a model.
- input_languagestr
The English name of the language in which the texts are found.
- output_language
The English name of the desired language for outputs.
- Returns:
- translated_outputslist
A list of keywords translated to the given output_language.
- kwx.utils.organize_by_pos(outputs, output_language)[source]¶
Orders a keyword output by the part of speech of the words.
- Parameters:
- outputslist
The keywords that have been extracted.
- output_languagestr
The spoken language in which the results should be given.
- Returns:
- ordered_outputslist
The given keywords ordered by their pos.
- kwx.utils.prompt_for_word_removal(words_to_ignore=None)[source]¶
Prompts the user for words that should be ignored in kewword extraction.
- Parameters:
- words_to_ignorestr or list
Words that should not be included in the output.
- Returns:
- ignore words, words_addedlist, bool
A new list of words to ignore and a boolean indicating if words have been added.