visuals¶

The visuals module provides functions for keyword and topic visualization

Functions

kwx.visuals.save_vis()
kwx.visuals.graph_topic_num_evals()
kwx.visuals.gen_word_cloud()
kwx.visuals.pyLDAvis_topics()
kwx.visuals.t_sne()

kwx.visuals.save_vis(vis, save_file, file_name)[source]¶

Saves a visualization file in the local or given directory if directed.

Parameters:

vismatplotlib.pyplot

The visualization to be saved.

save_filebool or str (default=False)

Whether to save the figure as a png or a path in which to save it.

Note: directory paths can begin from the working directory.

file_namestr

The name for the file.

Returns:

The file saved in the local or given directory if directed.

kwx.visuals.graph_topic_num_evals(method=['lda', 'bert'], bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', text_corpus=None, num_keywords=10, topic_nums_to_compare=None, metrics=True, fig_size=(20, 10), save_file=False, return_ideal_metrics=False, verbose=True, **kwargs)[source]¶

Graphs metrics for the given models over the given number of topics.

Parameters:

methodstr (default=[“lda”, “bert”])

The modelling method.

Options:

LDA: Latent Dirichlet Allocation

Text data is classified into a given number of categories.

These categories are then used to classify individual entries given the percent they fall into categories.

BERT: Bidirectional Encoder Representations from Transformers

Words are classified via Google Neural Networks.

Word classifications are then used to derive topics.

bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens)

The BERT model to use.

text_corpuslist, list of lists, or str

The text corpus over which analysis should be done.

num_keywordsint (default=10)

The number of keywords that should be extracted.

topic_nums_to_comparelist (default=None)

The number of topics to compare metrics over.

Note: None selects all numbers from 1 to num_keywords.

sample_sizefloat (default=None: sampling for non-BERT techniques)

The size of a sample for BERT models.

metricsstr or bool (default=True: all metrics)

The metrics to include.

Options:

stability: model stability based on Jaccard similarity.
coherence: how much the words associated with model topics co-occur.

fig_sizetuple (default=(20,10))

The size of the figure.

save_filebool or str (default=False)

Whether to save the figure as a png or a path in which to save it.

return_ideal_metricsbool (default=False)

Whether to return the ideal number of topics for the best model based on metrics.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

**kwargskeyword arguments

Keyword arguments correspoding to sentence_transformers.SentenceTransformer.encode or gensim.models.ldamulticore.LdaMulticore.

Returns:

axmatplotlib axis: A graph of the given metrics for each of the given models based on each topic number.

kwx.visuals.gen_word_cloud(text_corpus, ignore_words=None, height=500, save_file=False)[source]¶

Generates a word cloud for a group of words.

Parameters:

text_corpuslist or list of lists: The text_corpus that should be plotted.
ignore_wordsstr or list (default=None): Words that should be removed.
heightint (default=500): The height of the resulting figure Note: the width will be the golden ratio times the height.
save_filebool or str (default=False): Whether to save the figure as a png or a path in which to save it.

Returns:

plt.savefig or plt.showpyplot methods: A word cloud based on the occurrences of words in a list without removed words.

kwx.visuals.pyLDAvis_topics(method='lda', text_corpus=None, num_topics=10, save_file=False, display_ipython=False, **kwargs)[source]¶

Returns the outputs of an LDA model plotted using pyLDAvis.

Parameters:

methodstr or list (default=LDA)

The modelling method or methods to compare.

Option:

LDA: Latent Dirichlet Allocation

Text data is classified into a given number of categories.

These categories are then used to classify individual entries given the percent they fall into categories.

text_corpuslist, list of lists, or str

The text corpus over which analysis should be done.

num_topicsint (default=10)

The number of categories for LDA and BERT based approaches.

save_filebool or str (default=False)

Whether to save the HTML file to the current working directory or a path in which to save it.

display_ipythonbool (default=False)

Whether iPython’s display function should be used if in that working environment.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

**kwargskeyword arguments

Keyword arguments correspoding to gensim.models.ldamulticore.LdaMulticore.

Returns:

pyLDAvis.save_html or pyLDAvis.showpyLDAvis methods: A visualization of the topics and their main keywords via pyLDAvis.

kwx.visuals.t_sne(dimension='both', text_corpus=None, num_topics=10, remove_3d_outliers=False, fig_size=(20, 10), save_file=False, **kwargs)[source]¶

Returns the outputs of an LDA model plotted using t-SNE (t-distributed Stochastic Neighbor Embedding).

Parameters:

dimensionstr (default=both): The dimension that t-SNE should reduce the data to for visualization Options: 2d, 3d, and both (a plot with two subplots).
text_corpuslist, list of lists: The tokenized and cleaned text corpus over which analysis should be done.
num_topicsint (default=10): The number of categories for LDA based approaches.
remove_3d_outliersbool (default=False): Whether to remove outliers from a 3d plot.
fig_sizetuple (default=(20,10)): The size of the figure.
save_filebool or str (default=False): Whether to save the figure as a png or a path in which to save it.
**kwargskeyword arguments: Keyword arguments correspoding to gensim.models.ldamulticore.LdaMulticore or sklearn.manifold.TSNE.

Returns:

figmatplotlib.pyplot.figure: A t-SNE lower dimensional representation of an LDA model’s topics and their constituent members.

Notes

t-SNE reduces the dimensionality of a space such that similar points will be closer and dissimilar points farther.