visuals

The visuals module provides functions for keyword and topic visualization

Functions

kwx.visuals.save_vis(vis, save_file, file_name)[source]

Saves a visualization file in the local or given directory if directed.

Parameters:
vismatplotlib.pyplot

The visualization to be saved.

save_filebool or str (default=False)

Whether to save the figure as a png or a path in which to save it.

Note: directory paths can begin from the working directory.

file_namestr

The name for the file.

Returns:
The file saved in the local or given directory if directed.
kwx.visuals.graph_topic_num_evals(method=['lda', 'bert'], bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', text_corpus=None, num_keywords=10, topic_nums_to_compare=None, metrics=True, fig_size=(20, 10), save_file=False, return_ideal_metrics=False, verbose=True, **kwargs)[source]

Graphs metrics for the given models over the given number of topics.

Parameters:
methodstr (default=[“lda”, “bert”])

The modelling method.

Options:

LDA: Latent Dirichlet Allocation

  • Text data is classified into a given number of categories.

  • These categories are then used to classify individual entries given the percent they fall into categories.

BERT: Bidirectional Encoder Representations from Transformers

  • Words are classified via Google Neural Networks.

  • Word classifications are then used to derive topics.

bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens)

The BERT model to use.

text_corpuslist, list of lists, or str

The text corpus over which analysis should be done.

num_keywordsint (default=10)

The number of keywords that should be extracted.

topic_nums_to_comparelist (default=None)

The number of topics to compare metrics over.

Note: None selects all numbers from 1 to num_keywords.

sample_sizefloat (default=None: sampling for non-BERT techniques)

The size of a sample for BERT models.

metricsstr or bool (default=True: all metrics)

The metrics to include.

Options:
  • stability: model stability based on Jaccard similarity.

  • coherence: how much the words associated with model topics co-occur.

fig_sizetuple (default=(20,10))

The size of the figure.

save_filebool or str (default=False)

Whether to save the figure as a png or a path in which to save it.

return_ideal_metricsbool (default=False)

Whether to return the ideal number of topics for the best model based on metrics.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

**kwargskeyword arguments

Keyword arguments correspoding to sentence_transformers.SentenceTransformer.encode or gensim.models.ldamulticore.LdaMulticore.

Returns:
axmatplotlib axis

A graph of the given metrics for each of the given models based on each topic number.

kwx.visuals.gen_word_cloud(text_corpus, ignore_words=None, height=500, save_file=False)[source]

Generates a word cloud for a group of words.

Parameters:
text_corpuslist or list of lists

The text_corpus that should be plotted.

ignore_wordsstr or list (default=None)

Words that should be removed.

heightint (default=500)

The height of the resulting figure Note: the width will be the golden ratio times the height.

save_filebool or str (default=False)

Whether to save the figure as a png or a path in which to save it.

Returns:
plt.savefig or plt.showpyplot methods

A word cloud based on the occurrences of words in a list without removed words.

kwx.visuals.pyLDAvis_topics(method='lda', text_corpus=None, num_topics=10, save_file=False, display_ipython=False, **kwargs)[source]

Returns the outputs of an LDA model plotted using pyLDAvis.

Parameters:
methodstr or list (default=LDA)

The modelling method or methods to compare.

Option:

LDA: Latent Dirichlet Allocation

  • Text data is classified into a given number of categories.

  • These categories are then used to classify individual entries given the percent they fall into categories.

text_corpuslist, list of lists, or str

The text corpus over which analysis should be done.

num_topicsint (default=10)

The number of categories for LDA and BERT based approaches.

save_filebool or str (default=False)

Whether to save the HTML file to the current working directory or a path in which to save it.

display_ipythonbool (default=False)

Whether iPython’s display function should be used if in that working environment.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

**kwargskeyword arguments

Keyword arguments correspoding to gensim.models.ldamulticore.LdaMulticore.

Returns:
pyLDAvis.save_html or pyLDAvis.showpyLDAvis methods

A visualization of the topics and their main keywords via pyLDAvis.

kwx.visuals.t_sne(dimension='both', text_corpus=None, num_topics=10, remove_3d_outliers=False, fig_size=(20, 10), save_file=False, **kwargs)[source]

Returns the outputs of an LDA model plotted using t-SNE (t-distributed Stochastic Neighbor Embedding).

Parameters:
dimensionstr (default=both)

The dimension that t-SNE should reduce the data to for visualization Options: 2d, 3d, and both (a plot with two subplots).

text_corpuslist, list of lists

The tokenized and cleaned text corpus over which analysis should be done.

num_topicsint (default=10)

The number of categories for LDA based approaches.

remove_3d_outliersbool (default=False)

Whether to remove outliers from a 3d plot.

fig_sizetuple (default=(20,10))

The size of the figure.

save_filebool or str (default=False)

Whether to save the figure as a png or a path in which to save it.

**kwargskeyword arguments

Keyword arguments correspoding to gensim.models.ldamulticore.LdaMulticore or sklearn.manifold.TSNE.

Returns:
figmatplotlib.pyplot.figure

A t-SNE lower dimensional representation of an LDA model’s topics and their constituent members.

Notes

t-SNE reduces the dimensionality of a space such that similar points will be closer and dissimilar points farther.