visuals¶
The visuals
module provides functions for keyword and topic visualization
Functions
- kwx.visuals.save_vis(vis, save_file, file_name)[source]¶
Saves a visualization file in the local or given directory if directed.
- Parameters:
- vismatplotlib.pyplot
The visualization to be saved.
- save_filebool or str (default=False)
Whether to save the figure as a png or a path in which to save it.
Note: directory paths can begin from the working directory.
- file_namestr
The name for the file.
- Returns:
- The file saved in the local or given directory if directed.
- kwx.visuals.graph_topic_num_evals(method=['lda', 'bert'], bert_st_model='xlm-r-bert-base-nli-stsb-mean-tokens', text_corpus=None, num_keywords=10, topic_nums_to_compare=None, metrics=True, fig_size=(20, 10), save_file=False, return_ideal_metrics=False, verbose=True, **kwargs)[source]¶
Graphs metrics for the given models over the given number of topics.
- Parameters:
- methodstr (default=[“lda”, “bert”])
The modelling method.
- Options:
LDA: Latent Dirichlet Allocation
Text data is classified into a given number of categories.
These categories are then used to classify individual entries given the percent they fall into categories.
BERT: Bidirectional Encoder Representations from Transformers
Words are classified via Google Neural Networks.
Word classifications are then used to derive topics.
- bert_st_modelstr (deafault=xlm-r-bert-base-nli-stsb-mean-tokens)
The BERT model to use.
- text_corpuslist, list of lists, or str
The text corpus over which analysis should be done.
- num_keywordsint (default=10)
The number of keywords that should be extracted.
- topic_nums_to_comparelist (default=None)
The number of topics to compare metrics over.
Note: None selects all numbers from 1 to num_keywords.
- sample_sizefloat (default=None: sampling for non-BERT techniques)
The size of a sample for BERT models.
- metricsstr or bool (default=True: all metrics)
The metrics to include.
- Options:
stability: model stability based on Jaccard similarity.
coherence: how much the words associated with model topics co-occur.
- fig_sizetuple (default=(20,10))
The size of the figure.
- save_filebool or str (default=False)
Whether to save the figure as a png or a path in which to save it.
- return_ideal_metricsbool (default=False)
Whether to return the ideal number of topics for the best model based on metrics.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- **kwargskeyword arguments
Keyword arguments correspoding to sentence_transformers.SentenceTransformer.encode or gensim.models.ldamulticore.LdaMulticore.
- Returns:
- axmatplotlib axis
A graph of the given metrics for each of the given models based on each topic number.
- kwx.visuals.gen_word_cloud(text_corpus, ignore_words=None, height=500, save_file=False)[source]¶
Generates a word cloud for a group of words.
- Parameters:
- text_corpuslist or list of lists
The text_corpus that should be plotted.
- ignore_wordsstr or list (default=None)
Words that should be removed.
- heightint (default=500)
The height of the resulting figure Note: the width will be the golden ratio times the height.
- save_filebool or str (default=False)
Whether to save the figure as a png or a path in which to save it.
- Returns:
- plt.savefig or plt.showpyplot methods
A word cloud based on the occurrences of words in a list without removed words.
- kwx.visuals.pyLDAvis_topics(method='lda', text_corpus=None, num_topics=10, save_file=False, display_ipython=False, **kwargs)[source]¶
Returns the outputs of an LDA model plotted using pyLDAvis.
- Parameters:
- methodstr or list (default=LDA)
The modelling method or methods to compare.
- Option:
LDA: Latent Dirichlet Allocation
Text data is classified into a given number of categories.
These categories are then used to classify individual entries given the percent they fall into categories.
- text_corpuslist, list of lists, or str
The text corpus over which analysis should be done.
- num_topicsint (default=10)
The number of categories for LDA and BERT based approaches.
- save_filebool or str (default=False)
Whether to save the HTML file to the current working directory or a path in which to save it.
- display_ipythonbool (default=False)
Whether iPython’s display function should be used if in that working environment.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- **kwargskeyword arguments
Keyword arguments correspoding to gensim.models.ldamulticore.LdaMulticore.
- Returns:
- pyLDAvis.save_html or pyLDAvis.showpyLDAvis methods
A visualization of the topics and their main keywords via pyLDAvis.
- kwx.visuals.t_sne(dimension='both', text_corpus=None, num_topics=10, remove_3d_outliers=False, fig_size=(20, 10), save_file=False, **kwargs)[source]¶
Returns the outputs of an LDA model plotted using t-SNE (t-distributed Stochastic Neighbor Embedding).
- Parameters:
- dimensionstr (default=both)
The dimension that t-SNE should reduce the data to for visualization Options: 2d, 3d, and both (a plot with two subplots).
- text_corpuslist, list of lists
The tokenized and cleaned text corpus over which analysis should be done.
- num_topicsint (default=10)
The number of categories for LDA based approaches.
- remove_3d_outliersbool (default=False)
Whether to remove outliers from a 3d plot.
- fig_sizetuple (default=(20,10))
The size of the figure.
- save_filebool or str (default=False)
Whether to save the figure as a png or a path in which to save it.
- **kwargskeyword arguments
Keyword arguments correspoding to gensim.models.ldamulticore.LdaMulticore or sklearn.manifold.TSNE.
- Returns:
- figmatplotlib.pyplot.figure
A t-SNE lower dimensional representation of an LDA model’s topics and their constituent members.
Notes
t-SNE reduces the dimensionality of a space such that similar points will be closer and dissimilar points farther.