

Varun Shourie and Hailey
Data Science, AI Acceleration, Arizona State University
Businesses today generate vast amounts of text from sources like CRM systems, internal knowledge bases, communication platforms (e.g., Slack, Teams), and published research. Stakeholders seek high-level insights, including:
This whitepaper contrasts traditional and modern topic modeling methodologies, introduces a Python toolkit developed by ASU’s AI Acceleration team, and explores future opportunities in topic modeling.
1. Preprocessing
Tokenization, lowercasing, stopword removal, stemming or lemmatization, and removing non-alphanumeric characters.
2.Transformation
Converting text into numerical formats such as Document-Term Matrix (DTM), TF-IDF, or vector embeddings
3.Algorithm Selection
Choosing a suitable modeling algorithm (refer to Table 1)
4.Model Training
Setting input parameters to cluster data into topics
5.Evaluation
Using metrics like coherence scores and manual evaluation with unseen test data.
6.Topic Interpretation
Assigning human-readable labels to topic clusters.
Our toolkit automates steps 1-4, partially automates steps 5-6, and streamlines the overall process, requiring only user validation for the final steps.
Feature Traditional Methods
Algorithm Selection
LDA, GSDMM, LSI, NMF, Doc2Vec, Word2Vec
Our Approach
Hybrid algorithms using pre-trained SLMs/LLMs (e g , BERT), dimensionality reduction (UMAP, PCA), and clustering (KMeans from Lloyd and MacQueen, HDBSCAN)
Packages Scikit-Learn, Gensim BERTopic
Preprocessing
Transformation
Extensive: standardizing text, stemming/lemmatizing, phrase transformations
Sparse vectors based on token counts or learned embeddings
Dimensionality Reduction (Training) Often not applied
Clustering (Training)
Hyperparameter Tuning (Training)
Representation (Interpretation)
Evaluation
User-selected number of topics, requiring multiple iterations
Extensive tuning needed
TF-IDF for token weighting and labeling
Metrics like coherence and perplexity; generic clustering scores
Minimal: removing only semantically insignificant content to denoise text
Fixed-dimension embeddings from Huggingface models, with optional truncation
Applied by default to reduce embeddings to smaller dimensions
Automatic topic detection with algorithms like HDBSCAN
Optional, with default BERTopic parameters typically sufficient
TF-IDF plus LLM-assisted labeling, categorization, and subcategorization
Emphasis on manual evaluation for interpretability using LLMs
Value Proposition
Our toolkit automates the repetitive aspects of topic modeling and reduces the need for extensive iterations and heuristic decisions It simplifies model selection, evaluation, and interpretation, enabling developers to focus on deriving insights.
Features
Automated Training
Quickly build and iterate topic models using BERTopic
Integration with CreateAI Platform
Utilize ASU’s PII-compliant foundation LLMs for in-depth topic labeling and categorization.
Outputs
Spreadsheets listing topics, metadata, labels, and categories.
Inferred training data for visualization and driver discovery.
Summary visualizations showing topic distribution and conceptual maps.
Intermediate step outputs for debugging
Additional Utilities
Environment Setup
Bash scripts to create virtual or Miniconda environments tailored for topic modeling.
Validation
Python scripts to apply toolkit steps to unseen test data.
Model Retraining
Bash scripts to efficiently retrain models as needed.
1. Data Quality
Poor or noisy text data leads to ineffective topics
2. Computational Constraints
Large datasets may exceed memory or disk capacities
3. Document Length
Optimized for shorter texts; longer documents may produce overly generalized topics
4. Domain Knowledge
Lack of expertise can result in irrelevant topic splits and inaccurate token associations
With the decreasing costs of LLMs like OpenAI’s GPT4o-mini and Google’s Gemini Flash 1 5, realtime text analytics becomes more feasible. For example, batching Salesforce cases from our contact centers and summarizing them into thematic reports demonstrated the effectiveness of cost-efficient LLMs in generating insightful topic reports. As LLM pricing continues to drop, they will complement topic models–especially in generating strategic and granular reports–provided they have a sufficient token context window and data privacy compliance
Despite advancements in LLMs, our toolkit remains valuable for specific scenarios:
Topic models efficiently handle hundreds of thousands of documents without incurring high token costs.
Ideal for environments where topics remain consistent over time, allowing for quick and reliable topic labeling.
Temporal Analytics
Superior in tracking how topic drivers evolve, offering consistent labeling as long as data stays within the model’s distribution.
Our toolkit continues to serve niches where LLMs might be excessive, ensuring cost-effective and actionable text analytics.
Contact program manager Paul Alvarado (palvara2@asu.edu).
1: Universal Topic Modeling Toolkit Training Process