ASU Universal Topic Modeling Kit

Page 1


UniversalTopic ModelingToolkit

Data Science, AI Acceleration, Arizona State University

Stevens-Macfarlane

Overview

Businesses today generate vast amounts of text from sources like CRM systems, internal knowledge bases, communication platforms (e.g., Slack, Teams), and published research. Stakeholders seek high-level insights, including:

Purpose

This whitepaper contrasts traditional and modern topic modeling methodologies, introduces a Python toolkit developed by ASU’s AI Acceleration team, and explores future opportunities in topic modeling.

TypicalTopicModelingProcess

1. Preprocessing

Tokenization, lowercasing, stopword removal, stemming or lemmatization, and removing non-alphanumeric characters.

2.Transformation

Converting text into numerical formats such as Document-Term Matrix (DTM), TF-IDF, or vector embeddings

3.Algorithm Selection

Choosing a suitable modeling algorithm (refer to Table 1)

4.Model Training

Setting input parameters to cluster data into topics

5.Evaluation

Using metrics like coherence scores and manual evaluation with unseen test data.

6.Topic Interpretation

Assigning human-readable labels to topic clusters.

Our toolkit automates steps 1-4, partially automates steps 5-6, and streamlines the overall process, requiring only user validation for the final steps.

Table1: ComparingTopicModelingApproaches

Feature Traditional Methods

Algorithm Selection

LDA, GSDMM, LSI, NMF, Doc2Vec, Word2Vec

Our Approach

Hybrid algorithms using pre-trained SLMs/LLMs (e g , BERT), dimensionality reduction (UMAP, PCA), and clustering (KMeans from Lloyd and MacQueen, HDBSCAN)

Packages Scikit-Learn, Gensim BERTopic

Preprocessing

Transformation

Extensive: standardizing text, stemming/lemmatizing, phrase transformations

Sparse vectors based on token counts or learned embeddings

Dimensionality Reduction (Training) Often not applied

Clustering (Training)

Hyperparameter Tuning (Training)

Representation (Interpretation)

Evaluation

User-selected number of topics, requiring multiple iterations

Extensive tuning needed

TF-IDF for token weighting and labeling

Metrics like coherence and perplexity; generic clustering scores

Minimal: removing only semantically insignificant content to denoise text

Fixed-dimension embeddings from Huggingface models, with optional truncation

Applied by default to reduce embeddings to smaller dimensions

Automatic topic detection with algorithms like HDBSCAN

Optional, with default BERTopic parameters typically sufficient

TF-IDF plus LLM-assisted labeling, categorization, and subcategorization

Emphasis on manual evaluation for interpretability using LLMs

UniversalTopicModelingToolkit

Value Proposition

Our toolkit automates the repetitive aspects of topic modeling and reduces the need for extensive iterations and heuristic decisions It simplifies model selection, evaluation, and interpretation, enabling developers to focus on deriving insights.

Features

Automated Training

Quickly build and iterate topic models using BERTopic

Integration with CreateAI Platform

Utilize ASU’s PII-compliant foundation LLMs for in-depth topic labeling and categorization.

Outputs

Spreadsheets listing topics, metadata, labels, and categories.

Inferred training data for visualization and driver discovery.

Summary visualizations showing topic distribution and conceptual maps.

Intermediate step outputs for debugging

Additional Utilities

Environment Setup

Bash scripts to create virtual or Miniconda environments tailored for topic modeling.

Validation

Python scripts to apply toolkit steps to unseen test data.

Model Retraining

Bash scripts to efficiently retrain models as needed.

UniversalTopicModelingToolkit

Toolkit Limitations

1. Data Quality

Poor or noisy text data leads to ineffective topics

2. Computational Constraints

Large datasets may exceed memory or disk capacities

3. Document Length

Optimized for shorter texts; longer documents may produce overly generalized topics

4. Domain Knowledge

Lack of expertise can result in irrelevant topic splits and inaccurate token associations

Looking Ahead: Report Bots

With the decreasing costs of LLMs like OpenAI’s GPT4o-mini and Google’s Gemini Flash 1 5, realtime text analytics becomes more feasible. For example, batching Salesforce cases from our contact centers and summarizing them into thematic reports demonstrated the effectiveness of cost-efficient LLMs in generating insightful topic reports. As LLM pricing continues to drop, they will complement topic models–especially in generating strategic and granular reports–provided they have a sufficient token context window and data privacy compliance

UniversalTopicModelingToolkit

Future Vision for the Toolkit

Despite advancements in LLMs, our toolkit remains valuable for specific scenarios:

Large Datasets

Topic models efficiently handle hundreds of thousands of documents without incurring high token costs.

Stable Topics

Ideal for environments where topics remain consistent over time, allowing for quick and reliable topic labeling.

Temporal Analytics

Superior in tracking how topic drivers evolve, offering consistent labeling as long as data stays within the model’s distribution.

Our toolkit continues to serve niches where LLMs might be excessive, ensuring cost-effective and actionable text analytics.

Interested in this project?

Contact program manager Paul Alvarado (palvara2@asu.edu).

Appendix

1: Universal Topic Modeling Toolkit Training Process

Diagram

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.