Best Practices Series
Separate Hype from Reality: Choosing a Speech Analytics Application
by Cliff LaCoursiere Co-Founder and Advisor, CallMiner, Inc. December 1, 2008
ÂŠ 2008 CallMiner, Inc. All Rights Reserved.
Separate Hype from Reality: Choosing a Speech Analytics Application By Cliff LaCoursiere
Executive Summary Choosing a speech analytics application, like choosing other enterprise applications, requires a focused effort that considers the expected value to be derived from the application’s functions, use and its potential impact on the organization. With a clear understanding of these factors, you can separate hype from reality and determine whether a specific speech analytics application is appropriate for your organization: • how the speech analytics application works • the value it delivers • how the speech analytics application categorizes calls • how speech analytics integrates with other enterprise applications Setting expectations early in the evaluation process with all stakeholders including, business owners, end users and IT, will bring the perspective needed to evaluate the application’s suitability. Creating a set of measurements consistent with the goals of stakeholders will facilitate management of the evaluation process and set the stage for optimum use of the speech analytics application chosen. As long as a considered evaluation process is employed and the potential business impact understood, speech analytics can deliver exceptional value to the organization. This discussion will illustrate what organizations should consider when evaluating a speech analytics application, and once the choice made, the steps necessary to extract measurable business value from the application’s use. Cliff LaCoursiere Co-Founder and Advisor of CallMiner, Inc. As a CallMiner founder and advisor, Cliff LaCoursiere continues to support the company’s leadership position in the speech analytics market. LaCoursiere contributes to CallMiner’s marketing and strategic initiatives to build the company’s market share. Prior to being named an advisor, LaCoursiere served in various business development, corporate development, sales and marketing roles. He helped secure the company’s first reference customers and lead efforts that cemented CallMiner’s partnerships with best-of-breed quality management and workforce optimization vendors. Prior to CallMiner, LaCoursiere led marketing and sales efforts at ThinkEngine Networks, Inc., a speech recognition hardware manufacturer. Previously, he was responsible for exponential sales growth at Media 100, a digital media hardware and software company. LaCoursiere holds a BS degree in economics and math from the University of Massachusetts at Amherst. Page 02
Where the Hype Starts Choosing a speech analytics application would seem to be a relatively straightforward task since most vendors use similar technologies and espouse similar functionality. However, hype is still raging, especially in competitive reviews Good product demonstrations will highlight differentiating features, however, when the demo artists go home, it is still sometimes difficult to separate hype from reality. The reality is that few applications automatically reduce expenses or add revenue. Some speech analytics applications are clearly better than others at providing data for informed decisions that could lead to better operational performance and higher revenue.
Armed with sufficient intelligence on selecting a speech analytics vendor and requiring vendors to follow the selection guidelines you have established will set you on a path aligning expectations with results as embodied in Figure 3.0. Although there will be bumps in the road, you can avoid the precipitous downturn of a ‘hyped” installation.
Figure 1.0 Traditional Gartner“Hype” Curve HIGH
Usefulness & ROI
Figure 1.0 illustrates a traditional “Gartner Hype” cycle of expectations (modified for purposes of this analysis). Initially, a vendor promises everything but the kitchen sink and secures a purchase order in the wonderment and WOW stage. During the implementation phase, expectations fall dramatically for any one of a variety of factors – most commonly, major business issues were not addressed or hardware components were not specified correctly, et cetera. Only after a deep drop in customer satisfaction does the vendor redeem himself, leading to more meaningful results. When the “hype” is isolated as in Figure 2.0, a different picture begins to emerge. Once you’ve sorted through all of the hype, a series of best practices can be employed to extract the most business value out of your choice. Visit http://callminer.com/about-us.htm for a helpful guide outlining how to: • Start Simply • Generate a Clearly Defined SOW • Don’t Skimp on Hardware or Software • Focus on the Greatest Impact First • Mine +100% Audio • Connect the Back Office to the Front Office • Act Upon the New Business Intelligence • Dedicate a Business Analyst(s) • Employ Change Agents
Don’t be fooled by vendors that show you a demo on a laptop and claim that you can run speech analytics for your enterprise on a laptop. It doesn’t pass the “makes sense” test. www.callminer.com
Wonderment & WOW
Figure 2.0 Isolating the “Hype” HIGH Usefulness & Return on Investment
Figure 3.0 Fulfilling Customer’s Expectations HIGH Usefulness & Return on Investment
Speech Analytics 101: The Engine All speech analytics applications employ a speech engine to create data for mining. There are two primary speech engine variants: large vocabulary continuous speech recognition (LVCSR) and phonetics. There are other variations of speech recognition technologies that employ grammars, but as a practical matter are beyond the scope of this paper and do little to assist the reader to pick the optimal speech analytics application. LVCSR: The Industry Standard LVCSR systems are the predominate type of speech recognition systems used by business, research and the government. Most speech analytics applications use LVCSR and for good reasons. The LVCSR application creates a rough transcript of the words spoken in a call. The words, together with other call and acoustic metadata, serve as the grist for analysis and extracting business intelligence from the call data. Since LVCSR creates a rough transcript from a call, the data can be stored and analyzed long after the recordings with which they are based on are gone.
The second challenge for phonetic systems is they generate a high falsepositive rate. For example, when searching for the word “toe,” a phonetic engine will return two results for the word “tomato” since the phoneme for toe occur twice in the word.
One criticism leveled at LVCSR is that the engine requires a dictionary. The dictionary issue should be met with due diligence. Most speech analytics applications employing LVCSR have dictionaries with 100,000’s of words. However, make sure when evaluating applications that words can be added to the application’s dictionary. Most call centers use a vocabulary of less than 30,000 words; there are, however, occasions where product names or domain-specific terms need to be added to capture and report on call categories. A speech analytics application should be able to uncover multiple words in different contexts to definitely outline what was said in a call. For example, Page 04
a financial services company’s agent needs to identify the person on the phone is who they say they are by verifying their address to mitigate fraud. Given the vagaries of casual speech, verifying this information could be asked by an agent in a number of ways such as “where do you live” or “what’s your address” or “what is your mailing address.” All three questions will yield the correct answer and can be identified in an instant since an LVCSR engine has stored all the words spoken. While a phonetic search could be employed to find the occurrence of each phrase, it would ignore all similar expressions or word combinations that verify that the appropriate question was asked and answered. Phonetic Engines The second speech technology used in speech analytics applications employs a phonetic search engine. Phonetic engines break an audio recording into subword parts called phonemes that enable finding a word quickly. While finding a word may be useful in some forms of search, there’s no context to the word, so the differences between words like “buy” and “bye” are lost. Phonetic searches have a seductive allure. On the surface, they don’t rely on a dictionary and can provide a fuzzy “sounds like” match that seems like a good idea intuitively until you understand the limitations Phonetic Limitations While it is true the technology doesn’t require a dictionary, phonetic systems have two major drawbacks. The first is scalability. Phonetic engines can be extremely fast at searching for single words, however, because an individual word does not provide as much information as a series of words or phrases, much less account for the vagaries of casual speech, phonetic engines are required to perform multiple searches. Since phonetic systems must conduct multiple searches to find some results, they are considerably slower than LVCSR systems that can conduct multiple word searches and combinations on a single pass. The second challenge for phonetic systems is they generate a high false-positive rate. For example, when searching for the word “toe,” a phonetic engine will Out of Vocabulary? Don’t accept the premise of this assertion. If you are worried about whether a word or acroynm is in a LVCSR dictionary, ask the vendor. December 2008
return two results for the word “tomato” since the phoneme for toe occurs twice in the word.
is similar to the environment the speech analytics system will be implemented.
This quirk either forces the system to tune down the false alarm rate to an acceptable rate and in doing so hide relevant information, or forces the end user to listen to results to sort through which results are accurate. This problem is exacerbated for even a modest sized database of hundreds of hours of calls per day, rendering a false alarm rate too high to ensure reliable results.
All speech analytics systems convert recorded audio into minable information, however, the amount of non-recording metadata captured and used varies widely. The amount and quality of metadata captured is important because it lends significant color to who said what when, and what they meant.
Hybrid Engines Are Ineffective Early experiments have indicated that a phonetic engine’s weaknesses are not supplanted by LVCSR’s strengths. Experiments using the two technologies together have confirmed this hypothesis. Application Architecture Once an appropriate technology is vetted, choosing an application that suits a company’s analytics requirements should be assessed. Top among those considerations are: what call recording systems are supported, what other non-recording data are captured, data compatibility, footprint and scaling, and the real-time nature of the application. Most speech analytics systems support more than one call recording platform. While this would seem to be a fairly low hurdle, it is surprising how many vendors claim they are recording platform agnostic and how few have production deployments in each of those environments. Less obvious, and in some instances more important, is whether different recording platforms can be supported in the same production environment. While most vendors support more than one recording platform, fewer support a mixed recording environment. The implications of not being able to support multiple recording platforms in a single implementation can have a significant cost impact on the hardware and software required to analyze the recordings. This can be further complicated by the need to store the data in separate data stores. A key to understanding how vendors are handling multiple recording platforms is to ask for references that have a mixed recording environment or one that Search is not just playback. Ask your vendor if they can provide snippets of customer conversations to speed the process allowing analysts to “read” ahead to find the best call. www.callminer.com
Metadata, also known as call characteristics, can be lumped into three categories, data captured by the call recording system, data generated by the speech analytics system, and data generated by other CRM, ERP kinds of applications. The first category of metadata is information captured by the call recording system. Since call recording is a mature application, most vendors are capturing CTI data to reveal ANI, DNIS, call duration, etc. While this basic information is important when considering a speech analytics vendor, integration with the specific call recording platform is more important.
Early experiments have indicated that phonetic engines’ weaknesses are not supplanted by LVCSR’s strengths. Experiments using the two technologies together have served to confirm this hypothesis.
Some recording vendors have programmatic links to CRM systems. Speech analytics applications that are able to use this information in their analytics can provide more refined results when analyzing call recordings. Using CRM together with other call recording data can enable analysis of call data segmented by customer type. The next important category of metadata is data generated internally by the speech analytics system. Variables such as caller stress, speech tempo, and silence lend details that lead to a richer understanding of what a customer meant versus what they said.
For example, a frequent flyer that has been transferred several times repeating the same information each time might, on the third transfer say something like “hey thanks, this has been really great.” Applications that capture stress would show that when the customer spoke those words, their stress level went up Page 05
several percentage, points, and upon listening, would have learned that the customer was not happy at all, rather they were frustrated with the service rendered. Metadata is a commodity where more is definitely better under the right circumstances. While more data points can yield more telling results, it is critical that the speech analytics application provide an intuitive interface to sort through the metadata in a meaningful way. For example, are sales calls for a particular product more successful when average handle time goes up by 50% of targeted norms, or are shorter calls with specific sell language more effective? The ability of an application to facilitate those types of queries easily is essential. When reviewing speech analytics systems, the more available data the better, as long as the application makes the data easy to manipulate in a meaningful way. When evaluating vendors, have a specific challenge in mind, and have the vendor demonstrate from data capture to analysis and reporting how the results are achieved to gauge whether they are easy to attain or meaningful..
Near-Real Time Use Naturally, cost of deployment is The next important somewhere high on the list of considerations. Vendors typically category of metadata is quote software licenses, maindata generated internally tenance, training and support, and most will suggest hardware, by the speech analytics operating, and database software requirements for the system consystem. Variables such as templated.
caller stress, speech tempo and silence lend details that lead to a richer understanding of what a customer meant versus what they said.
Cost of Deployment Naturally, cost of deployment is somewhere high on the list of considerations. Vendors typically quote software licenses, maintenance, training and support, and most will suggest hardware, operating, and database software requirements for the system contemplated. While comparing systems based on a fixed set of requirements enables a direct comparison, the organization should consider requirements for system expansion. Organizations never use less of the speech analytics application and you should plan for the future.
The principal considerations are storage and CPUs required to convert, store, and analyze speech and metadata. Not all systems scale linearly, and each have different CPU and storage requirements. While some systems require less compute power to operate, they require more storage. Some systems require that the audio be kept to enable analytics, while others can conduct analysis once the data are converted into minable information and don’t require the presence of the audio for analytics. Page 06
While storage is usually dismissed as a low-cost commodity, the reality is that it is not just the storage required, but the management of that storage that makes the cost nontrivial. Consider that one managed terabyte of storage at $10/gigabyte/month equals $10,000/month or $120,000/year. Storage requirements vary widely for each speech analytics application, so it’s important to consider how much storage is required to implement a system and how that storage will grow to accommodate system expansion over time.
While comparing systems based on a fixed set of requirements enables a direct comparison, the organization should consider requirements for system expansion. Organizations never use less of the speech analytics application and you should plan for the future.
The principal considerations are storage and CPUs required to convert, store, and analyze speech and metadata. Not all systems scale linearly, and each have different CPU and storage requirements. While some systems require less compute power to operate, they require more storage. Some systems require that the audio be kept to enable analytics, while others can conduct analysis once the data are converted into minable information and don’t require the presence of the audio for analytics. While storage is usually dismissed as a low-cost commodity, the reality is that it’s not just the storage required, but the management of that storage that makes the cost nontrivial. Consider that one managed terabyte of storage at $10/gigabyte/month equals $10,000/month or $120,000/year. Storage requirements vary widely for each speech analytics applicaSize Matters. Ask about the database footprint. Remember 1 “managed” terabyte = $10/Gig/Month = $10,000/month = $120,000 year. December 2008
tion, so it’s important to consider how much storage is required to implement a system and how that storage will grow to accommodate system expansion over time. Call Categorization Call Categorization is the foundation of speech analytics. Categorization is a process where words are recognized in their appropriate context and converted from an unstructured form into one that can be understood within the structure of categories.
application should allow the end user to go back to the data to create new categories on an ad-hoc basis. Once the application has converted the unstructured audio into a minable database, the application should allow recategorization at will, even in cases where the audio has been deleted. This application feature allows the analyst to create new categories, and as indicated earlier, results in a significant savings in the system’s overall cost.
The system should allow the user to define the relationship between and among categories. For example, Pick any subject or area of inquiry that speech analytics is expected to uncover and a business may desire to categoCategorization is the analyze and the object of the rize calls by product, and within exercise can be boiled down to product categories, parse calls into foundation of speech the categorization calls. Every subcategories of sales, service, or vendor conducts this task with analytics. Categorization customer support ad infinitum. different degrees of depth and Further, the system should provide is a process where words the flexibility to aggregate categocompetence, however, at its core, results are grouped into buckets, ries into super categories for the are recognized in their then counted and measured over creation of indices and KPIs. a time dimension to indicate volappropriate context and ume and trends. Since categorizing calls is a core converted from an unvalue of speech analytics, it’s key The test of how well an applicathose charged with evaluating structured form into one that tion categorizes results can be an application’s suitability have an measured by how easily new catunderstanding of expected busithat can egories can be created, whether ness value along with a firm grasp be understood within the of how the resulting information new categories be created on an adcan be used to improve decision hoc basis as required or as new data structure of categories. making and business processes are introduced, the ability to use search to create new categories and subcategories, and finally, how accurate is the categorization.
The application should provide an interface that enables a simple method for the end user to create, test, and modify new categories. Ideally, the application provides features such as wizards to assist in the creation of categories or otherwise add words that the user would not ordinarily have considered. Once a category is created, the interface should enable the search against the audio and match the category against a text representation of the category. This provides the user with some assurance that the search yields what the category specified. The application should also present the number of calls that matched the category while providing a facility for listening to sample calls. Additionally, the Voice is data. Data can be measured. Data that can be measured can be improved. Does your vendor provide an extraction layer to interface with your organization’s BI? www.callminer.com
Search Search is useful in speech analytics, however, only as a component of analytics. The very nature of search is to find what you’re looking for in a body of text or audio. Finding the occurrence of something as in “search,” is not as important as measuring the number of times it occurs in relation to other variables or “analytics.” For example, with search you can find the number of times the name Barack Obama occurred in a call, but it can’t tell you how the occurrences of Barack Obama correlated to his stance on unemployment. Not only can a robust speech analytics application find the correlation, but it can uncover the relationship of the variables and illustrate how those variables change over time. The considerations that underpin the usefulness of search in analytics are: • the ability to conduct ad-hoc analytics with the application using logic • the availability of application assistance for constructing searches Page 07
• • • •
the ability to reuse searches and deliver results at any user frequency ensuring that the audio not be needed for a search to execute the availability of text for reading for validating results the ability of the application to differentiate words that sound alike
One of the cornerstones of a good analytics application is flexibility, and in the context of search, that quality is embodied in the ability to conduct ad-hoc searches. Like any discovery process, search is a building and branching exercise. So once a search has been executed and the results reviewed, the analyst will likely add to or constrain the search to focus the results. Some speech analytics applications limit a user to deciding what they’re looking for ahead of time, and then deposit the results in the database. This can be useful when the analyst unequivocally knows what to look for or the likelihood of further analysis is remote. That approach is decidedly not analytic and is at best lazy, or worse yet, yields subpar results. The ability to conduct ad-hoc searches is critical, but without some assistance from the speech analytics application, could significantly increase the time needed to construct searches. Some vendors employ application wizards that assist in the creation of searches by adding synonyms and other word forms to a search. This capability significantly reduces the time needed to construct a search by suggesting key words that could affect results accuracy. Look for applications that provide word frequency data – these data help the analyst look for topics that might not ordinarily be included in a set of search terms and could point to issues that were not otherwise obvious. While taken for granted, some speech analytics applications require that the audio be present to conduct analytics. Applications that require audio be present for search have a significant impact on storage and processing costs. Applications that employ LVCSR and store the entire call as text may not require the audio to be present for analysis, resulting in significant hardware savings. An ancillary benefit to having the text of a call is the ability to read what came before and after the search hit. This lends more color to the context of the search Page 08
and may inspire a deeper look at the call or take the analysis in another direction. Lastly, ask if the application can discern the difference between different homonyms. A quality LVCSR engine uses a language model and can differentiate “bye” from “buy” or “bi.” Discovery Even with competent training, employing a speech analytics application to the benefit of the organization is challenging, so an application that enables automated discovery and management of key performance indicators out-of-the-box is crucial. Applications that conform to these evaluation criteria will deliver value to the business on Day 1. So when considering which speech analytics application is appropriate for the organization, criteria should include what performance indicators are available without customization, can drill-down on performance indicators, and whether new indicators can be created and managed easily. While performance indices are varied, most organizations would agree that measurements of customer satisfaction, agent quality, sales performance, and marketing effectiveness would be at the top of the list. The expectation should be that that those key performance indicators are available out-of-the-box and without custom development work on the part of the speech analytics vendor. Further, the application should facilitate drill-down to the components that aggregate to those key indices so business analysts can isolate variables that effect the upward or downward movement of the metric. This gives the business a focused understanding of what components might need attention. Lastly, the application should also enable the business analyst the ability to create and modify custom components and indices to capture changing market and business dynamics. While the latter capabilities may be taken for granted, the subtlety is that value is delivered to multiple levels of the organization in a form that’s actionable. Senior management can view key performance metrics on a regular basis, and when they’re trending upward, call on a business analyst to drill-down on the indices to identify the components that are driving business in a positive direction.
Speech analytics is the science of customer conversations. Ask your vendor whether they can deliver three integral elements of categorization, search and discovery. December 2008
For example, marketing effectiveness may be trending up and on further drill-down reveal that a while several promotions are in effect, one is performing so well that the nonperforming promotions are masked by the outperforming promotion. Management could then take corrective action to improve underperforming promotions and duplicate qualities of the performing promotions in the future. When evaluating speech analytics applications, ask the vendors what’s delivered out-of-the-box to aid in the discovery and management of key business health indices. Many product demos illustrate some type of automated discovery, however, on further evaluation, find that business indices need custom development work that adds to the cost of the implementation. Make sure that if automated discovery is delivered with the application that analysts can drill-down easily on the disaggregate components that make up the indices and that they can be manipulated easily to suit business requirements. Reporting All speech analytics applications provide some modicum of reporting, whether offering canned, customized, or proprietary reporting formats. Like search capabilities, reporting should offer the user flexibility, including the ability to schedule reports on a recurring basis, enable the use of metadata as a reporting parameter, be able to expose underlying data to third-party analytics and reporting engines, show results in both tables and graphics, and be grounded in industry-standard formats. While not glamorous, reporting is the analyst’s vehicle to communicate findings. Since conducting analysis can be a time-consuming process, it’s critical that once created, reports can be scheduled to run and be delivered to the managers that need them at the required frequency. Further, once those reports are delivered, the ability to filter and view reports with metadata can provide richer information for better decisions. For example, a report that shows a spike in call activity for flights to a specific city might suggest that more customers are inquiring about flights to that city. A closer look at the report and parsing the data by an inbound vs. outbound parameter might show that there were more outbound calls about the suspect city than inbound calls. www.callminer.com
Further drill-down on the calls could reveal that the flights were cancelled and the agents were making outbound calls to inform passengers of the cancellations. The dimensions of the data can be daunting, so having a reporting application that depicts results graphically can help interpret the results. As with other features of speech analytics applications, adhering to industry-standard formats empowers the end user to use other analytics and reporting tools to conduct further analysis. So when evaluating a speech analytics application, inquire whether data can be exported to other business intelligence and reporting tools used by business. Extracting Value from Speech Analytics Independent of which application is chosen, defining expectations for how the application will be used and agreeing on what constitutes success criteria for the application’s use will create the framework for whether deploying the application is in the best interest of the business. Too many applications are deployed with expectations set by the vendor, and while good results can be attained, they may fail to measure up to the hype that often accompanies the adoption process. Summary Once you’ve separated the “hype” from the “reality” and employed the best practices in your selection process, you will be armed with a keen understanding of how speech analytics applications work, the output they can deliver, how the system will fit and play with other enterprise applications, and ultimately, how the application categorizes calls, which will determine whether the application is appropriate for your business. Choosing a speech analytics application, like choosing other enterprise applications, requires a focused effort that considers the expected value to be derived from the application’s functions, use, and its potential impact on the business. Setting expectations early in the evaluation process with all stakeholders, including business owners, end users, and IT can bring the perspective needed to evaluate the application’s suitability. Creating a set of measurements consistent with the goals of stakeholders will facilitate the management of the evaluation process and set the stage for optimum use of the speech analytics application chosen. As long as a considered evaluation process is employed and the potential business impact is understood, speech analytics can deliver exceptional value to your business. Page 09