Information Technology Quality of Service Metrics at ibm.com
IBM Technical Report TR-40.0031
TeleWeb Operations Program Manager
ÂŠ 2003 IBM CORPORATION
Sr. Manager and Corporate Webmaster
Page 1 of 11
Abstract Information Technology Operations Management within IBM includes the managing of Solution (i.e. infrastructure, application, business process) Performance Metrics. This case study reviews how IBM’s Web organization (ibm.com) performs effective Quality of Service management for web site availability and response time. At ibm.com we find that as our Quality of Service improves, we are rewarded in the marketplace with increased overall customer satisfaction and larger revenue capture on the web. To maximize the benefit of Quality of Service metrics ibm.com focuses its resources on Key Applications and Business Processes, and drives broad improvement through basic Quality of Service management across the entire portfolio of applications. IBM sets targets for high availability and response time based on best of breed benchmarking. To deliver on the response time and availability requirements, ibm.com uses an exception alerting system to generate immediate attention to issues. To proactively deliver on high Quality of Service metrics the management reviews and identifies actions in the framework of a standing calendar of Quality of Service reviews.
Introduction IBM has undergone a major financial, competitive, and cultural transformation since 1993. The Business Transformation Management System (BTMS) is a component of that transformation, and is used by IBM worldwide to identify, develop, and deploy IBM information technology and infrastructure. BTMS provides Operational Management guidance related to Solution Performance Management. Solution Performance Management Metrics include reporting on web traffic, customer events (such as quantity of orders), customer satisfaction, and availability and response time. In the area of availability and response time metrics ibm.com has exceeded the base guidance, and has been a pioneer in extending standards. ibm.com is the organization within IBM that develops, deploys and manages IBM’s web presence. This organization controls the ibm.com Internet domain, provides web sites for Commerce and stakeholder (i.e. customers, investors, the press and potential employees) support, and provides guidance to all external IBM web sites. This paper reflects the experiences and lessons learned by ibm.com in managing the Quality of Service metrics, availability and response time, for IBM’s web presence.
© 2003 IBM CORPORATION
Page 2 of 11
Marketplace impact of Availability and Response Time Service availability and response time expectations are two basic Quality of Service metrics an institution needs to achieve to maintain satisfied constituents. For example if there are two gas stations close to your home, and one is open more hours and the wait time to be served is significantly shorter, over time you are likely to use the more available gas station, and possibly switch forever. We at ibm.com know our customers rely on the web to learn about our goods and services, shop, buy and effectively use the goods and services. The ibm.com retail segment presents the most significant customer retention challenges to IBM. Retail customer sites are sticky, which is to say someone keeps going back or sticks to the same web site as long as it satisfies their need. If the retail web site is unavailable, the customer will switch to a new site if their needs are immediate. Once they switch sites, they may not switch again until the competitor site fails to satisfy their needs.
Customer Sat trend compared to site 10 9 8 7
Out age in prim e shif t
6 Out age in prim e shif t
Average Daily Score
ibm.com Retail commerce site survey data shows a strong correlation between outages and decreased customer satisfaction, web site satisfaction, and likelihood to do business with IBM.
Overall Sat POS Likelihood to buy again POS
Web Site Performance Sat
ibm.com has found web site response time can influence Customer Satisfaction, web site satisfaction and likelihood to do business with IBM. ibm.com has found that improvements in site response time by 20-30% produce modest increases in overall customer satisfaction of 5-10%. Significant increases of 20-50% in response time result in a decline of customer satisfaction of 10-15%. ibm.comâ€™s conclusion is response time alone does not drive significant increases in customer satisfaction; yet substantial increases in response time can drive customers away.
ÂŠ 2003 IBM CORPORATION
Page 3 of 11
Key Applications and Business Processes ibm.com Key Applications as % of Portfolio
Key Applications or Business Processes are designations at ibm.com that mandate a minimal level of system availability, system response time, and metric reporting. Key Applications or Business Processes meet one or more of the following criteria:
1. Used by external IBM Customers 2. Quantity of revenue or order capture volume 3. Web Site or Event influencing company image in major way a. Investor webcasts or hosting of a major sports event website are extremely visible events that can define the IBM image to an influential segment of the IBM stakeholders and customers. b. In general we use quantity of expected site visits to ascertain the Web Site or Event is critical. ibm.com measures web site traffic using Surfaidâ„˘. 4. Quantity of visits by entitled customers a. ibm.com customers have executed contracts with IBM for specific customer functions (like technical support) that have implied service level objectives 5. Alternative processing cost exposure a. IBM has found certain functions related to product delivery, like order status, when unavailable generate a deluge of alternative contacts into IBM that are dealt with in a less cost effective manner b. In high volume, low margin item handling an order processed by other then the web is prohibitive as alternative order processing costs reduce profit margins
Setting Availability Targets The service level expectations, in commercial and non-profit operations, historically started out with the hours the physical facility was open for business. For commercial enterprises the hours of operation, when not regulated by law, became a differentiator among firms. A web site, or any technology that enables access to a commercial organization (i.e. automated teller machines for banks), raises customer expectations that these institutions are always available to process a request with prompt response time. IBM strives to achieve the highest possible availability, tempered by the costs for supporting the infrastructure or application architecture. The ibm.com availability standard is 99.5% for the underlying web infrastructure, excluding the specific system maintenance time requirements. On top of that infrastructure availability standard, we have put in place specific Key Application or Business Process availability requirements. An example is that for the Key Application www.ibm.com homepage we have set a 99.95% availability target (about four hours per year), which we have exceeded for the
ÂŠ 2003 IBM CORPORATION
Page 4 of 11
prior two years, as IBM has had no measurable outages. Other applications, such as Commerce, have 99.5% availability targets that we still have room for improvement in our attainment. (See Figure 1Sample Availability and Response Time Report on Page 21).
Setting Response Time Targets IBM Response time standards evolved from IBM internally defined timings to targets based on competitive intelligence. The initial standard was based on the ibm.com assessment of what the web site could achieve. The revolution in thinking was the transition to true marketplace measures of acceptable response time, as defined during benchmarking.
Application Specific Standards
Corporate Wide Generic Transactions
Competitive Timings Based
Evolution over Time
Annually, for Key Applications or Key Business Processes, we set target response time targets by benchmarking our competitorsâ€™ sites1. The competitors chosen by the business teams are those doing well through the web channel. We do this benchmark on a geographic basis, so that we ensure we meet the challenge in each of the geographies we serve. With the benchmark data of similar competitor sites, we set response time standards that are at or lower then our competitorâ€™s response time. The output of this competitive exercise can be somewhat sobering to the technology and business staff, as it sets targets based on what you need to achieve in the marketplace, and not just setting targets based on what you can achieve.
Alerts and underlying monitoring At IBM we strive to identify performance or availability service delivery issues prior to customer impact. We have alerts generated for infrastructure component issues, application availability issues, or errors in business process monitors. We analyze all alerts via automated analysis or staff intervention. In cases where ibm.com has redundant resources to handle the customer requests, often there is no visible customer impact. For infrastructure components the costs to setup monitoring and act upon alerts are built into the service delivery rates, and are not considered discretionary. Discrete web processes or web link monitoring setup alert handling and reporting, is done for Key Applications or Business Processes. The alert handling and reporting costs can be significant, if it requires staff to review the information and initiate corrective actions. For infrastructure components we have either available staff or a page-out procedure to initiate complex problem analysis and correction. For Key Applications, or applications that have unique interim requirements, there is 24X7X365 hour staff coverage to respond 1
ibm.com reviews with our Legal staff to validate our perception of publicly available data, versus unethical competitive practices.
ÂŠ 2003 IBM CORPORATION
Page 5 of 11
to alerts. The staff response to alerts is: verification, problem definition, and then initiation of corrective action. ibm.com does monitoring at two levels: operational level and user perspective. Operations level monitors individual technology components. User perspective monitoring includes business scenario validation. Alerts issued for processing errors are returned from either the operational or end-to-end monitoring staff. Operational alerts are generated upon a change of status or a lack of response. End-to-end monitoring generates alerts due to a failure to respond prior to the timeout value2 or the response does not match the expected content anticipated. Operations level monitoring techniques include: • Enabling Tivoli monitoring to alert any time equipment, previously present and operational, does not respond. • Running System Resource Monitoring for servers to check on CPU Usage, Run Queue for AIX, Memory and Storage Capacity used; I/O Wait and Paging. • Validating DB2 Tables are at acceptable capacities • Measuring Network usage versus committed capacity. • Using IP Pings . With bi-directional probing the failing component (i.e. Firewall, network node, virtual private network link) in the network path can be identified. • Issuing HTTP head requests to specific servers that confirm a web server is running and responding to requests. This type of request is “light” and with minimal impact on capacity provides a significant measurement of server health. End-to-end (user perspective) monitoring techniques include: • Initiating XML requests for key common services (i.e. authentication) to ensure the directory is responsive. • Running a simple routine to ensure initial page load responds to a browser • Processing a fully scripted business scenario with business response validation for Key Applications or Business Processes. Periodically our staff performs site monitoring and validation directly. The most likely reason for manual site verification is that a significant upcoming event requires additional focus. For example, we at IBM do marketing campaigns that are intended to generate significant interest in our products and services. We may have extra monitoring of specific sites or perform specific functional verification to validate that we will reap the maximum benefit from the marketing activity. Another example is key demonstrations for selected customers. We offer custom web sites for our largest customers. We do manual monitoring of those sites during key customer demonstrations to ensure we effectively support our marketing efforts. Monitor as established from different points of presence depending on the information needed for operational management. We place monitors on both our internal network and external network points of presence. Monitors on our internal networks are used for 2
For web transactions we set the timeout value to 45 seconds, unless the nature of the request requires a shorter or longer time.
© 2003 IBM CORPORATION
Page 6 of 11
technology level monitoring, for business process measurement for Key Applications heavily used by IBM staff, and for providing a quick method to validate that the problem is external to ibm.com. Probing Application from Europe
For web sites where the majority of users are coming in via the Internet, we monitor from external network points of presence throughout the geographical area that serves our stakeholders. The chart on the left illustrates Points of Presence in Europe and Africa used for monitoring an application hosted in North America.
ibm.com uses commercially available IBM Global Services, IBM software products, and external service providers for monitoring. At ibm.com we are evaluating Client Perceived Response Time tools to collect the performance information for Key Applications or Business Processes. Client Perceived Response Time (CPRT) is a technology for accurately measuring the customer experience of a WWW service by instrumenting web pages with executables which send back to a collection point the response time data. While this technology may provide us a new basis to collect the data, we are too early in our evaluation to comment on the implications to our management system. At ibm.com we put great value on monitoring tools that provide real time, or close to real time, reports that aid in operational issue identification. The real time business probe has become fundamental in both immediate and historical problem analysis. The historical data from monitoring tools can pinpoint when a change in the response time or availability arose. Often the response time change in our sites have been related to content, and by identifying the date and time we can narrow the review of changes. ibm.com End-to-End probes are excluded from business usage metrics. The need to eliminate the monitoring traffic is so that ibm.com can ascertain true customer usage trends and directions. We periodically assess the volume of our end-to-end probes to make sure we are optimizing our activity across the IBM company. An example is we have unique portals for each of our large customers that are customized to their needs to learn, shop, buy and use IBM goods and services. We found multiple IBM brands (i.e. Software, Server, Learning Services, Sales and Distribution) groups were doing Business Scenario probing for their unique Quality of Service metrics. In one case we found approximately 30% of the portal web hits were monitoring transactions. Upon understanding this statistic, we consolidated the number of probes scenarios, yielding both system resource efficiency in dealing with customer requests and reduced internal reporting costs once the shared reporting was put in place.
ÂŠ 2003 IBM CORPORATION
Page 7 of 11
Availability and Response Time Reviews ibm.com performance and availability reports are based off data generated by technology monitors or business process probes, tempered by qualitative analysis. We use raw technology monitor data to represent the Quality of Service metrics of availability and response time for most applications. For our Key Applications or Business Processes we refine this data with verified outage data and quality of system usability criteria. Figure 1Sample Availability and Response Time Report
13-Month Availability and Response Summary Application Portal - AP SP 4.0 Portal - NA SP 4.0 Portal - Japan SP 3.6 Portal - EMEA SP 4.0
Met or Exceeded Degraded Service Miss Missed Target New Application
Green Amber Red
Portal - AP SP 4.0
Portal - NA SP 4.0
Portal - Japan SP 3.6
Portal - EMEA SP 4.0
x Availability Summary
Immediate verification upon an alert allows a more accurate representation to the business if the web site is failing or if there is a potential monitoring issue. We have found that technology monitors or business process probes themselves may fail in isolated cases To guard against these alerts detracting our attention from hard outages, we have a rule that two failures must occur in a row for this to be a verified outage. Secondly, for Key Applications or Business Processes, staffs with scripts verify any reported failure. This is to guard against situations when there is no perceived customer impact. For example probes that fail due to content changes that are acceptable to a business user but fails automated validation is different problem then the site failing. In the case where the probes are failing, but the business script with a human execution is working, we do not count it as a system outage. The best analogy for this is a doctor may re-run a test if the first results are not consistent, and then possibly do further exploratory work before reporting to a patient definitively they have a critical health issue.
ÂŠ 2003 IBM CORPORATION
Page 8 of 11
Quality of system usability criteria include judgments as to web site usability; tolerances for intermittent failures, and thresholds for response times. We have had situations where the technology monitors and business process probes work fine, however the site is unusable for the target audience. In one embarrassing episode, the content for the United Kingdom site was loaded with that for another country. The web site was responding, and business verification of the monitors was satisfied; however the customers were not getting relevant information. This warranted the reporting of an outage, although it was the content, and not the technology, that failed. At times there can be intermittent failures that with effort customers can overcome. ibm.com defines an outage when 40% or more of the probe firings within a time period fail due to timeout. ibm.com also investigates significant (>50%) fluctuations in response time to identify potential availability issues. Many of our Key Applications have dynamic content. We have had cases where system response time is dramatically higher or lower then historical levels. This may point to content or functions were so significantly revised, that we need to either recalibrate our monitors or address a production problem. Quality of Service reports are reviewed daily, weekly and monthly with different objectives. The daily operational reviews attendees include the application maintenance staff, service delivery staff and business owner. The response time and availability data is then compiled into a weekly report for review with the management of the application maintenance, service delivery, and business owner organizations. A monthly compilation is then reviewed with the Executive management of the application maintenance, service delivery, business transformation and business owner organizations. At each review the observations of trends will be discussed, and recommendations to resolve the problems will be refined. Below is a table that indicates the content of the reports by review cycle: Availability Performance Root Cause Analysis (RCA)
Daily Current Day + other Weekdays Current Day + other Weekdays Prior Day Issue
Weekly Prior Week + 13 Weeks trend chart Prior Week + 13 Weeks trend chart Open RCAâ€™s and requests for closure Current Month Outage Analysis by Root Cause
Monthly Prior Month + 13 Months Trend Chart Prior Month + 13 Months Trend Chart RCAâ€™s approved for closure Current Month Outage Analysis by Root Cause, with 13 month trend.
In our quality of service reporting we have 24X7 system availability and response time numbers, Root Cause Analyses for outages. The 24X7 system availability statistics allow us to continuously highlight the need to always be available, and to eschew system maintenance windows as much as practical. For select Key Applications or Business Processes, reports for each Geography during core business hours in the local time zone, are produced. Most IBM sites are worldwide, so we probe from worldwide points of presence and generate reports for availability and response time in the local time zone, so the business can understand the customer
ÂŠ 2003 IBM CORPORATION
Page 9 of 11
experiences by Geography. The underlying data for the reporting is maintained in Greenwich Mean Time (GMT), such that combined reporting worldwide and cross application portfolio can easily be accomplished. The Root Cause Analysis is provided not only for discrete outages, but we also include trending analysis by category. The categories for the Root Cause Analysis trending are: Application Package or Customization, Hardware Failure, System Software (OS, Middleware), Network, Application Maintenance Process, Service Delivery Management Process, and Business Owner (i.e. Site Content). Web Attributable Outage Impact Outage Impact 80 70
USD $ Thousands
60 50 40 30 20 10 0 App 1
App 2 Alternate Processing Costs
App 3 Revenue
For Key Applications or Business Processes ibm.com has established quantitative estimates of jeopardized revenue and alternate process impact during an outage. For any given hour in the typical week we estimate alternate processing volumes as an incremental cost, and the typical order volume as the revenue at risk.
IBM Customers who cannot access us via the Web may call our staff or may elect to order from a competitor whose web site store is open. This quantitative analysis has been used to stress to everyone involved in the delivery of the web experience the immediate cost impact of an outage at any time, and to articulate the impact of outages we experience. These measurements have been powerful tools to justify the incremental technology investments to get to that next level of availability.
Conclusions ibm.com has made Quality of Service availability and response time metrics part of the management of our business. The active management to improve these statistics has yielded metrics with a high degree of credibility. These credible metrics are then analyzed to ascertain tactical and strategic action items, with the ultimate goal of improving our web sites value proposition for our customers and stockholders.
Acknowledgements The authors would like to acknowledge the contributions of for review of the transcripts, and suggestions for improvement.
ÂŠ 2003 IBM CORPORATION
Page 10 of 11
References About the Authors David Leip is IBM’s corporate webmaster, with direct technical responsibility for IBM’s corporate portal which spans 83 countries on the wired and wireless web. Prior to becoming the corporate webmaster in 1999, David worked for IBM’s CIO office as program manager of web enablement. Earlier he worked in IBM Software Development Lab in Toronto. David has an MSc in Computing & Information Science from the University of Guelph. His personal web site can be found at: http://www.Leip.ca/ Tegan Lee is a lead for Information Technology Operations Management in the ibm.com unit. Tegan from 1995-2001 was the Project Executive for an IBM provided home banking web platform that at its peak had over 3,000,000 subscribers. Tegan has over 25 years of Information Technology experience in developing, deploying and operating business application computing platforms. Tegan has an MBA in Marketing and Finance from Pace University, and a BBA in Statistics from Baruch College.
© 2003 IBM CORPORATION
Page 11 of 11
IBM Technical Report TR-40-0031