ISSUE 5
ALEC SPROTEN & IRIS HOLLICK ADVANCED DEMAND FORECASTING AT BREUNINGER
THE PATH TO RESPONSIBLE AI Julia Stoyanovich
ENTERPRISE DATA & LLMs Colin Harman
HOW ML IS DRIVING PATIENTCENTRED DRUG DISCOVERY Benjamin Glicksberg
TARUSH AGGARWAL
THE SMARTEST MINDS IN DATA SCIENCE & AI
LIOR GAVISH
“
RYAN KEARNS
ARNON HOURI YAFIN
LLMs, ELMs & The Semantic Layer with TARUSH AGGARWAL “Using AI on top of the semantic layer is the most exciting application of AI in the data space today.” How observability is advancing data reliability and data quality with LIOR GAVISH and RYAN KEARNS “Observability is the idea that you’re able to measure the health of your data system.” How AI is Driving the Eradication of Malari ARNON HOURI YAFIN “Moving from malaria control to malaria elimination takes artificial intelligence and data.”
Expect smart thinking and insights from leaders and academics in Data Science and AI as they explore how their research can scale into broader industry applications.
Helping you to expand your knowledge and enhance your career. Hear the latest podcast over on
datascienceconversations.com
INSIDE ISSUE #5
CONTRIBUTORS
Alec Sproten Iris Hollick Jacques Conradie Tarush Aggarwal Benjamin Glicksberg Philipp M Diesinger Rex Woodbury Patrick McQuillan Francesco Gadaleta Julia Stoyanovich Colin Harman Isabel Stanley
EDITOR
COVER STORY: BREUNINGER Crafting an empowered demand forecast: A journey into the world of data-driven planning: navigating the path to precise forecasts Alec Sproten & Iris Hollick / Breuninger
06
SCALABLE DATA MANAGEMENT IN THE CLOUD Jacques Conradie / CGI
11
RISE OF THE DATA GENERALIST: SMALLER TEAMS, BIGGER IMPACT Tarush Aggarwal / 5x
17
DATA SCIENCE PLATFORMS FOR PATIENT-CENTERED DRUG DISCOVERY Benjamin Glicksberg / Character Biosciences
20
THE DAWN OF SYNTHETIC DATA Philipp M. Diesinger / BCG X
26
THE AI REVOLUTION VS THE MOBILE REVOLUTION & OTHER TECHNOLOGICAL REVOLUTIONS Rex Woodbury / Daybreak
30
THE BIOLOGICAL MODEL Patrick McQuillan / Jericho Consulting
35
COULD RUST BE THE FUTURE OF AI? Francesco Gadaleta / Amethix Technologies & Data Science At Home Podcast
40
THE PATH TO RESPONSIBLE AI Julia Stoyanovich / New York University
43
WITH LLMS, ENTERPRISE DATA IS DIFFERENT Colin Harman
49
Damien Deighan
DESIGN
Imtiaz Deighan imtiaz@datasciencetalent.co.uk
NEXT ISSUE
20TH FEBRUARY 2024
DISCLAIMER
The Data Scientist is published quarterly by Data Science Talent Ltd, Whitebridge Estate, Whitebridge Lane, Stone, Staffordshire, ST15 8LQ, UK. Access a digital copy of the magazine at datatasciencetalent.co.uk/media.
The views and content expressed in The Data Scientist reflect the opinions of the author(s) and do not necessarily reflect the views of the magazine, Data Science Talent Ltd, or its staff. All published material is done so in good faith. All rights reserved, product, logo, brands and any other trademarks featured within The Data Scientist are the property of their respective trademark holders. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by means of mechanical, electronic, photocopying, recording or otherwise without prior written permission. Data Science Talent Ltd cannot guarantee and accepts no liability for any loss or damage of any kind caused by this magazine for the accuracy of claims made by the advertisers.
THE DATA SCIENTIST | 03
EDITORIAL
HELLO,
AND WELCOME TO ISSUE 5 OF
THE DATA SCIENTIST
W
e are delighted to feature German retail industry giant Breuninger on the front cover of this issue. One of the things I love about our industry is how we enable 100-yearold companies to innovate and improve how they operate
using Data Science & AI. In issue 5 we focus on both cutting-edge AI topics such as LLM’s
enterprise, and also the all-important fundamentals of traditional data management and strategy. After the success of our AI Special Issue, we continue the AI focus, touching on synthetic data, LLMs, how Rust is emerging as the language of AI, and enterprise data. With all the buzz around AI in the last 12 months, you could be forgiven for thinking that traditional Data and ML are not as relevant anymore. However, it’s even more important we don’t lose sight of the fundamentals, so we are also covering enterprise data strategy, scalable data management in the cloud, and how ML is driving drug discovery. The Age of AI Compared to Previous Technology Cycles
I started recruiting in technology in 1999, in the early stages of the Internet Revolution. There are some parallels and some differences between the internet era and what is happening now with AI, even though we are just 12 months into the Age of AI. 12 months into the Age of AI, and what a year it has been. The pace of change and development is incredibly fast, but if past cycles are anything to go by, this is just the beginning of a long multi-decade period. Just as the internet went through an extreme hype phase followed by the dotcom bust, the AI era will likely become a bubble and hype cycle that crashes. But, just as the internet touched virtually every area of our lives and changed how the world works, AI will too. On the topic of cycles, Rex Woodbury has penned a brilliant article on AI and technology cycles (our magazine’s first article from a VC investor) which is a must-read. Cycles are generally overlooked in the industry, maybe because the time period in which they exist is very long. But, most trends that are a long time horizon have predictable patterns, which is why I was so keen to feature this article.
04 | THE DATA SCIENTIST
EDITO RI AL
The Role of Regulation on the Path to Responsible AI
At the beginning of the year, my view was we must regulate AI as quickly
and aggressively as possible, because I was worried about the existential risks of AI. I now believe the doomsday scenarios are overstated and my view is much closer to that of Yann Le Cunn’s and Andrew Ng’s. There are many more significant and immediate risks to deal with in the race to embed AI systems into every fabric of our technological infrastructure, and Julia eloquently talks about many of those. The question is how effective regulation will actually be, and what cost will it come with. President Biden’s recent executive order is wide-ranging but likely to create problems for start-ups wanting to build AI products. Regulatory capture is a huge problem in the USA and it usually ends up with a small number of big players dominating, and inadvertently being handed rock-solid business moats by way of government licensing. However, no regulation is simply not an option. AI is not the internet, mobile, or cloud. It’s bigger than all of those. It cannot be left to AI companies to self-regulate. As a society and an industry, we made a mess of how we allowed social media to evolve unhindered by any meaningful oversight. If there is no AI regulation until much further down the line, then we are likely to repeat those mistakes. Many of the benefits we are lucky enough to experience as citizens living in the Western world are a direct result of regulation and government controls that hold businesses to account, so we have to impose regulation regardless of how imperfect it is. AI World Congress in London - See you there?
The AI World Congress takes place in London on the 27th and 28th of November. You can find out more about the conference here aiconference.london I am speaking on the Monday morning about how to build effective enterprise AI teams and avoid the hiring mistakes that were made in the Data Science hiring hype cycle of 2015-2019. I will also be contributing to a panel discussion which features partner representatives from McKinsey and BCG. In this digital era, it’s hugely important that we still make the effort to show up to in-person events. Face-to-face is still the best way to make new personal connections and learn new things. There are still tickets available for what will be a great conference, so I hope to see some of you there.
Damien Deighan Editor
THE DATA SCIENTIST | 05
BREUNINGER
FROM INTUITION TO INTELLIGENCE:
CRAFTING AN EMPOWERED DEMAND FORECAST A JOURNEY INTO THE WORLD OF DATA-DRIVEN PLANNING: NAVIGATING THE PATH TO PRECISE FORECASTS
DR ALEC SPROTEN is a renowned data science expert and the Head of Data Science at E. Breuninger GmbH & Co., the leading premium and luxury department store chain in the DACH region. With a diverse background in psychology and a Ph.D. in economics, Alec brings a unique perspective to the field of data science. Previously, he served as an Assistant Professor for Economic Theory, conducting research on social norm compliance. Alec has also held the role of A.I. Evangelist for STAR cooperation and has extensive experience in strategy and management consulting, specialising in after-sales pricing. Currently, he manages a large Data Science Centre of Excellence, overseeing a team of talented Data Scientists and product managers. Alec’s expertise lies in leveraging data-driven insights to drive business strategy, innovation, and customer-centric solutions.
IRIS HOLLICK, a Data Science Product Owner, is on a mission to empower Breuninger with data-driven excellence by spearheading transformative Data Science projects. With a strong foundation in demand forecasting and strategic marketing, coupled with her adeptness in project management, she excels in orchestrating international teams by having an extensive global experience. Iris’s expertise lies in orchestrating data-driven initiatives across diverse technical landscapes, and maximising the value of data science implementations. She is a distinguished academic, holding a Master’s Degree in Technical Mathematics with a specialisation in Economics from Vienna University of Technology, complemented by an MBA in Project Management.
06 | THE DATA SCIENTIST
EMBRACING OUR ORIGINS As the leading luxury department store chain in the DACH region and beyond, E. Breuninger GmbH & Co. faces a range of challenges demanding data-driven solutions. To meet these challenges head-on, the Data Platform Services (DPS) unit, a part of the IT department, brings together expertise in data engineering, data modelling, data science, and data product management. Established four years ago, DPS emerged to bring a structured, data-first approach to Breuninger’s operations. Since its inception, the unit has continuously grown and now manages a data lake brimming with up-to-date raw data from vital source systems like ERP, logistics, and the online store. In conjunction, a business pool has been curated, housing business objects that empower self-sufficient utilisation, as described in Tarush Aggarwal’s article in Issue 3 of The Data Scientist. This reservoir of insights serves various purposes, with one notable application being the creation of an integrated buying and planning solution. This solution acts as a guiding light for
PHILIPP BREU NING KOEHN ER
the merchandise department, illuminating the path to informed decisions that align seamlessly with customer preferences. At the core of this approach lies a meticulously detailed customer demand forecast. This forecast serves as the foundation for merchandise management, encompassing stock modelling, order management, assortment curation, and financial planning for both new product introductions (NPI) and perennial stock items (NOS, neverout-of-stock). With a vast catalogue of several hundred thousand stockkeeping units (SKUs), this intricate orchestration is adjusted on a weekly or even daily basis, covering an extensive 18-month time horizon. Driving the technical part of the implementation is a supply chain management service provider. This collaborator is tasked with the technical realisation of the project: crafting a user platform, developing the forecast and its plug-ins in the software, and addressing the nuanced, Breuningerspecific requirements integral to the project’s success. Guided by an agile project management approach, the process unfolds in iterative stages. The initial emphasis revolves around refining the forecast and optimising software performance, all centred around a prototype product group - a strategic launching point for an innovative journey.
ERECTING THE FOUNDATION Now, let’s delve into the fundamental requisites of the customer demand forecast. It’s a dynamic landscape, encompassing a multitude of critical factors, including sales transactions, prices, promotions, opening hours, customer data, and extending all the way to macroeconomic and market data. The demand forecast we’re crafting operates at a high level of granularity, working with SKUs, demand locations, and week granularity, all spanning a substantial 18-month window. With each weekly forecast run, our
aim is steadfast: to furnish reliable product level is poised to yield outcomes, backed by the latest superior outcomes. To facilitate insights and employing precise profiling and comprehensive algorithms and methodologies. analysis, our product master data As we navigate the [An integrated buying and planning solution] acts complexities as a guiding light for the merchandise department, of forecasting illuminating the path to informed decisions that align for a seamlessly with customer preferences. distinguished upmarket department store, two primary furnishes a spectrum of levels and challenges come to the forefront: hierarchies, ranging from the nittyintermittent sales patterns and fleeting gritty SKU specifics up to broader product lifecycles. Our inventory product groupings (such as shoes thrives on fashion items, which or jackets) and even further to enjoy brief moments in the spotlight, encompass department hierarchies heavily influenced by pricing and (like womenswear). promotions. But the journey of In the face of NPI challenges, product management commences our focus shifted to enriching the even earlier, with crucial decisions product master with data attributes around granularity and master data tailored for forecasting. Quality data structure. These foundational choices pertaining to standard department ripple through every subsequent store attributes such as colour, development, shaping the trajectory size, and customer demographics, of the entire venture. alongside product-specific fashion The first milestone on this dimensions like heel height and path was carving out our demand scarf length, stand as sentinels of location structure. Considering precision within the realm of retail the inventory’s lens, Breuninger’s business. By imbuing the forecast warehouse configuration mirrors with these rich attributes, we ensure distinct stores and an e-shop our accuracy is resolute, even amidst arm for sellable wares. In light of the nuanced landscape of fashion’s this, we opted to define demand ephemerality. locations pertinent to forecasting UNVEILING THE TIME SERIES: by blending warehouse and channel A JOURNEY INTO UNIVARIATE considerations. This initial structural DEMAND FORECASTING choice is a reminder that the With demand locations and a granularity of your demand location refined product master in place, directly impacts forecast quality. our focus turned to the bedrock Yet, this granularity must harmonise of our endeavour: constructing seamlessly with the customer a comprehensive time series. demand structure, ensuring its Envisioning the diverse applications efficacy across diverse use cases. of the customer demand forecast The subsequent key definition, and embracing the existing demand one wielding substantial influence location and product structure, we over forecast accuracy, centres on embarked on a journey to create a the product master. Intermittent time series based on weekly sales sales patterns and the prevalence of transactions, quantified in units. The NPIs with their transient lifecycles time series mirrors the desired output present the main challenge. To granularity, encapsulating SKUs, address this, we’re driven by the demand locations, and weeks or days. pursuit of delivering forecasts with This canvas was adorned with elevated granularity. Employing a every form of transaction that progressive profiling technique, captures customer demand - a rich generating forecasts at a higher
THE DATA SCIENTIST | 07
BREU NING ER
tapestry that transcends traditional sales. Notably, for online sales, we considered the order creation date rather than the invoice date, while for product trials (a Breuninger customer service benefit), we tracked the moment of withdrawal. A pivotal choice was the omission of customer returns: retailers, especially in the online realm, grapple with a considerable return rate, one that should remain excluded from the “pure customer demand forecast” aimed at merchandise management or financial planning. However, for use cases involving stock modelling, returns could not be ignored. Thus, a parallel return forecast
materialised, rooted in historical returns and correlating with the customer demand forecast - more on this shortly. Having forged our time series, the next cornerstone was assortment information. This repository essentially outlines the items earmarked for forecasting, their horizons, and designated demand locations. This data isn’t solely about forecasting; it casts a long shadow on future forecast accuracy. The assortment must harmonise with the overarching business strategy, accounting for launch dates and avoiding confusion with the purchase or production planning periods, which factor in lead times. The goal, always, is to forecast customer needs, understanding that sales are the linchpin. Granularity depends on nuances like simultaneous product launches across all stores or location-dependent launches, the sequence of product group releases, and the timing of alterations. We prioritised a robust level of granularity to ensure flexibility. With our time series in hand and guided by assortment information, refining our results in the
08 | THE DATA SCIENTIST
realm of retail meant rectifying outliers and mitigating stock-outs. Understandably, when products aren’t available, sales naturally taper off - this phenomenon affects both in-store and online scenarios. We took a savvy approach by employing sellable stock in each demand location as a reference for stock-out rectification. Naturally, the lowest granularity, in our case, SKU/store level, is pivotal here. Yet, luxury department store dynamics introduce challenges due to high intermittency, heightening the risk of redundant data cleansing. As a starting point, we employed a stock-out cleansing strategy that focused on filtering out very low sellers and identifying zero stocks, which we then corrected using diverse interpolation techniques. To address general time series outliers, we employed seasonal interquartile ranges and sigma methodologies. Finally, with these foundational steps in place, we advanced to construct our maiden forecast - a univariate forecast. During exploratory data analysis (EDA) and rigorous testing, key performers emerged, including tbats, croston, prophet, simple moving average, arima, alongside rule-based inductions. Further enhancing the decision-making around algorithm application, we incorporated a segmentation logic based on sales volume and variability. For our initial release, we seamlessly integrated these models into our weekly batch run, employing best-fit logic by comparing WMAPE scores. Nevertheless, these model selections bear a strong dependency on article types and granularity levels. Recognising this, we initiated a product profiling method based on insights garnered from EDA. Yet, the landscape remains fluid; we’re acutely aware that as time progresses, fresh evaluations will steer our course. When addressing granularity, considering the sporadic
P BREU H IL IP NING P KO EH ERN
nature of demand, we recognised that forecast level and output level might diverge. Thus, we embraced a higher forecast level based on optimal results per product family and deployed profiling techniques centred around product, location, and time granularity requirements. For instance, to perform stock modelling on a daily level, we channelled past daily data for proportions, factoring in operational hours, promotions, and other variables.
HARNESSING THE MACHINE: Crafting the machine learningdriven tomorrow Until now, our journey was paved with the creation of a robust univariate statistical forecast - a reliable companion for NOS articles. Yet, recognising that the realm of fashion predominantly thrives on NPIs, we ventured into the heart of our endeavour: constructing potent multivariate and machine learning (ML) models, a vital step to satiate the demands of NPI forecasting. As the ensemble of these models unfurled, we first fortified them with the bedrock of driver data. Echoing back to our earlier discussions, we infused these models with the richness of product-dependent attributes and fashion dimensions. Steering forward, we ventured into the realm of prices and promotions. Undoubtedly, the dominion of prices, especially regular retail and reduced prices, holds profound sway over retail enterprises. Markdowns and sales periods wield unparalleled influence over overall demand, thus rightfully claiming their place within the forecast fold. To distil these drivers, we considered historical price and markdown data, blending it seamlessly with future projections. Notably, markdown strategies planned over extended horizons can yield significantly refined results, encompassing both the immediate and distant future. A similar narrative plays out with the inclusion of promotion
data. The efficacy of your forecast hinges on the availability and quality of your promotions data. In our scenario, we delineated between article-based, customer-based, and location-specific promotions. These categories might at times overlap (as in the case of brand events), necessitating meticulous structuring to prevent duplication. Just as with prices, a comprehensive promotions dataset includes both historical and future data. Holiday and special events data also found a place in our pantheon of drivers. While sales during these periods naturally manifest in the demand time series, delivering explicit event information remains pivotal, particularly for weekly forecasts where weekly structures might diverge. Depending on the product and customer groups, macroeconomic and sociodemographic data could also augment forecast accuracy. Even weather data, offering insights into store performance during distinct weather conditions, was factored in. And in a retail setting, the inclusion of customer data, centred around behaviour patterns, added another layer of depth. This data encompassed a spectrum of aspects, from evaluating sales across diverse target groups and customer profiles to tracking online interactions. Armed with quality driver data, product attributes, and fashion dimensions, we charted onward to construct the multivariate and machine learning forecasts. Given fashion’s inherent dynamic of NPI prevalence, ML forecasts emerged as the stalwarts of our strategy. Guided by insights drawn from EDA and the outcomes of previous data tests within the initial project scope, we forged ahead with LightGBM and Catboost - champions that exhibited the most promising results. Similar to our approach with the univariate forecast, our compass guided us towards what worked best for a given product family at a particular juncture in time. Adaptability
THE DATA SCIENTIST | 09
remained our watchword. With the univariate and ML forecasts in hand, the final lap encompassed the application of best-fit logic, diligently seeking the pinnacle of results for our weekly outputs. This orchestration, the culmination of our arduous journey, was more than a mere computational endeavour; it encapsulated our need for precision, customer insights, and the ever-evolving landscape of retail dynamics.
BEYOND THE HORIZON: Envisioning beyond demand The journey towards an allencompassing planning tool transcends the creation of a sole demand forecast - it’s a symphony orchestrated by numerous considerations. In the realm of online shops, a formidable challenge emerges in the form of returns. To cater to this complexity, fashioning a return forecast becomes imperative for use cases like stock modelling and financial planning. Pioneering a nuanced approach, we birthed a distinct time series rooted in returned transaction data, capturing the original sales juncture. By integrating past sales, we paved the way for return probabilities to surface across various granularities - ranging from transaction specifics to broader product family categories, grounded in statistical significance. In practice, these probabilities intertwine with actual sales and the demand forecast, forging a projection that spans the entire forecast horizon. To fortify this approach, machine learning models were harnessed to analyse attributes that exert influence on customer returns - attributes such as material type or patterns that unfurl diverse facets of return dynamics. Venturing further, the landscape widens with a kaleidoscope of potential forecasts. For instance, catering to aged stock consumption enters the fray. As time elapses, old stock necessitates calculated consumption trajectories, rooted
BREU NING ER
in comprehensive insights to avert wastage and optimise resource allocation. Additionally, projecting into the future, we contemplated estimations on forthcoming prices, spanning a lengthier horizon. This realm demands an intimate understanding of market dynamics, the intricate dance between supply, demand, and the economic currents that shape them.
data’s integrity. Yet, it’s the performance of our forecasts that takes centre stage in this narrative. A mosaic of key metrics - WMAPE, Bias, R², and MSE - paint a comprehensive picture of forecast efficacy. In our iterative pursuit of forecast enhancement, these metrics unfurl in tandem with the development of the project. As we weave new drivers into our equation and experiment Anchored in the ethos of data-driven precision, with diverse our journey has been an exploration of art and models, this ongoing science, where meticulous methodology marries evolution the artistry of fashion’s ephemerality. provides a dynamic In the grand tapestry of crafting canvas for comparison. Our goal: to an all-powerful planning tool, etch a clear demarcation between these supplementary forecasts and forecast errors before and after enrichments emerge as keystones. introducing new drivers, or venturing Each forecast dances in harmony into uncharted models. with others, their outcomes As these metrics oscillate and intertwined with the symphony of trends emerge, they form the basis customer needs, industry trends, and for informed decisions, steering our the ebb and flow of retail’s everjourney towards refined forecast evolving tide. models. This perpetual dialogue between insights and outcomes, driven ILLUMINATION THROUGH by vigilant monitoring and meticulous INSIGHT: Monitoring Progress, measurement, crystallises our Measuring Success commitment to not only delivering a The bedrock of data-driven planning stable and accurate forecast but also lies in the bedrock of stability perfecting it over time. rendering a forecast that’s both A DESTINATION, A BEGINNING unwavering and accurate. In our In the intricate tapestry of retail, pursuit of this essence, vigilant where dynamic demands and monitoring and continuous ever-evolving trends interlace, our measurement of progress emerged expedition from best guesses to an as indispensable components, woven all-powerful demand forecast has into the very fabric of our project. unveiled a roadmap of innovation. In this journey, the lifeblood of Anchored in the ethos of data-driven stability resides in actual sales data precision, our journey has been an and assortment data. Recognising exploration of art and science, where their pivotal role, we embarked on meticulous methodology marries the a dual-tiered monitoring approach. artistry of fashion’s ephemerality. The first layer, a technical vigil, From the inception of the Data stands as a sentinel, safeguarding Platform Services (DPS) unit to the daily transmission of data. the meticulous construction of Beyond the technical dimension, our univariate and multivariate our scrutiny deepened to the very forecasts, our narrative has unfolded content being transmitted, guided by in layers. The first steps, framed by a litmus test of time series quality the need for refined granularity, led and the count of active items - a us to erect a robust foundation preliminary checkpoint ensuring the
10 | THE DATA SCIENTIST
the demand location and product master. Through these pillars, we harnessed the potency of time series construction, nurturing it with sales data, assortment insights, and an unyielding resolve to rectify outliers and stock-outs. With our gaze fixed on the fashion horizon, we ventured into the realm of multivariate and machine learning models, where prices, promotions, and even weather danced as drivers. This strategic dance was underpinned by a profound understanding that each attribute, each thread, wove together the tapestry of forecast accuracy. But our voyage was not merely about crafting forecasts; it was about weaving a comprehensive tool for planning. Encompassing return forecasts, exploring aged stock consumption, and even peering into the crystal ball of future prices, our endeavour transcended singular predictions. It embraced the entire spectrum of retail dynamics, manifesting a comprehensive toolkit for precision decision-making. As we navigated these waters, the vigilant pulse of monitoring and measurement echoed, providing a metronome to the rhythm of our progress. Key metrics like WMAPE, Bias, R², and MSE stood sentinel, guarding the threshold between forecast iterations and allowing us to meticulously mould our predictions into refined instruments. Through these layers, we emerged with an unshakable truth - our journey is not an endpoint, but a stepping stone. The evolution of retail is ceaseless, and so too must be our commitment to innovation. As we unveil our unified demand forecast, a beacon of precision and insight, we do so with the awareness that it is a testament to our dedication, adaptability, and the indomitable spirit of data-driven transformation. Our voyage is a testament that, while retail’s horizon may shift, our resolve to chart its course stands unswerving.
JACQUES CONRADIE
SCALABLE DATA MANAGEMENT
IN THE CLOUD LEVERAGING DATA MESH PRINCIPLES TO DRIVE DATA MANAGEMENT AT SCALE
By JACQUES CONRADIE Jacques is considered an expert in the field of data management & analytics, with a demonstrated track record of shaping and leading technology change in the financial services and gas/oil industry. He is a Principal Consultant (Data & Analytics) at CGI and currently working for a large Dutch financial services company with its headquarters based in Utrecht, Netherlands. In his role as Product Manager, he leads 2 teams responsible for improving the data reliability within the Global Data Platform. Founded in 1976, CGI is among the largest IT and business consulting services firms in the world. We are insights-driven and outcomes-based to help accelerate returns on your investments. Across hundreds of locations worldwide, we provide comprehensive, scalable and sustainable IT and business consulting services that are informed globally and delivered locally.
THE DATA SCIENTIST | 11
JACQ PHILIPP U ES CO KOEHN NRADIE
THE INCREASING IMPORTANCE OF DATA MANAGEMENT FOR FINANCIAL SERVICES If a bank loses the trust of its clients, or takes too many risks, it can collapse. Most of us recall the financial crisis that began in 2007 and its horrendous aftermath. This included events such as the collapse of Lehman Brothers (an American Investment bank) and the sub-prime mortgages. As a direct result of these events, the Basil Committee on Banking Supervision (BCBS) published several principles for
effective risk data aggregation and risk reporting (RDARR). Today, these principles are shaping data management practices all over the world and they describe how financial organisations can achieve a solid data foundation with IT support. The BCBS 239 standard is the first document to precisely define data management practices around implementation, management, and reporting of data.
THE GLOBAL DATA PLATFORM: A GOVERNED DATA MARKETPLACE Historically, every use case involving data required a tailored and singular solution. This model for data sharing wasn’t ideal as it provided little to no re-use for other data use cases. From a data management perspective, business areas were expected to implement tooling across many different data environments. This was sub-optimal as it required time and effort from producers of data to connect their systems to instances of Informatica Data Quality (IDQ). All these pain points resulted in a slow time-to-market for data use cases and, therefore posed a significant threat to the organisation’s data-driven (future) ambitions. This triggered a series of questions including, but not limited to, the following: ●● Can one deliver scalable data platforms that would
enable data sharing between data producers and consumers in a manner that is easy, safe and secure? ●● Can one democratise data management and make it accessible to everyone? The end goal was simple: a single “platform of truth” aimed at empowering the organisation to govern and use data at scale. As a means to an end, the Global Data Platform (GDP) was launched. Today, the GDP is a one-stop (governed) data marketplace for producers and consumers across the globe to exchange cloud data in a manner that is easy, safe and secure. This flexible and scalable data platform was implemented using a variety of Microsoft Azure PaaS services (including Azure Data Lake Storage and Data Factory).
12 | THE DATA SCIENTIST
JACQ U ES CO NRADIE
TRANSACTION MONITORING: A CONSUMER USE CASE As part of customer due diligence (CDD) within financial services, organisations are often expected to monitor customer transactions. A team of data scientists would typically build, deploy and productionise a model whilst connecting to some or other data platform as a source. As an output, the model will generate certain alerts and function as early warning signal(s) for CDD analysts. Without complete and high-quality data, the model could potentially generate faulty alerts or even worse, completely fail to detect certain transactions in the first place. From a BCBS 239 perspective (described
earlier), banks are expected to demonstrate a certain level of control over critical data and models. Failure to do this could result in hefty penalties and potential reputational damage. This is one of many use cases within the context of financial services, and truly highlights the importance of having governed data where data quality is controlled and monitored. By adopting data mesh principles, the GDP has been able to successfully deliver trusted data into the hands of many consumers.
DATA MESH FOR DATA MANAGEMENT Although initially introduced in 2019 by Zhamak Dehghani, Data Mesh remains a hot topic in many architectural discussions today. The underlying architecture stretches beyond the limits of a (traditional) single platform and implementation team, and is useful for implementing enterprise-scale data platforms within large, complex organisations.
The associated principles have served as inspiration ever since the beginning of the GDP journey and provided a framework for successfully scaling many platform services including data management. The following principles form the backbone of the data mesh framework and each of them was carefully considered when the GDP initially geared up for scaling:
For each principle, a brief overview will be provided, followed by an example of the principle in practice.
DATA DOMAINS & OWNERSHIP Modern data landscapes are complex and extremely large. To navigate this successfully, it is recommended to segregate enterprise data into logical subject areas (or data domains if you will). It is equally important to link ownership to every one of these domains. This is important as data mesh relies on domain-oriented ownership. Within the context of the GDP, data domains were established around the various business areas and organisational value chains (example below): Retail Business ●● Customer ●● Payments ●● Savings
Within the Retail Business, for example, there exists a Customer Tribe with several responsibilities. These responsibilities typically span from core data operations (like data ingestion) all the way to data managementrelated operations (like improving data quality). This model ensures that each domain takes responsibility for the data delivered.
●● Investments ●● Lending ●● Insurance
Wholesale & Rural Business ●● Customer ●● Financial Markets ●● Payments ●● Lending THE DATA SCIENTIST | 13
JACQ U ES CO NRADIE
To support data management-related operations, many domains appoint so-called data stewards, because data governance is still being treated as something separate and independent from core data operations. However, it is not
feasible to increase headcount in proportion to the vast amounts of data that organisations produce today. Instead, data stewardship should be embedded within those teams who are building and delivering the data.
DATA-AS-A-PRODUCT Data-as-a-Product is another important data mesh principle and challenges the traditional perspective of separating product thinking from the data itself. It describes how this change in perspective can truly impact the way we collect, serve and manage data. By merging product thinking with the data itself, we start treating data consumers as customers, and we try
our utmost to provide our customers with experiences that delight. In his book Inspired, Marty Cagan emphasises three important characteristics behind successful technology products that customers love. They are (in no particular order):
When building and releasing data products, the various domains are expected to adopt a similar way of thinking. From a platform perspective, it is recommended to always deliver something compelling to use. Within the GDP, this was realised by offering platform services that are easy to use and transparent, consisting of shared responsibility (Bottcher, 2018). More guidance on this can be found opposite:
For example, when the GDP initially released a solution for data quality monitoring and reporting, we asked ourselves: ●● Is our service intuitive to use? ●● Do our users have actionable insights available as a result of the service that we deliver, which is essentially DQ monitoring? ●● What is the scope of user vs. platform responsibility? Are both parties accepting responsibilities?
To successfully manage data-as-a-product independently and securely, data should be treated as a valuable product, and responsible domains should strive to deliver data products of the highest grade. If we treat data as a by-product, instead of treating data-as-a-product, we will fail to prioritise muchneeded data management at the risk of losing consumer trust. Without consumers, do we really have (data) products? And without (data) products, do we truly have a business?
14 | THE DATA SCIENTIST
JACQ U ES CO NRADIE
SELF-SERVICE PLATFORMS It was highlighted earlier how traditional data architecture often involved a single platform and implementation team. Data mesh draws a clear distinction between the following: ●● A platform team that focuses on providing technical capabilities to domain teams and the needed infrastructure in each domain ●● Data domains that focus on individual use cases by building and delivering data products with longterm value In other words, data mesh platforms should enable the different data domains to build and deliver products of the highest grade completely autonomously. As an outcome, self-service will be enabled with limited, if any, involvement needed from the central platform team. Sadly, this model is not always reflected in practice. When considering data quality (DQ) management, for
example, it is clear that a traditional approach is still prevalent in many organisations today. This approach involves intensive code-based implementations where most DevOps activities are taken care of by central IT. The result could be a turnaround of 1 day (at best) to build, test and deploy a single DQ rule. A practical example of self-service in action is the GDP’s so-called DQ Rule Builder application. This front-end application promotes accelerated DQ monitoring and caters for a variety of DQ rules via a user-friendly interface (developed using Microsoft Power Apps). The end-to-end solution gathers user requirements and intelligently converts these into productionised DQ rule logic. This approach has automated many parts of the traditional build/deploy process and resulted in record-level turnaround times for the organisation. As an added benefit, both IT and business were empowered and platform users could essentially start serving themselves.
FEDERATED GOVERNANCE Self-service without governance can quickly turn into chaos and therefore the final pillar of data mesh includes federated data governance. Data mesh relies on a federated governance model where e.g. capabilities for data management are owned centrally (usually by the platform team) and utilised in a decentralised manner (by the cross-functional domain teams). What we have noticed in the on-premise world, is that data management capabilities used to be separate and far away from the data itself. Thankfully, the public cloud has enabled us to embed.
Essentially, core platform services have been extended to DG which makes it possible for users to access services for data quality, data lineage, etc. completely out of the proverbial box. When studying data mesh articles, the word “interoperability” pops up quite frequently. This is defined as “the ability of different systems, applications or products to connect and communicate in a coordinated way, without effort from the end user”. Therefore, DG services should be designed to seamlessly integrate with existing and future data products.
THE DATA SCIENTIST | 15
JACQ U ES CO NRADIE
CONCLUSION: TIPPING THE SCALE TOWARDS SCALABLE DATA MANAGEMENT IN THE CLOUD How are you and your organisation viewing data management? Is it regarded as something controlling, slow and limiting? Or is it frictionless and adaptable whilst at the same time democratised? In one of his articles, Evan Bottcher shares that truly effective platforms must be “compelling to use”. In other words, it should be easier to consume the platform
capabilities than building and maintaining your own thing (Bottcher, 2018). The story of the Global Data Platform (GDP) illustrates how organisations can effectively tip the scale towards scalable data management in the cloud by creating a compelling force driven by data mesh principles.
16 | THE DATA SCIENTIST
TARUSH AGGARWAL
RISE OF THE DATA GENERALIST:
SMALLER TEAMS,
BIGGER IMPACT By TARUSH AGGARWAL AFTER GRADUATING WITH A DEGREE IN COMPUTER ENGINEERING FROM CARNEGIE MELLON IN 2011, TARUSH BECAME THE FIRST DATA ENGINEER ON THE ANALYTICS TEAM AT SALESFORCE.COM. MORE RECENTLY TARUSH LED THE DATA FUNCTION FOR WEWORK, WERE HE GREW THE TEAM FROM 2 TO OVER 100. IN 2020, TARUSH FOUNDED THE 5X COMPANY TO SUPPORT ENTREPRENEURS IN SCALING THEIR BUSINESSES THROUGH LEVERAGING THEIR DATA.
THE DATA SCIENTIST | 17
TARU SH AG G ARWAL
In the evolving data landscape, driven by ai and advanced data vendors, data teams are expected to become more compact, prioritising efficiency over team size. In a landmark blog post from 2017, Maxime Beauchemin talked about the “Rise of the Data Engineer.” This marked the moment when data engineering became a recognised job title, with Facebook leading the way in 2012. But let’s take a step back and see how we got here.
LOOKING BACK: DATA BOOM AND SPECIALISATION decade ago, we saw the rise of cloud-based super apps A like Facebook, Yelp, and Foursquare. These apps needed to handle vast amounts of data in the cloud, and we didn’t have the right tools and infrastructure to make the most of this data. Things like data pipelines (ways to move data), data storage, modelling (making data usable), business intelligence (BI), and other tools were either not around or just getting started. This deficiency in data management tools led to the need for specialisation in the field.
Their role involves informing decision-makers and uncovering data-driven patterns to shape business strategies. 4.
Data Visualisation Engineers: Specialising in transforming raw data into easily understandable visual representations, data visualisation engineers combine data analysis, design, and technical skills. Their objective is to create visually appealing graphics that facilitate interpreting complex information.
But here’s another reason we ended up with so many roles: the data world got really complicated. Each part of the data puzzle had its own set of tools. This complexity meant companies needed bigger teams to manage it all, translating to larger spends and management challenges. T his trend was fuelled by buzzwords like ‘data is the new oil’ and cheered on by investors and pundits. As a result, companies often boasted about the size of their data teams. However, 18 months ago, things took a turn...
The changing face of data roles
ACHIEVING MORE WITH LESS
A new breed of professionals, known as “data engineers,” emerged in response to this growing challenge. They were responsible for creating tools and infrastructure to handle and optimise data in the cloud, filling the crucial gap. Over the past decade, as the data landscape has evolved, we’ve witnessed the emergence and transformation of various roles within the field to meet the changing demands of the data ecosystem. 1. Data Analysts: They translate structured, modelled data into actionable business insights. They play a vital role in interpreting data effectively, empowering organisations to make informed decisions.
I t’s time to refocus on the true purpose of these teams: driving business growth. It’s no longer about building large teams or using trendy tools; it’s about generating ROI for the business. The era of unchecked spending on data teams is behind us. Today, our goal is efficiency, which is going to be achieved by solving for both the platform and people:
2.
Analytics Engineers: Building upon the foundation of data analysis, analytics engineers, often associated with dbt, introduced software engineering practices to the data world. Their focus shifted towards building analytical models, reducing the need for ad hoc analysis, and progressively converging with data engineering responsibilities.
3.
Data Scientists: The need for data scientists arose with the increased availability of structured and actionable data. These individuals possess expertise in statistics, programming, and domain knowledge, allowing them to extract valuable insights from extensive and intricate datasets.
1.
educing the overhead of managing multiple R vendors: As the data landscape becomes increasingly complex, “Fully Managed” is the newest category on the block. It eliminates the need for managing multiple vendors, handling vendor discovery, POCs, negotiation, procurement, integration, and maintenance. There are over 500 vendors in 30 different data categories. The analogy we use is all these vendors are selling car parts. Imagine walking into a Honda and instead of selling you a Civic, they sold you an engine and you had to build your own car. Businesses spend a lot of time figuring out which vendors to use and how to integrate them. This results in large, cumbersome data teams to manage these vendors, leading to rising costs and complexity. The fully managed category will give you all of the advantages of an end-to-end platform with the flexibility of best-of-breed vendors. Depending on your industry, use case, size and budget you will be able to build a tailored platform for your business. By 2025, I anticipate that 30% of data teams will migrate to a fully managed solution.
18 | THE DATA SCIENTIST
TARU SH AG G ARWAL
2.
Creating a lean team that moves fast: As we start to consolidate tooling, we have an equally or arguably bigger opportunity to reduce team size; “Do more with less” is becoming a theme. People costs represent a substantial portion of a company’s expenses, making this shift significant. Moreover, tooling has become more mature, and many current data roles would not justify dedicated titles moving forward. As a result, we’re witnessing the rise of a new archetype - the “data generalist” - who can operate around different areas of the data stack. Here are a few examples of consolidation in titles in light of this shift:
● ● Data platform engineers - They typically comprise 20% of the data team. Tooling consolidation and fully managed solutions will allow them to adeptly manage complex tasks like vendor integrations, access control, governance, and security, often without dedicated resources. ● ● Data engineers - The rise of data generalists will enable them to utilise automated ingestion tools to construct and manage data pipelines with increased efficiency. ● ● Data analysts / Analytics engineers - The need for specialised analytics engineers and data analysts is on the decline because intuitive tools and concepts (like activity schema) have made data modelling simpler and more accessible. Moreover, some of these tools offer helpful insights and recommendations, making it easier to take on these tasks. ● ● Data scientists - We’re going to see a lot of AI platforms that run complex models. These platforms require little knowledge to tweak and feed your data in. This will pave the way for generalists to operate at a higher level with less sophistication ● ● BI engineers - The rise of conversational BI on top of the semantic layer has automated the tasks usually executed by BI Engineers. Additionally, LLM features like chat have made it far more intuitive to answer business questions.
Sure, we will still need specialised talent, especially at larger companies, and will still have enough workflows for each role, but in general, smaller to medium-sized companies would need their data team members to have generalist skills as well. Over time, there is a possibility that the next generation will focus on an all-rounded approach to data instead of specialising in data science, data engineering, or other skill sets. Existing specialists will also need to retrain to gather more rounded expertise. Specialists who don’t retrain run the risk of getting left behind. For example, a number of years ago, Microsoft Stack specialists worried their skills had limited use outside of Microsoft as the industry. Just like everything else, highly skilled specialists will continue to excel, but an increasing percentage may struggle to adapt to the job market, which starts to favour generalists in an environment with a shrinking number of highly specialised roles.
IN CONCLUSION
The data landscape has evolved significantly. We’ve moved from having big teams to focusing on efficiency. T his shift has given rise to data generalists who can handle various tasks, making traditional roles less necessary. Larger teams may still specialise, but the skills gap is widening, prompting specialists to explore other paths. Adaptability is key. Expect lean, agile, and highly effective data teams building on top of fully managed data platforms.
Over time, there is a possibility that the next generation will focus on an all-rounded approach to data instead of specialising in data science, data engineering, or other skill sets.
THE DATA SCIENTIST | 19
BENJAMIN GLICKSBERG
DATA SCIENCE PLATFORMS
FOR PATIENT-CENTRED DRUG DISCOVERY BENJAMIN GLICKSBERG, PH.D. CURRENTLY SERVES ON THE LEADERSHIP TEAM OF CHARACTER BIOSCIENCES AS THE VP AND HEAD OF DATA SCIENCE AND MACHINE LEARNING. Dr Glicksberg was an Assistant Professor in AI for Human Health and Genetics and Genomic Sciences at the Icahn School of Medicine at Mount Sinai, where he led a team using machine learning on multi-modal and multi-omic patient data for personalising medicine. His research applications range from predictive modelling to drug discovery. Dr Glicksberg received his Ph.D. from the Icahn School of Medicine at Mount Sinai and completed post-doctoral work at the University of California, San Francisco.
CHALLENGES OF TRADITIONAL DRUG DEVELOPMENT PROCESS Bringing a new drug to market is a complex, lengthy endeavour and is beset with challenges. The traditional drug development process can span over a decade and requires substantial investment, on the order of 100s of millions to billions of dollars. Furthermore, there is a high failure rate for novel drugs across the stages of development. Lastly, even if a drug gets approved, there is no guarantee that it will be equally beneficial to patients with different demographic and clinical characteristics. Despite the numerous successful treatments available, it is clear that new strategies should be developed to overcome these challenges to generate a higher likelihood of success and more treatment options for patients. With recent
advancements in machine learning techniques and computing power, data science is poised to not only streamline drug discovery but also identify more personalised medicine applications.
PRECISION MEDICINE AND PERSONALISED TREATMENT APPROACHES Precision medicine aims to provide the right therapy for the right patient at the right time. It is becoming increasingly apparent that medicines do not work the same way for everyone. They can have varying levels of safety and efficacy for individuals with different characteristics. Some of this variability can be linked to genetics and is, therefore, especially relevant in diseases with high levels of genetic contribution. In certain cancers, for instance, there may be particular causal
20 | THE DATA SCIENTIST
B ENJAMIN PHILIPPG LKOEHN ICKSBERG
genetic mutations that drive disease pathogenesis. As such, therapies that target those specific disruptions will be beneficial to individuals with that particular genetic background. So-called complex diseases, like type 2 diabetes, are polygenic, often having multiple genetic components of smaller effects. Unlike targeting a single, likely causal genetic signal, personalising medicine for these types of diseases requires a multifaceted approach. The challenge is further compounded as the manifestations of these diseases can often be varied between individuals. Emerging biostatistical techniques and increased availability of human genetic data across complex diseases are facilitating strategies to identify the best therapies for individuals across heterogeneous diseases.
osteoarthritis is characterised by joint space narrowing, while severe osteoarthritis is characterised by large reductions in joint space and significant osteophyte growth. Therapeutics aimed at treating osteoarthritis can have different strategies by targeting various stages of the disease. While “reversing” the disease course is ideal, this is extraordinarily challenging, and strategies of this kind are limited across medicine as of now. Many therapeutic strategies, instead, focus on delaying the progression or growth of a key biophysical property or preventing or prolonging conversion to more advanced stages of the disease. In order to pursue therapeutic strategies at various stages of the disease course, it is imperative to effectively characterise disease progression according to pathophysiological properties and relevant biological pathways.
DEVELOPING PRECISION MEDICINES FOR DATA-DRIVEN PATIENT SUBTYPES OF COMPLEX DISEASES
THE UNIQUE CHALLENGES OF STUDYING AND TREATING COMPLEX PROGRESSIVE DISEASES Progressive diseases, such as Alzheimer’s disease, develop over time and are associated with the complex physiological process of aging, leading to another set of unique problems in developing novel therapies. Most importantly, it is imperative that enough data can be collected as the disease develops and progresses to not only understand the physiological changes but also what components, such as genetics, could drive these changes. Obtaining data spanning years with genomic data at discrete states is exorbitantly costly and operationally difficult to collect. While various initiatives and databases exist that try to accumulate such data, like the UK Biobank and the All of Us Program, large gaps still remain for the comprehensive study of longitudinal, complex diseases. As an example, osteoarthritis, like other progressive diseases, has varying degrees of severity, often stratified by stages, which reflect key milestones in pathophysiological, molecular, and morphological changes. These stages are distinctly marked by the presence of biological traits that develop throughout the course of the disease. For instance, moderate
In order to bridge these various gaps that currently exist in the drug development space for complex, progressive diseases, it’s essential to develop precision therapeutics based on personalised characteristics, such as genomics. The goal of precision medicine is not only to identify effective targets based on these characteristics but also to determine who should take them and when. Clinicogenomic data, which couples longitudinal patient data with genomics, is essential for this goal. Such patient data, like clinical diagnosis and imaging, is necessary in order to disentangle the complex mechanisms of the progression of diseases over time. Unfortunately, it is often the case that no comprehensive dataset encompasses such requirements, at least for diverse patient populations. As the field grows, it is imperative that represented clinical-genomic biobanks be developed from a network of consented patients across the country. Both clinics and patients must agree that the data collected during their regularly scheduled visits can be utilised for analyses. For genetically defined progressive diseases, it is imperative to study changes in progression rates that coincide with relevant endpoints for registrational trials rather than disease risk. It is also necessary to study the underlying data modalities that are assessed as part of clinical trials. In many progressive diseases, such relevant endpoints must be extracted from imaging data. Chronic obstructive pulmonary disease (COPD), for instance, relies upon high-resolution CT scans to detect structural alterations in the lungs, like fibrosis. In Osteoarthritis, structural endpoints consist of such things as joint space narrowing, which can be detected in X-rays. Clinico-genomic biobanks are poised to facilitate such analyses as these data are often collected during routine care. These biobanks often consist of DNA and longitudinal clinical data from Electronic
THE DATA SCIENTIST | 21
B E NJAMIN G L ICKSBERG
Health Records (EHR), text from patient notes, as well as imaging data.
doesn’t necessarily benefit those who are biomarker negative, it at least allows for more informed treatment decisions; there certainly is a benefit of not giving an ineffective treatment and/or one that has an increased risk for adverse events for a given individual. Of course, the success of discovering novel drug targets of progression and affiliated patient stratification strategies using multi-modal patient biobanks is dependent on the quality and composition of the underlying data. Furthermore, the jump from dataset to biomarker often requires intricate analyses based on biostatics and machine learning that address underlying challenges in the data.
HOW A DATA SCIENCE PLATFORM CAN ENABLE PATIENT-DRIVEN DISCOVERIES Multi-modal data science platforms can be used for novel drug target discovery and patient stratification. The key basis for these applications is the focus on disentangling the progression of a disease rather than incidence. Specifically, all individuals in such a cohort should have the disease of interest, allowing the focus to be on why some individuals progress more quickly than others rather than why an individual develops the disease in the first place. The nuance here is critical as certain genetic variants may play a large role in influencing progression but are “hidden” by the overwhelming large signal from genes that confer disease risk. Characterising “cases” only over time by progression (covered more thoroughly in the next section) allows for more nuanced analyses that could reveal key genetic modifiers. The genetic markers identified from these analyses could be used as the basis for drug targets: if these factors control progression, perhaps targeting these factors with therapeutics at key time points could curtail further advancement. In addition to novel target discovery, decoding these markers can also be used for patient stratification. The concept of patient stratification aligns with the premise of personalised medicine: diseases may manifest via different mechanisms in individuals and, therefore, require tailored treatment selections. The goal of patient stratification is to identify which individuals would be most likely to respond to treatments. This goal can be achieved in concert with drug development: if therapies are developed targeting certain genetic markers, patients can be stratified by the carrier status of these markers. Risk scores can be generated per patient for their genetic status that stratifies patients into biomarkers, positive or negative. It can then be hypothesised that patients who are biomarker-positive should respond better to the development treatment. While this strategy
THE CAVEATS AND BEST PRACTICES OF USING REAL-WORLD DATA IN BIOMEDICAL RESEARCH The power of real-world data, or patient data collated from routine care, is readily apparent. Real-world data has unlocked a world of research beyond the confines of clinical trials and prospective studies. That being said, there are certain issues that can arise in the analysis of such data if not taken into proper context. Patient data are often not uniform in many datasets. Data collected from multiple clinics in different health systems in diverse geographic settings may have slightly different data type representations. For instance, cardiac MRIs that are collected as part of hypertrophic cardiomyopathy monitoring can be captured via different machine brands, which, in turn, have slightly different output formats. Visualisations can come in different resolutions and scales. Furthermore, the practising physician may decide to take imaging based on slightly different protocols based on their experience and at their discretion. Depending on the scale at which they need to analyse the
22 | THE DATA SCIENTIST
B E NJAMIN G L ICKSBERG
pathology, he or she may choose to view specific regions, field-of-views, or axes (planes) of the heart. Put together, while there are many similarities in the data collected within a single disease, there is actually a large amount of underlying heterogeneity which needs to be taken into account. There are many potential biases that can result in real-world data, especially with such heterogeneous data. One issue is information bias: does each site qualify for disease presence and stage the same way? Another issue relates to bias by indication, which can be reflected in the decision of which imaging protocol is selected and when. In more advanced stages of the disease, the treating physician may be able to want to perform a 3D visualisation at a finer detail, where cheaper, more “crude” imaging would have sufficed in earlier stages. Similarly, the clinician may decide to perform imaging more frequently to detect minute changes over a short period of time that may reflect conversion to a more severe disease state. Therefore, the mere frequency of imaging available or the number of slices in a given scan can “leak” information due to its relationship with the disease state. Therefore, it is challenging to compare data across sampling protocols as it is not exactly comparing apples to apples. These are just a few of many potential biases that can exist, and careful steps need to be taken to maximise true signal from noise in real-world data. Data quality control and standardisation should be performed at all steps of the process. Provider sites should be examined before enrollment to ensure data are captured electronically and in an accessible format. Ideally, the format of the data should also be interoperable, such as the Fast Healthcare Interoperability Resources (FHIR) HL7 format for EHRs. Robust electronic phenotyping should be conducted to ensure the images taken match disease stage diagnosis, as errors in data collection can occur. This verification step should ideally be performed by independent clinical experts. Internal bias checking is also imperative. One check can be performed by strict inclusion/exclusion criteria to ensure patients have a sufficient amount of data across modalities and time. Relatedly, one should make sure that there are no differences between those with and without comprehensive data, which may reflect a bias in access to healthcare. All biomedical images that are analysed should be standardised with extraneous information removed before modelling. If different machine types are involved, some kind of calibration should be performed to align output values. In real-world datasets, the quality and robustness of data are often tied to the uniformity of screening practices. Regardless, proper quality control of any realworld dataset is imperative for robust insights.
DEEP PHENOTYPING: USING COMPUTER VISION TO CREATE NOVEL QUANTIFIABLE DIGITAL BIOMARKERS As mentioned, many progressive diseases progress in stages that are characterised by specific pathophysiological changes that are often captured in biomedical images. While it is useful to analyse gross progression on this scale, such as a time-toevent analysis of conversion to later stages of the disease, studying more nuanced aspects of each stage will allow for a more refined understanding through a genetic lens. Accordingly, deep phenotyping refers to the comprehensive, detailed, and systematic analysis of phenotypic abnormalities (observable traits or characteristics) in an individual, often within the context of a particular disease. Deep phenotyping often involves assessing the intricate interplay of molecular, physiological, and environmental factors that give rise to observed clinical symptoms or traits. Not all of these aspects can be directly measured with conventional technologies, but real-world data collected can partially serve this purpose by identifying observable patterns reflective of underlying endophenotypes. Put together, one primary goal of deep phenotyping is going from the abstract, qualitative, or broad sense to a more quantitative representation or assessment. For instance, the intermediate stage of dry age-related macular degeneration is characterised by the presence of large drusen, tiny yellow or white lipid deposits under the retina, mild vision loss or changes, and/ or potential pigmentary changes. There is, of course, a lot of heterogeneity that can be contained within this classification within the pre-specified ranges for inclusion (i.e., >125 µm drusen size). Additionally, patients can have different compositions of the three, which may reflect divergent etiologies and, therefore, potential novel drug targets. This enhanced granularity is particularly valuable in the realms of personalised medicine and genomics, as it allows researchers to link specific genetic variants with detailed phenotypic outcomes instead of the overall disease stage. Deep phenotyping can represent patients along more
THE DATA SCIENTIST | 23
B E NJAMIN G L ICKSBERG
nuanced and multi-faceted lines. As mentioned, deep phenotyping has incredible potential in diseases with image profiling, where many pathophysiological changes are observed and tracked. CT scans, for instance, can help visualise tumours, which are used to grade colon cancer stages. Exploring tumour biomarkers as quantitative measures such as size and tissue layer location, rather than presence/absence, can facilitate more nuanced genotype/phenotype associations, which will be explored in more detail below. However, manually quantifying the hundreds of thousands to millions of images for these features by experts would be exorbitantly costly and time-intensive. Therefore, machine learning techniques like computer vision can be used to achieve this at scale. Computer vision relies on a subset of machine learning called deep learning, which allows the processing of images through multiple layers of neural network connections. The recent advancements in computer vision have made a huge impact on everyday life, from facial recognition to automated driving. For such medical purposes, computer vision can be used to analyse images automatically for such purposes as classification, i.e., determining if an image is of a certain class, or segmentation, i.e., identifying and outlining certain features of interest. Colon cancer can be used as an illustrative example of the utility of imaging biomarkers. The stages of colon cancer are in part separated by tumour size and location within tissue layers. Therefore, it would be invaluable to quantify and localise tumours within CT scans automatically and at scale. In order to build models
to achieve this task, experts have to provide training examples for the machine to learn from. These examples contain manually labelled images for the features of interest and examples for which no feature is present (negative controls). Successful application of these segmentation models allows for the quantification of the “real world” size of these features for all images for all patients across time. In this way, not only can size be quantified and tracked, but changes over time can be calculated, forming progression phenotypes, both within and across individuals. Clear patterns often emerge that differentiate patients along these lines: some individuals have tumours that grow rapidly, while others have ones that stay at the same size for prolonged periods of time. Furthermore, the localisation of tumours can also be compared and contrasted across individuals to define another phenotype. The rate at which the tumour invades the various layers of the colon can be quantified and compared via computer vision. Additionally, the rate at which cancer is spread to other organs if at all, can also be compared. Put together, these traits are just some of the unique ways by which a heterogeneous disease like colon cancer can be investigated. Computer vision and longitudinal imaging data can form progression phenotypes across various dimensions, each of which allows for a personalised understanding of nuances that are highly variable between patients. Coupling these progression phenotypes with genetics can allow for the identification of signals that can explain the underpinnings of the heterogeneity.
UNDERSTANDING PERSONALISED GENOMICS IN THE LENS OF PROGRESSION PHENOTYPES
the progression of disease, which may differ from those that are associated with susceptibility. As mentioned, the genetic variants that mediate progression may be fruitful drug targets that often remain hidden due to the complexity of modelling complex progression patterns. Deep phenotyping of imaging biomarkers can enable the generation of phenotypes that track progression along various axes or elements that constitute complex progressive diseases. Modelling how these endophenotypes, or components, change over the
Generating these multifaceted progression phenotypes is just a necessary first step for personalised medicine drug discovery. Most genetic associations with diseases are identified as those of susceptibility, or in other words, genetic signals that differentiate those with a disease from those without. Separate from overall disease development, there is growing evidence and examples that there are also genetic signals that control
24 | THE DATA SCIENTIST
B E NJAMIN G L ICKSBERG
course of the disease allows for more refined genetic association analyses of progression. Performing genetic analyses like Genome Wide Association Analyses (GWAS) on these progression phenotypes can reveal signals that mediate severity that are not apparent when comparing cases vs. controls. Many of these signals are often novel, but some can overlap with susceptibility genes, indicating multiple functions of those variants. The hits identified from progression GWAS comparisons can then be funnelled into subsequent selection and screening steps to determine the feasibility of moving forward with drug development based on these genetic signals. Apart from uncovering new drug targets and developing drugs based on them, another prime objective of data science platforms is response-based patient stratification or discerning which patients would most likely benefit from these therapeutics. The hypothesis behind this aim is that individuals with disruptions in a specific genetic signal, targeted by a therapeutic, stand to gain the most benefit from it. These genetic signals, typically originating at the Single Nucleotide Polymorphism (SNP) level, reside in genes - collections of nucleotides often numbering in the tens of thousands or more. Genes can then be
categorised based on various pathways or cascades of interconnected biological functions. Individuals have slight variations in SNPs, some of which have been associated with causing issues, while others confer no functional or observable differences. Patients can accordingly be characterised by having genetic variations in known (i.e., from the literature) or discovered (i.e., via data science platform) genetic signals relating to the disease. This disease burden can be reflected as a Polygenic Risk Score (PRS), which is based on the cumulative effect of multiple genetic variants. PRS can be conceived for patient stratification across multiple dimensions, from disease susceptibility to progression phenotype and beyond. This stratification can also be used for predicting therapeutic response. For instance, PRS can be further characterised by relevant biological pathway burden to match the purported mechanism of developed drug targets. Patients with high genetic PRS in the targeted pathways may be prioritised to receive the therapeutic, while those with low PRS may fare better with an alternative medication. In this way, clinical-genomic data science platforms can be a two-sided coin: seamless interconnectivity between genetic discovery and application.
THE FUTURE OF DATA SCIENCE IN PRECISION DRUG DISCOVERY
be more successful than any single level in isolation. On a practical level, data science platforms can also help refine clinical trial inclusion and exclusion criteria or who should enter a clinical trial or not. The retrospective real-world data can be used to perform data-driven calculations of collections of patient features and endophenotypes that are most associated with an outcome of interest. In this way, a platform can be a continuous learning machine where past data can help inform future studies. We are only at the cusp of this data-driven renaissance in drug development. The fusion and sustained assimilation of multi-omic data with refined, nuanced longitudinal phenotypes are poised to catapult personalised pharmaceutical innovation, surmounting many of the limitations that challenge the field today.
The future is both exciting and promising for the role of data science platforms in precision drug discovery. There are many other ways in which data science platforms can support, refine, and enhance the drug discovery and application processes beyond what has been described so far. There is continued and growing interest in companion diagnostics or FDA-approved screening processes that officially designate who should be prescribed a therapeutic based on some personalised condition. While genetics can fit this need, it is only the beginning. Human biology is a multi-scale collection of systems at various molecular and cellular layers. Other -omics, like transcriptomics, proteomics, and beyond, can be used to personalise medicine in a fuller sense, which undoubtedly will
THE DATA SCIENTIST | 25
PHILIPP M DIESINGER
THE DAWN OF SYNTHETIC DATA In the rapidly evolving landscape of technology and data, a groundbreaking trend is emerging the rise of synthetic data. As data becomes the lifeblood of modern businesses, researchers and developers are looking for innovative solutions to harness its power while addressing privacy concerns and data scarcity. Synthetic data, a novel concept that generates artificial datasets with properties mimicking real-world data, is gaining momentum.
S
ynthetic data is generated by algorithms and models that replicate the statistical properties, structures, and relationships found in real data. It is often used as a substitute for actual data, especially in cases where privacy, security, or a limited dataset pose challenges. Synthetic data is information that has been artificially created by computer algorithms, as opposed to traditional real data based on observations of real-world events. Synthetic data is not a new concept. Academic disciplines, including computational physics and engineering, have long employed synthetic data. These fields have successfully modelled and simulated complex systems, spanning from molecular structures and intercity traffic to optical devices and entire galaxies. These simulations are grounded in first principles, generating data that portrays the behaviour of these systems. Subsequently, this synthetic data is subjected to statistical analysis to create insights and predict system properties. Additionally, synthetic data is often generated using known statistical probability distributions of system components. This method also allows for the creation
PHILIPP IS A DATA SCIENTIST, AI ENTHUSIAST AND ESTABLISHED LEADER OF LARGE-SCALE DIGITAL TRANSFORMATIONS. HE IS PARTNER AT BCG X. PHILIPP HOLDS A PHD IN THEORETICAL PHYSICS FROM HEIDELBERG UNIVERSITY AND SPENT THREE YEARS AT MIT DEVELOPING A STRONG BACKGROUND IN AI RESEARCH AND LIFE SCIENCES.
of synthetic data, even from limited datasets, by empirically measuring distributions and then sampling them to expand and augment the dataset. Well before the advent of computational power, mathematicians employed analytical techniques. They derived probability distributions from first principles and propagated them to the system level, often utilising theories like the central limit theorem. While the notion of synthetic data is not a recent development, its relevance has witnessed a significant upswing in recent years. The number of industry applications has increased dramatically. Synthetic data finds its applications spanning a multitude of industries. Notably, the realm of autonomous vehicles, aircraft, and drones relies on the training of these technologies with hyper-realistic 3D-rendered data. Industry giants like Amazon have made a name for themselves by employing synthetic data to instruct their warehouse robots in recognising and managing packages of diverse shapes and sizes. The healthcare sector is increasingly harnessing synthetic data to train AI systems, ensuring that patient privacy remains uncompromised. The surge of relevance of synthetic data is aided by
26 | THE DATA SCIENTIST
P H IL IP P M DIESING ER
innovative data generation techniques fuelled by the accessibility of cost-effective computational resources and abundant data storage capabilities. This synergy has led to the emergence of a multitude of cutting-edge approaches to synthetic data generation. New methods for the synthesis of data include Generative Adversarial Networks (GANs), which are deep learning models that consist of competing generator and discriminator neural networks. The generator learns to produce data that is indistinguishable from real data, while the discriminator learns to differentiate between real and synthetic data. GANs are widely used in generating realistic data, especially in the domains of image data, audio or text data. Variational Autoencoders (VAEs) are another type of generative model that learns to encode and decode data. VAEs can be used to generate new data points that are similar to the given training data. Methods for creating synthetic data still strongly depend on the type of data being generated, as well as their respective verticals. Synthetic data can be as good or sometimes even better learning data for AI systems. One of the most significant advantages of synthetic data is its potential to safeguard privacy. With the increasing awareness of data protection laws like GDPR and HIPAA, organisations must ensure that sensitive information is not exposed. Synthetic data allows for the creation of datasets that maintain statistical accuracy while eliminating any personal or sensitive information. Synthetic data can often be the solution to a data bottleneck created by privacy protection. Synthetic data can be used to augment real datasets, expanding the size and diversity of the data available for machine learning and AI models. This enables more robust model training and enhances model generalisation, ultimately leading to better performance. Developers can create synthetic datasets with varied scenarios and edge cases that might be challenging to collect in the real world. This diversity, or “data variability”, is crucial for testing the resilience and adaptability of AI systems in different conditions. Synthetic data is a cost-effective alternative to collecting, storing, and managing large volumes of real data. It saves time, money, and resources, making it an attractive option for startups and organisations with limited budgets. Synthetic data can help overcome naturally occurring limitations of real-world data due to actual physical constraints. Synthetic data can be used to overcome cold-start problems. Small organisations might not have sufficient data to develop their AI models and, therefore, might choose to augment the existing data with algorithmically generated data. Synthetic data must be created with care. Artificial data must meet the same properties
of the underlying systems as real-world data would. Generating high-quality synthetic data requires sophisticated algorithms and substantial computational resources. While synthetic data mimics real data, it sometimes cannot capture all the nuances, anomalies, or subtleties present in genuine datasets. This limitation may affect the performance of AI models later in real-world scenarios. Unanticipated events that can occur in real life may challenge AI models. Ensuring that synthetic data accurately represents the distribution of real data is still a significant challenge. Biases and inaccuracies can lead to models that do not perform well on real-world data. In the data-hungry field of GenAI and machine learning, the generation of high-quality synthetic data plays an increasingly important role and can render significant competitive advantages. Training neural networks requires large amounts of data. Data quality and quantity are significant drivers for the performance of AI models. Real-world datasets often suffer from limitations and biases, which can result in biased machine-learning models. Synthetic data can oftentimes provide larger data variability and thus provide better training data for AI models, enabling them to learn and predict system behaviour in unusual situations. Data variability is a key driver for model performance under real-world conditions. Oftentimes, training neural networks requires large amounts of well-annotated data. Tagging or annotating data can be a significant cost driver in the development process of performant neural networks. Besides the cost, tagging synthetic data can even be more accurate than annotating real-world data, thus avoiding false labels and reducing noise in training data. Gartner(1) predicted that by 2030, most data utilised in the field of AI will be synthetical data. The ability to create high-quality synthetic data may become a necessity for the development of high-performance AI systems. The advent of synthetic data marks a promising step forward in the realm of data generation and utilisation. It offers tangible advantages, particularly in safeguarding privacy, enhancing data diversity, and optimising resources. However, the technology is not without its challenges, including authenticity, ethical considerations, and the need for sophisticated algorithms. A fun way to explore the generation of synthetic data is SDV - the Synthetic Data Vault - a system of opensource libraries for the generation of synthetic data. SDV(2) offers a variety of synthesisers that can be used to create synthetic data and benchmark it.
THE DATA SCIENTIST | 27
(1) (2)
gartner.com
github.com/sdv-dev/SDV
Luke AI-Walker
YodaGPT Superintelligence Model
I’m struggling to defeat this AI-powered Death Star. My team of Data Science rebels needs to work faster and achieve more. How can we finish our Rebel AI Interceptor in time? Any advice, YodaGPT? Luke, in troubled times you find yourself. Strong, your Data Science rebels are, but in the right formation they are not. The DST Profiler you must use, a creation of Data Science Talent it is. Understand the unique strengths of each member, it will. In positions where they truly excel, it will guide them. Enhanced productivity, harmony, and innovation it will bring. More successful projects they will deliver, and the AI-enabled Death Star you will defeat.
Hold on, Yoda. Are you suggesting recruiters created a tool that can help me lead better? Doubtful, you sound, young AI-Walker. Much more than mere recruiters, they are. To craft this powerful tool from ancient days, a Senior Data Scientist and two engineers they engaged. A beacon of knowledge, this Data Scientist is, with a PhD in statistics and two decades of wisdom. Over 250 Data Scientists he has assessed, in roles of leadership. Underestimate them, you must not. Alright, I’m intrigued. How does this help us outsmart the Death Star? Questions, you have. Answers, I will provide… Each team member’s profile, a beacon of insight it is, showing their unique strengths and talents. The dashboards, like the Force, bestow you with knowledge and its visualisations immediately reveal the path. Guide you to intelligent decisions about which missions to undertake and with whom, it will. The dashboard possesses a potent power - the team profile function. A group, when chosen, reveals its synergy. You’ll discern if together they can stand against the darkness of the Death Star, or if other alignments are needed. Trust in the DST Profiler, you must. Help you optimise your current team, it will. Time is of the essence, young AI-Walker. May the data be with you...
DST PROFILER
®
The DST Profiler®
The world's first technical profiling system created specifically for Data Scientists
You don’t need “The Force” to manage your team more effectively and get better results. You just need the data and insights you get from the DST Profiler ®
Contact us now to find out more and get a 30-minute demo of the system:
Tel: +44 8458 623 353 Email: info@datasciencetalent.co.uk
DST PROFILER
®
May the Data be with you...
REX WOODBURY
THE AI REVOLUTION
vs
THE MOBILE REVOLUTION & OTHER TECHNOLOGICAL REVOLUTIONS
REX WOODBURY is the Founder and Managing Partner of Daybreak, an early-stage venture capital firm based in New York. He partners with Pre-Seed and Seed founders building products with the potential for viral adoption. Before founding Daybreak, Rex was a partner at Index Ventures. He writes Digital Native, a weekly technology publication, for an audience of 50,000+ global readers.
30 | THE DATA SCIENTIST
REX PHILIPP W O OKOEHN DBU RY
Technology revolutions take time. Despite the hype for AI right now, we’re still early: while 58% of American adults have heard of ChatGPT, only 18% have used it. In recent months, ChatGPT monthly active users actually ticked down. I expect we’ll need more vertical-specific, userfriendly LLM applications for the technology to really break through. Many of those applications are being built or dreamt up right now. This quest for understanding technological revolutions drove me to read Technological Revolutions and Financial Capital, an excellent book by the economist Carlota Perez (thank you to Rick Zullo for the recommendation). Perez wrote her book in 2002, shortly after the dotcom bubble burst. The book proved prescient in forecasting the 2000s and 2010s of technology, and I believe it offers some key insights for where we sit in 2023. This piece explores how today’s AI revolution compares to past revolutions - from the Industrial Revolution of the 1770s to the Steel Revolution of the 1870s, the Internet Revolution of the late 90s to the Mobile Revolution circa 2010. The key argument I’ll make is this: the AI Revolution isn’t comparable to the Mobile Revolution, as the latter was more a distribution revolution. Rather, AI is
more comparable to the dawn of the internet. Or, more fundamentally, AI is an even larger-scale technology shift - it’s the dawn of a new discrete revolution that’s built not around computers acting like calculators, but computers acting like the human brain. In short, we’re coming to a close of the “Information Age” that started in 1971, and we’re beginning a new era in technology.
HISTORY’S TECHNOLOGY REVOLUTIONS In the moment, it can be difficult to quantify a new technology’s impact. This leads to predictions that don’t age well for example, the economist Paul Krugman’s 1998 declaration, “By 2005 or so, it will become clear that the Internet’s impact on the economy has been no greater than the fax machine’s.” Yikes. Poor Paul has had to live down that sentence for a quarter-century. This is why it’s helpful to zoom out - to study history’s past cycles of innovation and to try to discern patterns. This is the focus of Carlota Perez’s book. Perez focuses on five distinct technological revolutions from the past 250 years. Each revolution, she argues, was sparked by a “big bang” breakthrough:
THE DATA SCIENTIST | 31
REX W O O DBU RY
Our most recent technology revolution, the dawn of the so-called “Information Age,” began in 1971 with Intel developing the microprocessor. Microprocessors were manufactured with silicon, giving Silicon Valley its name; the rest is history. Technology revolutions follow predictable boom-and-bust cycles. An exciting new technology leads to frenzied investment in that technology; frenzied investment leads to the formation of an asset bubble; that bubble eventually bursts, cooling an overheated market. What makes Perez’s book remarkable is that she wrote it in 2002, shortly after the dotcom bubble had burst. Many people at the time were declaring the end of the internet era. But Perez argued that the bubble bursting was only the middle of a predictable cycle; the dotcom crash was rather
According to Perez, technology revolutions follow 50-year cycles. “Turning points” - which often come in the form of a market crash - typically occur about halfway into the cycle. Many crashes bear the name of the revolution’s prevailing technology: canal mania (1790s); railway mania (1840s); the dotcom bubble (late 1990s). After a technology’s turning point, the technology enters the deployment phase - this is 20 years of steady growth and broad wealth creation. The internet’s widespread adoption in the 2000s and 2010s, buoyed by the arrivals of mobile and cloud, bears this out. Looking retrospectively from 2023, Perez’s framework appears spot on.
a so-called “turning point” that would usher in the internet’s Golden Age (what she calls “Synergy”).
We can observe the “shift” from one technology revolution to the next in the companies that dominate an era. In the 1930s and 1940s, for instance, oil and automobile companies replaced steel companies as the largest businesses in America.
Here are the 10 largest companies in the world in 1990:
32 | THE DATA SCIENTIST
REX P H ILW IPOPOKO DBU EHRY N
Today, of course, tech companies dominate: Apple, Alphabet, Amazon, Microsoft, Nvidia, Meta. Tech domination is more pronounced via market cap, while the table above shows revenues, but the point remains. (Apple, for what it’s worth, brought in $394B in sales last year - with 25% profit margins to boot.) In the 1990s and 2000s, we saw technology begin to dominate; today, Big Tech represents 27% of the S&P 500. But tech domination is also a sign of something else: maturation.
THE NEXT REVOLUTION: AI Take another look at Perez’s chart above; the final stage is when a technology begins to mature. And that’s what we’ve been seeing with Big Tech. Companies like Alphabet and Meta have a bad case of arthritis. Maturation extends to the private sector. I had a debate with a former colleague the other day - how many companies founded since 2016, we wondered, had hit $100M in ARR? Wiz, Ramp, Deel, Rippling. There might be a few others. But the list is short. And how many new apps reliably hover near the top of the App Store - apps not funded by Bytedance (TikTok, CapCut) or Pinduoduo (Temu)? New revolutions emerge when the potential of the previous revolution approaches exhaustion. And it feels like we’re at the exhaustion point. Capital flowed into the venture world over the past decade, but that capital is increasingly chasing point-solutions. Abundant capital is thirsting for a new seismic, fundamental shift. Thankfully, we have one. Enter: AI. The arrival of AI is near-perfect timing to Perez’s framework. For AI, the “big bang” event - to use Perez’s terminology - was probably the release of ChatGPT last year. You could argue that the big bang was actually the publication of the seminal paper Attention Is All You Need in 2017, which introduced the transformer model. But I think we’ll look back at ChatGPT as the true catalysing moment. Another sign of market saturation and the dawn of a new era: top talent drains from mature, slow-moving incumbents to strike out on its own. The co-author of Attention Is All You Need, Aidan Gomez, left Google to build Cohere.ai. The Google Brain researchers behind Google’s image model also left the Big Tech giant, founding Ideogram. When you zoom out, the past 50 years have pretty closely followed Carlota Perez’s framework - in much the same pattern that we saw with the Industrial Revolution, with the steam engine, with steel, and with oil and mass production. To oversimplify:
●● 1970s and 1980s: Irruption - venture capital is born as an industry, turbocharging the nascent Information Age. ●● 1990s: Frenzy - things get a little ahead of their skis. ●● 2000-2015ish: Synergy - the Golden Age, with mobile and cloud acting as accelerants on the fire. ●● 2015-Present: Maturation. Exogenous shocks muddy the picture. Many blamed the dotcom crash on the September 11th attacks, for instance, but that was incorrect; 9/11 worsened the market correction, yes, but the bubble had already begun to burst in spring 2001. The Great Recession and, later, COVID and related government spending also cloud the framework. Who could have predicted a mortgage crisis and a coronavirus pathogen? But when you zoom out, the pattern is there. We’re now entering Phase One for AI - explosive growth and innovation: This is exciting.
It means that the comparison in the title of this piece - the Mobile Revolution vs. the AI Revolution - is something of a misnomer. AI is bigger, a more fundamental shift in technology’s evolution, than for example the mobile revolution. VR/AR, perhaps underpinned by Apple’s forthcoming Vision Pro, might be a mobile-scale revolution - a massive shift in distribution. That’s probably 5-ish years away. But AI is bigger. The way I think about it: we’re moving from the calculator era to the brain era. Back when computers were being created, there was a debate among experts - should the computer be designed to mimic a calculator, or to mimic the human brain? The calculator group won out (particularly because of technology’s limits) and the computer was born as we know it: literal, pragmatic, analytical.
THE DATA SCIENTIST | 33
REX W O O DBU RY Computers are very good at… well, computation. They’re less good at nuance, reasoning, creativity. Now, of course, that’s changing. AI is actually quite good at these things. In fact, AI is now better than humans at many uniquely-human tasks: reading comprehension, image recognition, language
understanding, and so on. One study found that not only did ChatGPT outperform doctors on medical questions, but the chatbot had better bedside manner. (It turns out, ChatGPT doesn’t get as tired, irritable, or impatient as human physicians.)
Computers used to be good at math. Now they can write and draw and paint and sing. Naturally, this new technology epoch will bring with it new opportunities for innovation.
FINAL THOUGHTS: CREATIVE DESTRUCTION The economist Joseph Schumpeter once wrote: “Capitalism is a process of industrial mutation that incessantly revolutionises the economic structure from within, incessantly destroying the old one, incessantly creating a new one.” The same is true for technology, which goes hand-in-hand with capitalism; innovation is capitalism’s spark, and technology its fuel. We’ve come a long way since the Information Age began. A terabyte hard drive in 1956 would’ve been the size of a 40-story building; today, it fits on your fingertip. Amara’s Law says that we tend to overestimate the effect of a new technology in the short run and underestimate the effect of that technology in the long run. This means that AI might be a little frothy right now, but If Perez’s framework holds, we’ll be in for more than one correction in the years to come. But those corrections won’t detract from the long-term potential of a new paradigm-shift in technology. There are also open questions about where value will accrue. We’re entering this new revolution with a slew of trillion-dollar tech companies at the helm; what’s more,
Big Tech has been unusually quick to respond to the threat posted by AI. Traditionally, incumbents tend to avoid radical change for fear of upsetting the apple cart - for fear of sacrificing juicy short-term profits in favour of massive self-disruption. But maybe this time is different. The question remains: will incumbents vacuum up the value, or will more agile, AI-native upstarts be able to win major segments of the market? Time will tell if Google and Meta sound as dated in 2043 as Yahoo! and AOL do in 2023. The internet, mobile, and cloud looked like their own distinct revolutions - but rather, they may have been sub-revolutions in the broader Information Age that’s dominated the last 50 years of capitalism. We’re now seeing a brand new sea change - one that only comes around every half-century. In other words, we’re in for a helluva ride.
This is an excerpt from an article published on Digital Native. To access the full article including where start up opportunities will crystallise in the AI revolution go here: digitalnative.tech/p/the-mobile-revolution-vs-the-ai-revolution
34 | THE DATA SCIENTIST
PATRICK MCQUILLAN
THE BIOLOGICAL MODEL By PATRICK MCQUILLAN
PATRICK MCQUILLAN has a successful history leading data-driven business transformation and strategy on a global scale and has held data executive roles in both Fortune 500 companies as well as various strategy consulting firms. He is the Founder of Jericho Consulting and a Professor at Northeastern University and University of Chicago, where he teaches graduate programmes in Analytics and Business Intelligence.
THE CONCEPT OF A BIOLOGICAL MODEL How would you describe the concept of a ‘biological model’ that you developed in the context of data strategy and organisations? It’s essentially the idea that all data, decision-making power, and resources should be consolidated into a single epicentre that’s connected throughout an entire business or organisation. It becomes a self-feeding system that can quickly react to, or predict, any anticipated challenges or bottlenecks based on what it’s historically encountered. Crucially, the model continues to learn and provide an adjacency between access to data and the immediate access of that data by key decision-makers in
the organisation, rather than disseminating it to individual managers on different teams, and having decentralised analytics hubs or centres of excellence that might exist across different verticals that don’t communicate with each other. What’s the difference between your biological model that functions like a nervous system, and centralised data structures that currently exist? For example, small or medium-sized companies tend to be centralised already, as they don’t have the capacity to decentralise everything. Excellent question. The biological model is so named because it simulates the central
THE DATA SCIENTIST | 35
nervous system. The nerves collect information on what’s felt in the fingertips and internal organs, and automated processes like blinking and breathing; they’re similar to AI. But the brain is the epicentre: the key decision-making component which processes the non-automated, conscious decisions like grabbing things, and influencing the world. So when using the biological model, it’s crucial to have a robust data foundation. With this in place, the model rectifies two common issues encountered when using traditional models: The first issue is that large companies with decentralised data centres can’t communicate with each other effectively, making it difficult to get the full picture
PATRICK PHILIPP MCQ KOEHN U IL L AN
quickly at the decision-making level. Small and medium-sized businesses encounter a different problem. Although, as you pointed out, their data tends to be centralised, the nervous system may not actually be healthy because it isn’t collecting the right data, or the data it does collect isn’t a sufficient volume to make meaningful decisions. This can really affect the business’s success if, for example, they need to gain the edge over their competitors. Do they rely on faulty data to make that decision, or use non-data sources to help fill
those gaps? That’s usually what happens when the system is technically working, but not necessarily flowing in the way it needs to: it’s not fluid. There may not be as many neurons or not enough information collected. In terms of practical application, this could mean there aren’t enough data sources, or there’s not enough testing. If you’re trying to go to market, it might be you’ve rolled out a new marketing plan without testing in the right markets first, or without finding statistical significance in your results before branching out
and scaling that strategy. Similar to supply chain optimisation (and anything happening on the op side), these companies may be getting ahead of themselves with a less-thanpreferred level of maturity for their data infrastructure: data’s going to the brain and being collected, but it’s incomplete. They may have too much information on the arm, but not enough on the leg. This creates a system that overcompensates in some areas and undercompensates in others, which becomes a difficult habit to break as you scale.
CENTRALISATION VS DECENTRALISATION What would be the advantages of centralising the data model for large organisations that are currently decentralised? Typically, decentralised large organisations need to scale quickly, so they create different ‘centres of excellence.’ But these end up no centre of excellence at all. Supply Chain has its own vertical; Marketing its own vertical, Customer Service its own vertical. Consequently, the organisation ends up with all these different decentralised data centres. While this does operate to a certain degree of functionality within those verticals, in my experience I’ve yet to see a system like that work in the long run. Individual leaders of those verticals might testify to their vertical’s effectiveness, but the people who report laterally to those leaders, or the people who they report up to, will always mention the knowledge gap. That’s because each leader is focused only on their lane, without understanding the wider context; how other silos may be impacted. Centralisation is particularly important at the C-suite and board level, where leaders must report performance to shareholders and
make key decisions on a quarterly horizontal translation layer where basis, or even on a monthly basis their input is levelled out and if something critical is happening presented easily to senior leadership. across the organisation. Under the That person can also be a partner decentralised model, it takes the to senior leadership and the other leaders about a month (and costs lots vertical heads. of money) to get all the FTEs on the It’s a simple fix that costs as little ground to pull those reports, which as one additional leader and a small are usually substandard to what they team of two or three. The team can would be if centralised. The reports develop a diplomatic relationship tend to be contextualised within the with each of these verticals, and avenue of each particular vertical. set up a simple framework of data The benefit of having a fully collection and reporting which centralised system is that, instead keeps folks accountable, increases of having eight little brains each transparency, and significantly connected to different body boosts the organisation’s efficiency, parts, we have one brain all with minimal disruption. collecting all our data, and making informed The benefit of having a fully centralised decisions from that system is that, instead of having eight little data. Both the data brains each connected to different body parts, and the decisionmakers are in one we have one brain collecting all our data, and place - ready to send making informed decisions from that data. reports to the head of the organisation, and to loop all existing lines of data In your opinion, what’s a good collection and communication into way to convince leaders in that central base. verticals to share - to give You can still have the silos, up some of the ownership but you need a leader in place: a or insights that come from CTO, a CIO, or some sort of VP of their data? efficacy. The leader needs to take That’s a pain point many teams each of those verticals and create a encounter, particularly in large
36 | THE DATA SCIENTIST
PATRICK MCQ U IL L AN
organisations. Often, they don’t want to relinquish control - not for selfish reasons, but because they’re concerned losing full control will impair performance of their vertical. But in reality, there’s no actual relinquishing of control. It’s more a partnership with these different vertical heads, or with translators at the leadership level who say: your name is going to be on this report one way or the other, and it’s costing you guys $40,000 a month to chase down and pull these metrics together. That amounts to almost $500,000 a year per vertical. Instead, why don’t we save ourselves a few million dollars, get a small team of three or four folks to stand this up, and present it as a partnership? You’ll have more time to innovate, more time to chase down projects, and fewer bottlenecks in your work stream if you let us assist with reporting. And again, your name is going to be on this, so we’ll be able to share this up to C-suite. And C-suite won’t have to ask questions like: what is this and what’s happening? There won’t be any more unpleasant surprises. Instead, it’s going to be something wonderful happening with our name on it. Or if something suboptimal is happening, but we’ve already got ahead of it and we can speak in greater detail about the problem, maybe issue a report before this large meeting, or before these big reports go out to the C-suite. So sharing verticals increases efficiency and creates more confidence. And whether something’s going well - or not as well - they always come out looking better, because if that thing was going to happen one way or the other, this partnership grants the capacity to anticipate the issue. There’s an external partner who understands what they’re working with on the reporting side, and how it will affect different verticals in the organisation. For example, if it’s a supply chain issue and software engineering
should be included, they need someone who can help manage that relationship and both their work streams, so that Supply Chain doesn’t have to worry about Software’s work stream, and vice versa. There’s a third party to help them collaborate, and eventually roll up the same solution as normal, but better. It would be reported more quickly and with greater agility. The organisation would be saving money instead of chasing down reports and having people on the ground hopping off projects to issue ad hoc reports for senior leadership. Ultimately, this impact for the organisation can be sustained over time, and can be scaled quite easily. In the biological model, are there ways of sharing power, or keeping decision-making fluid and flexible at the same time while centralising? I’d argue there’s no sacrificing of power or metrics whatsoever. It’s purely a partnership. To give a perfect example, an organisation I worked with in the past had about eight key verticals, and issues with decentralised reporting: no transparency, little accountability, and inconsistent and inaccurate reporting. So these verticals were doing their jobs, but they weren’t doing their jobs within the context of each other. They weren’t melding as well as they could have. When I first joined and built up my team, the concern was that we would be taking the data away from them; that we would be sacrificing power or influence. But that wasn’t the case. What we actually did was equip them with a framework to make their work look better. So they would be reporting the same metrics: we would not be taking over the calculation of those metrics. We wouldn’t be chasing down those metrics each day, because there might be 1300 KPIs at the organisation, and one team can’t know the narrative for each of those KPIs every single day.
THE DATA SCIENTIST | 37
Instead, we cleaned up how the KPIs were reported (you’d be surprised how many times in an organisation something as simple as ROI or cost per’s, are actually calculated differently across different verticals, but reported as the same thing upward). We called a meeting and basically said: what’s going to be our ultimate indicator here, what’s the definition we’re going to agree on? And we created an alignment. From there, we created data dictionaries that tracked down the individual owners on each team of each KPI. So it’s clear who to speak to if there’s going to be an issue, if there’s a breakage, or an interesting trend happening. Who can we bring into a meeting with the higher-ups? Who in local teams might be working cross-functionally? So we didn’t actually take over reporting. We imposed a governance structure to help secure their data and make their performance reporting more confident, more consistent, and keep the across verticals more effective. So the way leaders presented their findings changed: my team would steer those meetings and we would have the KPIs categorised by division. We would report a simple trend, and incorporate the insights their team wanted us to incorporate, but we would hold standing business reviews weekly, monthly, or quarterly, depending on the audience. Usually weekly for each vertical, monthly for C-suite, and quarterly occasionally with the board or CEO. Each of these VPs or SVPs would be sitting on that call, and we would let them speak to the narrative and share the story. But the VPs became more effective decision-makers because now they had another team, my team, that would be helping contextualise those metrics with their team, so they no longer had to constantly chase things down. We tell the story that they want to tell, and we put it in the context of
PATRICK P H IL IPMCQ P KOUEH IL N L AN
other folks’ stories. So it becomes a complete view of the business. And most importantly, no data’s being sacrificed. It’s still owned by those teams. It’s just being filtered into a master document that our team is managing. And we’re not changing the values. The only values that could be readjusted are some KPIs, to ensure every team is calculating them, and reporting them, and speaking on them, in the same way with the same understanding - which ultimately makes everyone look good, and gives C-suite and board a lot more confidence to report outward to shareholders and make internal decisions.
centralised lakes on top of those existing lakes. So, you have this foundational layer where each vertical has its data sources already being compiled into, whether it be the cloud or a combination of cloud and manual spreadsheets. But above that foundational layer, we have a curated layer where nothing is being recalculated; it’s just we’re filtering away the metrics that we’re not concerned about at an executive level. This might already cut 80, or 85% if we’re talking 1300 metrics or 2000 metrics. Maybe we want to whittle it down to the metrics that only C-suite or vice presidents - the senior leadership or above - care about. With this approach, the
We imposed a governance structure to help secure their data and make their performance reporting more confident, more consistent, and keep the across verticals more effective. That makes a lot of sense. How does the biological model work from the technical side? Do you centralise physical data in one big data centre somewhere? Or keep the data physically decentralised, and just establish this virtual layer on top to funnel into insights and inform decision-making? Data would be collected from different points, as you mentioned, and we don’t want to separate the hand from the rest of the body or mix it with another hand. So we want to make sure the collection epicentres - the data lakes that each vertical is managing - are as undisrupted as possible. It’s best to avoid getting involved too deeply in having each individual team manage their backend data engineering, because usually that engineering has been inherited from many years of certain builds, and has certain rules in place. It’s a lot for one small team to manage, and it wouldn’t be valuable to fully transform that. Instead, we create one or two
individual teams can focus more on the smaller metrics that matter in their day-to-day. It will filter those out, and then it’s just an extra layer of collection. And again, to use our biological analogy, that’s to get it into the base of the brain by saying: okay, we can’t over-process a lot of information, so let’s just focus on what needs to be understood. Let’s focus on breathing. Let’s focus on blinking. Let’s focus on vital organ health. So what are the metrics that help us measure that? And what are the metrics that are going to drive the health of the body at a high level, at the key decisionmaking level? Rather than focussing on the health of a joint in the finger for that vertical, we want to make sure the body can breathe, absorb oxygen into the lungs, and maintain healthy functioning at a universal scale. Usually, those will be primary metrics and then what I call secondary or contextual metrics. So maybe you have, say 10 to 30 metrics, that will tell the performance of the entire business. You might add an additional 50
38 | THE DATA SCIENTIST
or so to contextualise some of those. Maybe if you have a cost per customer service engagement when someone contacts a customer service centre, for example. But then you want to break that out into contextual metrics such as cost per call, cost per email, or cost per chat. So we roll up those primary and secondary or contextual KPIs into a curated layer, which essentially sits at the base of the brain. This curated layer ideally exists on the cloud, but usually it’s on a combination of cloud, querying the individual cloud lakes they have in their own verticals, and pulling manually (or through email load-up) on a synchronised frequency from those partner teams any additional CSVs, or department-wide reporting documents that are useful to source additional metrics that might not be loaded into the cloud. The result is a comparatively simple foundation of clean data that’s querying already clean data. But it’s rolling it up into a safe place that can be managed at an executive level without disruption, and some additional CSV polls for other data that might not necessarily be coming in. And then those get rolled up into overall reports and infrastructure discussions. So the technical side isn’t too complicated when compared with individual verticals. In the curated layer, how would you establish having one source of truth in data that comes from different verticals? So, as a classic example, client or CRM data that lives in different parts of the organisations and is often conflicted? One of the highest value aspects this function delivers is a single source of truth. And that comes into a strong data governance infrastructure. Most folks are focused on the model, or the outcome, or the narrative which is driving those decisions they’re going to be making. But (and this applies to all organisations at all levels) more attention needs to be
PATRICK MCQ U IL L AN
paid to creating a sound and strong foundation of data. What that means is imposing universal standards that are simple to put into place. They don’t require a lot of coding, engineering, or changing at the foundational layer level. These standards ensure consistent reporting at all levels. The best solution I found is creating a data dictionary and an end-to-end management process for any sort of changes that happen in the way a KPI is calculated, or the way data’s being sourced, to ensure that the backend all the way to the front end can manage and adapt to those changes. So the data dictionary side, for example, is saying: over time we can focus on cleaning up these 2000 KPIs, but for the main ones that we’re trying to report at an executive level - these 100 KPIs that are most important for different levels of reports - for those KPI’s, the most important thing is to have conversations with leaders and say: alright, we have eight verticals. All eight verticals rely on different variants of ROI, but they get rolled up into a large ROI number. So let’s sit together and walk through the calculations, and let’s bring the folks who make those calculations into the conversation. That can be a standing meeting twice a month for two months - very simple, half an hour of everyone’s time. And in these meetings, we say: okay, how are we calculating this? Let’s all calculate it the same way. This is obviously easier said than done; sometimes you might have to break it up into two KPIs or three KPIs, but it’s better than having seven or eight different versions, and you can rename them and contextualise them. So it’s getting that alignment on what you’re trying to report upward on. It comes back to that partnership agreement; trying to horizontalise the
conversation where you get that buyin from leadership and say: this is going to help all our collective bosses. It’s a quick adjustment that’s going to help everybody, and it can be done without disrupting previous reporting. But it’s better than just creating more KPIs, which, frankly, I think is more disruptive. Instead, we retire ones we don’t need, creating a consolidated list of fully aligned KPIs on their definition from a layman’s term, their calculation and their individual owners. So, if there’s a KPI being managed by say, the North American team and there’s a team in the EU that has it from a different data lake or a different centre, they
try to approximate that calculation as best they can. That way, we can have a breakout that’s as close to one-to-one as possible, and have a leader in the EU team, and the leader in the North American team, who can speak to that. And we have that in place for each KPI. So this dictionary would consist of: the name of the KPI; the team or teams that manage it from an actual calculation perspective; the location of where it can be found from a cloud or data lake standpoint; the layman’s definition; a technical definition; the SQL code or similar code that’s used to calculate it which folks can copy
THE DATA SCIENTIST | 39
and paste; and the owner. In this way, the dictionary creates a one-stop shop. It’s something that takes a little while to build maybe three or four months. But if it’s being maintained, it lives in the organisation forever. And it’s extremely good for performance reporting, and a great reference when folks are using self-service dashboards to understand metrics more fluidly and with less translational error. Also, the end-to-end process is about connecting those individual KPI owners. Whenever there’s an issue such as a data blackout, or a change to a KPI because some reporting rule has been shifted or the data that’s being collected has changed, there are recurring meetings from a small subset of those groups. So in the past, I’ve led those meetings with just the managers or the analysts: a handful of people who actually pull those metrics. They’ll have recurring meetings, and they’ll mention if something’s changing. And all we have to do is update the dictionary, and then we connect with whoever’s being affected by the issue. So if there’s going to be a 24-hour data blackout that will affect Marketing, for example, and they’re not going to have access to their complete marketing data, we can inform them ahead of time. And while the engineering team is updating that, our team can partner with them on how to mitigate problems while we’re waiting for the new solution. So there’s no loss of efficiency, money, or time from the backend update through to the front-end team that’s actually working with that information. It’s basically about having a responsible hand in the management of that information from end to end, top to bottom, and horizontally.
FRANCESCO GADALETA
COULD RUST
BE THE FUTURE OF AI?
I
n a recent tweet, Elon Musk claimed that Rust is going to be the language of artificial general intelligence. Some people might think this claim is an exaggeration (to be taken with a pinch of salt, like many of his tweets), but in my opinion, Musk’s tweet confirms the speculation surrounding this amazing language. It’s no longer a secret that Rust is a language of the future. My company, for example, has migrated most of the commercial projects from the usual Python stack (Python C, C++ stack) to Rust, solving many issues of code safety and efficiency. And that’s exactly the point of Elon’s tweet: Rust has many advantages over Python. In this article, I’ll evaluate the various uses of Rust and the ways it has the potential to overtake standard programming language.
THE LIMITATIONS OF PYTHON TODAY When it comes to artificial intelligence and machine learning, Python has been the standard for about 20 years. In the fields of data science and machine learning, for example, many Python frameworks are available for data transformations and data manipulation pipelines, as well as for more sophisticated things like deep learning frameworks, computer vision and more recently large language models (LLMs). But this is about to change. The latest big trend of large language models is to impose some form of
optimisation when it comes to memory storage and compute, for example. And because Python isn’t the best language for efficiency and optimisation, developers have been working on alternatives.
MOJO AND RUST: NEW LANGUAGES TO HANDLE OPTIMISATION A language called Mojo has recently been designed by Chris Lattner, one of the field’s most talented developers; Lattner is the inventor of Clang and Swift, and also the LLVM suite of tools for compiling from a high-level language to an intermediate representation (many of the low-level architectures that we are used to dealing with today). Lattner demonstrated that Python could be up to 35,000 times slower than compiled languages, and he created Mojo to rectify this. Mojo hasn’t been released yet, so it’s not fully accessible to all. However, it’s already proving to be one of the best combinations between Python (or interpreted languages in general) and compiled languages. Mojo’s been designed specifically for AI tasks and AI workflows, and it features many capabilities that traditional computer programming languages don’t have. Rust is another new language on the scene; having been the underdog for a while, Rust has recently become wildly popular, and I believe - as perhaps Elon does that Rust is set to be the best replacement for Python and Python workflows. Why? For one, Rust has key
40 | THE DATA SCIENTIST
F RANCESCO G ADAL E TA
aspects that set it apart from standard languages: it’s compiled, rather than interpreted Python. It’s also a runtime, again, unlike Python.
Rust is in fact one of the most performant languages currently available. Rust does have its drawbacks: it’s a notoriously cryptic language (Python, on the other hand, is easy to understand). That said, there are more challenging languages out there: I don’t find Rust as difficult to deal with as Perl, for example. Even if you’re starting from scratch with Rust, I recommend persevering with it because there are many paradigms of programming in this language that are new, and will prompt you to discover new concepts about concurrency and safety.
KEY BENEFITS OF RUST Back when LLMs had yet to emerge and optimisation was not strictly mandatory (due to the reasonable size of machine learning models), we used to start working with Python (or another compact language), especially in commercial environments. We’d use Python to prototype a new idea and iterate it to work with the business. When everything seemed to work as expected, we’d migrate that code base from Python to a better language, usually a compiled one only for very demanding sectors. But in the last few months, with the advent of LLMs, several billions of parameters are no longer an exception. This means optimisation has become a must, and we have to start with a language that’s performant enough to deal with these new billion-parameter models.
PYTHON’S CURRENT ISSUES WITH PERFORMANCE Data engineers encounter problems with Python whenever they want to increase the performance of their code and workflows. Code snippets often get thrown over the fence by Python developers. Consequently, data engineers have to resolve some very nasty optimisations to make the system work, as they usually do not deal with the machine learning algorithm, or whatever the NLP underneath is. In other words, they choose to trust the underlying NLP is built to perform, even though it’s usually not. For example, docker containers tend to get used to containerise very badly listed dependencies. This makes life easier for the developer, but it’s a nightmare for the poor engineer who must deal with these massive containers. A pure computer scientist would try to eliminate that obstruction, and find a different solution: usually something increasingly close to the bare metal of the machine. And this involves compiling your code
for that particular architecture; using so-called native binaries that are specifically optimised for each operating system and specific hardware. And this is where Rust excels. In fact, because Rust is born as a language that’s compiled, it can be optimised as needed. For the record, Rust is compiled to low-level machine code that is optimised for this or that hardware architecture. Such optimisation is left in the hands of rustc, the official Rust compiler.
RUST’S CAPACITY FOR OPTIMISATION There are several levels of optimisations with Rust. For example, using crates (the crate is the equivalent of a Python package or a C++ library) that are specifically designed for certain operating systems and hardware. Rust also provides security, portability, and performance. These are three features of the best programming languages you don’t want to ignore, especially for commercial projects. One problem with Rust, however, is that safe code (code that, for example, doesn’t allow the developer to fall into the trap of nasty bugs like memory violation or buffer overflows) comes at the cost of having native binaries. It can be hard work to learn how to avoid these types of bugs, and to learn the computer programming paradigm specific to Rust. So, don’t expect to become familiar with Rust overnight. Further, when using compiled languages, or native binaries executed on a system, they could potentially crash the entire system. You no longer have a Python script that runs into a runtime sandboxed environment. When working with Rust or C/C++ and compiling code into native binaries, it’s crucial to be aware that these executables can severely impact your system, potentially causing significant issues if there are bugs present. There are also portability issues. Native binaries can be specific to a particular operating system; this can be a benefit, because of the optimisation part, but also a problem because you reduce portability by compiling, or using specific crates that are for that particular operating system. So, the problem of portability is a double-edged sword that can play in your favour, or against it. When we weigh up the advantages and drawbacks of Rust, performance is the biggest elephant in the room: Rust is in fact one of the most performant languages currently available. It’s usually compared to C, and it’s actually proven to be better than C, especially in terms of safety and concurrency. When it comes to the large language model application stack, there’s another important runtime for Rust applications, which is WebAssembly (WASM). There are several languages that can compile to WASM: C++, TypeScript, and of course, Rust. This combination between Rust and WASM is what I believe Elon Musk meant when he claimed that Rust is going to be the language of AGI.
THE DATA SCIENTIST | 41
F RANCESCO G ADAL E TA
RUST, WASM AND AI I contend that Rust plus WASM is one of the best combinations when it comes to the large language model application stack. When you consider the LLM application stack, you typically have agents connecting or connected to the internet. And of course, they receive asynchronous events. They can connect to databases, they can call web services around your cloud, or even out of your cloud. Rust and WASM provide a very interesting stack for high-performance agent apps, especially when it comes to a sync code, asynchronous code, and non-blocking input-output for high-density applications. Going downwards in the LLM application stack, we move from the agent to the so-called inference layer. The inference layer tends to perform some of the most CPU-intensive tasks: pre-processing data to post-process numbers, for example, into sentences, or structured JSON data, and so on. In terms of pre-processing words and sentences - especially when it comes to LLM applications - that’s the inference layer where the CPU is doing the job of performing the prediction of the next wording, a text generative task, and so on. There are some great examples of Rust excelling here, the most famous being MediaPipe Rust, which is written Mediapipe-rs. Finally, there’s the TensorLayer where all the GPUintensive tasks are passed from WASM to the native tensor libraries: for example, LLaMA, CPP, PyTorch. TensorFlow, and the famous MediaPipe Rust.
MEDIAPIPE RUST: A NEW, FLEXIBLE SOFTWARE MediaPipe Rust is a Rust library for transferring MediaPipe tasks into WASM Edge and WASI-NN. It’s a very interesting piece of software because it’s a low-code API, very similar to the MediaPipe Python library. It’s very flexible because users can use custom media bytes as input. A host of tasks can already be supported by MediaPipe Rust: in particular, object detection, image segmentation and classification. There’s also gesture recognition, hand landmark detection, image embedding, face detection, text classification, text embedding, and audio classification. Further, if you look at the README of the official repository on GitHub, you’ll see some interesting and accessible code snippets. Although you must know how to read Rust, the API is very clean, and there are very interesting examples of an image classifier, as well as an object detector, or a text classification that can be done with less than 10 lines of code of Rust. The complexity that these 10 lines of code of Rust can hide is impressive.
FINAL THOUGHTS To conclude, Rust and WASM could be one of the best replacements for the Python ecosystem as
we know it today. They integrate very well with CPU tensor libraries, as well as GPU tensor libraries, which are also written in C/C++ and Rust. They efficiently complete that stack from the frontend, or from an API for a developer down to the backend, which is already optimised. Also, Rust and WASM are more efficient in implementing application-specific pre and post processing data functions. All the tasks where inference is involved are obviously faster than Python, or a Python equivalent. Further, they have container image sizes - if you ever want to use docker or docker containers - which are much smaller than the Python equivalent; usually several megabytes when you use a Rust-based docker image, for example. By contrast, Python-based images span several hundreds of megabytes. Rust and WASM are also safer objectively speaking, not only because of the language (which is known to be a safe language by design), but because of the impossibility of writing certain nasty bugs. In fact, the attack surface is much smaller than when you deal with a Python-based container where the amount of dependencies that you build or import in your docker is significantly bigger. Using Rust and WASM significantly lowers the risk of bugs, which in turn lowers the chance of sabotaging the entire workflow if you ever use that particular docker image. Of course, the above point is increasingly superfluous now Rust and WASM are more efficient in implementing all the networking-intensive and long running tasks that are usually required for LLM agents. Remember, we’re moving from a standalone way of inferring and generating text (the ChatGPT we were used to in March this year), to something more complex: the concept of an agent that eventually can go online, download live resources, embed these resources live and generate a conversation using a dynamic context. Thanks to prompt engineering techniques, the concept of the agent is becoming something that, from a computational complexity perspective, is increasingly complex. A more complicated infrastructure is therefore needed to serve these beasts now; these are no longer standalone models. And so again, Rust has all the properties that one would expect from a highly performant language which is, while not easy to learn, a powerful tool. Even fast cars are not easy to drive; there is a learning curve. But guess what? They go faster than all the others. So, yes, Elon, I think this time, you’re right. Rust is the language of the future.
To conclude, Rust and WASM could be one of the best replacements for the Python ecosystem as we know it today.
42 | THE DATA SCIENTIST
JULIA STOYANOVICH
THE PATH
TO RESPONSIBLE AI JULIA STOYANOVICH is Institute Associate Professor of Computer Science and Engineering, Associate Professor of Data Science, and Director of the Center for Responsible AI at New York University. She engages in academic research, education, and technology policy, and speaks about the benefits and harms of AI to practitioners and members of the public. Julia’s research interests include AI ethics and legal compliance, and data management and AI systems. She has co-authored over 100 academic publications, and has written for the New York Times, the Wall Street Journal and Le Monde. Julia holds a Ph.D. in Computer Science from Columbia University. She is a recipient of the NSF CAREER Award and a Senior Member of the ACM.
THE DATA SCIENTIST | 43
JUPHILIPP L I A STOYANOV KOEHNICH
Could you give us your definition of what responsible AI is? That’s a great question to ask and a very difficult question to answer. At the NYU Tandon Center for Responsible AI, our goal is to make responsible AI synonymous with AI in a not too distant future. I use “Responsible AI” to refer to the socially sustainable design, development, and use of technology. We want to build and use AI systems in ways that make things better for all - or at least for most - of us, not only for the select few. Our hope is that these systems will be able to cure diseases, distribute resources more equitably and more justly in society, and also - of course - make money for people. We also want these systems to make it so that economic opportunity is distributed in ways that are equitable. One component of responsible AI is AI ethics. Ethics is the study of what is morally good and bad, and morally right and wrong. AI ethics is usually used to refer to the embedding of moral values and principles into AI systems. Much of this conversation usually centres around the unintended consequences of AI. This is when there are mistakes that an AI system may make, and the mistakes that a human may make when following a recommendation or a suggestion from an AI. This conversation also often concerns bias in AI. Further, we want to think about arbitrariness in decisions as a kind of mistake that an AI system may make. For a positive example of AI use, let’s take medical imaging. We are starting to use cutting-edge AI tools in clinical practice to improve diagnosis and prognosis capabilities. In a recent collaboration, researchers from NYU and Facebook AI developed a technology called Fast MRI. This is a way to generate semi-synthetic magnetic resonance imaging (MRI) scans. These scans use a lot less real data as compared to a traditional MRI
and so can be done much faster. We start with a quick MRI scan of an individual, and then we fill in the gaps with the help of AI. It has been shown that these semisynthetic MRI scans are diagnostically interchangeable with traditional scans. MRI machines are in short supply in many locations, and so this makes MRI scans more accessible, and allows more people to be diagnosed with diseases. It also can make a huge difference for somebody who is claustrophobic, and does not want to stay inside an MRI machine for longer than is absolutely necessary.
We want to build and use AI systems in ways that make things better for all - or at least for most - of us, not only for the select few. Importantly, here, what we have is an environment in which machines cooperate productively with welltrained and professionally responsible clinicians. They understand what it means for a person to have a disease. They also understand that the responsibility for the diagnosis, and also for any mistakes that they may make, even if the diagnosis is AI-assisted, are still with them. This is because clinicians have been trained in medical ethics. And so, this gives us an environment in which AI is being used responsibly. It’s helping us solve a problem, an actual problem - increasing access to MRI technology. We can check if the technology works - we can validate the quality of the semi-synthetic MRI scan. And we have a responsible human decision maker in the mix. I like to contrast this with some other examples where the use of AI is irresponsible. Here, there are lots of things that can go wrong. For example, some uses of AI create a self-fulfilling
44 | THE DATA SCIENTIST
J U L I A STOYANOV ICH
prophecy rather than addressing an actual need. In some uses, we are asking machines to make predictions that are morally questionable, like predicting whether somebody will commit a crime in the future based on how others like them have behaved in the past. Sometimes AI is out in an environment where it interacts with people who have not been taught how to interact with these machines, and then these people just take the suggestions on faith, and they cannot meaningfully take responsibility for any mistakes. How can global regulations help ensure AI is responsible? AI is being used in a variety of sectors, with a variety of impacts. From health, to economic opportunity, to people surviving or dying on a battlefield. Because of this variety, I think that it’s going to be tough to come up with globally accepted ways to regulate AI use. In part, this is because we don’t really agree on a universal set of ethics or moral norms or values. But this is not to say that we shouldn’t try. I think that there are some high-level insights that we all share and some high-level goals. Most importantly, it’s that we should keep our humanity in our interactions with AI. We should make sure that it’s people who are deciding what the future will look like - and not machines. Is this a problem that can be solved through regulation? Regulation is a very valuable tool in our responsible AI toolkit. But it’s not the only thing we will rely on. Government oversight, internal oversight within AI vendor companies and within organisations that buy and use AI, as well as awareness of the people being impacted by these systems, are all very important for controlling them. Let’s take the medical domain, where the use of AI presents challenges even though this is already a tightly regulated space in many countries. There’s a negative example that was surfaced by Obermeyer and co-authors in 20191. In many hospitals throughout the United States, predictive analytics are used to estimate how likely somebody is to be very ill. Researchers showed that these predictors exhibit racial bias: at a given “risk score”, African-American patients are actually considerably sicker than White patients. This happens because of the way that the predictive problem has been set up: the algorithm predicts how ill someone is based on healthcare costs - on how much money has been spent on healthcare for comparable patients to date. We have a biased healthcare system in the US, where people from lower income communities have less access to medical care. These are very often people who are African-American or Hispanic. Therefore, healthcare spending is going to be lower for them than for
somebody from a more affluent social group, but who is comparably as ill. By using a biased proxy like healthcare cost, we end up propelling the past into the future, further exacerbating disparities in access to healthcare. So, in this domain - and in many others - we need to be very careful about how we use data, how we collect it, what it encodes, and what are some of the harms that the irresponsible use of data may bring to this domain. What is causing these bias boxes? Is it limitations of data and what role do data models play for AI? Data may or may not represent the world faithfully. This certainly contributes very strongly to the bias in predictions. But it’s not the only reason. I like to think about bias in the data by invoking the metaphor that data is a mirror reflection of the world. Even if we reflect the world perfectly correctly in the data, it’s still a reflection of the world such as it is today, and not of a world that could or should be. The world being reflected may be gender biased, or racially biased, or have some other distortions built in. Then, the data will reflect this and it will legitimise it. Because there is a perception that data is “correct” and that it’s “objective.” Data is almost never correct or objective. And so, we also need to think about whether, given the current state of the world and given the current state of our data collection and data processing capabilities, we in fact can do things better than simply replaying the past into the future with the help of these data-based predictions. Do you differentiate between different types of AI or would it all fall under the same umbrella? So here, as an academic, I choose to take an extreme point of view. Of course, in the real world things may be a bit more nuanced. I actually think that it doesn’t matter what sort of technology lives inside that technical “black box.” It could be a very complex model or it could be a very simple one. I have spent the bulk of my career studying these very simple gadgets called score-based rankers. You start with a dataset of items, let’s say these are people applying for jobs, and you compute a score for each item using a formula that you know upfront. Some combination of standardised test scores, for example, and some number of years of experience. Then you sort everybody on that score. Even in that case, by taking the top 10% of the people from that ranked list to then invite them for in-person job interviews, you’re introducing a lot of opacity. You, as a decision-maker, are not going to immediately understand what is the impact of the scoring formula on whom you choose to invite for job interviews, and whom you forgo.
(1) “Dissecting racial bias in an algorithm used to manage the health of populations”, Obermeyer et al., Science, 2019, www.science.org/doi/10.1126/science.aax2342
THE DATA SCIENTIST | 45
J U PL IHAILSTOYANOV IP P KO EH NICH
As another example, let’s say that we’re talking about linear regression models. These are being used in hiring, college admissions. Let’s say, half of the score is made employment, credit and lending, and in determining up of the high school grade point average of the student, who has access to housing. We shouldn’t forget that if and half of the score is based on their standardised test the AI tool is simpler, there can still be, and have been, performance, like the SAT. If this is a very selective college, documented tremendous harms that the use of these then applicants self-select, and only those with the very tools can bring. We should definitely regulate in a way top SAT scores will apply. that looks at who is impacted, and what are the impacts, Although the SAT score component has an equal weight rather than by regulating a particular kind of technology in your formula, it’s going to have far less importance, that sits inside the box. because everybody’s tied on that component of the score. This shows you that even seemingly One of [the bigger risks of AI] is that decisions will be made simple models can have side effects - or direct with the help of these tools by people who do not question effects - that are hard for people to predict. whether the predictions of the tools are correct in any sense. So rather than worrying about what lives inside that black box - whether it’s a generative AI model, a simple, rule-based AI, or a scoring formula What’s happening currently regarding - we should worry about the impacts that these regulations? How do you think we should go devices have. about creating regulations? To think about the impacts of AI, we have to ask: what I think that we just need to try to regulate this space. is the domain in which we use it? Can we tell what the AI We shouldn’t wait until we come to a moment where does, rather than how it works? We have the scientific we’re absolutely sure that this is the perfect way to put methods at our disposal to help us deal with and unpack a regulation into place. That will never happen. It’s very how black boxes work. We can feed it some inputs and hard to reach a consensus. So I think that we should see what happens at the output. Are there any changes try. We should talk less and do more. I’m really glad that in the input, for example, if I change nothing except an the European Union has been leading the way in this, applicant’s gender or ethnicity? If the output changes, starting with the GDPR (the General Data Protection then we can suspect that there is something going on that Regulation). That has been extremely impactful. In the we should be looking into more closely. United States there is still no analogue to this, and this To summarise, I wouldn’t worry about whether we are is really problematic. I’m also really glad that the AI Act dealing with a very complex machine or a seemingly simple in the European Union is moving forward. Again, in the one. I would worry more about what these machines US we have been hearing lots of people speak about this. do, whether they work, and how we measure their But we are yet to see regulation at the federal level in the performance. And I would worry about the consequences United States. We are lagging behind. In the US, of course, of a mistake, and about whether and how we can correct we have a system that is decentralised, at least to some these mistakes. extent. And so there is also a lot of opportunity in the United States to regulate at the level of cities and states. Did you see an increased interest in regulations and responsible AI with the rising of generative AI? Is there any evidence between the strictness in Yes, absolutely. It’s a blessing and a curse that there’s regulations and suppression of technological now this hype around generative AI. innovation? The blessing is, of course, that almost everybody is I’ve not actually done any research specifically to look paying attention. Worldwide, we have politicians speaking at the impact of regulation on innovation. It’s hard to about the need to control the adverse impacts or the risks do this research really, because we don’t have examples of harm that the use of generative AI can bring. Together of two places that are comparable in every way, except with that, everybody’s just paying attention to AI more that one has stronger regulatory regimes than another. generally, and to how we might oversee, regulate, and But personally, I don’t believe that regulation stifles bring more responsibility into our deployment of these innovation in any way. systems. It’s a good thing in that sense. To me, “responsible AI” is, first and foremost, AI that But of course, hype is also very tiring, and it’s also is socially sustainable. To reach social sustainability, harmful in that we are paying a lot of attention to we need to make it so that when we deploy a tool, it things that may or may not matter immediately. We doesn’t break society further. Because then you have shouldn’t forget that we already are using AI tools in to recover from the ill impacts of that. So to me, first very impactful domains, and have been using these for deploying something and then seeing how it plays out is decades. These are not, for the most part, fancy tools not at all a sustainable way to operate a society. It also like large language models. They are much simpler only advantages a very select few. The people who are tools like rule-based systems, score-based rankers, or releasing the technology stand to benefit from it
46 | THE DATA SCIENTIST
J U L I A STOYANOV ICH
financially now. But in the long run, this is going to hurt us and it is already hurting us. So I personally see no alternative here. Considering the success that this technology has had, we do need to think about regulation at the same time as we think about large-scale adoption of things like large language models. What’s your opinion on the release of generative AI, such as Chat GPT, to a mass audience? Was this too early in terms of the maturity of the technology? I definitely think that it’s too risky. I think it’s extremely irresponsible to have unleashed this technology without giving us any meaningful way to control how the data travels and where the data goes. We also haven’t been given any meaningful way to understand where this technology can be safely used. There are tremendous issues around labour and environmental sustainability that go along with the release of the technology. I think that the harm to individuals and to society, and the risks of further harm due to data protection violations, bias and anthropomorphisation of these tools far outweigh the benefits. But then the question is: benefits to whom? For the companies that release this technology, financial benefits are what matter. We need regulation so that it’s not just the select few who benefit. I don’t currently do any research work that involves generative AI because I just don’t think that we should be feeding into this hype and giving away our data. Those who produce these technologies need to spend resources - including time and money - on figuring out how to control them before they can go into even broader use. What are the bigger risks of AI? One of them is that decisions will be made with the help of these tools by people who do not question whether the predictions of the tools are correct in any sense. So, many of the decisions being made will be arbitrary, and this is even beyond bias. How our data is used, and whether we’re comfortable with our data being used in this way, is also problematic. One of the angles on this - in addition to the conversation about benefits and harms - is that people have rights. We have rights to privacy. We have rights to agency, to being in charge, both of our own data and existence, and also of the world in which our society functions. At the high level, it’s really just that we’re insisting on using a technology that we don’t yet really know how to control. To be more concrete, we need to think about – in each specific domain – who benefits, who is harmed, and who can mitigate the harm. It’s the same story with every technology that we’ve been experiencing throughout human history. The Industrial Revolution also left out some and benefited some others. And we need to make sure that we are acting and using technology in ways
that are more equitable this time around. How can practising data science leaders and data scientists make sure that they develop AI systems responsibly? In my very simple worldview, there are essentially four conditions that you need to meet to use AI responsibly. Firstly, are you using AI to meet some clear need for improvement? Are you just using it because your competitors are doing the same, or is there some actual problem that you can clearly articulate, and that you want AI to help you solve? Secondly, can you actually check whether the AI is going to meet the requirements for that need for improvement? Can you validate that the predictions of the AI are good and correct? If you can’t validate it, then again, it’s not the right setup. Thirdly, can the problem that we have set out to solve be solved given the current capabilities in terms of hardware and data and software? If that is not the case, for example, if data doesn’t exist that would allow you to predict the kind of thing that you want to predict, then it’s hopeless. AI is not magic. Finally, AI very rarely operates autonomously. Usually it’s in a collaboration with a human. So, do you actually have these decision-makers who are well-equipped to work with your AI, and who can challenge it when it needs to be challenged? Here again, take the example of a clinician working with AI to diagnose a disease. They need to understand that it’s up to them to make the decision. Outside of these four conditions, there are, of course, others, like legal compliance. Are you going to be legally compliant in your data collection and AI use? But, the main four components are absolutely crucial. Is there a problem to solve? Can we solve that problem? Can we check that we solved it? And can we use this AI, this solution, safely together with humans?
Generally, to me, responsible AI is about human agency. It’s about people at every level taking responsibility for what we do professionally and for how we’re impacted personally. How can global regulations lead us to ensure that all AI is responsible? This is, again, a very difficult question. I don’t know whether we are prepared to regulate the use of AI globally. We have been trying to do this in a number of very concrete domains. For example, take lethal autonomous weapons. These weapons decide who or what is a legitimate target, and who or what is a civilian - person or infrastructure - and so should not be targeted. Even in this domain, AI has been very difficult to regulate globally.
THE DATA SCIENTIST | 47
J U L I A STOYANOV ICH
The United Nations has been playing a tremendous role in pushing for regulation in this domain. But it has been very difficult to come to a global worldwide agreement about how we can control these technologies. There is a balance between the rate of technological development and the rate that we develop ethical frameworks that need to be hit. Is that balance being met, and do you think we will be able to keep up with technological advances in the future? I am an engineer - I’m not a philosopher or somebody whose job it is to predict the future. Engineers predict the future by making it. I think more engineers are going to understand that it’s our responsibility to make sure that we build systems that we are proud of and that we can stand behind. We should take control and participate in making decisions about what we think we should be building and using. When we talk about responsible AI, that term itself is a bit misleading. Responsible AI doesn’t mean that the AI is responsible. It’s the people who are responsible for the development and use of AI. One of the things that’s particularly dangerous with the current AI hype, is that there are some very vocal people saying that AI is about to take over and that it has a mind of its own. They argue that whatever harms us socially is the AI’s responsibility. This is a really problematic narrative, because it absolves those who stand to benefit from AI, financially and otherwise, from any responsibility for the mistakes. We cannot allow that to pass. I think that this is really a point in history where we’re witnessing people fuelling this AI hype for personal benefit, so that they absolve themselves of the responsibility and yet reap all the benefits. Generally, to me, responsible AI is about human agency. It’s about people at every level taking responsibility for what we do professionally and for how we’re impacted personally. We all need to step up and say that We the People are in control here. The agency is ours and the responsibility is ours. This is, again, one area in which generative AI is presenting us challenges, because a lot of the impetus for these tools to exist is to seem indistinguishable from what a human would have produced. This anthropomorphisation of AI is very problematic, because it takes us away from the goal of staying in control and into somehow giving up agency to machines. We should resist this as much as possible. What do you say when people counter that AGI’s can start writing their own code now and potentially start self-improving at some point in the future? I don’t believe that’s the case. Furthermore, we should decide whether we are okay with this. If generative AI writing code is something that we think can be used to
automate more mundane tasks--for example, software testing–then certainly we can allow this particular use. But whenever we ask an AI to do something, we need to be able to measure whether whatever it has done is correct, good, and adheres to the requirements that we have set out. If we can’t do that, then we cannot take an AI’s word on faith that it worked. One example is the use of AI in hiring and employment. There are several tools that have been developed that claim to construct a personality profile of a job applicant based on their resume. But, is there any way to validate this? If I made such a prediction myself, could I actually check if I was correct? If the answer is no, then we should not be using machines to make these predictions. This is because AI tools are engineering artefacts. If we can’t tell that they work, then they don’t work. Do you have a perspective on Open AI’s super alignment initiative? I don’t have a perspective on their super alignment initiative, and I’m not a fan of the term alignment in general. Usually the message there is that somehow we’re able to just automate moral and value-based reasoning in machines, and I don’t believe that is possible, nor should it be the goal. I don’t think that we can automate ethics or responsibility. I don’t think alignment in the way that it’s being discussed right now is a productive way forward. This is because it essentially borders on this conversation about algorithmic morality, where essentially it’s just the simplest, least nuanced version of utilitarianism that we end up trying to embed. For example, we only look at how many people die and how many people are safer. We add these numbers up, we subtract some, and then based on that, we decide whether or not it’s safe to deploy self-driving cars, for example. I think that the use of AI is way too complex and contextdependent for us to pretend that we can automate ethics and responsibility and morality. So. I think that that’s a dead end. So in your view, is making AI systems responsible more of the duty of engineers in the first place? For technologists like myself, I think the main task is to figure out where technology can be helpful and where it has its limits. Technology cannot solve all of society’s problems. There’s no way for you to de-bias a dataset and then proclaim that now you are hiring with no bias, or lending with no bias. This is hubris. We need people to make decisions and take responsibility for decisions throughout. There’s no way that we can align technology to our values, push a button, and then say that the world is fair and just. On November 1st 2023 Julia was invited to speak at the AI Insight Forum at the US Senate. Her full statement can be found on this link: r-ai.co/AIImpactForum
48 | THE DATA SCIENTIST
COLIN HARMAN
WITH LLMS,
ENTERPRISE DATA IS DIFFERENT By COLIN HARMAN
COLIN HARMAN is an Enterprise AIfocused engineer, leader, and writer. He has been implementing LLMbased software solutions for large enterprises for over two years and serves as the Head of Technology at Nesh. Over this time, he’s come to understand the unique challenges and risks posed by the interaction of Generative AI and the enterprise environment and has come up with recipes to overcome them consistently.
W
elcome to the age of Enterprise LLM Pilot Projects! A year after the launch of ChatGPT, enterprises are cautiously but enthusiastically progressing through their initial Large Language Model (LLM) projects, with the goal of demonstrating value and lighting the way for mass LLM adoption, use case proliferation and business impact. Developing these solutions either within or for mature businesses is fundamentally different from startups developing solutions for consumers. Yet the vast majority of advice on LLM projects comes from a startup-to-consumer perspective, intentionally or not. The enterprise
environment poses unique challenges and risks, and following startup-toconsumer guidance without regard for the differences could delay or even halt your project. But first, what does “enterprise” mean?
Provider
Startup
Enterprise
Consumer
B2C
B2C
Enterprise
B2B
B2B or Internal
User
A simple provider-user matrix highlighting ( ) where enterprise challenges and risks are involved. Note that a product provided to an enterprise user could end up with a consumer end-user, e.g. a chatbot provided to a financial services company for their clients.
THE DATA SCIENTIST | 49
PHILIPP CO L IN H ARM KOEHN AN
There’s currently a strong but waning bias of LLM content toward the consumer-user and startup-provider portion of the provider-user matrix above. It’s largely driven by the speed to market and click-hungriness of VC-funded startups and independent creators. Where, then, does the enterprise guidance that does exist come from, and can you trust it? This may surprise you, but very few enterprises have actually completed LLM projects. At this point, few entities like traditional consulting organisations are trading on actual experience, but instead use repackaged web content. There’s not been sufficient time for much enterprise experience to accumulate, let alone disseminate. There are plenty of exceptions, but the majority of future LLM experts from within the enterprise are still busy with their first projects, and the number of experienced startup providers for the enterprise is still limited. As a B2B software vendor, I’ve had the privilege of developing and implementing LLM projects (the Startup/ Enterprise cell in the provider/user matrix above) for enterprises since 2021, and have overcome the set of unique challenges to complete multiple end-to-end implementations. In this article, I’ll share some of the most important dimensions I’ve observed in which the data involved in enterprise LLM projects differs from startup/consumer projects and what it means for your project. Use this to de-bias information coming from a
startup/consumer perspective, inform yourself of risks, and build intuition around enterprise applications of LLMs.
DATA IS DIFFERENT Specifically, this post will focus on the data used for data-backed applications, which cover a majority of enterprise LLM use cases. Let’s move beyond the LLM parlour tricks of cleverly summarising an input or answering a question from its memory. The value that can be generated by allowing LLMs to operate over data is much greater than the value from interacting with standalone LLMs. To prove this, imagine an application that consists solely of LLM operations, taking user input and transforming it to produce an output. Now imagine that this same system can also access data stored outside the model: the capability is additive, and so is the value. This data could be anything: emails, client records, social graphs, code, documentation, call transcripts… Almost always, the data comes into play through information retrieval - the application will look up relevant information and then interpret it using an LLM. This pattern is commonly known as RAG (RetrievalAugmented Generation). So, how does data differ between enterprise and startup/consumer applications, and what does that mean for your project?
Qualitative comparison of the data behind LLM applications in consumer and enterprise environments along several axes. It’s a cartoon, so relative sizes are approximate, and there are many exceptions.
50 | THE DATA SCIENTIST
CO L IN H ARM AN
Open
vs Closed
Domain
Open-domain data refers to a wide array of topics that aren’t confined to any specific field or industry. This is the kind of data that consumer applications or publicfacing enterprise offerings deal with. On the other hand, closed-domain data is typically found in enterprise applications. This type of data is more specialised Enterprise Data and domain-specific and often includes terminology, acronyms, concepts, and relationships that are not present in open-domain data.
REAL-WORLD EXAMPLES: Pharmaceutical R&D documentation ●● E-commerce product listings / ●● Customer support records / ●● Company policies ●● Quarterly earnings reports Consumer Data ●● Social media posts ●● Personal finance records ●● Podcast transcripts / ●● Recipes ●● Travel journals
IMPLICATIONS FOR YOUR PROJECT Failing to account for closed-domain data in your application could render it totally ineffective. Most commercial LLMs are trained on open-domain data and, without help, will simply fail to correctly interpret terms and topics they haven’t encountered in their training, likewise for the embedding models used in vector search. For example, many organisations have massive vocabularies of acronyms that are used as the primary handle for certain concepts, and without help, an LLM or retrieval system may be unable to relate the acronyms to the concepts they represent. However, don’t be fooled into undertaking costly retraining projects if it’s not actually needed or there are simpler solutions! Many domains that seem closed are,
in fact, open. For example, commercial LLMs have been trained on an immense amount of public financial data and intuitively understand it because, while finance is a particular domain, it is not a closed domain. Unset
Particular Domain ≠ Closed Domain Even fields like medicine and law are heavily represented in public data, although subdomains often exhibit closed-domain properties (as with finance). Think twice before pursuing training projects or specialised models and evaluate whether they are truly needed. Expect more posts on how to assess the openness of your domain’s data. In most cases of closed-domain data, you can get quite far by looking up synonyms or definitions of closeddomain terms and injecting them into your retrieval and generation systems. However, this requires you to have structured information around synonyms, relationships, or definitions, which may not be readily available. When possible, this pattern is often the best and simplest solution, and we’ll explore related techniques in future posts. Unset
Prefer bespoke systems to bespoke models In summary: ●● Startup/consumer applications’ bias toward opendomain data makes them a natural fit for off-the-shelf commercial LLMs and embedding models for vector search. ●● Enterprise applications, when they involve closeddomain data, often necessitate different, more complex designs to provide acceptable performance. ●● Start by assessing whether your domain is open or closed, and if it’s closed, identify whether structured information about the domain exists (synonyms, definitions, acronyms, etc).
Size Businesses tend to have more data than individual consumers. The simplest proof of this is that an enterprise is a collection of employees, clients/ consumers, and processes, all of which generate data, while a consumer is a population of one. Businesses also tend to undertake projects over longer spans of time than consumers, with a higher density of records within those spans. We’ll split the size dimension into 3: ●● Population vs Unit ●● Archival vs Fresh ●● Many modalities vs Fewer modalities Enterprise projects tend to occupy the “more data”
side of each scale. The reason this matters is that, as the amount of data you retrieve grows, so does the likelihood that unhelpful records will be considered relevant, simply because there are more unhelpful records, or “noise.” Unset
More data causes less accurate retrieval What’s more, certain types of data generate more noise than simple random data. Those are sometimes referred to as distractors, and we’ll see how they can occur in some of the subcategories.
THE DATA SCIENTIST | 51
CO L IN H ARM AN
Population
vs Unit
Each part of an enterprise tends to deal with groups of things: employees, clients, products, and projects. Each unit in these groups can generate anywhere from a few to millions of records that your application may need to handle. Individual consumers may be one of those units and therefore represent a smaller slice of data. To understand your data along this dimension, ask: does it deal with a single entity or with many? Are those entities known, or even knowable?
REAL-WORLD EXAMPLES: Enterprise Data ●● E-commerce product listings ●● Customer support records ●● Customer support records for a particular customer ●● A company’s meeting transcripts ●● A team’s meeting transcripts ●● A company’s GitHub repositories Startup/Consumer Data ●● My personal call transcripts ●● My GitHub repositories
IMPLICATIONS FOR YOUR PROJECT There reaches a point (that most enterprises’ datasets are well beyond), where it’s critical to limit the size of data your application operates over. Populationlevel data can easily push you past this limit. Even if
Archival
your application contains an entire population’s data, you may be able to transform it into unit-level data at runtime by filtering on a target entity (client, product, project, etc), clearing the path for a simpler and more accurate retrieval system. However, in some projects, this can be impossible if your client wishes to provide an un-curated data dump of, for example, PC hard drives. There are several approaches to handling this: ●● When scoping the project, focus on a problem where the supporting data is well structured and able to be filtered down to a unit level. ●● Request that your client provide a smaller or segmented dataset. ●● If the above approaches are unsuccessful, identify entities by which the data can be filtered in coordination with subject matter experts. This may dictate new data extraction/enrichment and query understanding processes, but for large datasets, it may be critical. If you aren’t successful with any of these, expect to spend a lot of effort reducing noise in your retrieval system. For both startup/consumer and enterprise applications, you should always seek to relieve the burden on the retrieval system by only searching through the relevant slice of data. This isn’t optional with massive enterprise datasets.
vs Fresh
Enterprises love bookkeeping, because they’re forced to keep records for compliance reasons, or because they’ve simply been doing the same work for many years and have accumulated a lot of documentation. Clearly, this contributes to the data size issue in that it can introduce noise, but it can also add a more aggressive form of noise: highly semantically similar records which, if unaccounted for, will become adversarial distractors to both your retrieval and generation systems. A good heuristic for whether this will be a problem is if data pertains to projects spanning long periods of time. All the enterprise examples below can be thought of as part of long-running projects (product development for a product, project management for a project, customer support for a customer) while the consumer examples are related to relatively shortlived events or objects. There are plenty of exceptions, but this is going to be a challenge more frequently in enterprises.
REAL-WORLD EXAMPLES: Enterprise Data ●● Quarterly voice of customer reports ●● Project report draft versions ●● Product spec versions ●● Quarterly OKR meeting transcripts Startup/Consumer Data ●● Recipes ●● Travel journals ●● Social media posts
IMPLICATIONS FOR YOUR PROJECT If your project data has a lot of versions, entries, or copies, you should plan to address it from the beginning. Otherwise these versions can greatly increase retrieval noise and can lead to contradictory or untrue inputs to your generation system. The best courses of action are generally: ●● Request that the data be curated, and only the fresh data be provided for your application (not possible if previous versions of the records are also relevant).
52 | THE DATA SCIENTIST
CO L IN H ARM AN
●● If the first approach is unsuccessful, consider a dataslicing approach similar to the one for addressing population-level data. This can help get more relevant retrieval results before generation. ●● Instruct your generation system how to handle different versions, dates, and contradictory records through prompting, few-shot examples, or finetuning. Note that this will require extracting version or date information and passing it through your system to the generator.
Few
●● Date-based ranking: it’s best to avoid this, as your system probably already has a ranking system, and blending ranking signals is harder than it sounds. However, if you can replace or follow your original ranking with a relevant/not-relevant filter, then a date ranking may be useful. It’s best to avoid this challenge altogether by insisting on high-quality data rather than building complex passive version control or multi-signal re-ranking systems.
vs Many Modalities
How many discrete forms of data does your application need to support? Data generated directly by individuals usually falls into our natural language modalities: text, audio, image, and video. Within those, we can look even deeper: text records may contain emails, code, documents, presentations, websites, etc... which might contain other forms of media themselves! Then there’s data about individuals and their behaviour, and about other entities (like products or events), which are often represented in structured formats like tables and graphs. The average enterprise will own a mixture of all of these, whereas a consumer often only consciously stores and processes a few. Think of the difference between a company’s data warehouse and your personal iCloud. In some enterprise areas (like federated search), you will need to deal with several modalities at once, while in others, you will focus on one at a time.
REAL-WORLD EXAMPLES: Enterprise Data ●● Project kickoff reports ●● Click behaviour data ●● CRM data ●● Product spec sheets ●● Quarterly OKR meeting transcripts ●● Property brochures
Data Regularity (
vs
Startup/Consumer Data ●● My Notion pages ●● To-do lists ●● Phone calls ●● Social media posts
IMPLICATIONS FOR YOUR PROJECT Another way of looking at this is that enterprises have myriad tiny categories of media, each containing many entries, whereas individual consumers tend to have fewer overall categories but (as we’ll see in the next section) greater variety within those categories. This property of enterprises actually turns out to be an advantage: the different types of content can help you scope your retrieval using the slicing technique described in the data size section and, even more importantly, scope your solution as narrowly as possible while still solving a user problem. As a general rule, only operate over the types of content that are necessary to address your use case. As much as possible, treat use cases that involve different modalities as different projects and solve them individually before combining them for a one-size-fitsall solution.
)
Though enterprises may have more different forms of content, all the examples of a certain kind are likely to be fairly similar. In structured data like graphs and tables, this is self-evident (because they were generated by a consistent business process), but why might it be the case with human-generated data like reports, presentations, and phone calls? The simplest reason is that enterprises and even entire industries have fairly reliable norms and procedures around communication, but those between consumers are more diverse. For example, sales phone calls tend to follow more reliable structures than phone calls between friends.
REAL-WORLD EXAMPLES: Enterprise Data ●● Project kickoff reports ●● Click behaviour data ●● CRM data ●● Product spec sheets ●● Quarterly OKR meeting transcripts ●● Real estate brochures Startup/Consumer Data ●● Personal email ●● Phone calls ●● Social media posts
THE DATA SCIENTIST | 53
CO L IN H ARM AN
IMPLICATIONS FOR YOUR PROJECT Depending on where your project falls on the consumerenterprise spectrum, you may be able to make more assumptions about what the data looks like. If you’re building a product for internal use or to work with regulated documents, you have the highest certainty about the data structure that your system will need to handle. On the other hand, if you’re building a solution used by disparate enterprise customers, you may need to either build your solution flexibly enough to handle variable data schemas or perform discovery and system adaptation to each customer’s data. The parts of an
Access Control (
vs
LLM application in which differences may arise include pretty much everything: data ingestion, retrieval, and generation. For example: ●● Unfamiliar data not represented in few-shot prompt examples increases the defect rate. ●● Unexpected formats in documents can introduce retrieval defects due to chunking. The more you know about the data, the simpler and more reliable you can build a system to work with it. So use this knowledge when it’s available and design flexibly when it’s not.
)
Possibly the most mundane of the differences between the data in consumer and enterprise applications is who is allowed to access it. In consumer applications, this is typically straightforward: a user can access their data, but others can’t (except maybe an admin). However, in enterprises, this situation can become extremely complex, with numerous and frequently updating access groups for particular resources and different tiers of access. The stakes are often high, with employees needing access to certain data to perform their jobs but also needing access immediately revoked if they leave that role.
REAL-WORLD EXAMPLES: Enterprise Data ●● Product line documentation ●● HR records
Consumer Data ●● Crypto brokerage transactions ●● Frequent flyer account
IMPLICATIONS FOR YOUR PROJECT Unfortunately, there’s no way to avoid a lot of engineering to implement access control properly. The best thing you can do for your project is to recognise this early and make sure you have the engineering expertise and client access necessary to figure it out. Definitely look to integrate with the data source’s identity provider if applicable, and look out for frameworks to help with the integration. Still, there are many decisions to be made about how your system handles different user groups, for example, in a system involving retrieval, will results be filtered before or after being retrieved (sometimes called early vs late-binding)?
Conclusion I hope this article will raise some flags at the beginning of your enterprise LLM project that save you time later on, and help you form a de-biased intuition around LLM projects.
54 | THE DATA SCIENTIST
WOULD YOU LIKE TO BE FEATURED THE DATA SCIENTIST MAGAZINE? We welcome contributions from both individuals and organisations, and we do not charge for features. Key Benefits of Getting Published:
EDUCATIONAL CONTRIBUTION By sharing your technical know-how and business insights, you contribute to the larger data science community, fostering an environment of learning and growth.
THOUGHT LEADERSHIP
BRAND EXPOSURE
A feature in an industry magazine positions you or your company as a thought leader in the field of data science. This can attract talent, partners, and customers who are looking for forward-thinking businesses.
An article can significantly enhance a company’s or individual’s visibility among a targeted, niche audience of professionals and enthusiasts.
RECRUITMENT A feature can showcase your company’s work culture, projects, and achievements, making it an attractive place for top-tier data scientists and other professionals.
FUTURE ISSUE RELEASE DATES FOR 2024 20th February 21st May 3rd September 19th November
SHOWCASING SUCCESS STORIES Highlight your personal achievements or companylevel successful projects, providing proof of your expertise and building trust in your capabilities.
To set up a 30-minute initial chat with our editor to talk about contributing a magazine article please email: donna.aldridge@datasciencetalent.co.uk
Hire the top 5% of pre-assessed Data Science/Engineering contractors in 48 hours Quickly recruit an expert who will hit the ground running to push your project forward
Analyst
r
ist
at
St
ise
l ua
Researcher
Architect
n
ia
DST PROFILER
®
er
Don’t let your project fall behind any further. You can access the top 5% of pre-assessed contractors across the entire candidate pool - thanks to our exclusive DST Profiler® skills assessment tool.
ic
s
Vi
Le a ne
gl
ac
hi
er Hacker
M
You can find your ideal candidate in 48 hours* - GUARANTEED
ra n
rn
W
We recruit contractors to cover: Skills or domain knowledge gaps Fixed-term projects and transformation programmes Maternity/paternity leave cover Sickness leave cover Unexpected leavers/resignations
Tell us what you need at datasciencetalent.co.uk *Our 10K contractor guarantee: in the first two weeks - if we provide you a contractor who is not a fit, we will replace them immediately and won’t charge you anything for the work we’ve done.