MACHINE TRANSLATION with PHILIPP KOEHN
Reinforcement learning
WITH FRANCESCO GADALETA
How cross-validation can go wrong and how to fix it
BY TAM TRAN-THE
DEALING WITH GROWING DATA TEAM SIZES
ISSUE 1
FRANKFURT: DATA SCIENCE CITY A GUIDE TO DATA OBSERVABILITY
CONTRIBUTORS
Philipp Koehn
Mikkel Dengsøe
Genevieve Hayes
Damien Deighan
Tam Tran-The
Skanda Vivek
Francesco Gadaleta
Dr Anna Litticks
George Bunn
EDITOR
DESIGN
Anthony Bunn
Imtiaz Deighan
PRINTED BY Rowtype
Stoke-on-Trent, UK +44 (0)1782 538600
sales@rowtype.co.uk
The Data Scientist is published quarterly by Data Science Talent Ltd, Whitebridge Estate, Whitebridge Lane Stone, Staffordshire, ST15 8LQ, UK. Access a digital copy of the magazine at datatasciencetalent.co.uk/media.
DISCLAIMER
The views and content expressed in The Data Scientist reflect the opinions of the author(s) and do not necessarily reflect the views of the magazine or its staff. All published material is done so in good faith.
All rights reserved, product, logo, brands, and any other trademarks featured within The Data Scientist magazine are the property of their respective trademark holders. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by means of mechanical, electronic, photocopying, recording, or otherwise without prior written permission. Data Science Talent Ltd cannot guarantee and accepts no liability for any loss or damage of any kind caused by this magazine for the accuracy of claims made by the advertisers.
02 | THE DATA SCIENTIST
PRODUCTION
Our first city focus features Frankfurt : the financial hub of Germany, one of the world’s largest stock exchanges, and a city that is fast becoming a global Data
In each issue we speak to the people that matter in our industry and find out just how they got started in Data Science or in a particular part of it. In issue 1, we speak to
Each issue, we’ll examine the world of Data Science recruitment through the eyes of Damien Deighan , the CEO of UK-based Data
DATA SCIENCE: CAREER PROGRESSION ADVICE Genevieve Hayes tells Data Scientists - if you want to be valued then you need to ask the right questions. 05 HOW NEURAL NETWORKS HAVE TRANSFORMED MACHINE TRANSLATION We talk at length to Philipp Koehn , an iconic leader in the field of machine translation research and the author of over 100 publications. 08 DATA TEAMS ARE GETTING LARGER, FASTER Highly respected and experienced Data Leader Mikkel Dengsøe examines the
between data team size and complexity. 16 DATA SCIENCE MATTERS Data Science can be a complex place. Here, we try to simplify it. We cast our eye over some of the key benefits of Data Observability. 19 DATA SCIENCE CITY
21 CROSS-VALIDATION
25 START-UP
relationship
Science and AI powerhouse.
Tam Tran-The examines how it can go wrong and how to fix it.
Skanda Vivek 28 DS RECRUITMENT: STATE OF PLAY
31 REINFORCEMENT LEARNING With Francesco Gadaleta
and
Amethix Technologies and the Host of the Data Science At Home podcast. 35 DR ANNA LITTICKS Our spoof Data Scientist. Or is she? 38 COVER STORY Taking corporate machine translation to the next level with PROFESSOR PHILIPP KOEHN INSIDE ISSUE #1 THE DATA SCIENTIST | 03
Science Talent.
, the Founder
Chief Engineer of
HELLO, AND WELCOME TO ISSUE 1 OF THE DATA SCIENTIST
Why produce a Data Science magazine? Well, we wanted to produce not just a Data Science magazine, but one that matters. We also wanted to produce a magazine that features not only the biggest names in Data Science and AI, but also an eclectic mix of Data Scientists discussing the topics that we think matter to you or may soon matter to you. Basically, a magazine that we would want to read ourselves.
As you will see - we are not a vehicle for advertisement after advertisement. We are a serious magazine - because Data Science is a serious subject matter, and that’s why we will cram as much content as we can into every issue. Our aim is for you to read this magazine from front to back and be both entertained and possibly challenged by the content. We also produce a top-tier Data Science podcast too, and basically we wanted to do exactly the same in magazine format. It really is as simple as that!
We know that we are preaching to the converted when we say how vitally important Data Science is in the modern world and how important it will be in the future. We want The Data Scientist to reflect this importance.
So, what’s inside our first issue? Well, we feature interviews with Data Science icons Philipp Koehn, and Francesco Gadaleta plus a number of key articles from those within the industry. Amongst a raft of other pieces, we also look at the German city of Frankfurt, and in each issue will be focusing on cities and areas that are fast becoming thriving Data Science hotspots.
The Data Scientist is available in digital and print formats and we really do welcome any feedback that you may have. At present, the magazine is quarterly, and we would love to discuss any ways in which we can feature you or your organisation in a future issue and even possibly work alongside you.
We hope you enjoy this issue.
The Data Scientist Editorial Team
EDITORIAL 04 | THE DATA SCIENTIST
DATA SCIENCE:
GENEVIEVE HAYES
If you think your employer doesn’t value you, you’re probably right, but it’s not for the reasons you’d expect. To your employer, your value is based on the value you create. It doesn’t make them evil. It’s just the way business works. But for data scientists, that presents a problem.
Data scientists aren’t trained to create business value. They’re trained to fit models to data. Employers don’t retrain them, because, being a new profession, employers don’t always know what value data scientists could produce - if only they’d ask. It’s a classic “don’t know what you don’t know” situation and the result is disappointment all around.
Employers come to see data science as a failed experiment. Meanwhile, data scientists quit in frustration, to seek employment where they are truly appreciated. It’s tragic when it happens, but it doesn’t have to be this way. In fact, changing the situation isn’t as hard as you might imagine. The secret is to focus on adding value. The first step to doing that is changing the questions you ask yourself and others.
Small shifts in the questions you ask can lead to a massive increase in your perceived value as a data scientist.
Here’s where to begin.
GENEVIEVE HAYES TELLS DATA SCIENTISTS: IF YOU WANT TO BE VALUED, THEN YOU NEED TO ASK THE RIGHT QUESTIONS.
CAREER PROGRESSION ADVICE CAREER PROGRESSION THE DATA SCIENTIST | 05
INSTEAD OF ASKING HOW , ASK WHY?
Lots of white-collar professionals operate at this level. A lot of tradespeople are here, too. Think tax accountants or plumbers. If you have a leaky tap, you call a plumber to come and fix it. You tell the plumber what needs to be done and they do it. You don’t tell them how because you don’t know how. The plumber takes responsibility for the how
If you need to ask how to do your job, the value you bring as a data scientist is minimal. The how level is the level of unskilled and entry-level workers.
Many McDonald’s workers work at the how level. McDonald’s restaurants don’t hire workers to develop new menu items. The franchise model wouldn’t allow it. McDonald’s restaurants typically employ unskilled, predominantly teenage workers to deliver a standardized menu, and provide them with clear instructions on exactly how to do just that. That allows the restaurants to provide food quickly and cheaply at a massive scale. McDonald’s restaurants hire workers because they’re willing to be told how to do their job. Yet, for a highly skilled data scientist, this isn’t where you want to be.
Asking how l eaves little opportunity for you to create business value, which makes it much harder for you to be valued by your employer. Operating at the how level makes you replaceable. You become a commodity.
It may not even be feasible for some data scientists to operate at the how level. To operate at the how level, there has to be someone who can tell you how to do your job - a manager or senior co-worker - and in smaller companies, they don’t always exist.
Throughout my career as a data scientist, I’ve never had a manager with data science training. Even if I wanted to work at the how level, I couldn’t.
If you want to build a career as a data scientist, you need to rise above the how l evel. Rising above the how level involves internalizing the how
Instead of asking your boss “how do I do this,” take personal responsibility for the how and answer it yourself. The internet is loaded with articles and tutorials explaining any data science concept you could possibly ever need. If you’re reading this article, you know where to find them. Internalizing the how raises your perceived value above that of a commodity.
ASKING WHAT? LEADS TO SUB-OPTIMAL SOLUTIONS
As a data scientist, you can do fine at the what level, too. In fact, you can earn a decent wage there. You just show up at work, wait for your boss to tell you what they want from you and figure out how to do it. But you might be waiting a long time.
By working at the what level, you are delegating responsibility for diagnosing data science problems to someone who might not be capable of doing so.
Tax accountants and plumbers can operate effectively at the what level because their clients are able to selfdiagnose a leaky tap or the need to submit a tax return. However, a person without data science training cannot and should not diagnose the best data science solution to a business problem. If your boss is able to answer the what , you may end up with a sub-optimal solution. If they aren’t, you’ll get no solution at all.
The latter is very bad for you. It leads to employers concluding that data science adds no value to the business, and ultimately downsizing or eliminating the data science team. Either way, you’re not demonstrating the full value you can bring as a data scientist. To do that, you need to focus on the why instead.
WHY? IS THE QUESTION OF EXPERTS
Once you take personal responsibility for the how , you move up to the what level of data science - tell me what you want me to do and I’ll do it.
If you want to maximize the value you bring as a data scientist, you need to start asking why :
■ Why do you want me to do this?
■ Why did you hire a data scientist?
■ Why do you want me to do this now?
Asking why allows you to understand your boss’s motivations and diagnose their business problem.
If you’re functioning at the why level, you’re taking personal responsibility for the what and how . The why level is where experts operate. It’s no coincidence that workers functioning at the why level are the most highly valued. They’re also the most highly paid.
Doctors function at the why level. If you wake up one morning feeling ill and go to the doctor for antibiotics, your doctor won’t just write you a prescription. Instead,
CAREER PROGRESSION
“It takes half your life before you discover life is a do-it-yourself project.”
NAPOLEON HILL
“Yo, I’ll tell you what I want, what I really, really want.” SPICE GIRLS
“People don’t buy what you do; people buy why you do it.”
06 | THE DATA SCIENTIST
SIMON SINEK
ASK THE RIGHT QUESTIONS AND CHANGE HOW YOU VALUE YOURSELF
they’ll ask you a series of questions to determine why you are feeling ill - is it really something that requires antibiotics or is it just a common cold, or perhaps the after-effects of the dodgy takeaway you ate last night? Your doctor doesn’t rely on your self-diagnosis because you don’t have the knowledge and experience to make that diagnosis. Your doctor, on the other hand, has a medical degree and experience with hundreds, if not, thousands, of patients. You went to them because they know better than you.
The same applies to data scientists. Data scientists are highly trained specialists. Most have a Masters degree and many have a PhD, too. But in most cases, data scientists report to people who don’t have the same level of technical ability as they do. That’s why the data scientist was hired.
As a data scientist, if you abdicate your role in the diagnosis process to your boss, or anyone else, you’re effectively saying that person, who may be less technically capable than you, can come up with a better solution to a data science problem than you can. If that’s the message you’re sending out, of course you’re not going to be valued.
However, by accepting a role in the diagnosis process and operating at the why level, you can position yourself as an expert. Asking why shifts the focus to the outcomes your boss needs you to achieve, not just what they say they want you to achieve. Those are the outcomes that deliver real business value. And by using your knowledge and experience to select the right course of action - the what - you can play an active and vital role in delivering that outcome. You’ll rise above the level of a cog in the machine to being the machine driver.
From the fact that you’ve read this far, I know you want to be valued as a data scientist, but so far feel that you’re not valued.
So do something about it.
For the next 30 days, commit to shifting your questions up a level. If you’re currently operating at the how level, internalize the how and commit to moving to the what level. If you’re currently at the what level, focus on rising to why . 30 days isn’t long. You might not notice any change in the way your employer sees you in that time. You might even get push-back. You’re trying to change the people around you, so the people around you will subconsciously fight to stay the same. In biology, it’s called homeostatic pull.
What will change, though, is the way you view yourself. Psychologist Benjamin Hardy talks about the concept of becoming your future self. According to Hardy, if you want to make a change in any aspect of your being, you should act as though the change has already occurred. You will change how you see your future and then your present will shift to match that future.
If you ask the questions a data scientist operating at a higher level than you would ask, you’ll also start operating at that level. If you start behaving like an expert data scientist, you’ll become an expert too and you’ll be valued as an expert. It might not happen immediately, but keep at it. The psychology tells us it will happen.
And that’s how you’ll be valued as a data scientist.
“Change what you do and you will change who you are.”
BENJAMIN HARDY
“Change what you do and you will change who you are.”
THE DATA SCIENTIST | 07
BENJAMIN HARDY
HOW NEURAL NETWORKS HAVE TRANSFORMED MACHINE TRANSLATION THE PHILIPP KOEHN
INTERVIEW
■ An iconic leader in the field of machine translation research and the author of over 100 publications.
■ Professor of Computer Science at the prestigious Johns Hopkins University, where he continues his research into machine translation through his affiliation with the Centre for Language and Speech Processing at Johns Hopkins University.
■ Chief Scientist at Omniscien Technologies, a market-leading global supplier of high-performance and secure Language Processing, Machine Translation (MT) and Machine Learning (ML) technologies and services for content-intensive applications.
■ Professor and Chair of Machine Translation at the University of Edinburgh’s School of Informatics and contributes to its Statistical Machine Translation Group which organises workshops, seminars, and projects related to the subject.
■ Under Philipp’s guidance and leadership, the open source Moses system has become the de-facto standard toolkit for machine translation in research and commercial deployment.
■ Led international research projects such as Euromatrix and CASMACAT, and his research has been funded by the likes of the European Union, DARPA, Google, Facebook, Amazon, Bloomberg, and several other funding agencies.
08 | THE DATA SCIENTIST
PHILIPP KOEHN
PHILIPP KOEHN
Philipp, please explain machine translation in simple terms…
PK: To understand documents in different languages you need to have a translation. That’s a process that humans have been doing forever, and it’s always been one of the Holy Grails of natural language processing. Over the last twenty to thirty years machine translation has gone from the point of being really poor, to really quite useful.
Why have you dedicated your life to studying machine translation?
PK: When I started studying computer science in the early nineties, I realised relatively quickly that it’s bad to work on machine learning for machine learning’s sake, without having a problem. At that time, text processing was a really good practical problem because you have the data and you can actually do machine learning, and it’s a somewhat feasible task where there is at least some idea about what the correct input and the correct output should be.
Approaches to machine translation tend to centre around the two concepts of adequacy and fluency. Can you talk us through that?
PK: There are always two goals for translation. Firstly, in terms of fluency, translators want to produce texts that aren’t noticeably translated. Adequacy, on the other hand, is when the translated meaning conflicts with the original text and should be amended. If you think about translation of literature, fluency is more important than adequacy; the book should be enjoyable rather than have all the facts. For example, if I wrote a story for an American newspaper that a town ‘has the population size of Nebraska’, this is understandable in America, but not for someone in China reading a Chinese translation.
We have a model that asks: ‘is this a fluent sentence?’ and a second one to see how well things map. This is then balanced and applied to the type of text being worked on.
How do you quantify the performance of such machine translation systems? What kind of metrics are useful for that?
PK: The model I just discussed has components that make sense within the model, but ultimately the goal is to produce translations. This raises two questions: ‘how do you evaluate?’ and ‘what is a good translation?’
We have an engineering problem in that we want to build machine translation systems and then tune and change them, and be able to measure the results immediately. This means we need an automatic metric to evaluate how good machine translation is.
Are performance metrics more important internally, to train and develop the models?
PK: There’s the infamous BLEU score that is used in machine translation. Ideally you want to know how many words are wrong, but you also have to consider word order, which is not easy to do. The BLEU score looks at how many words are right and also which pairs, triplets and forward sequences are right. You then compare it against the human translation.
The best human translation is disputed amongst translators, however if there are flaws in a sentence, two opposed human translations will have different flaws. Why is one flaw worse than another flaw? We have a reasonably useful set-up for the last twenty years since the BLEU score was invented. They have definitely helped guide the development of machine translation.
One of the earlier ideas of machine translation was to split the problem into three categories; a lexical, a syntactic, and a semantic problem. Is this still a valid approach in the age of neural networks?
PK: Yes, and no. Twenty or thirty years ago, there was a grand vision of machine translation being an application that guides the development of better natural language processing. That involved understanding language by going through various processing stages, such as finding the nouns, the verbs to handle morphology, and detecting syntactic structure. Ultimately, the vision was always to have some meaning representation that is beyond all language. If you take a source language and map it to that meaning representation beyond our language and then generate from that, you can build machine translation systems for every language pair.
The goal of interlingual systems was to build an analyser and a generator for each language. The statistical revolution that happened twenty years ago disregarded this theory and said it’s just a word-mapping problem - a problem based only on finding source words and mapping them to target words. We have to have some kind of model of reordering but it’s all tied to words, so it was a very superficial model. It just looked at word sequences.
These results were generally good, except for the grammatical accuracy, because these systems only looked at very short sequences of five words at a time. For instance, you could get to the end of a sentence that ended without a verb to give it meaning.
THE DATA SCIENTIST | 09
PHILIPP KOEHN
Over the last 20-30 years machine translation has gone from the point of being really bad, to really quite useful.
We developed quite successful systems that actually built syntactic structures with noun phrases, clauses, and so on. Work was initially done focusing on Chinese to English, where the word order and structure was much trickier. We were also really successful in German to English, which had been a problem for natural language processing. German and English are related languages, but the syntax is vastly different. We’ve made some progress towards building linguistically better models, but they have become more complicated because you had to build these structures. Then, there was talk about semantic representations and graph structures that are even more difficult. That’s where the neural machine translation wave hit again and started out by saying there are just two ways of word mapping problems - the sequence of where it’s coming in and producing a sequence of output words.
With these new machine learning methods, why do you think other fields are now statistically driven? Are neural graphs matching language better? If so, is it due to hardware or increased data?
PK: There are various aspects to that question. The turn to data-driven methods and natural language processing is pretty much parallel to what I just described about machine translation. Other problems in natural language processing, for instance, analysing syntactic structures, was often done by handwritten rules or had a sentence as a subject object. A subject is a noun-phrase; a noun-phrase is determined as an ‘adjective noun’. However, if you look at the actual texts, something often violates how language should look. In the nineties, it was rebuilt. Now it just annotates sentences with their syntactic trees.
Why has it completely overtaken the field? Because you can get all your training data for free. All you need is translated text, which is what people use all the time. It’s extremely rare that we actually annotate training data ourselves. We find it from the internet or from public repositories.
Humans also have different ways of learning language. Do we learn language mainly from rules, or do we just absorb it?
PK: I’m not a linguist and we don’t have a linguistic theory. We listen to language first and then go to school, realising we make grammatical mistakes, and then we are taught some rules. Is language driven by rules or just an amalgamation and repetition of what people say? It seems to be a mix of both. There’s some structure, but there’s also evidence of language being recursive. For example, my kid said to me: “he be vibing though”, which is not grammatically correct, but that’s what people say and it becomes part of language through repetition.
There are attempts now to process even sequences of characters, what’s driving this process?
PK: The fundamental problem is that everything is incredibly ambiguous in language. The classic example of this is ‘river bank’; we have banks for our money but a ‘river bank’ means something completely different. That’s why it’s hard for a computer because it has to resolve the ambiguity. How can a computer ever tell the difference between a financial bank or river bank? They’re just banks. They’re just character sequences of four letters - b, a, n, k.
PHILIPP KOEHN 10 | THE DATA SCIENTIST
To resolve that, machines can take the context of the preceding word, ‘river’, and apply this to ‘river bank’. If you do phrase translation it becomes much less ambiguous. For example, ‘interest rate’ can be accurately translated, but it is much more ambiguous when translated individually. It’s somewhat similar when you get to sub-words and character sequences: if you take two words and say they’re different, what will you do with ‘car’ and ‘cars’, for instance? You will still understand what it means to add the plural. We need to get away from representing car and cars. Looking at the character sequences we see that they’re very similar, and that should help.
Is this one of the reasons the field is currently relying so heavily on the recurrent neural networks?
PK: You have an input sentence and you have an output sentence. The output sentence, if you predict one word at a time, has all the previous words to help disambiguate. What drives the decision to produce the next word is obviously the input sentence but also all the previous words produced. This creates a recurring process where the big question is: is language recursive or is it just a sequence?
translation five years ago, which is now ancient. Since around 2019, we have a different model that is called a transformer (a more informative name is ‘self-attention’). The idea is that we are modelling words in the context of the other words and we do this very clearly. We refine the purpose of the word, give them the surrounding word, and we go through layers of that, so this is the self-attentional transformer approach. We have the same thing on the source and the target side.
What can typically go wrong with machine translation and which methods are used to validate the outputs?
PK: Output validation asks, ‘how well did your system do?’ then leaves aside a number of sentences, translates them and checks how well they match with human translation. We can measure the probability of each word with the human translation and we can also make it produce the human output to see how well it scored.
How do you know, for instance, if a translated Chinese document into English is ‘beautiful gibberish’ or an accurate translation? That’s the fluency-related problem.
There are good reasons to believe that language is heavily influenced by the latter. When we understand language we always receive it linearly; we read and listen to things word by word. We don’t look at the entire sentence, find the verb and then branch out again. We just see it as a sequence, so it should be modelled as a sequence generating task too, where you produce one word after the other. This makes it a bit more feasible. You still have a fairly large vocabulary of hundreds of thousands of words. You can break up infrequent words into sub-words to make it computational but this recurrent process of producing one word at a time suits language well.
What type of neural network are you using?
Input sentences are encoded and output sentences decoded?
PK: You have an input sentence and you try to work out the meaning of that sentence. During rule-based days in particular, this was done explicitly. Representations closely mirrored our understanding of meaning, or at least syntactic structure in neural networks. There are claims that this kind of meaning emerges in the middle of the process - going from an input sentence to an output sentence. We look at the source sentence and encode that source, then the decoder generates the output sentence.
Recurrent neural networks started in machine
So, what can go wrong? An interesting thing about the neural machine translation approach is it differs in the types of errors. Statistical methods (because they’ve had a very narrow window in what they looked at) often produced output with incoherent outputs. In the neural model, if the input has unusual words the output will be gibberish. With little training data, they often produce beautiful sentences that sound like biblical prophecies and have nothing to do with the input.
Why is this? If you don’t have much data, you have the Bible and Koran in hundreds of languages in which to train your model. If you then want to translate tweets the model is thinking, ‘I have no idea what that input is but here’s a beautiful sentence I’ve seen in training. How about that?’. It’s effectively a type of hallucination. That’s a real problem because this output can fool you. How do you know, for instance, if a translated Chinese document into English is ‘beautiful gibberish’ or an accurate translation? That’s the fluency-related problem.
There is also a problem of adequacy - do we translate the words correctly and how do we handle ambiguous words? Previously, you just got gibberish output and didn’t trust it. Now you get a beautiful output and there’s no reason not to trust it.
What datasets are you using to train your machine translation?
THE DATA SCIENTIST | 11
PHILIPP KOEHN
PK: Everything we can get our hands on. A long time ago I came across the European Parliament website that had public debates translated into all the official languages, which back then, were 13 languages….now they’re 28 languages. You can just download all the web pages. It’s very easy to figure out which blocks of text belong where, so it’s not that hard to then break these down into sentences.
It’s a big data resource that was used for a long time, there’s about 50 million words of translated text for all the official EU languages and a lot of the other publicly available datasets are similar. Open subtitles are interesting too. Currently, people like to translate. I think this comes also from pirating TV shows, from English to Chinese, for instance. People create subtitles and then translate the subtitles, so there’s actually hundreds of millions of words from translated subtitles (although the quality is not always great).
We have a big project right now, running for three or four years, with the University of Edinburgh and other groups in Spain where we go out on the web and find any website there is. This is something Google has been doing since the very beginning when they got engaged in machine translation. They had the advantage, they already downloaded the entire internet because of their search engine. They do better because they have more data, but we do have access to the internet and can download everything and we’ve tried finding translating texts on them.
We usually find goldmines of good data, where there is consistent translation that’s nicely formatted. For the biggest languages, you can get billions of words - more than you can read in your lifetime, but for the lesser known languages there is not much at all.
You mentioned using data from the European Parliament. Is that language not particularly specific, and would it not have an impact on the translation?
PK: It depends on what the application is. We’ve been using machine translation in the academic community for the last 15 years. We used news stories as a test because it’s tough and has a very broad domain. I can talk about sports, natural disasters or political events and it’s relatively complex language - the average sentence length is around 30 words. We found the European Parliament proceedings very useful because they talk about the same subjects, using particular language.
Translation of speech and spoken language is very
different from written language, even parliamentary proceedings, which are normally spoken. However, there is a mismatch between spoken language data to train the machines and edited official publications. The two are very different and it’s a real problem.
What role does infrastructure and technology play for machine translation?
PK: It’s been pretty computer-heavy because of large data sizes and the gigabytes as training data. A student at a university could do meaningful research with publicly available data. There was a lot of open source code - you can download the software, run it, and then work on improvements. A single computer was enough for that.
This changed because of neural networks. You need GPU servers and the more computer sources you have, the better the results. You can build more complex models and measure the complexity of neural networks by how many layers it has. You can build models with six or seven layers or you can have models with twenty layers, except they’re slower to train. For the big language pairs where you do have a billion word corpus, training takes weeks.
The problem we have in academia is competing with industry labs that easily have 1,000 GPUs, with each one costing up to $2,000. You need to put it in a computer, so a computer with four GPUs costs $10,000. For academic institutions, we have about 100 or 200 GPUs available for 50 PhD students (and that’s with us being the centre for language and speech processing).
Academia simply can’t compete with lab experimentation. I read about a language model that trained 1,000 GPUs in a week, and I thought, ‘that’s the end for us.’ There’s also a model called GPT3, which some of your readers may be familiar with. That’s a big language model utilising around 50,000 GPU in a day with opportunity for more in future - and that’s just inconceivable, so this does limit what we can practically do with our models.
12 | THE DATA SCIENTIST PHILIPP KOEHN
How does academic research translate into industry applications?
PK: Ultimately, students and researchers work on what’s fun, but we are guided by big funding projects, too. In the US, DARPA have been funding machine translation and they’re interested in understanding foreign language text. More recently, languages not covered by Google have been the focus. I’ve worked on Somalian and Ethiopian languages for example, which drives some research.
Generally, in machine translation research, academia is not that concerned about the end applications. The bar isn’t as high as Google translate. Mistakes are okay, so long as they’re understandable. Facebook has a similar problem in that people post in different languages and the translation needs to be understandable, which is even tougher when people write slang or error-ridden sentences. This can be very hard to translate.
Commercial application for translation is actually completely different. Most companies who want to globalise their products have to translate marketing materials, for instance, Omniscien Technology is one of the big areas being worked on. That includes subtitle translation for movies and TV. That quality bar is much higher, because consumers expect to read it without any errors, otherwise it will just annoy them. This is where human translation is still relevant.
How far do you think machine translation is away from passing a Turing test?
PK: I’m not going to make predictions for when we will have flawless translation. Machine translation has a history of overselling and under-delivering and going through various hype cycles. I think a good measure of machine translation should be, is it good enough for a particular purpose? If I go to a French newspaper website, for instance, and there’s a story about President Macron, I can run it through Google translate and I can perfectly understand the story, even if some things are missing. But that’s good enough. If I want to buy a Metro ticket in Paris and the translation of the website allows me to buy it, it’s good enough. It doesn’t have to be perfect.
Another measure is, does machine translation make professional human translators more productive? If you can make them twice as fast, that saves an enormous amount of money - and that’s generally the measuring stick. If you translate perfectly, you could construct any kind of intelligence test as a translation challenge and basically write a story in any way, and you can check it. As an example, ‘cousins’ in English is not gendered, but if I translate that into German, I have to pick a gender for that, so you have to alter the meaning a bit when
translating. This kind of world knowledge is a deep AI knowledge that we don’t have right now. We’re not close to perfect translation and we don’t have to be. I think it’s an impossible task anyway. There’s always going to be someone that says, “no, that’s not right.”
You mentioned the black box field of the neural networks you’re training, do you see evidence that language might be an emerging property of a complex system?
PK: I think it’s a very interesting question. What does it actually say, for instance, about image recognition or language? There is a lot of physics envy to reduce the world to a few formulae but that doesn’t seem to work for language, where you have a few rules and that’s it. We can discover principles that are true 90% of the time. The rule is proven by its contradiction. There’s always an exception that proves the rule and language is a lot like this.
I think this is definitely one of the great challenges, trying to understand what’s going on.
To learn more about Machine translation, Philipp’s most recent book, Neural Machine Translation is a great place to start.
Available at amazon.co.uk/Neural-MachineTranslation-Philipp-Koehn/dp/1108497322 or scan the QR code below to go straight to the book’s Amazon page:
If you require professional grade machine translation services check out Omniscien website at: omniscien.com or scan the QR code below:
THE DATA SCIENTIST | 13
PHILIPP KOEHN
NEED A DATA SCIENTIST?
We’ll recruit the right Data Scientist for you.
Guaranteed.*
* We are so sure that we can find you a suitable candidate, that if you recruit a Data Scientist (Permanent Hire) that is unsuitable or leaves within 12 months, then we’ll replace them for free.
At Data Science Talent, we understand Data Science recruitment better than anyone else. Here’s why:
You'll choose from the top 20% of permanent and contract candidates, thanks to our 80/20 Hiring® system and marketing-led approach.
We'll use our exclusive DST Profiler® platform to preassess your choice of candidates before you interview them.
Our in-house Data Scientist quality-tests everything we do to ensure you get the best service, people and results.
Visit us at datasciencetalent.co.uk Welcome to the future of Data Science recruitment.
Without hiring the best people, you won’t get the best results.
DATA TEAMS ARE GETTING
LARGER, FASTER
HIGHLY-RESPECTED DATA LEADER MIKKEL DENGSØE EXAMINES THE RELATIONSHIP BETWEEN DATA TEAM SIZE AND COMPLEXITY.
Data teams at high-growth companies are getting larger and some of the best tech companies are approaching a data-to-engineers ratio of 1:2.
More data people means more analysis, more insights, more machine learning models, and more data-informed decisions. But more data people also means more complexity, more data models, more dependencies, more alerts, and higher expectations.
When a data team is small you may be resource-constrained but things feel easy. Everyone knows everyone, you know the data stack inside out, and if anything fails you can fix it in no time.
But something happens when a data team grows past 10 people. You no longer know if the data you use is reliable, the lineage is too large to make sense of, and end-users start complaining about data issues every other day.
It doesn’t get easier from there. By the time the data team is 50 people, you start having new joiners you’ve never met, people who have already left the company are still tagged in critical alerts, and the daily pipeline is only done by 11am - leaving stakeholders complaining that data is never ready on time.
DATA TEAMS 16 | THE DATA SCIENTIST
MIKKEL DENGSØE
HOW DID THIS HAPPEN?
With scale, data becomes exponentially more difficult.
The data lineage becomes unmanageable. Visualising the data lineage is still the best way to get a representation of all dependencies and how data flows. But as you exceed hundreds of data models the lineage loses its purpose. At this scale, you may have models with hundreds of dependencies and it feels more like a spaghetti mess than something useful. As it gets harder to visualise dependencies, it also gets more difficult to reason about how everything fits together and knowing where the bottlenecks are.
The pipeline runs a bit slower every day. You have so many dependencies that you no longer know what depends on what. Before you know it, you find yourself in a mess that’s hard to get out of. That upstream data model with hundreds of downstream dependencies is made 30 minutes slower by one quirky join that someone made without knowing the consequences. Your data pipeline gradually degrades until stakeholders start complaining that data is never ready before noon. At that point, you have to drop everything to fix it and spend months on something that could have been avoided.
Data alerts get increasingly difficult to manage. If you’re unlucky, you’re stuck with hundreds of alerts, people mute the #data-alerts channel, or analysts stop writing tests altogether (beware of broken windows). If you’re more fortunate, you get fewer alerts but still find it difficult to manage data issues. It’s unclear who’s looking at which issue. You often end up wasting time looking at data issues that have already been flagged to the upstream engineering team who will be making a root cause fix next week.
The largest data challenge is organisational. With scale you have teams that operate centrally, embedded and hybrid. You no longer know everyone in the team. In each all-hands meeting there are many new joiners you’ve never heard of, and people you have never met rely on data models you created a year ago and constantly come to you with questions. As new people join they find it increasingly difficult to understand how everything fits together. You end up relying on the same few data heroes who are the only ones that understand how everything fits together. If you lose one of them you wouldn’t even know where to begin.
All of the above are challenges faced by a growing number of data teams. Many growth companies that are approaching IPO stage have already surpassed a hundred people in their data teams.
HOW TO DEAL WITH SCALE
How to deal with data teams at scale is something everyone is still trying to figure out. Here are a few of my own observations from having worked in a data team approaching a hundred people.
Embrace it. The first part is accepting that things get exponentially harder with scale and it won’t be as easy as when you were five people. Your business is much more complex, there are dramatically more dependencies and you may have regulatory scrutiny from preparing for an IPO or from regulators that you didn’t have before.
If things feel difficult, that’s okay. They probably should.
DATA TEAM SIZE 5 10 30 50 100 DATA LINEAGE RUNTIME DATA ALERTS
THE DATA SCIENTIST | 17 DATA TEAMS
Challenges at different data team sizes
103
80 82 84
201 216
153 163 174
122
310 15
HOW IT FEELS You are on top of it. Everyone knows everyone. It’s easy!
• Doing simple work takes longer
• You no longer understand all data models
• People who’ve left the company are getting tagged in Slack alerts
• How do you even show a lineage with 1000 data models!
• Why is your pipeline only done at 11am!
• What should I do with the 100 Slack alerts I get every day!
Things are slowly starting to get easier again
WHAT’S CHANGED You just started using gbt and it’s great
• You’ve hired more people and it’s great
• You start having 100s of data models
• More analysts work embedded
Work as if you were a group of small teams. The big problem when teams scale is that the data stack is still treated as everyone’s responsibility.
As a rule of thumb, a new joiner should be able to clearly see data models and systems that are important to them but more importantly know what they don’t have to pay attention to. It should be clear which data models from other teams you depend on. Some data teams have already started making progress on only exposing certain well-crafted data models to people outside their own team.
Don’t make all data everyone’s problem. Some people thrive from complex architectural-like
• You have 1000s of data models
• Dozens of teams have data people embedded
You’ve invested in teams and tools to make life easier at scale 4 5 8 8 8 9 9 9 10 10 10 12 12 12 12 14 14 15 19 21 23 23 23 25 26 27 28 29 30 35 36 40 43 45 49 56 56
• New joiners that you’ve never met have joined your company
work, such as improving the pipeline run time. But some of the best data analysts are more like Sherlock Holmes and shine when they can dig for insights in a haystack of data.
Avoid mixing these too much. If your data analysts spend 50% of their time debugging data issues or sorting out pipeline performance, you should probably invest more in distributing this responsibility to data engineers or analytics engineers who shine at (and enjoy) this type of work.
Growing data teams are here to stay and we’ve only scratched the surface of how we should approach this.
18 | THE DATA SCIENTIST DATA TEAMS DATA TEAM SIZE 0-10 10-20 20-100 100+
• Your company is preparing for an IPO Zego Soldo Lendable Trouva Meero Snyk Starling Bank TravelPerk Freetrade Trulayer Instabox Messagebird ComplyAdvantage Dixa Tessian Hopin Cleo Gymshark Contentful Tide Curve Tink Truecaller Zopa Voi Onfido Streetbees Get Your Guide Trustpilot BlaBlaCar Typeform GoCardless Lunar Babylon Gousto N26 Cazoo Etoro Contentsquare Wise Monzo checkout.com Revolut Getir Deliveroo Just Eat Hello Fresh Glovo Klarna
DATA SCIENCE MATTERS #1 DATA OBSERVABILITY
Data is, quite simply, one of the most valuable assets in modern times, and proactive and strategic organisations and companies have long since realised that results and insights derived from data has to be regular, accurate, reliable, and top-tier when it comes to quality. Only then can critical decisions and plans be formulated. Quality data ensures that outstanding decisions can be made, and one of the key strengths of data observability is that it provides full visibility into an organisation’s data pipelines.
Data observation allows an organisation to identify, troubleshoot, and rectify data issues quickly, and the 2021 Observability Forecast ascertained that 90% of respondents believed that it’s both important and strategic to their business, but only 26% stated that their observability was mature.
In very simple terms, data observability refers to an organisation’s ability to fully understand its data and data systems. Sounds simple, doesn’t it? But as we know, this significantly simplifies and underestimates the relevance, usage, and importance of data in modern times.
Many ask what the main
[...] 90% of respondents believed [data obervation] is both important and strategic to their business, but only 26% stated that their observability was mature.
difference is between data observability and data monitoring. To do this, we need to examine what has happened and why it has happened.
Whilst monitoring may inform you that your data pipeline has failed, observability tells you why it has failed and gives you the information to make informed decisions. Two key differences are that data observability is proactive, unlike its reactive data monitoring counterpart. And secondly, data monitoring tells you when something goes wrong, whereas data observability informs why it’s gone wrong.
Data (pipeline) observability is not the ability to know simply that your pipeline failed, as monitoring should tell you this. A data observability tool does the detective work to point you to the proximal cause - for example, failure of a Spark job - as well as the root cause, such as the data contained an invalid row.
So, do organisations know about the importance of data observability? If not, here’s The Data Scientist’s quick and easy guide to some of its key benefits…
DATA SCIENCE CAN BE A COMPLEX PLACE.
DATA OBSERVABILITY THE DATA SCIENTIST | 19
IN THE DATA SCIENTIST, WE TRY TO SIMPLIFY IT.
IMPROVES RELIABILITY, SERVICE, AND EXPERIENCE
In a world that has changed hugely because of the Covid-19 pandemic, organisations increasingly rely more and more on digital services, and from that data we can receive greater detail into performance. So, for example, whilst data observability gives a greater purpose beyond just showing how well our app components are performing over time - data can and should be used to show where business results are affected and improve the ability to handle risks.
Observability tools empower engineers and developers to create better customer experiences despite the increasing complexity of the digital enterprise. With observability, you can collect, explore, alert, and correlate all telemetry data types.
COST EFFECTIVE
As well as benefitting compliance and security, another of the great benefits of data observability is that it really doesn’t need to break the bank. Data shouldn’t need to be taken from its current location which means that any solutions found should be fast and scalable. Data observability should efficiently connect to your existing stack without requiring any changes to your codebase, pipelines, or programming language, too. And when it is said that it costs ten times as much to complete a unit of work when data is flawed than when data is perfect, then it could also be said that prevention, in other words data observability, is better than cure.
CONTROL
Organisations can gain a far firmer grip on active data and resting data when their teams monitor and observe data pipelines. Analytical teams are empowered to develop systems, processes, and tools to quickly identify data problems, bottlenecks and inconsistencies that have the potential to prevent any downtimes in data.
SIMPLIFIES COMPLEX SYSTEMS
Because simple systems have fewer moving parts, they are easier to manage, but distributed systems are constantly updated and have a more interconnected parts, which means that the types and numbers of failure that can occur is higher too, creating more unknowns.
Data observability is a huge help when data pipelines leak,
as it answers many of the key questions as to what has happened and why, which can result in improved efficiency and reduce costs. Observability is better suited for the unpredictability of distributed systems, mainly because it allows organisations to ask serious questions about its systems as issues arise and allows organisations to gain control over data in motion and at rest.
UNDERSTANDING
As organisations and companies data usage and systems increases and becomes more complex, these systems and pipelines are more likely to malfunction or break. Data observability gives clearer visibility into data pipelines and infrastructures, detects hard-to-spot problems, and hence gives organisations a greater degree of not just what is happening, but why. It allows you to measure and then improve what you are monitoring.
SERVICE
It’s an age-old business value – and not simply because of data – if you can measure, you can improve it. Where does data observability come into this? It’s pretty simple: data observability offers better quality data insights to assist your organisation in its planning, decision making, and controlling budgets. The more you know, the better you go.
REDUCES NOISE
An observability tool can really help accelerate current processes and really reduce noise. As you know, one of the challenges with data governance is that it just creates a lot of noise. With a proactive mindset, you can actually layer-in an observability tool to reduce the amount of noise and increase the amount of coverage and gradually adopt it as you’re taking advantage of the capabilities of a modern tool.
DATA OBSERVABILITY 20 | THE DATA SCIENTIST
Max Duzij
FRANKFURT DATA SCIENCE CITY
Rated as an ‘Alpha World City’ by the GaWC / The financial hub of Germany and one of the world’s largest stock exchanges / The biggest city in the German federal state of Hesse / A city of commerce and high-technology / Boasts one of Europe’s largest and busiest airports / Hosted international trade fairs since 1240 / A long and proud history in research and academia / An extremely livable city with a cosmopolitan feel and diverse, multicultural population
Frankfurt may sometimes go under the radar as far as the likes of tourism and media attention are concerned. Perhaps because it has a look quite unlike other German cities? It has inherited the nickname of ‘Mainhattan’ – because the city that straddles the River Main boasts the vast majority of the country’s skyscrapers - with the Commerzbank Tower being the tallest in Frankfurt, standing at 260 metres high.
But, as stated above, when it comes to Data Science, Frankfurt is gaining a burgeoning reputation on both the European and worldwide stages, and as a city is a great place to live and work in. That’s why so many are taking a bite out of Germany’s ‘Big Apple’ and why we at The Data Scientist are featuring it as our very first Data Science City
FRANKFURT FACTS
Full City Name
Frankfurt-am-Main
Population
It is the fifth largest German city, and has 763,380 inhabitants (as of 31 December 2019)
#1
OUR FIRST CITY FOCUS FEATURES A CITY THAT IS FAST BECOMING A GLOBAL DATA SCIENCE AND AI POWERHOUSE - FRANKFURT .
THE DATA SCIENTIST | 21
FRANKFURT
Frankfurt’s Data Science Pedigree
Frankfurt is Germany’s financial hub. But not only that, the city is also becoming a real hub of the country’s data centre industry, too. Possibly because of the strong financial sector within the city and also because Frankfurt is home to one of the largest internet exchange points in the world. There are more than 50 colocation data centres and over a thousand international networks that converge at DE-CIX. Frankfurt is quickly joining the likes of London, Amsterdam, Paris and Dublin as one of the largest locations for data centres in Europe - and as a country, Germany is one of the most developed countries with regard to business digitisation and digital public services.
Indeed, there are a number of Pharma companies based in and close to Frankfurt. Whilst data science is still relatively new with regard to academia within the country, the huge importance of digitisation for the city’s financial sector ensures the growing need for data science expertise within the Frankfurt area.
Location
Located in the German state of Hesse, and lies on the River Main in the heart of the Rhine-Main metropolitan region.
Geography
Close to a number of major German cities, stunning countryside, and it’s only two hours away from France - you can even go and ski in Switzerland at the weekend if that’s what you want to do!
Hesse
Frankfurt-am-Main
FRANKFURT
22 | THE DATA SCIENTIST
The World Conference on Data Science & Statistics
(more commonly known as Data Science Week 2023) will be held in Frankfurt from 26th to 28th June next year, with a theme of “Understanding Data Science: How it can help now and in the future”. To keep your finger on the pulse of data science events within the city, then head to the following website: www.meetup.com/topics/data-science/de/frankfurt
Money, money, money
With well over 400 banking addresses in the financial district alone, some say that Frankfurt runs on money. Indeed, the city is a major European financial hub, the main financial hotspot in Germany, and the gateway to Europe’s biggest economy.
Frankfurt hosts a large and unique concentration of European and national supervisory bodies, international banks, insurance companies and legal practitioners. Companies and organisations in Frankfurt include financial giants such as Deutsche Bank, Commerzbank, Dresdner Bank, BHF Berliner Handels-und Frankfurter Bank, DG Deutsche and Landesbank Hessen-Thüringen, while you can also find the likes of BfG Bank, Citibank, HSBC, Lehman Brothers, Merrill Lynch, Morgan Stanley, and Goldman Sachs within the city.
Frankfurt also has the 12th largest stock exchange in the world by market capitalisation - and it’s in Frankfurt where European monetary policy is conducted, due to the city being home to the European Central Bank (ECB).
But Frankfurt isn’t all about finance and banking - there are any number of cloud and fintech start-up companies too. Plus the automotive, chemical and pharmaceutical, technology and research services, consulting, media and creative industries are really strong within the city, ensuring Frankfurt is a major economic base, and one where data science flourishes.
Overview of the city
Frankfurt is multicultural, welcoming, and a great place to live and work. It has a thriving economy, a low unemployment rate, and excellent career opportunities. It is a city of contrasts, where towering glass structures rise above the picture-postcard-pretty Old Town (Altstadt), and where Germany’s largest city forest is walkable from the centre.
90% of Frankfurt was destroyed in World War II, but having been rebuilt it’s now possibly Germany’s most diverse and multicultural city, with nearly 30% of residents carrying a foreign passport. Frankfurt is a city that is both modern and progressive and also charming and historic, and boasts a wealth of culture, art, and places to visit and see.
Many experts say that Frankfurters enjoy some of the highest quality of life in the world. Their work-life balance clocks in at a rather attractive average working week of 25 hours and a happiness level of 7.3 out of 10. Add in excellent education, transportation, and healthcare, and you can see why many see the city on the Main as one of the best places to live in the world.
With regard to working in Frankfurt, any citizens of countries that are members of the European Union enjoy the same status as Germans in the local job market, whereas others must first obtain a work permit (Arbeitserlaubnis) which is needed for a residency visa (Aufenthaltsgenehmigung). In general, an application must be filed through the German consulate abroad well in advance of the starting date of the work contract.
FRAN KFURT
THE DATA SCIENTIST | 23
BUT DON’T TAKE OUR WORD FOR IT: Frankfurt was ranked 7th in the international city comparison of Mercer’s ‘Quality of Living Index’ in 2019, and in The Economist Intelligence Unit’s 2018 ‘Global Livability Index’, Frankfurt came in 12th place worldwide, making it the most livable city in Germany and number 4 in Europe!
7 things about Frankfurt you may not know
1. Frankfurt has a huge forest (Stadtwald Frankfurt) in the heart of the city. The 5000 hectare forest is even within walking distance from the old town, and is popular with runners, cyclists, and walkers.
2. The famous German Poet Goethe was born in and lived all his life in Frankfurt.
3. The Frankfurt Book Fair is the world’s largest trade fair for books.
4. Throughout the Holy Roman Empire, new kings were elected in Frankfurt.
5. The first Wikimania - the yearly conference for all things Wikipedia - was held in Frankfurt in 2005.
6. The Senckenberg Museum of Natural History houses the largest dinosaur exhibition in Europe.
7. In 1944 when the city was bombed during the Second World War, all the animals (including lions) escaped the city’s zoo.
Transport
Frankfurt’s major international airport is the biggest in Germany, the primary hub for Lufthansa, and one of the busiest in Europe. There’s also an excellent transport system that includes U-Bahn, S-Bahn, and buses. And it’s good to note that much of Frankfurt has been an Environmental Zone since 2008, so if you’re driving make sure that your vehicle has an emissions sticker.
Education
A great place to study, especially for those who are interested in finance, business, or indeed, data science. There are plenty of international schools within the city, which is home to influential and reputable educational institutions and seven universities, including the Goethe University, the UAS, the FUMPA and graduate schools like the Frankfurt School of Finance & Management (courses include Masters in Applied Data Science, and Masters in Data Analytics & Management).
FRANKFURT
24 | THE DATA SCIENTIST
HOW CAN GO WRONG AND HOW TO FIX IT CROSS-VALIDATION
FEATURE SELECTION USE CASE WITH SAMPLE CODE
By TAM TRAN-THE
Cross-validation is a resampling procedure used to estimate the performance of machine learning models on a limited data set. This procedure is commonly employed when optimizing the hyper-parameters of a model and/or when evaluating performance of the final model. However, there are multiple nuances in the procedure design that might make the obtained results less robust or even wrong.
Consider that you are working on a classification problem with tabular data containing hundreds of features. You decide to select features based on their corresponding ANOVA f-statistics with the outcome label.
How it can go wrong - Scenario 1
You first perform the feature selection strategy on the entire dataset to select the top k features (where k is an arbitrary number) with the highest f-statistics. After this, you decide to do cross-validation and feed the data with selected features into the CV loop to estimate the model performance.
Here, you have committed the mistake of data leakage. Since you perform a selection strategy that involves learning about the outcome label on the entire dataset; knowledge about the validation set - especially the outcome label - is made available to the model during training. This gives the model an unrealistic advantage to make better predictions, which wouldn’t happen in real-time production.
How it can go wrong - Scenario 2
Instead of choosing an arbitrary number of features, you want to choose features whose p-value of f-statistics is smaller than a certain threshold. You think of p-value threshold as a model hyper-parametersomething you need to tune to get the best-performing set of features, thereby the best-performing model.
As CV is well-known for hyper-parameter optimization, you then evaluate a distinct set of p-value thresholds by performing the procedure on the whole dataset. The problem is that you use this CV estimate both to choose the best p-value threshold (hence best set of features) and to report the final performance estimation.
CROSS-VALIDATION THE DATA SCIENTIST | 25 THE DATA SCIENTIST | 25
TAM TRAN-THE
How
it can go wrong - Scenario 1
#1 PERFORMING FEATURE SELECTION ON FULL DATASET BEFORE CV LEADS TO DATA LEAKAGE
Image by Author
CROSS-VALIDATION
How it can go wrong - Scenario 2
#2 COMBINING HYPER-PARAMETER TUNING WITH MODEL EVALUATION IN THE SAME CV LOOP LEADS TO AN OPTIMISTICALLY BIASED EVALUATION OF THE MODEL PERFORMANCE
Image by Author
Apply feature selection strategy that involves learning about the outcome label
Apply cross-validation
When combining hyper-parameter tuning with model evaluation, the test data used for evaluation is no longer statistically pure, as they have been “seen” by the models in tuning the hyper-parameter. The hyper-parameter settings retain a partial “memory” of the data that now form the test partition. Each time a model with different hyper-parameters is evaluated on a sample set, it provides information about the data. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. Hyper-parameters could be tuned in ways that exploit the meaningless statistical peculiarities of the sample. In other words, over-fitting in hyper-parameter tuning is possible whenever the CV estimate of generalization performance evaluated over a finite sample of data is directly optimized. The CV procedure attempts to reduce this effect, yet it cannot be removed completely, especially when the sample of data is small and the number of hyper-parameters to be tuned is relatively large. You should therefore expect to observe an optimistic bias in the performance estimates obtained in this manner.
How to fix it
CV methods are proven to be unbiased only if all the various aspects of classifier training takes place inside the CV loop. This means that all aspects of training a classifier, e.g. feature selection, classifier type selection and classifier hyper-parameter tuning takes place on the data not left out during each CV loop. Violating this principle in some ways can result in very biased
estimates of the true error.
Find optimal hyper-parameter AND generate final evaluation
In Scenario 1 (#3), feature selection should have been done inside each CV loop to avoid data leakage.
To avoid undesirable optimistic bias, model evaluation must be treated as an integral part of the model fitting process and performed afresh every time a model is fitted to a new sample of data.
In Scenario 2 (#4), model performance should be evaluated on a totally unseen test set that has not been touched during the hyper-parameter optimization. In cases where your data is so small that you’re not able to afford a separate hold-out set, nested CV should be used. Specifically, the inner loop is used for hyperparameter search and the outer loop is used to estimate the generalization error by averaging test set scores over several dataset splits.
A code snippet for nested CV Scikit-learn does have out-of-the-box methods to support nested CV. Specifically, you can use GridSearchCV (or RandomSearchCV) for hyperparameter search in the inner loop and cross_val_score to estimate the generalization error in the outer loop.
For the purpose of illustrating what happens under the hood for nested CV, the code snippet below doesn’t use these off-the-shell methods. This implementation will also be helpful in case the scoring strategy you’re looking to implement is not supported by GridSearchCV. However, this approach only works when you have a small search space to optimize over. For a larger hyperparameter search space, Scikit-learn CV tools are a neater and more efficient way to go.
THE DATA SCIENTIST | 33
Full dataset Fold 1
Fold 3 Fold 2 Train fold Validation fold Train fold Train fold Train fold Validation fold Validation Fold Train fold Train fold Split 1 Split 2 Split 3 Full dataset Dataset with selected features Estimated generalization performance
26 | THE DATA SCIENTIST
CROSS-VALIDATION
How to fix it
#3 MODEL PERFORMANCE SHOULD BE EVALUATED ON A TOTALLY UNSEEN TEST SET THAT HAS NOT BEEN TOUCHED DURING THE HYPER-PARAMETER OPTIMIZATION Image by Author
Noteworthy question 1: Which feature set to use in the production model if we apply feature selection strategy in the CV loop?
Due to the stochastic nature of train/test split, when we apply the feature selection strategy inside a CV loop, it’s likely that the best set of features found for each outer loop is slightly different (even though the model performance might almost be the same over runs). The question then is, what set of features should you use in the production model?
To answer this question, remember: Cross-validation tests a procedure, not a single model instance.
Essentially, we use CV to estimate how well the entire model building procedure will perform on future unseen data, including data preparation strategy (e.g. imputation), feature selection strategy (e.g. p-value threshold to use for one-way ANOVA test), choice of algorithm (e.g. logistic regression vs XGBoost) and the specific algorithm configuration (e.g. number of trees in XGBoost). Once we have used CV to choose the winning procedure, we will then perform the same bestperforming procedure on the whole data set to come up with our final production model. The fitted models from CV have served their purpose of performance estimation
#4 NESTED CV SHOULD BE USED IN CASE YOU CAN’T AFFORD A SEPARATE HOLD-OUT TEST SET. THE INNER LOOP IS USED FOR HYPER-PARAMETER SEARCH AND THE OUTER LOOP IS USED TO ESTIMATE THE GENERALIZATION ERROR Image by Author
and can now be discarded.
In that sense, whatever feature set outputted from applying the winning procedure to the whole dataset is what would be used in the final production model.
Noteworthy question 2:
If we train a model on all of the available data for the production model, how do we know how well that model will perform?
Following up on question 1, if we apply the bestperforming procedure found through CV to the whole dataset, how do we know how well that production model instance will perform?
If well-designed, CV gives you the performance measures that describes how well the finalized model trained on all historical data will perform in general. You already answered that question by using the CV procedure for model evaluation! That’s why it’s critical to make sure your CV procedure is designed appropriately.
TAM TRAN-THE is a Data Scientist with Enolink. Her work focuses on statistics, machine learning, and predictive modelling and she has studied at the University of Massachusetts and Mount Holyoke College.
Split 1 Split 2 Split 3 Full dataset Train fold Validation Fold Train fold Train fold Train fold Validation fold Validation Fold Train fold Train fold Test data Test data Find optimal hyper-parameter or apply feature selection strategy Final evaluation
Full dataset Train fold Train fold Train fold Train fold Full dataset Validation fold Train fold Validation fold Train fold Validation fold Train fold Train fold Train fold Train fold Validation Fold Validation Fold Validation Fold Train fold Train fold INNER LOOP Tune hyperparameter or apply feature selection strategy OUTER LOOP Train with optimal hyper-parameter/ feature set found to generate final evaluation Test data
Fold 1 Fold 3 Fold 2
THE DATA SCIENTIST | 27
IN EACH ISSUE OF THE DATA SCIENTIST , WE SPEAK TO THE PEOPLE THAT MATTER IN OUR INDUSTRY AND FIND OUT JUST HOW THEY GOT STARTED IN DATA SCIENCE OR IN A PARTICULAR PART OF IT.
START-UP
Since I was around ten years old, I’ve always wanted to be a scientist. Many close family members were scientists in academia — so getting a PhD, then a Postdoc, and finally becoming Professor were naturally my goals. This dream was just pie-in-the-sky for many years. But all that changed once I started having fun solving advanced physics and calculus problems while my friends were more concerned about grades, and not the subject.
So, as an Undergrad, I made up my mind to follow my dreams and become a Physicist…
After that, it was relatively smooth sailing thanks to a combination of hard work and dedication, as well as my natural creative inclination to find and solve research problems. I did a Masters at a prestigious IIT in India, a PhD at Emory University, and a Postdoc at Georgia Tech. All indicators showed that I was on the right track to achieving my academic dream. During my PhD, I was awarded the ‘Best Graduate Student Award’ by the
Physics department at Emory. I also published a first author paper in PNAS, and had many other co-author publications. During my Postdoc, I was awarded the ‘Best Speaker in Robotics Award’ (related to research at the Georgia Tech Postdoc symposium) and my work was featured by the likes of Forbes and the BBC.
I’m giving this context to show how focused I was on the track to academia, and why it’s important to enjoy what you are doing at a particular moment.
Finally, I landed a Faculty position at Georgia Gwinnett College - an undergraduate college in the Atlanta area. I enjoyed my time there; having the freedom to teach a course that I created and choose research projects that interested me most. Towards the end of my third year at GGC I decided it was time for a change. One month later, I landed a new job as a Senior Data Scientist. However, that didn’t mean that I’d only prepared for this role for a month…
RESEARCH USING DATA
I’ve always been keen to understand patterns in the world. So, when I had some spare time to develop an independent research project as a Postdoc, I thought of
VIVEK
SKANDA
IN ISSUE 1, SENIOR DATA SCIENTIST SKANDA VIVEK FROM ONSOLVE TALKS ABOUT HOW HE TRANSITIONED FROM ACADEMIA TO THE DATA SCIENCE INDUSTRY.
28 | THE DATA SCIENTIST
SKANDA VIVEK
FROM ACADEMIA
exploring traffic patterns as a physics problem. We came up with one of the first estimates of how traffic patterns would break-up in the aftermath of a cyber-attack. For this project I explored a bunch of traffic data - including data from the likes of Google, HERE API, NYC taxi data, and Open Street Map. I also developed a simple algorithm to track vehicle speeds from a local camera.
While none of the data science methodologies I used in these projects were ground-breaking, it gave me hands-on experiences in data extraction, cleaning, and some basic machine learning in real-world contexts.
DATA INCUBATOR FELLOWSHIP
I then did a remote, eight-week intense data science fellowship. At the time, I was more interested in academia, but in case I didn’t find an academic position I wanted data science to be the next option. I got a bunch of interviews from the fellowship program which almost resulted in an offer, and my Capstone project was also highlighted during pitch night.
But ultimately, as my wife had just given birth to our twins a few days earlier, I’d decided not to pursue an on-site visit at a Chicagobased company. I had just landed my Academic Professor job at GGC. The most important thing about this fellowship was that it exposed me to peers, most of whom chose data science after the fellowship and gave me an overview of the technical skills and requirements for an entry-level Data Scientist. The experience also provided me with another important trait: the psychological edge. Before TDI I felt like a physicist who was skilled in python. After TDI I felt like a Data Scientist.
This was back in 2019, where my path to data science had already started. But, there have been several steps along the way….
BLOGGING
My first article on Medium was back in October 2019:
‘What if the next large-scale hack involved your vehicle instead of your security camera?’. It performed dismally for a month, racking just 20 views or so. At the time, I didn’t know about submitting to larger publications. Then, an editor from The Startup (Medium’s largest publication at the time), reached out to me about publishing my article. After that, I consistently continued to appear in top Medium publications-particularly Towards Data Science. Publishing and sharing my work with the larger data science community has helped me stand out during the job interview process.
CONSULTING, TUTORING, MENTORING
I started reading stories about how people were selling data science services on platforms like Fiverr and Upwork. In parallel, I’d also dabbled with creating tutor accounts on common tutoring platforms during my PhD, but I’d never really followed up on these.
While I was experimenting with online platforms, I’d stumbled across a great need for data science tutoring, as so many universities and colleges had started to offer data science Masters programmes and undergrad specialisations. I got a number of students who valued my tutoring sessions and gave me good ratings, and as the hours and ratings piled up, I increased my hourly rate. Surprisingly, I found that as I increased my ratethe requests didn’t dramatically decrease. In fact, I had students from top universities like Harvard, Columbia, and Berkeley. I even tutored someone for a semester whose father is a Tech CEO of a top 20 Fortune 500 company. I also landed a longer-term client who was consulting with me on developing an image object detection platform.
INTERVIEW, INTERVIEW, INTERVIEW…
My limited data science consulting and tutoring experience made me a little nervous when I was deciding to transition into the data science profession. I figured there must be lots of applicants on the job market, given every college or university seemed to have launched a Data Science programme. However, it was only during my interviews in March and April 2022 that I realised two things.
Firstly, there were a record number of job openings at that time (inflation was not yet a concern), and secondly, everyone who claims to be a Data Scientist, or has a data
THE DATA SCIENTIST | 29 SKANDA VIVEK
and sharing my work with the larger data science community has helped me stand out during the job interview process.
Publishing
science degree, is not always competent. The latter is actually obvious in hindsight - I was tutoring so many students from top universities who didn’t have a clue what they were doing.
An encounter with a recruiter taught me that interviews are like doing reps at the gym. Do more, and you get better at it. Interviews are the most important part of the job search process. The only way you get better at interviewing is by doing more interviews!
DON’T BE AFRAID TO FAIL
It’s natural to hold back on interviewing until you feel ready. But, if you’re like me, you might never feel 100% ready. And that’s the danger of settling into a comfort zone and making excuses about why there’s never a right time.
Once, I almost lost my voice due to an illness from my kid’s daycare. I’d applied to jobs with mistakes on my CV. But I still interviewed, even though at times my throat didn’t cooperate. I fixed my CV and made it more appealing when I wasn’t getting interviews for a week. After that, recruiters reached out to me.
One of the worst and most morally devastating tools that I came across was Jobscan. I haven’t heard anyone say negative things about Jobscan,but it didn’t work for me. Jobscan scans your resume and the job posting and gives a matching score. I never got above 20 or 30% from the Jobscan system, and they suggested applying only when you have a score of 70% or higher. If I‘d taken their advice, I would still be tailoring my CV to this day!
DOCUMENT EVERYTHING YOU DON’T BLOG ABOUT
If you follow Ali Abdaal’s YouTube channel, you will see that he was incredibly successful in med school, and at the same time made it big as a YouTuber. One of his productivity tools is intense note-taking using Notion. I use Notion to consolidate my DS learnings; especially in the context of job interviews. It isn’t very structured, but does the job.
CLOSING THOUGHTS
Becoming an ‘expert’ at something new is about more than having one specific goal and achieving it. For me, the journey was much more important than the final destination. Think about it, if you decide on a goal and reach it, what will you do once you’ve met it? In the absence of future plans, you might crash and burn out. Also, big goals can change from day to day. What if you were to suddenly decide on a goal for where you should be next year and change your mind tomorrow?
On the other hand, if you are on a constant journey
and enjoying every moment, any destination is a step along the way. Creating positive habits is more likely to strengthen an ever-changing mind. If you decide to blog every week and make it through nine weeks, you are less likely to quit in the tenth week. You might take a break - but it is easier to get back in the zone afterwards. After thirty blogs, recruiters might reach out to you, and you end up getting an interview where you talk about your blogs and land a dream job.
Compare this with creating a goal of transitioning to a data science job one year later, and not putting in consistent efforts. That could be an unrealistic goal, especially if you can’t show someone that you have the relevant experience. This neat psychological mindshift is also referenced in other ways by many successful people.
As Maria Sharapova said: “The mission I was on was very different. It wasn’t that I had or didn’t have to be a champion. It was that I was learning and growing to be a better tennis player.”
Ravikanth Naval says: “Be impatient with actions but patient with results.”
And in the book Tiny Habits , B.J. Fogg discusses how small, focused daily activities can lead to hugely positive impacts on your life.
Finally, if you have a positive attitude towards your transformation and make the necessary efforts - whether that’s blogging, publishing your research, consulting, writing a book, posting on social media, or taking the time to be creative; then you’re making the best possible decision. By investing in your health and success, it doesn’t matter whether you reach your original destination. You might find something even more awesome along the way.
30 | THE DATA SCIENTIST
You can read more articles from Skanda at skanda-vivek.medium.com
VIVEK
TO THE DATA SCIENCE INDUSTRY SKANDA
THE CURRENT STATE OF DATA SCIENCE HIRING
Most developed nations are reporting the lowest levels of unemployment in 50 years. This trend is set to continue, especially in candidate-scarce disciplines like Data Science & Engineering. This means the current candidate market is the tightest in the history of our sector, and it’s going to get worse before it gets better. Data Scientists and Data Engineers with any interest in moving jobs have more choice than ever, and can expect higher remuneration levels than previously seen before.
The current candidate market is the tightest in the history of our sector, and it’s going to get worse before it gets better.
DS RECRUITMENT THE DATA SCIENTIST | 31
Each issue, we’ll examine the world of Data Science recruitment through the eyes of DAMIEN DEIGHAN, the CEO of UK-based Data Science Talent.
In this issue, Damien gives his thoughts and opinions on the current state of Data Science hiring.
DAMIEN DEIGHAN
RECRUITMENT IS NOW PRIMARILY A MARKETING FUNCTION
There is nothing anyone can do about the size of the current candidate pool or any shortages that exist within it.
However, you and your company can significantly increase the number of Data Scientists and Engineers in the candidate pool who are open to talking to you about your vacancies.
This requires an acceptance of how important marketing now is to recruiting, and that doing it well requires considerable effort and expertise.
THE FIRST PRINCIPLE OF SUCCESSFULLY RECRUITING DATA SCIENTISTS & ENGINEERS
Successful recruitment is about two key things: attraction (the first principle) and assessment. Everything in recruitment is derived from these two foundational elements.
But, attraction comes first. Despite this, most hiring guidance (and it’s been this way for 50 years) tends to be about how to assess better and avoid hiring mistakes. There is very little being done about how to attract a better quality of people at the beginning of the processaside from generic employer branding. Most information on the topic of attracting people in candidate-scarce niches is superficial and tactical.
The problem with not focusing on attraction is that unless you’re able to attract a consistently healthy stream of people into the hiring pipeline, then you don’t get to assess anyone in the first place. So, the most important place to start - if hiring is going to be a continual endeavor for you and your team - is to fix your attraction problem.
DS RECRUITMENT 32 | THE DATA SCIENTIST
Successful recruitment is about two key things: attraction (the first principle) and assessment.
THE BEST CANDIDATES ARE ASSESSING YOU, LONG BEFORE YOU START ASSESSING THEM
Put it another way - 10 years ago, you were the buyer. Now, you are also the seller at every stage of the candidate funnel.
You are assessing candidate
1 4 5
As the image shows, the candidate is assessing you and your company from stage 1 through to stage 5. This means you need to sell your proposition at each stage of the process. You are only actually assessing the candidate in the final two stages.
This means that throughout the funnel, you have to sell the benefits of coming to work in your team more than ever. If you don’t, the best people won’t consistently show up for you to interview. Most companies spend a lot of time and money selling and marketing their product or service. Marketing departments have huge budgets to invest on product strategy and value propositions. Yet, little to no energy is spent on their hiring strategy or how to market jobs.
YOU DON’T HAVE A CANDIDATE SCARCITY PROBLEM. YOU HAVE A WEAK MESSAGING PROBLEM
Most companies who are currently experiencing
THE DATA SCIENTIST | 33 DS RECRUITMENT
2 3
Candidate is assessing you, job + company
difficulties with hiring, think they have a market conditions problem. They don’t. They have a weak messaging problem. They take an ill-considered proposition and communicate it badly, and then blame this on a lack of candidates.
If your company operates in any major Western economy, there are literally thousands of potential Data Scientists or Engineers that could be the right fit for your team. Your issue is due to the fact that you’re not reaching these individuals, and even when you do, they’re not listening because you’re failing to talk to them about what they want to hear. In fact, it’s likely that your entire approach is company-centric in terms of the process and the messaging.
AWARENESS Job Adverts, Content, Outreach CONSIDERATION Value Proposition, Content, 1st Recruiter Call APPLICATION Online Application or CV Submission ASSESSMENT Interview + Testing OFFER Offer Presentation + Negotiation
This is what a typical candidate marketing funnel looks like: TALENT ATTRACTION FUNNEL
The concept of customer centricity is well-established in business. However, this concept is usually ignored when it comes to recruitment. We call messaging - and recruitment marketing campaigns that talk to the candidate about what’s important to them - candidatecentric.
CANDIDATE-CENTRIC MARKETING CAMPAIGNS WORK EVEN IN THE CURRENT MARKET CONDITIONS
Ad-hoc recruitment activity is becoming less and less effective as the candidate pool gets tighter.
The good news for you is that most companies’ recruitment marketing is very poor. Very few of the companies you are competing with for talent have put much effort into their recruitment marketing. This offers a big opportunity for you, if you take it seriously.
As an absolute minimum, any consistently high-performing, candidate-centric hiring campaign needs 3 things:
1. (JOB) VALUE PROPOSITION
4-6 page document that describes every possible job and company benefit to the candidate, and looks good visually.
2. COMPELLING JOB ADVERT written by a professional copywriter - this is a onepage attraction piece that generates the initial interest.
3. A WELL THOUGHT-OUT MESSAGING CAMPAIGN with multiple touch points, executed across multiple channels.
These 3 things will set you up for success. They’re not a silver bullet, because there’s no such thing in recruitment. But, if you execute these tactics well, you’ll find it much easier to consistently hire the people you had in mind all along.
For most companies, improving candidate attraction is more important than improving assessment. If your job is properly defined from the start, and communicated in a highly candidate-centric and specific way, you’ll attract the people you want and repel the people you don’t want. If attraction is done well, then it makes everything else easier, including assessment and retention.
datasciencetalent.co.uk
34 | THE DATA SCIENTIST DS RECRUITMENT
REINFORCEMENT LEARNING
REINFORCEMENT LEARNING
By FRANCESCO GADALETA
Some people say that deep learning is enough to reach or build an artificial general intelligence. There are some others, and by that I mean scientists, who say quite the opposite, that it’s not possible. For example, deep learning methods are very good function approximators, but, we cannot call them artificial intelligence and definitely not, artificial general intelligence or AGI.
There is, however, another branch of artificial intelligence known as reinforcement learning that some scientists state is enough for general artificial intelligence. A strong statement.
However, it does make sense to some extent, and so first of all I’ll clarify what these scientists actually mean with that statement. I’ll then offer my own opinion based on reading a huge amount of relevant literature and papers in the last few years, including the latest cuttingedge results and findings in artificial intelligence and machine learning.
Reinforcement learning is a paradigm of computation that essentially allows an agent to learn how to solve a particular problem, a problem that has to be defined in a particular way. The classic example here is the typical agent that goes around within an environment.
Let’s look at this in a different way, let’s assume that we have one agent: a mouse or a cat. And then the house is the environment. What reinforcement learning states is that the agent can perform an action that will alter the environment and the environment will respond to the agent with a new state, the next state. The reward that identifies the state encodes how good or bad the direction was, once performed in the environment.
THE DATA SCIENTIST | 35
FRANCESCO GADALETA
So, if you think about the distance moving to the next position, it would alter the environment in the sense that the position of the mouse would have changed and the position with respect to the cheese would also have changed. The environment has obviously changed because the relative positions of the objects within the environment have changed. This could also happen if the mouse could move, for example, objects within the environment and the positions of these objects would change as well.
As you can see, the action could be - for examplemoving to another position, going up, down, left,right, diagonally… you can have as many actions you want, but essentially, the mouse at state T chooses an action randomly, or in a more or less random way. This action will then bring the mouse into another state in the environment, and the environment will respond to the mouse with either that was good, or that was bad, depending on where the mouse is, where the cheese is, where the cat is and so on.
If we repeat this simple concept a number of times by trial and error, we would get to an agent that understands the environment and understands the specification of the environment or the set of actions. If by trial and error we keep playing this game of cat and mouse, eventually the mouse will become intelligent enough to solve the game and win most of the time. Now, if we want to be more detailed in the explanation of what reinforcement learning is, I should definitely speak about the concept of policy.
THE POLICY
The policy is a function that maps states to the actions, and this policy is essentially a set of actions. The policy is a plan that says if you are in a particular state and you perform a particular action, you are going to receive a particular value that will tell you if the action in that state was appropriate or not. You can have a number of these actions align a chain of actions, or a set of actions that will bring you from A to B by simply performing these actions one by one.
This policy can be learned or stored in a lookup table if the number of possible combinations that are state actions are approachable. Otherwise, it can be approximated with whatever machine learning model of your choice, including neural networks. So, if you have a neural network that allows you to know (approximately) the policy and the mapping, and states actions with the neural network, then we are not storing all combinations or all possibilities. That might be quite a large number if you have a lot of states and a lot of actions, but we are approximating it with a neural network, and with any other machine learning model.
There is another detail we need to look at, which is how the policy is usually trained. You can have on-policy reinforcement learning or off-policy reinforcement learning. Traditionally, the agent observes the state of the environment and then takes a certain action and performs that action, and this is usually based on a policy. The policy is identified with the Greek letter Pi. Pi of a given S is the policy of the action given the state; whereby the agent collects the result, the reward, and moves to the next state. That’s how it works now, how that action is, how policy is calculated. The learning can happen on policy, which means that experiences are collected using the latest learned policy, and then using that experience to improve the policy as we go or learning can happen off-policy. This means that the agent’s experience is buffered or is appended to buffer. It’s also called the replay buffer, and each new policy collects additional data, and is calculated and actually recalculated on the buffer. This means that the buffer becomes a kind of training set that will generally be used by a neural network or another model to learn the policy of the next time step.
THE REWARD FUNCTION
This trial and error approach has been defined by a type of reward function that is, for example, survival: a more brutal reward function. So, the closer the mouse is to the cheese, the higher the reward. This is, of course, a theory. There’s nobody who can tell us that it is a hundred percent correct, but it is a theory that makes sense, and we are trying to apply this concept that we have theorised for humans and for living organisms. We are trying to apply the same concepts to artificial organisms that we define intelligent or we would like to be intelligent.
There are so many other behaviours that living organisms have discovered or developed that are too complicated to be explained by a simple reward maximisation strategy. One example is a squirrel that can only find food but doesn’t have a particular behaviour that allows him to store additional nuts in his mouth. It’s almost impossible to fit anymore in, so they hide these nuts because they want to make sure that no other animal will steal them. This behaviour has allowed the squirrel to survive, but it’s not an immediate reward, like ‘I’m starving so I’ll eat and I will get a positive reward’.
No, this is more like ‘I’m planning for the end of the season or for the next season because I might die if
REINFORCEMENT
LEARNING
36 | THE DATA SCIENTIST
Reinforcement learning is a paradigm of computation that essentially allows an agent to learn how to solve a particular problem.
there is a scarcity of food’. And obviously, human beings have even more powerful planning capabilities. This is hard to believe - that reward maximisation is the only reason we’ve evolved the amazing capabilities of the human brain.
There are other scientists who claim that reward maximisation should not be enough or cannot be enough because there are many other scenarios in which these agents or these living organisms have been leaving and experiencing the environment and the world. For example, taking into consideration collective behaviour, the fact that some behaviours have been developed due to society and communities. For instance, take survival as a positive reward from the environment for animals that protect their offspring. This is something that we also see in humans, not so much anymore because we have a very developed society that protects this by laws and regulations, which are very high concepts of behaviour. But if you look at animals, these behaviours are still present.
However, there is one observation from another scientist which goes against what many other scientists have been saying in recent papers. I’m referring to the author of Algorithms Are Not Enough . They explain that reinforcement learning, as the cause for developing complex behaviour, is actually quite wrong or quite incomplete. I would argue that with reinforcement learning, there are some assumptions that have to be made in terms of the reward function and the value function.
So, there must be someone (usually an engineer) who defines what reward is. And how do we reward the agent in any particular state for every particular action? In the case of a mouse and the cheese example, that’s pretty easy, as the mouse gets a higher reward if it gets closer to the cheese, or if it keeps a minimum distance from the mouse trap, or a combination of the two. In this world, everything is simple, therefore it’s easy to define
what a reward function would look like. So I need to inject this knowledge. That is the reward function in the system, and then I will just run the algorithm that, by trial and error and approximations, will eventually converge to the optimal policy. But the system could not find that reward function by itself, and there is also the value function that assigns a particular value to a state of the environment or a particular action in a particular state. These are mathematical functions that actually drive the convergence towards the optimal policy.
This is where most of the difficulties and the challenges are when you design a reinforcement learning algorithm. These are probably the most critical components of the entire model, because a reward function that is wrong will probably never let the agent converge to the optimal policy - and the same occurs for a value function that is not well designed. So the statement from the author of Algorithms Are Not Enough is that reinforcement learning is okay, but the question remains, how do we deal with the fact that someone else should inject reward functions and value functions for the system to work? If this was the case, it would mean that even in the living world we should accept that another entity was designed as the reward function for us or for other living organisms.
This is an open topic, and is definitely subject to interpretation and various theories. There is no way to close the topic with a solution or with an explanation that explains a hundred percent of everything with just one reward and maximisation methodology. There are many other points that have been ignored by deepminded scientists; for example, the fact that, in my opinion, touch and other sensory data should be part of an intelligent organism. And so intelligent organisms don’t take decisions or intelligent decisions just by using their eyes - we use almost every part of our body when we make our minds up.
36 | THE DATA SCIENTIST
REINFORCEMENT LEARNING THE DATA SCIENTIST | 37
FRANCESCO GADALETA is the Founder and Chief Engineer of Amethix Technologies and the Host of the Data Science At Home podcast.
ANNA LITTICKS
DR. ANNA LITTICKS:
THINGS I WISH I KNEW BEFORE BECOMING A DATA SCIENTIST
Hello there.
You may not know me, so allow me to introduce myself.
My name’s Dr. Anna Litticks. I’m a Chief Data Scientist, analysing my way through life, at some of the most infuriating businesses on the planet. Of course I didn’t know this going in, lest I may have chosen another path.
If you’re reading this, chances are you’re the same. You excel in a job virtually no one understands, except other Data Scientists of course, who you hardly ever meet, because well, you’re a Data Scientist. So think of this as your safe space. A place to raise a smile and be comforted by someone in the know.
In the next few issues of The Data Scientist , I’ll be giving you my top five ‘things’ I wish I’d known before starting my career. Let’s start with the most obvious...
No one will ever understand what you do. “Statistics?!” my Mother snapped.
“I don’t know love, it sounds pretty limiting. There aren’t many jobs at the National Statistics office. There’s Banking I guess? But I can’t see you in a pin-striped suit.” It should’ve been a warning.
I’m sure that I get my intelligence from my Mother, but even her brilliant mind couldn’t see where a career in Data Science would take me. Of course, it wasn’t called that then. And so explaining its importance seemed like debating religion with a frog.
“
38 | THE DATA SCIENTIST
DR.
My Father, who I inherited my introverted nature from, just smiled and nodded with a “Sounds interesting to me.” This only infuriated my Mother further, whose mind raced with the image of me burning my maths tutor at the stake.
In the outside world things only got worse. On the very rare occasion I describe my job to strangers (at events like weddings, where such banal queries seem mandatory), I witness people’s very souls evaporate, right after the word ‘data’ is plonked awkwardly on the conversation altar.
These days I don’t bother and merely say “I’m a Scientist”, which delivers a mildly less painful wry smile as the topic quickly shifts to what my husband does. As if to illuminate my poor life choice for the masses, his easy to understand “I’m a Teacher” pleases them greatly.
It is easy to forgive occasions like these however, in lieu of the alternative, where the query holder has data sets of their own on the subject of Data Science and delivers their story as if it were complete and repeatable.
Where these ‘models’ exist, I’m at least thankful if it’s awkward wedding chatter, and not an interview, where the opinions of others will model my success, on a problem they think they understand, but don’t.
See you in the next issue!
Dr. Anna Litticks
On the very rare occasion I describe my job to strangers (at events like weddings, where such banal queries seem mandatory), I witness people’s very souls evaporate...
DR. ANNA LITTICKS
THE DATA SCIENTIST | 39
“Let’s move from malaria control to malaria elimination… moving from control to elimination takes artificial intelligence and data.
ARNON HOURI YAFIN
I spent my whole career in data and even today, it’s hard sometimes to keep a track of everything that’s out there.
TARUSH AGGARWAL
Being sceptical, not taking it on face, that AI is this magic bullet that knows what you mean and it’s going to just fix all of the world’s problems for you if only you pay enough to the vendor.
JULIA STOYANOVICH
This is what data science sounds like.
BY Dr Philipp Diesinger and Damien Deighan
datascienceconversations.com
Listen to the Data Science Conversations podcast and hear from some of the industry’s leading experts making a real-world impact. Expand your knowledge. Enhance your career. SCAN ME
PRESENTED
ARNON HOURI YAFIN How AI is Driving the Eradication of Malaria TARUSH AGGARWAL How to Leverage Data for Exponential Growth JULIA STOYANOVICH The Pitfalls of Using AI Systems for Hiring