Building AI models for deployment in the real world

Stephen Hogg is the Head of AI Systems at Harrison.ai, he spoke to Adel Foda, Head of AI at Silverpond, about how to approach engineering AI systems, from developing and testing models to embedding AI into operational workflows.

Adel: What are you and your team working on at Harrison AI?

Stephen: We are currently building out a joint venture with I-MED, Australia’s largest radiology network to bring AI to life in the field of radiology. My piece of the puzzle is basically about what happens once we transition from a laboratory environment where you build a model to the real world where actual people are using it. There are a couple of things that you have to think about there. One is that in the medical field, there are a lot of regulations, which is actually a good thing. It gives us a license to operate and it also gives us the external impetus that you ideally want in order to think about testing properly. How can modeling work in that environment? That’s one part of it. Another part is spotting situations where we want to improve our models – how can we identify those areas of improvement and how can we do something about it? That is what my team is about. Personally I generally tend not to look for completely cookie cutter machine learning jobs for a couple reasons. One is because you will not learn anything from the job that you cannot find out from reading blog posts. The second is that you are much more likely to be at the cutting edge of your industry if you pick weird ambiguous problems that people are not thinking about.

“My piece of the puzzle is basically about what happens once we transition from a laboratory environment where you build a model to the real world where actual people are using it.”

Adel: What are some of the common mistakes made when building models for deployment in the real world?

Stephen: When people are building models there are two obvious mistakes and they are really basic ones. One is that you build a model and you think, okay, performance is X. I am going to try a few things and see if I can improve it. Now, what a lot of people will then do is make say three different changes, run a new model and see if the performance improves. Getting the answer to that is easy enough. In order to know whether or not it was a good idea though, you do need to know why it improved. Not only does that tell you whether or not that change was actually worthwhile but also what to change next. This means isolating the changes you make to your model from each other.

The other mistake people make is when they are trying to work out why the model is doing what it is doing, they will start coming up with really, really complex answers. A lot of the time models are subject to really, really simple forces of bias and overfitting that cuts across basically everything, and of course also data quality. A lot of the time, whatever your explanation is for model performance, if it does not have a clear link back to those really fundamental forces that shape your model, it is probably wrong. The basics have to be right. Your model might be performing better, but that could be misleading because now there is a bias issue that you have introduced and you need to be aware of that. I see people make those mistakes a lot. A lot of models actually wind up being Rube Goldberg machines.

Adel: Overly complicated, too many components?

Stephen: Not necessarily that it is overly complicated or has too many components. But a bit like this model is not doing what you think it is doing. I remember seeing a story of a company that had a lot of clients. All of their customers of one particular product type were leaving, reliably leaving. So these guys put together a churn model, of course, as you do. They looked at that model and said, “Okay. Yeah, this has really great performance. Let us put it into production.” Then a couple years later when it was revisited, it turned out all it was doing was basically putting a really massive weight on whether or not the customer had this one product that was causing everyone to leave and all of the other variables had no effect at all. So, it was a giant if statement that was using up a huge amount of compute to no real effect. It is easy to not realise that that is going on.

When you agree to provide a single-click solution to a problem for a customer, you internalise all of the complexity underlying that problem into your organisation. That implies a lot of work when you are working in a complex problem domain such as medicine, because you are internalising information about the problem domain, the downstream implications of the system returning a particular answer, as well as the management of the technology solution itself.

Adel: To me that is one of the major challenges in dealing with complexity, reducing it into a yes/no function for a client. Do you also see that as one of the big challenges in terms of getting a machine learning production system correct?

Stephen: Framing is everything. The temptation is to say I am going to make this one model that I am never going to have to think about ever again. All that does is it just asks for a counter example to come along, like some weird subset of data that just breaks everything. It is really hard to anticipate unless you are going to spend a lot of time and effort fuzzing your model, which is possible, but it is a lot of work and you want to be sure that it is worth it before you do it. So, I think machine learning has the capability to completely break the SAAS business model if it is not thought about carefully. That again comes back to how do you engage with complexity? So, in that case I was talking about where I think, okay, I have got this one model that I am going to base my whole company on and I never need to update it. What that has done is swept all complexity under the rug into this one zip file in an S3 bucket somewhere.

Hiding from complexity does not make it go away. As much as you think about, how do I build this model? How do I make sure the data going into it is good? You need to be doing that in a way where you can replicate that in production. You may wind up in a situation where ultimately in your first release of a product, machine learning might not be that big a part of it. It may be the core of it, but it might not be everything. Then as you slowly carry on from there and release more versions, you might build confidence in your model and extend it to more problems in different parts of the problem space. Start somewhere where you know you are going to get what you came for. If you are going to treat models like code then you need to demonstrate that they are doing what they are intended to do before you start worrying about bigger and better things. Treating them like a really small machine that you can have a lot of is quite powerful, and then you can repurpose them and make them do other things. Doing it this way also tends to keep complexity in plain sight, which is what you want because you want people thinking about that stuff explicitly while they are planning and working out what product decisions should be made.

“You may wind up in a situation where ultimately in your first release of a product, machine learning might not be that big a part of it. It may be the core of it, but it might not be everything. Then as you slowly carry on from there and release more versions, you might build confidence in your model and extend it to more problems in different parts of the problem space.”

Adel: What do you mean when you say you look for opportunities for improving your machine learning model?

Stephen: We need to find situations where we think that performance could improve. We want to make sure that we always have a solid floor under the performance of our models in as broad a range of situations as possible. Part of that can only be known through interaction with the real world.

Adel: Is it fair to say then that there are models out there working in a production or a simulated production environment and you are looking through the failure cases? The importance of failure depends on both the value or the cost of the error plus the actual severity of the error. You are honing in on those cases and trying to patch those up.

Stephen: Seeing a report from someone saying how something needs to change does not immediately mean that there is a model problem. Part of it is figuring out why this is happening? Which is what I was talking about earlier. When you make a change to a model you need to know not only whether it worked but also why. It is the same here, if we are going to investigate something or do something about a corner case we have identified, we have got to work out why that is even happening in the first place. It might not actually be a model issue. It could be something else like someone might be using damaged equipment to send us their radiology reports. That can happen. The idea is to more or less turn models into a continuous improvement problem so that they can become more robust over time rather than more risky over time.

Adel: That goes back to your comment about framing being everything, that we do not treat the model as static and perfect. We treat the problem as a continuous improvement process that everybody is on board with.

Stephen: It takes a little while to spell out that viewpoint and the implications it has not only for the products that companies are making but also how a company should actually operate. But when you do, it is a powerful insight. People tend to see that, from a really basic commercial point of view, users using our product provide a wealth of information that can be used to improve that model. So why would we leave that money on the table? Even if you do not want to think about it from a model performance point of view, there is so much to be had there.

Adel: It is a challenge to design a workflow so that it is still delivering value to the user with a model that is imperfect and a work in progress. How does a company operate inside that kind of framework?

Stephen: Conway’s Law is interesting. Mostly it says is that the product design, the way your product works, mirrors the flow of communication inside your organisation. This is one of the things that complicates the jump from consultancy to product. I think that is the source of complexity that needs to be addressed with open eyes. If you build a product that is a single big monolith and you have got teams working on it that are not a single big monolith then you are obviously going to have problems and vice versa. This is interesting in our context because we want to build this closed-loop. What does that mean for how we need to structure ourselves as a company and how teams work together? There is no upstream in this world. It is not like data comes to the data engineers, then they feed it to the data scientists, then they feed it to production. It is not like a river. So you have to think about how you set up your teams to account for that?

Adel: What is the closed-loop? Is it from clinicians to clinicians?

Stephen: Effectively, if we build and deploy it and then people start giving us feedback. This should be reflected in a change to the way our model works

Adel: So clinicians are feeding data into the system, they are also using the outputs of the system and in the background all of the model updating and retraining happens invisible to the end-user?

Stephen: It is not always a direct link but it is there. I should point out we do a lot of clinical validation on our models before we can even release them as is required by law. It is not like we start out with dog food and then get clinicians out in the field to update it. We start out with something that we know will improve the performance of the human using it and then keep iterating on that.

Adel: The benefit that is being offered by your product is that it is this synergy between the machine predictions and the human, which together make a system which offers a reduction in risk?

Stephen: Our product does not usurp the role of a doctor or clinician. We are not here to put radiologists out of work. We are here to make them more efficient. That is partly because of law because quite rightly, a human has to make the final decision in cases involving someone’s medical treatment. We agree with that and prefer that. What we are about is helping people get work done more efficiently and faster. That is worth a lot both to clinicians and their patients. So, we can help a clinician spot a finding that they might have otherwise missed, for instance.

Adel: When you say efficient, I definitely understand the way in which in a system like yours can help reduce the false negative rate. Are there other benefits? Are you claiming a time-saving associated with your product?

Stephen: Yes.

Adel: How does that materialize?

Stephen: What does it take to automate a human workflow? Now, anyone saying you just give it all to the machine and go to the pub is A: joking, or B: has not done this before. What you want to do is use automation to take away some of the easier bits of thinking a lot of the time. One of the things we do is we say, “Hey, look at this part of the image. We think it is of concern for this reason.” We are not actually telling you yes or no, or that anything is the answer. But we are saying this bit of the image appears to look as if it would have this problem. The other thing that we do is we say, “Well, we think there is an X percent chance of this clinical finding overall,” and give you some sort of idea of how confident we are in that prediction. So, what we are doing is framing things for a clinician. We can help people spot the easy stuff quickly and we can help people spot complex stuff accurately.

Adel: So, there are two ways in which the product provides value. On the one hand you are helping sort the bulk of the data and saving time by getting the clinician to the point of diagnosis more quickly. And on the other hand you are reducing risk in that there will be some cases where the system is going to warn the clinician about something that they otherwise might have missed.

Stephen: Neural networks are supposed to be, if you believe some people on the internet, like the human brain, right? Now, what is a model? It is nothing more than congealed data. So in effect, what our company does is it gives the average clinician just a few extra brain cells, a couple million that are very closely acquainted with a certain type of information. That’s one way of stating what we do.

Adel: An accuracy statistic compresses a lot of information, even though it may be an accuracy of 99%, there still may be subsets in which the system fails systematically.

Stephen: False positives and negatives are always with you and different clinicians have different tolerances for that. Depending on your radiology clinic, it might be there is a red-hot risk of one type of finding and you do not want to miss it. Whereas for another clinic that might not be the case. So with the model output you can play with the sensitivity and specificity of what you are looking at and there is nothing stopping your client organisation from doing that. They can say, “Look, I am really worried about this one finding so if you have any sign of it, pick it up.” You can sort of see it as a high-tech problem that can be addressed by low-tech means, which is something that I really like.

Adel: So there is some understanding that failures do happen, whether it is a purely human system or whether it is a human plus machine system and maybe the values are different in the different cases or they have a slightly different distribution. But there is some acceptance of the fact that the system is not perfect and there will be a non-zero rate of consequential errors. But nonetheless, we sort of believe that overall this is better than what we had previously.

Stephen: We have to prove that conclusively before we can deploy anything. It is actually one of the really good things about the medical industry. There is a lot of paperwork for documentation and performance validation, but you want that because it gives you the kind of structure you need to build a product that is not going to get you out of bed at 2 a.m.

Adel: With work we have done in the power industry where there is the potential for high-cost failures. At the end of the day there is a question of liability that underlies everything and it can’t rest with the system. There still has to be a person in the loop ultimately using the tool and who is responsible for the outcome.

Stephen: We do not want to take over the world, if you can put it that way. We are helping hands, not there to completely reach into people’s lives and take them over.

Adel: So, the workflow that you are working on is the augmentation of the assessment and diagnosis of radiology scans and x-rays?

Stephen: Yes.

Adel: Are there any other workflows in the medical industry that your company is targeting for augmentation or is it really just fully focused on radiology?

Stephen: Watch this space.

Adel: Trade secrets?

Stephen: I’m sure all will be revealed in due course!

Adel: You do not have to go into too much specifics but have you ever experienced the failure of a particular kind of AI project that you are attempting to?

Stephen: Do you mean at this company or any company?

Adel: Any at all, really.

Stephen: I have seen machine learning projects fail all over the shop. The funniest ones are where people do not realise it has failed, which is the thing I was telling you about earlier. Someone put together a churn model and then all it did was just condition on this one variable to the max and ignored everything else. That was a project to outward appearance that appeared successful, but actually it was not and nobody had done the digging to realize it. That is what happens when you just sweep complexity under the rug.

Adel: How would you characterise the cause of that failure? Is it just not making a baseline model or is it treating the model as a black box?

Stephen: In a sense, it is a failure to treat models as code because what they did was they built a model, looked at aggregate performance statistic X and then went, okay, let us deploy and call it done and did not stop and ask themselves whether this model was doing what they thought it was doing. When you write code you have to write tests to establish that that is the case. But there is not really a culture out there in the wider world, in my opinion, about doing the same with models. Does it get the really obvious case? What happens if you give it something weird? From your training or test data it should be possible to sample from the mode for a few obvious cases and check that those are fine. It should also be possible to sample some stuff that maybe does not look like everything else and check that that is handled one way or another or at least within some acceptable bound. If you are really getting freaky about it, you should be able to either A: fuzz it, or B: do some sort of model introspection and/or explainability thing to check what is going on.

One thing I have noticed about data scientists at large, machine learning engineers, whatever you want to call them, is that you get someone who comes to the industry and is fascinated by models, that is good, that is enough to get you to the starting line. Then if they want to be on the cutting edge, that is also good. It is good to have ambition. But then what happens is that in order to try and make their way to the cutting edge, they will spend a lot of time reading other people’s papers. That to my mind is actually counterproductive because as long as you are reading and implementing other people’s papers, you can never be closer than one step away from the cutting edge. What you want to do is think about all this minutiae that exists that nobody is thinking about or at least very few people are thinking about and come up with an interesting solution to it. If you want to be on the cutting edge of machine learning, do not go looking for interesting solutions. Go looking for interesting problems. There are tons of them. They are everywhere.

“If you want to be on the cutting edge of machine learning, do not go looking for interesting solutions. Go looking for interesting problems. There are tons of them. They are everywhere.”

Adel: Are you referring to domain or to technology problems?

Stephen: Could be anything. Look for a problem that people in general are not really thinking that much about or at least there is not already massive literature about. How do I know my model is doing what I think it is doing? Is an interesting problem. There is a alot of literature about model explainability but in terms of dissecting my model, what can I do, I do not think there has been as much thought. For example, putting together some sort of package that for a given model samples from the distribution of a test set and checks does it get the simple cases right. What happens in the hard cases? I do not think that exists and that would probably be really useful. Someone should go do it. There is still a lot that could be done in practical terms, not just in research terms. It’s an easy way for someone to make a name for themselves and get on the cutting edge.

Adel: What kind of methodology do you use to manage the development of an AI system or an AI solution?

Stephen: I encourage the people in my team to think like software engineers where they need to think like software engineers and think like machine learning engineers where they need to think like machine learning engineers. I spend quite a lot of time spelling out the differences and helping people shift between them. So sure, you have got all your standard workflow and collaborations stuff that you get everywhere nowadays, but really what I spend a lot of time doing is making sure that people are doing a good job of surveying the landscape around them before they start writing any code or doing any documentation. A lot of my job is just helping people perceive their jobs clearly.

Since ML systems are software systems, we should treat them as software. In particular we should incorporate standard software development practices such as unit testing (checking that the ML system gets obvious cases correct), regression testing (checking that new versions of the model perform at least as well as previous ones), continuous integration (e.g. running the same set of tests on every update to the model) and continuous deployment into the ML development process.

Adel: Are you talking to your team about ideas that are already out there that they can build on, code that you can reuse, things like that?

Stephen: What concerns they might have or want to take account of when performing this particular task. That is not me presenting someone with a laundry list. That is me presenting someone with a general idea of what sort of parameters do you think we should take account of when making this change to this repository or building this repository from scratch or any of that sort of thing? The most useful people are the ones who can look up from writing code for money long enough to think about something bigger than themselves. There is meme after meme of these two scientists talking and then at the end, the senior one says to the junior one, “Yes, but why did you do that?” You want to flip it and ask “Why would I do this?.” And encourage people to think of themselves in the context of something bigger rather than just how do I get this ticket done? Up to a point you want to be making problems go away rather than working on problems. That is another criticism I have of data scientists. You get a lot of people who want to work on interesting problems. Great, go to academia. If you want to make interesting problems go away, then that is a different thing and I have a job for you. Yeah, a lot of leading a machine learning team is just about inculcating effective habits of mind in people.

“Up to a point you want to be making problems go away rather than working on problems.”

Adel: Thinking about the purpose of what you are doing.

Stephen: Also in terms of reliability. What can I do to satisfy myself that this system I am working on is performing as intended and not just appearing to? With machine learning you need to start thinking like that early. It is very hard to bolt on later. In the medical world it is great because we can go out with a product and say, “Yes, we have done this clinical validation trial and we can show that it improves the performance of radiologists and we know from the ground up that this system does what we say it does.” It seems like a very simple claim but that also means it is a fundamental one. It gives both our customers and us a level of certainty.

Adel: When you say the system does what we say it does. With real world problems you cannot fully describe in words what a system is supposed to do. Because the input space is so variable, the only claim you can make about it is aggregated, it is a statistical claim. We find that when we run into corner cases it becomes increasingly specific about what it is that the system is supposed to do. It might be that the system is meant to have 90% accuracy but in this corner case, it really has to be 99%. Do you have a similar experience?

Stephen: There are two sides to that coin. If someone is going to tell you in this one circumstance that the expectation is that accuracy should be much higher, then immediately you ask why. What is it about that situation that enables that to happen? If you can then encode that into your model somehow, then you are laughing. Any time someone says to you, “Well, in this situation it should be easy,” then yeah, immediately you get to say, “Well, why is that? Spell it out for me!” Then it is a matter of thinking about, well, what can I do about the way in which this model works or the way in which it is built that accommodates that? Then you are back in that situation of, okay, I have got some domain knowledge. I just need to incorporate it into my model.

Adel: When it is a business critical case, that is also extremely rare, do you have a view on low data machine learning? Do you think that is a real thing?

Stephen: Absolutely it is. The question is are you sure a model is the right answer for that case? If it is a really clearly defined edge case and there are only two or three cases of it ever, do you actually want to do it in your model or do you just want to have it as some initial filtering step?

Adel: But what kind of system is going to carry out the initial filtering step?

Stephen: It could be any of a number of things. It depends on what you are dealing with. I did have a problem like this at a previous role where again we were automating a human process. There were a couple of really obvious edge cases in there that you could sort of spot and say, “Well, something specific needs to happen here.” So one thing you can do is just go through all your training data and update the labels accordingly. That kind of forces your model to act in that way. The other thing you can do is just set up simple traps like regex or otherwise to get this out of the way and say, okay, the model does not deal with that case. It deals with everything else. Down the track, if you want to incorporate weird edge cases into your model because enough time has passed for you to observe it happening a bit, that door will be open.

Adel: I am more thinking in the case of image data.

Stephen: Both of those principles kind of apply. The case of, okay, if this is a tiny corner case then usually that means it is pretty well-defined. In which case I should be able to find whatever instances I have got of that and update the labels so that If I ever see it, I can be pretty sure of what the model will do. Indeed I can also train the model such that the importance of getting those cases right is elevated relative to everything else. That is just a matter of playing with sliding knobs really. The other thing you can do is you can say, well, okay, how can I separate these cases from the rest of my data? You could use a model to do that. If you can do that with sufficient reliability, then you can sweep the edge case that you are worried off into a separate trap.

Adel: How about cases where for example it has been observed, say only two times in history, it is extremely rare. So it is very hard to validate what the model is going to do the next time that it encounters something like that because there just were not enough examples.

Stephen: Well, that is where you start generating examples.

Adel: You mean like GANs or actually going out and finding examples?

Stephen: Whichever way you choose, if again it is easy to do it. GANs are going to take a lot of machinery. There are other things that you can do, but typically you probably need to start thinking about some sort of other data augmentation or synthetic data. That is the obvious response to “I do not have data”.

Adel: But even getting a statistically significant test set in that situation is kind of impossible, right? You definitely cannot test on generated data.

Stephen: Can’t you?

Adel: I do not think so. No?

Stephen: It is not a clear-cut answer, I don’t think the answer is completely no. One thing that I spend a lot of time teaching my people is that there is a difference between what you need to satisfy yourself about a model in production, in academia and in general. So in Academia, typically you want to be exhaustive about whether or not this model works. In production, you want to be clear about most situations. In general, what you want is not absolute proof at all. What you want is evidence. So, evidence that points in the direction of whether or not this model is working as intended is usually pretty cheap to acquire. I would suggest that if you built a model with synthetic data and it failed a test, then that would already tell you something. It would not be the model you would necessarily deploy, but it would not tell you if you are on the right track with the way you are building things. So, we spend a lot of time getting evidence to tell us whether or not we are moving in the right direction. Then when we get to saying, “Okay, we are happy with this model. Let us actually go and deploy it,” that is when we start doing a fully-fledged job with a test set and clinical validation.

“…there is a difference between what you need to satisfy yourself about a model in production, in academia and in general. So in Academia, typically you want to be exhaustive about whether or not this model works. In production, you want to be clear about most situations. In general, what you want is not absolute proof at all. What you want is evidence.”

Now, you want evidence to pile up and give you a greater or lesser degree of confidence about whether or not things are working, absolute sort of hardline stuff about it must work in this situation, which we did not even have here because we were not talking about 100% accuracy. We were talking about 99%. You should not go looking for that until you are pretty sure that is what you are going to find, if I can put it that way.

Adel: I remember a skin assessment model where it is something that looks very distinctive but you only see it once in every 5 million people. When we talk about comparing ML workflows to human workflows, people are very good at detecting anomalies. ML models tend to be quite bad at that. These rare cases can be fairly expensive to collect sufficiently large kinds of test sets to generate the evidence like you want to.

Stephen: Yes, This is partly why I am saying do you want to handle this case using the model? One of the tricks you learn as a mathematician is if you do not know how to solve a problem, solve something else instead and then come back to it and see if that has changed the way you view the problem. So, I am basically just applying that here. Again, this is what I was saying earlier. A lot of things we do are shaped by really fundamental forces and you want to keep a clear eye on that. If this problem looks seemingly insurmountable, then do not go over it. Find a way around it.

Adel: With the augmentations that your product does to the clinician workflow, if you run it over a longer enough period of time, there are going to be cases where counterfactually the doctor would have made a correct diagnosis in the case where the system was not present versus when it was. Is that right?

Stephen: Yes. I mean, that is a counterfactual, is not it? Yeah, that is a counterfactual. We know what happens when people actually use it. Insofar as that is the case, we can see what the risks are of things like that.

Adel: The risk of that particular scenario is nonzero. Even if we create a system that in the aggregate works better, it is still clear that there will be individual cases at points in time where the system causes a degradation in the performance in that instance.

Stephen: That is what my job is about, right?

Adel: It is about looking for those cases and correcting them?

Stephen: It is looking for situations where the outcome is not as good as it should be and doing something about it. Rule number one of models is that you cannot have perfection. Forget it. They are all about approximation. So, really what you want is a good grip on what the risks are and at the same time you also want to be able to say, “Well this is designed to speed you up. It is not designed to replace you.” Ultimately, the human owns the decision and we spend time and effort making sure that the benefits to the patient from using our product is clear. There will always be risks associated with that, much as with anything in life, the name of the game is to keep your eyes open to them and do something about them.

Stephen Hogg is the Head of AI Systems at Harrison AI, he is a seasoned leader in ML development and delivery across a range of industries and has a background in mathematics, econometrics and languages.