From the JAMA Network, this is JAMA Pediatrics Author Interviews. Conversations with authors exploring the latest clinical research, reviews, and opinions featured in JAMA Pediatrics. Hello, podcast listeners. This is Dimitri Christakis, Editor-in-Chief of JAMA Pediatrics. Joining me today, as always, my colleague, Associate Editor, and Professor of Pediatrics at Boston Medical Center, Dr. Alison Galbraith. Hi, Allie, how are you? Hi, I'm good, thanks. And as you know, those of you that listen, we have a format here where we use an article published recently in JAMA Pediatrics as a springboard to discuss a looming important issue in pediatric health and health care. Today's topic is something that we hear about incessantly, it seems, these days, namely artificial intelligence in health care. And to help us talk about that and learn about it, we're delighted to have Bimal Desai, who's the Chief Health Informatics Officer at Children's Hospital of Philadelphia. And he's also an Associate Professor of Pediatrics at the Perelman School of Medicine at the University of Pennsylvania. Bimal, welcome to JAMA Pediatrics podcast. Good to have you here. It's a pleasure to be here. Thanks for inviting me. So Ali, tell us what made us think about doing a deep dive into AI and pediatrics this month. Yeah, so we have an ongoing theme of research at JAMA Pediatrics on artificial intelligence and pediatric care. So we published recently a paper under that theme called Development and Validation of an Automated Classifier to Diagnose Acute Otitis Media in Children by Sheik and colleagues. So in this study, they wanted to develop and then validate an AI decision support tool to interpret videos of tympanic membranes to improve diagnostic accuracy of otitis media, which notoriously is not always perfectly diagnosed. So they developed a medical grade smartphone app that used the camera of the smartphone attached to an otoscope, and it has voice commands that can start and stop the video. So they took videos from well and sick visits in the University of Pittsburgh clinics. And by the way, if you wanna see how the app and the otoscope work, there's a link attached to the paper where you can see that. So anyway, the videos were reviewed and annotated by some validated otoscopists who assigned the diagnosis, and that was the reference standard. And then they trained it with two different types of models. One was a deep residual recurrent neural network model to predict features of the tympanic membrane and the diagnosis of otitis media or not. And then they compared that with the accuracy of a different method using a decision tree approach. And then they also trained a noise quality filter so it could prompt the users when the video might not be adequate for diagnosis, for example, if it had too much wax, something like that. So they had over a thousand videos for more than 600 children. Mostly they were under three years old. And then they found that the Deep Residual Recurrent Neural Network algorithm had very similar diagnostic accuracy compared to the Decision Tree Network. They both had sensitivity around 94% and specificity around 93 to 94%, which is better than primary care physicians and advanced practice clinicians. Of note, about a quarter of the videos had to be excluded, mostly because the timinic membranes were occluded by wax. And then also of note, we can talk about this in a little bit, they didn't collect patient demographics. So we don't know how diverse the population was, it was trained on, and whether the findings apply across subgroups. But they concluded that these algorithms, how they apply them through image acquisition and quality filtering could be a good way in a primary care or setting to help with diagnostic decisions for otitis media. And they thought that this could be, especially because it has high specificity, that it could be used at triage by trained non-physicians. So you didn't have to have repeat exams, you could upload the video into the chart and a provider could just review. So that's the paper. I guess I have one lead-off question, which is, Bimal, if you could talk a little bit about the different types of algorithms in layperson's terminology, in what situations would one be more favored than another? And here they had similar results, but I'm wondering about those applications. Yeah, I think it's a complicated question, especially because I think it's task dependent. And I may not be the expert on sort of which deep learning algorithms are better for which use cases, but I know that there's probably some that are favored for image classification versus clustering versus prediction. I think this is where you would tap on your local data scientist to really inform which model would be the best one to use. And I think what was nice about this manuscript was that the investigators tried two different methods, right, they used this decision tree as well as this recurrent network, and they found very comparable results. And so it lends some validity, I think, to this notion that you can actually accurately classify otitis media images almost regardless of the algorithm. So Bimal, you know, just I think to follow up on that question, because please, even though you may feel like you're getting over your skis, you know way more about this certainly than I do and Ali does and probably most of our listeners do. So maybe take a step back because what I think most of us think of AI as this bit of a black box and most of our familiarity with it is ChatGPT, at least from my standpoint, right? That's different than this, right? That's a natural language model. Tell us what the difference is between me going to ChatGPT and asking it a question versus taking a picture of an eardrum and putting it in the ChatGPT. Yeah, that's a great question, Dimitri. So AI is a very broad category. And if you think about artificial intelligence in general as like the umbrella, that includes any technique that allows computers to mimic human language and intelligence and includes machine learning, which we would think of as a category of that. Within machine learning, you have lots of other techniques and some of which include things like deep learning. And deep learning is the set of techniques that were applied here using these neural networks. So these try to approximate, for example, how human neurons might work by having sort of reinforcement and feed forward and feedback and training. And so they are kind of a special class within machine learning and a special class within artificial intelligence. Chat GPT is actually a very specific type of deep learning model, right? Large language models, which have been trained to emulate human, in our case, English language. And it's a very different type of model used for very different purposes than something like this model, which was used for image classification. Can you talk a little bit more about sort of the use cases then? What they did here was using image classification for diagnostic purposes, but you could imagine you could use other types of AI, like Chat GPT to help us write our notes or do prior auths or something like that. Absolutely. And, you know, all of us encounter artificial intelligence in myriad ways in our own lives. So if you have an electric vehicle or if your automobile has sort of lane assist control that prevents you from departing out of your intended lane, that relies on image recognition and a whole bunch of sensors trying to calculate and project your position in space. If you've ever gone to Amazon and you've purchased anything, it will tell you people who bought this book also have bought these books and these sort of suggestion algorithms that try to approximate your preferred purchases compared to that of other buyers. If you've ever had a smart doorbell or a smart camera on your front door that detects when there's a package there or a human visitor and knows to exclude cars that drive by, that requires video classification. So we encounter these very specific purpose-built algorithms all the time in our lives. Of course, the classic one is if you've used Amazon or Alexa or Siri for speech to text and speech recognition, those are also very specific types of artificial intelligence algorithms. To your point, Ali, about ChatGPT and large language models in healthcare, I think we're really just skimming the surface in terms of what we think the applications will be in clinical practice. And there's some great opportunities there, which would be lovely to discuss, even if they're a little bit different from the image classification in this manuscript. So for the clinicians that are listening, tell us what they should look forward to, what they should be afraid of, afraid of both in terms of like potential mistakes, but I suppose afraid of, in terms of even being replaced. If I'm a primary care pediatrician and I read this paper on acute otitis media, I can't help but wonder like, is the future that people will self-diagnose at home and get a prescription mail to them and never even come to my office? I don't know. You tell us. What do you see as the bright stars in the future and where do you see the challenges or the obstacles? Well, I mean, I'll first confess that as a general pediatrician, performing an ear exam is like one of my least favorite tasks. It's, you know, it's unpleasant. Happy to give that over to a computer. So if a computer wants to take that task over, I'm willing to give it. And, you know, all of us have this sense that the work of clinical practice these days, there's a lot of non-value-added, time-consuming, stereotyped things. My perspective is why not let the AI write your letter of medical necessity? Why not let it summarize large corpuses of clinical text and try to suggest a synopsis for you? We spend a lot of time doing that kind of task, and I've heard it referred to as like chartopsy, right? You have to look through the chart and actually try to piece together this patient's story. And as an inpatient provider, we take care of a lot of very complicated patients. It could take you 10, 20 minutes just to figure out what's going on. And I love the idea that at some point, these language models might actually be quite good at doing that. We don't know yet how good they'll be, but if you think about it, CHAT GPT has only been out for about a year and a half. It was released in November of 22. Think about how far it's come in that very short time period, or rather how far large language models have come. The opportunities I'm looking at, and I think many health systems are starting to probe this, we all get a large influx of patient requests or patient messages through our patient portals these days. And I know that many of the large health system, electronic health record vendors are exploring, why not let the large language model draft a response to that initial message? Even if it saves you a minute here or there, it can be very useful for a provider. And we need to validate if that's actually the case, that it actually does save you time to have somebody else draft it for you or something else in this case. But that's certainly one option. I love the idea of clinical synopsis, if we can show that these things can actually review the patient's recent chart to say, okay, since the last time you saw Johnny, here's the 10 things that have happened to him in that interval, right? He was admitted to the emergency department, he was started on these two medications, he had this diagnostic study performed. That's exactly what we do. It's exactly what our trainees do when we work with them. So again, if the computer can do that accurately in a fraction of the time, we might benefit from that. And what would it take for us to give that task over for lack of a better trope? What would the confidence interval have to be? Or what would the false positive or false negative rate have to be? Do you know what I mean? So if I said summarize this patient's chart, a very complicated patient, let's say, coming to my hospital that has 50 pages of medical records, on the one hand, you're right, that's an incredible time saver. And it might even synthesize it as well or better than I do. But on the other hand, what if it misses a key detail that... I mean, how do I know... What's the gold standard? That's what's the gold standard. Like in this study, they could have, you know, these validated otoscopists like beat the gold standard. But in that situation, I don't know how you test that. That's a good point. Yeah, the gold standard for chart review, I don't think we even know what that is. Because is it the expert who has an hour and a half to look through the chart and to uncover every possible detail? Not only that, but who's also an expert at summarizing it in a lucid way for the person who's reviewing it downstream. Or is the gold standard, which is, this is my opinion, I think the gold standard is current clinical practice, right? So like comparative effectiveness, like no world comparative effectiveness. Have a resident summarize it or have this or this. That's exactly right. And we know that those resident summaries and even the attending summaries and the med student summaries are not perfect, right? And so I think the gold standard is the current practice. This is an interesting conversation. We actually started a new AI governance committee within our organization. And we started to talk about, okay, well, how would we start to put in standards around validation of AI algorithms, especially for use in clinical practice, as well as other use cases, operational uses and things like that. And I think it's going to be really tricky for us to actually decide or to come up with repeatable standards for AI validation for clinical purposes, because it just feels like we don't have the gold standard in many cases. If the gold standard is my busy intern, who's trying to go through 11 admissions in a busy night in the winter, I guarantee you there's things that are going to get missed and not because they're doing anything wrong, but just because there's the time pressures that they face in clinical practice. Right, no, I think that's exactly right. That's a situation probably where AI would outperform. That's right. Talk a little bit before we go about bias. I mean, we'd be remiss if we didn't talk about, as we rely on these and we give ourselves over to them, how can we be certain that the black box that we're using doesn't have implicit or explicit biases built in just based on the inputs? You know, we have to be really cautious about both the inputs and the outputs introducing bias. As an example, there are implicit biases in our training sets for many of these data. As you pointed out, Ali, in this manuscript, we don't actually know what the ethnic makeup is of the children who were in the study, which could certainly be a source of bias. We know that visual classification algorithms may be biased based on skin tone, for example. So that's a very fair point. I can't say I know what the differences in skin tone are of the middle ear, especially when you're trying to diagnose otitis or if it's important or not, but it's worth asking the question. The other types of bias that I would be concerned about, especially for AI algorithms, the classic example was a health system that used artificial intelligence to try to determine which patients would benefit the most from care coordination. And the inputs were the existing data set of patients who were using health services, right? So the more you tended to use health services, this thing would preferentially say that, yes, this patient might benefit from a care manager or a care concierge. And the problem with that, of course, is that it disfavors patients who actually can't access the health system. They're using utilization as a proxy for need, which is not true. The patients who need it the most may not actually be able to access the services. Another classic example more on the output side was an artificial intelligence algorithm that was used to quote unquote, predict no-show rates. So trying to figure out which patients were least likely to show up for an able-to-way visit. In this manuscript, the health system used it to intentionally double-book patients who had a very high rate of no-show. And so it's not so much on the model itself, but in the application of the model. Now they're actually introducing bias. So if you do show up for your visit, there's a good chance you'll get bumped, right? And so you're biasing the health system, the care delivery against people who, again, need that spot the most. And you might need the most support to make sure they can attend the visit, whether it's reaching out to them ahead of time, offering transportation or whatever the need may be. We need to be very cautious and make sure that we understand at every step of AI model development, validation and implementation, are there possible biases that could creep into this along the way? How will AI algorithms be better than all the years of prediction rules that people have developed? People have been using information that you have in the medical record, like labs or what have you. These AI models can do this more objectively, like with images more than you could just do some subjective assessment of a TM in a prediction rule. But we have prediction rules and they don't get applied. This study was great in that it was part of a visit. It's attached to your otoscope. It's giving you this in real time, but we could develop AI algorithms, but if we don't have a way to put it in at the point of care, it's not going to be helpful. Yeah, I think that's a very fair point. A lot of these algorithms work great sort of in silico, and you can show that they have very high accuracy and high performance, and things might fall apart in actual clinical application or clinical practice for a number of reasons. One is that the recommendation is not actionable or it's too diffuse. We're not even sure what to do with the recommendation or what the action step should be. I always use the example when I teach about sort of test characteristics and things like that of the Apple Watch, right? So many people have Apple Watches that have this atrial fibrillation algorithm, an AI algorithm. This algorithm is actually really, really accurate. It has almost 100% specificity and 98% sensitivity. Really good algorithm. And in the FDA clearance application, Apple actually showed that this in the study population had a 99.6% positive predictive value. But the study population had a 50% prevalence of atrial fibrillation, which is completely unnatural. There's no naturally occurring population that has a 50% prevalence of atrial fibrillation, except maybe like the waiting room of your electrophysiologist's office or something like that. And so in actual practice, and you could actually model this out, like how would this thing perform? Well, first of all, it would be generally useless in children because the prevalence of atrial fibrillation is vanishingly small in that cohort. Even in people of my age, I'm roughly 50 years old, almost 50, the prevalence is only 0.1%. And so if you do the math, even with a 99% specificity and 98% sensitivity, this thing only has like an 18% positive predictive value in people of my age. So what I would do and what I try to socialize within our organization is treat AI algorithms no different than any other diagnostic test, right? Just pretend it's like a lab test and apply to it the exact same rigor that you would if this was a genetic assay or some biochemical assay or something else, right? Because all the same things still apply. You still have to characterize its positive predictive value, its utility in clinical practice. You still have to figure out how you want to use this as part of a guideline or a pathway or a clinical protocol. And my worry is that when people see the word AI they think of that as a way to shortcut all those steps that biostatisticians and epidemiologists have worked for so long to establish as like the right way to establish rigor. And I think that's the risk and also the opportunity for us to remind the world that AI is really no different than other types of algorithms, except for the fact that it might be quicker, it might be easier to implement at the point of care, and things like that. That's great, Bimal. I think that's a great way of thinking about it. We have a lot more to discuss with you, but we don't have a lot more time. So, Bimal, thank you so much for your time and sharing your expertise with us. Yeah, thanks. It's been great. You're very welcome. Thanks for the invitation. This episode was produced by Shelley Steffens at the JAMA Network. To follow this and other JAMA Network podcasts, please visit us online at jamanetworkaudio.com. Thanks for listening.