Transcript for:
AI in Pediatric Healthcare - JAMA Pediatrics Podcast

From the JAMA Network, this is  JAMA Pediatrics Author Interviews. Conversations with authors exploring  the latest clinical research, reviews,   and opinions featured in JAMA Pediatrics. Hello, podcast listeners. This is Dimitri Christakis,  Editor-in-Chief of JAMA Pediatrics. Joining me today, as always,  my colleague, Associate Editor,   and Professor of Pediatrics at Boston  Medical Center, Dr. Alison Galbraith. Hi, Allie, how are you? Hi, I'm good, thanks. And as you know, those of you that listen,   we have a format here where we use an article  published recently in JAMA Pediatrics as a   springboard to discuss a looming important  issue in pediatric health and health care. Today's topic is something that we  hear about incessantly, it seems,   these days, namely artificial  intelligence in health care. And to help us talk about that and learn  about it, we're delighted to have Bimal Desai,   who's the Chief Health Informatics Officer  at Children's Hospital of Philadelphia. And he's also an Associate Professor of Pediatrics   at the Perelman School of Medicine  at the University of Pennsylvania. Bimal, welcome to JAMA Pediatrics podcast. Good to have you here. It's a pleasure to be here. Thanks for inviting me. So Ali, tell us what made us think about doing  a deep dive into AI and pediatrics this month. Yeah, so we have an ongoing theme of research   at JAMA Pediatrics on artificial  intelligence and pediatric care. So we published recently a paper under that  theme called Development and Validation of an   Automated Classifier to Diagnose Acute Otitis  Media in Children by Sheik and colleagues. So in this study, they wanted to develop  and then validate an AI decision support   tool to interpret videos of tympanic membranes  to improve diagnostic accuracy of otitis media,   which notoriously is not  always perfectly diagnosed. So they developed a medical grade smartphone  app that used the camera of the smartphone   attached to an otoscope, and it has voice  commands that can start and stop the video. So they took videos from well and sick visits  in the University of Pittsburgh clinics. And by the way, if you wanna see  how the app and the otoscope work,   there's a link attached to the  paper where you can see that. So anyway, the videos were reviewed and  annotated by some validated otoscopists   who assigned the diagnosis, and  that was the reference standard. And then they trained it with  two different types of models. One was a deep residual recurrent  neural network model to predict   features of the tympanic membrane and  the diagnosis of otitis media or not. And then they compared that with the accuracy of  a different method using a decision tree approach. And then they also trained a noise quality filter  so it could prompt the users when the video might   not be adequate for diagnosis, for example,  if it had too much wax, something like that. So they had over a thousand  videos for more than 600 children. Mostly they were under three years old. And then they found that the  Deep Residual Recurrent Neural   Network algorithm had very similar diagnostic  accuracy compared to the Decision Tree Network. They both had sensitivity around 94% and  specificity around 93 to 94%, which is   better than primary care physicians  and advanced practice clinicians. Of note, about a quarter of  the videos had to be excluded,   mostly because the timinic  membranes were occluded by wax. And then also of note, we can  talk about this in a little bit,   they didn't collect patient demographics. So we don't know how diverse the population was,   it was trained on, and whether the  findings apply across subgroups. But they concluded that these algorithms,   how they apply them through image acquisition  and quality filtering could be a good way in   a primary care or setting to help with  diagnostic decisions for otitis media. And they thought that this could  be, especially because it has high   specificity, that it could be used  at triage by trained non-physicians. So you didn't have to have repeat exams,   you could upload the video into the  chart and a provider could just review. So that's the paper. I guess I have one lead-off question,  which is, Bimal, if you could talk a   little bit about the different types of  algorithms in layperson's terminology,   in what situations would one  be more favored than another? And here they had similar results, but  I'm wondering about those applications. Yeah, I think it's a complicated question,  especially because I think it's task dependent. And I may not be the expert on sort  of which deep learning algorithms   are better for which use cases, but I  know that there's probably some that   are favored for image classification  versus clustering versus prediction. I think this is where you would tap on your local   data scientist to really inform which  model would be the best one to use. And I think what was nice about this manuscript  was that the investigators tried two different   methods, right, they used this decision  tree as well as this recurrent network,   and they found very comparable results. And so it lends some validity, I  think, to this notion that you can   actually accurately classify otitis media  images almost regardless of the algorithm. So Bimal, you know, just I think  to follow up on that question,   because please, even though you may feel like  you're getting over your skis, you know way   more about this certainly than I do and Ali  does and probably most of our listeners do. So maybe take a step back because what I think  most of us think of AI as this bit of a black   box and most of our familiarity with it is  ChatGPT, at least from my standpoint, right? That's different than this, right? That's a natural language model. Tell us what the difference is between  me going to ChatGPT and asking it a   question versus taking a picture of an  eardrum and putting it in the ChatGPT. Yeah, that's a great question, Dimitri. So AI is a very broad category. And if you think about artificial intelligence  in general as like the umbrella, that includes   any technique that allows computers to  mimic human language and intelligence   and includes machine learning, which we  would think of as a category of that. Within machine learning, you have  lots of other techniques and some   of which include things like deep learning. And deep learning is the set of techniques that  were applied here using these neural networks. So these try to approximate, for  example, how human neurons might   work by having sort of reinforcement and  feed forward and feedback and training. And so they are kind of a special  class within machine learning and   a special class within artificial intelligence. Chat GPT is actually a very specific  type of deep learning model, right? Large language models, which have been trained  to emulate human, in our case, English language. And it's a very different type  of model used for very different   purposes than something like this model,  which was used for image classification. Can you talk a little bit more  about sort of the use cases then? What they did here was using image  classification for diagnostic purposes,   but you could imagine you  could use other types of AI,   like Chat GPT to help us write our notes  or do prior auths or something like that. Absolutely. And, you know, all of us encounter artificial  intelligence in myriad ways in our own lives. So if you have an electric vehicle or if  your automobile has sort of lane assist   control that prevents you from departing out  of your intended lane, that relies on image   recognition and a whole bunch of sensors trying  to calculate and project your position in space. If you've ever gone to Amazon and you've  purchased anything, it will tell you people   who bought this book also have bought these  books and these sort of suggestion algorithms   that try to approximate your preferred  purchases compared to that of other buyers. If you've ever had a smart doorbell or  a smart camera on your front door that   detects when there's a package  there or a human visitor and   knows to exclude cars that drive by,  that requires video classification. So we encounter these very specific purpose-built  algorithms all the time in our lives. Of course, the classic one is if you've used  Amazon or Alexa or Siri for speech to text and   speech recognition, those are also very specific  types of artificial intelligence algorithms. To your point, Ali, about ChatGPT and  large language models in healthcare,   I think we're really just skimming  the surface in terms of what we think   the applications will be in clinical practice. And there's some great opportunities  there, which would be lovely to discuss,   even if they're a little bit different from  the image classification in this manuscript. So for the clinicians that are listening,  tell us what they should look forward to,   what they should be afraid of, afraid of  both in terms of like potential mistakes,   but I suppose afraid of, in  terms of even being replaced. If I'm a primary care pediatrician and  I read this paper on acute otitis media,   I can't help but wonder like, is the  future that people will self-diagnose   at home and get a prescription mail to  them and never even come to my office? I don't know. You tell us. What do you see as the bright  stars in the future and where   do you see the challenges or the obstacles? Well, I mean, I'll first confess  that as a general pediatrician,   performing an ear exam is like  one of my least favorite tasks. It's, you know, it's unpleasant. Happy to give that over to a computer. So if a computer wants to take that  task over, I'm willing to give it. And, you know, all of us have this sense that  the work of clinical practice these days,   there's a lot of non-value-added,  time-consuming, stereotyped things. My perspective is why not let the AI  write your letter of medical necessity? Why not let it summarize large corpuses   of clinical text and try to  suggest a synopsis for you? We spend a lot of time doing that kind of task,   and I've heard it referred  to as like chartopsy, right? You have to look through the chart and actually  try to piece together this patient's story. And as an inpatient provider, we take care  of a lot of very complicated patients. It could take you 10, 20 minutes  just to figure out what's going on. And I love the idea that at some point,   these language models might actually  be quite good at doing that. We don't know yet how good they'll  be, but if you think about it,   CHAT GPT has only been out  for about a year and a half. It was released in November of 22. Think about how far it's come  in that very short time period,   or rather how far large language models have come. The opportunities I'm looking at, and I think many  health systems are starting to probe this, we all   get a large influx of patient requests or patient  messages through our patient portals these days. And I know that many of the large health system,  electronic health record vendors are exploring,   why not let the large language model  draft a response to that initial message? Even if it saves you a minute here or  there, it can be very useful for a provider. And we need to validate if that's actually  the case, that it actually does save you   time to have somebody else draft it  for you or something else in this case. But that's certainly one option. I love the idea of clinical synopsis, if  we can show that these things can actually   review the patient's recent chart to say,  okay, since the last time you saw Johnny,   here's the 10 things that have happened  to him in that interval, right? He was admitted to the emergency department,   he was started on these two medications,  he had this diagnostic study performed. That's exactly what we do. It's exactly what our trainees  do when we work with them. So again, if the computer can do that  accurately in a fraction of the time,   we might benefit from that. And what would it take for us to give  that task over for lack of a better trope? What would the confidence interval have to be? Or what would the false positive  or false negative rate have to be? Do you know what I mean? So if I said summarize this patient's  chart, a very complicated patient,   let's say, coming to my hospital  that has 50 pages of medical records,   on the one hand, you're right,  that's an incredible time saver. And it might even synthesize  it as well or better than I do. But on the other hand, what if  it misses a key detail that... I mean, how do I know... What's the gold standard? That's what's the gold standard. Like in this study, they could have, you know,   these validated otoscopists  like beat the gold standard. But in that situation, I  don't know how you test that. That's a good point. Yeah, the gold standard for chart review,  I don't think we even know what that is. Because is it the expert who has an hour and a   half to look through the chart and  to uncover every possible detail? Not only that, but who's also an  expert at summarizing it in a lucid   way for the person who's reviewing it downstream. Or is the gold standard,  which is, this is my opinion,   I think the gold standard is  current clinical practice, right? So like comparative effectiveness, like  no world comparative effectiveness. Have a resident summarize it or have this or this. That's exactly right. And we know that those resident  summaries and even the attending   summaries and the med student  summaries are not perfect, right? And so I think the gold standard  is the current practice. This is an interesting conversation. We actually started a new AI governance  committee within our organization. And we started to talk about, okay, well,   how would we start to put in standards around  validation of AI algorithms, especially for   use in clinical practice, as well as other use  cases, operational uses and things like that. And I think it's going to be really  tricky for us to actually decide   or to come up with repeatable standards  for AI validation for clinical purposes,   because it just feels like we don't  have the gold standard in many cases. If the gold standard is my busy intern,   who's trying to go through 11 admissions in  a busy night in the winter, I guarantee you   there's things that are going to get missed  and not because they're doing anything wrong,   but just because there's the time pressures  that they face in clinical practice. Right, no, I think that's exactly right. That's a situation probably  where AI would outperform. That's right. Talk a little bit before we go about bias. I mean, we'd be remiss if we didn't talk  about, as we rely on these and we give   ourselves over to them, how can we be  certain that the black box that we're   using doesn't have implicit or explicit  biases built in just based on the inputs? You know, we have to be really cautious about  both the inputs and the outputs introducing bias. As an example, there are implicit biases in  our training sets for many of these data. As you pointed out, Ali, in this manuscript,  we don't actually know what the ethnic makeup   is of the children who were in the study,  which could certainly be a source of bias. We know that visual classification algorithms  may be biased based on skin tone, for example. So that's a very fair point. I can't say I know what the differences in skin  tone are of the middle ear, especially when you're   trying to diagnose otitis or if it's important  or not, but it's worth asking the question. The other types of bias that I would be  concerned about, especially for AI algorithms,   the classic example was a health system  that used artificial intelligence to   try to determine which patients would  benefit the most from care coordination. And the inputs were the existing data set of  patients who were using health services, right? So the more you tended to use health services,  this thing would preferentially say that, yes,   this patient might benefit from a  care manager or a care concierge. And the problem with that, of course, is that it   disfavors patients who actually  can't access the health system. They're using utilization as a  proxy for need, which is not true. The patients who need it the most may not  actually be able to access the services. Another classic example more on  the output side was an artificial   intelligence algorithm that was used to  quote unquote, predict no-show rates. So trying to figure out which patients were  least likely to show up for an able-to-way visit. In this manuscript, the health system used it to   intentionally double-book patients  who had a very high rate of no-show. And so it's not so much on the model  itself, but in the application of the model. Now they're actually introducing bias. So if you do show up for your visit, there's  a good chance you'll get bumped, right? And so you're biasing the health system,   the care delivery against people  who, again, need that spot the most. And you might need the most support to make  sure they can attend the visit, whether it's   reaching out to them ahead of time, offering  transportation or whatever the need may be. We need to be very cautious and make sure that we  understand at every step of AI model development,   validation and implementation, are there possible  biases that could creep into this along the way? How will AI algorithms be  better than all the years   of prediction rules that people have developed? People have been using information that you have  in the medical record, like labs or what have you. These AI models can do this more  objectively, like with images more   than you could just do some subjective  assessment of a TM in a prediction rule. But we have prediction rules  and they don't get applied. This study was great in  that it was part of a visit. It's attached to your otoscope. It's giving you this in real time,  but we could develop AI algorithms,   but if we don't have a way to put it in at the  point of care, it's not going to be helpful. Yeah, I think that's a very fair point. A lot of these algorithms  work great sort of in silico,   and you can show that they have very high  accuracy and high performance, and things   might fall apart in actual clinical application  or clinical practice for a number of reasons. One is that the recommendation is  not actionable or it's too diffuse. We're not even sure what to do with the  recommendation or what the action step should be. I always use the example when I teach about sort   of test characteristics and things  like that of the Apple Watch, right? So many people have Apple Watches that have this  atrial fibrillation algorithm, an AI algorithm. This algorithm is actually  really, really accurate. It has almost 100% specificity  and 98% sensitivity. Really good algorithm. And in the FDA clearance application,  Apple actually showed that this in the   study population had a 99.6%  positive predictive value. But the study population had a 50%  prevalence of atrial fibrillation,   which is completely unnatural. There's no naturally occurring population that  has a 50% prevalence of atrial fibrillation,   except maybe like the waiting room   of your electrophysiologist's  office or something like that. And so in actual practice, and you could actually  model this out, like how would this thing perform? Well, first of all, it would be  generally useless in children because   the prevalence of atrial fibrillation  is vanishingly small in that cohort. Even in people of my age, I'm roughly 50 years  old, almost 50, the prevalence is only 0.1%. And so if you do the math, even with  a 99% specificity and 98% sensitivity,   this thing only has like an 18% positive  predictive value in people of my age. So what I would do and what I try to  socialize within our organization is   treat AI algorithms no different than  any other diagnostic test, right? Just pretend it's like a lab test and  apply to it the exact same rigor that   you would if this was a genetic assay or some  biochemical assay or something else, right? Because all the same things still apply. You still have to characterize  its positive predictive value,   its utility in clinical practice. You still have to figure out how you want to use   this as part of a guideline or a  pathway or a clinical protocol. And my worry is that when people see  the word AI they think of that as a   way to shortcut all those steps that  biostatisticians and epidemiologists   have worked for so long to establish as  like the right way to establish rigor. And I think that's the risk and also the  opportunity for us to remind the world that   AI is really no different than other types of  algorithms, except for the fact that it might   be quicker, it might be easier to implement  at the point of care, and things like that. That's great, Bimal. I think that's a great way of thinking about it. We have a lot more to discuss with  you, but we don't have a lot more time. So, Bimal, thank you so much for your  time and sharing your expertise with us. Yeah, thanks. It's been great. You're very welcome. Thanks for the invitation. This episode was produced by Shelley  Steffens at the JAMA Network. To follow this and other JAMA Network podcasts,  please visit us online at jamanetworkaudio.com. Thanks for listening.