From the JAMA Network, this is
JAMA Pediatrics Author Interviews. Conversations with authors exploring
the latest clinical research, reviews, and opinions featured in JAMA Pediatrics. Hello, podcast listeners. This is Dimitri Christakis,
Editor-in-Chief of JAMA Pediatrics. Joining me today, as always,
my colleague, Associate Editor, and Professor of Pediatrics at Boston
Medical Center, Dr. Alison Galbraith. Hi, Allie, how are you? Hi, I'm good, thanks. And as you know, those of you that listen, we have a format here where we use an article
published recently in JAMA Pediatrics as a springboard to discuss a looming important
issue in pediatric health and health care. Today's topic is something that we
hear about incessantly, it seems, these days, namely artificial
intelligence in health care. And to help us talk about that and learn
about it, we're delighted to have Bimal Desai, who's the Chief Health Informatics Officer
at Children's Hospital of Philadelphia. And he's also an Associate Professor of Pediatrics at the Perelman School of Medicine
at the University of Pennsylvania. Bimal, welcome to JAMA Pediatrics podcast. Good to have you here. It's a pleasure to be here. Thanks for inviting me. So Ali, tell us what made us think about doing
a deep dive into AI and pediatrics this month. Yeah, so we have an ongoing theme of research at JAMA Pediatrics on artificial
intelligence and pediatric care. So we published recently a paper under that
theme called Development and Validation of an Automated Classifier to Diagnose Acute Otitis
Media in Children by Sheik and colleagues. So in this study, they wanted to develop
and then validate an AI decision support tool to interpret videos of tympanic membranes
to improve diagnostic accuracy of otitis media, which notoriously is not
always perfectly diagnosed. So they developed a medical grade smartphone
app that used the camera of the smartphone attached to an otoscope, and it has voice
commands that can start and stop the video. So they took videos from well and sick visits
in the University of Pittsburgh clinics. And by the way, if you wanna see
how the app and the otoscope work, there's a link attached to the
paper where you can see that. So anyway, the videos were reviewed and
annotated by some validated otoscopists who assigned the diagnosis, and
that was the reference standard. And then they trained it with
two different types of models. One was a deep residual recurrent
neural network model to predict features of the tympanic membrane and
the diagnosis of otitis media or not. And then they compared that with the accuracy of
a different method using a decision tree approach. And then they also trained a noise quality filter
so it could prompt the users when the video might not be adequate for diagnosis, for example,
if it had too much wax, something like that. So they had over a thousand
videos for more than 600 children. Mostly they were under three years old. And then they found that the
Deep Residual Recurrent Neural Network algorithm had very similar diagnostic
accuracy compared to the Decision Tree Network. They both had sensitivity around 94% and
specificity around 93 to 94%, which is better than primary care physicians
and advanced practice clinicians. Of note, about a quarter of
the videos had to be excluded, mostly because the timinic
membranes were occluded by wax. And then also of note, we can
talk about this in a little bit, they didn't collect patient demographics. So we don't know how diverse the population was, it was trained on, and whether the
findings apply across subgroups. But they concluded that these algorithms, how they apply them through image acquisition
and quality filtering could be a good way in a primary care or setting to help with
diagnostic decisions for otitis media. And they thought that this could
be, especially because it has high specificity, that it could be used
at triage by trained non-physicians. So you didn't have to have repeat exams, you could upload the video into the
chart and a provider could just review. So that's the paper. I guess I have one lead-off question,
which is, Bimal, if you could talk a little bit about the different types of
algorithms in layperson's terminology, in what situations would one
be more favored than another? And here they had similar results, but
I'm wondering about those applications. Yeah, I think it's a complicated question,
especially because I think it's task dependent. And I may not be the expert on sort
of which deep learning algorithms are better for which use cases, but I
know that there's probably some that are favored for image classification
versus clustering versus prediction. I think this is where you would tap on your local data scientist to really inform which
model would be the best one to use. And I think what was nice about this manuscript
was that the investigators tried two different methods, right, they used this decision
tree as well as this recurrent network, and they found very comparable results. And so it lends some validity, I
think, to this notion that you can actually accurately classify otitis media
images almost regardless of the algorithm. So Bimal, you know, just I think
to follow up on that question, because please, even though you may feel like
you're getting over your skis, you know way more about this certainly than I do and Ali
does and probably most of our listeners do. So maybe take a step back because what I think
most of us think of AI as this bit of a black box and most of our familiarity with it is
ChatGPT, at least from my standpoint, right? That's different than this, right? That's a natural language model. Tell us what the difference is between
me going to ChatGPT and asking it a question versus taking a picture of an
eardrum and putting it in the ChatGPT. Yeah, that's a great question, Dimitri. So AI is a very broad category. And if you think about artificial intelligence
in general as like the umbrella, that includes any technique that allows computers to
mimic human language and intelligence and includes machine learning, which we
would think of as a category of that. Within machine learning, you have
lots of other techniques and some of which include things like deep learning. And deep learning is the set of techniques that
were applied here using these neural networks. So these try to approximate, for
example, how human neurons might work by having sort of reinforcement and
feed forward and feedback and training. And so they are kind of a special
class within machine learning and a special class within artificial intelligence. Chat GPT is actually a very specific
type of deep learning model, right? Large language models, which have been trained
to emulate human, in our case, English language. And it's a very different type
of model used for very different purposes than something like this model,
which was used for image classification. Can you talk a little bit more
about sort of the use cases then? What they did here was using image
classification for diagnostic purposes, but you could imagine you
could use other types of AI, like Chat GPT to help us write our notes
or do prior auths or something like that. Absolutely. And, you know, all of us encounter artificial
intelligence in myriad ways in our own lives. So if you have an electric vehicle or if
your automobile has sort of lane assist control that prevents you from departing out
of your intended lane, that relies on image recognition and a whole bunch of sensors trying
to calculate and project your position in space. If you've ever gone to Amazon and you've
purchased anything, it will tell you people who bought this book also have bought these
books and these sort of suggestion algorithms that try to approximate your preferred
purchases compared to that of other buyers. If you've ever had a smart doorbell or
a smart camera on your front door that detects when there's a package
there or a human visitor and knows to exclude cars that drive by,
that requires video classification. So we encounter these very specific purpose-built
algorithms all the time in our lives. Of course, the classic one is if you've used
Amazon or Alexa or Siri for speech to text and speech recognition, those are also very specific
types of artificial intelligence algorithms. To your point, Ali, about ChatGPT and
large language models in healthcare, I think we're really just skimming
the surface in terms of what we think the applications will be in clinical practice. And there's some great opportunities
there, which would be lovely to discuss, even if they're a little bit different from
the image classification in this manuscript. So for the clinicians that are listening,
tell us what they should look forward to, what they should be afraid of, afraid of
both in terms of like potential mistakes, but I suppose afraid of, in
terms of even being replaced. If I'm a primary care pediatrician and
I read this paper on acute otitis media, I can't help but wonder like, is the
future that people will self-diagnose at home and get a prescription mail to
them and never even come to my office? I don't know. You tell us. What do you see as the bright
stars in the future and where do you see the challenges or the obstacles? Well, I mean, I'll first confess
that as a general pediatrician, performing an ear exam is like
one of my least favorite tasks. It's, you know, it's unpleasant. Happy to give that over to a computer. So if a computer wants to take that
task over, I'm willing to give it. And, you know, all of us have this sense that
the work of clinical practice these days, there's a lot of non-value-added,
time-consuming, stereotyped things. My perspective is why not let the AI
write your letter of medical necessity? Why not let it summarize large corpuses of clinical text and try to
suggest a synopsis for you? We spend a lot of time doing that kind of task, and I've heard it referred
to as like chartopsy, right? You have to look through the chart and actually
try to piece together this patient's story. And as an inpatient provider, we take care
of a lot of very complicated patients. It could take you 10, 20 minutes
just to figure out what's going on. And I love the idea that at some point, these language models might actually
be quite good at doing that. We don't know yet how good they'll
be, but if you think about it, CHAT GPT has only been out
for about a year and a half. It was released in November of 22. Think about how far it's come
in that very short time period, or rather how far large language models have come. The opportunities I'm looking at, and I think many
health systems are starting to probe this, we all get a large influx of patient requests or patient
messages through our patient portals these days. And I know that many of the large health system,
electronic health record vendors are exploring, why not let the large language model
draft a response to that initial message? Even if it saves you a minute here or
there, it can be very useful for a provider. And we need to validate if that's actually
the case, that it actually does save you time to have somebody else draft it
for you or something else in this case. But that's certainly one option. I love the idea of clinical synopsis, if
we can show that these things can actually review the patient's recent chart to say,
okay, since the last time you saw Johnny, here's the 10 things that have happened
to him in that interval, right? He was admitted to the emergency department, he was started on these two medications,
he had this diagnostic study performed. That's exactly what we do. It's exactly what our trainees
do when we work with them. So again, if the computer can do that
accurately in a fraction of the time, we might benefit from that. And what would it take for us to give
that task over for lack of a better trope? What would the confidence interval have to be? Or what would the false positive
or false negative rate have to be? Do you know what I mean? So if I said summarize this patient's
chart, a very complicated patient, let's say, coming to my hospital
that has 50 pages of medical records, on the one hand, you're right,
that's an incredible time saver. And it might even synthesize
it as well or better than I do. But on the other hand, what if
it misses a key detail that... I mean, how do I know... What's the gold standard? That's what's the gold standard. Like in this study, they could have, you know, these validated otoscopists
like beat the gold standard. But in that situation, I
don't know how you test that. That's a good point. Yeah, the gold standard for chart review,
I don't think we even know what that is. Because is it the expert who has an hour and a half to look through the chart and
to uncover every possible detail? Not only that, but who's also an
expert at summarizing it in a lucid way for the person who's reviewing it downstream. Or is the gold standard,
which is, this is my opinion, I think the gold standard is
current clinical practice, right? So like comparative effectiveness, like
no world comparative effectiveness. Have a resident summarize it or have this or this. That's exactly right. And we know that those resident
summaries and even the attending summaries and the med student
summaries are not perfect, right? And so I think the gold standard
is the current practice. This is an interesting conversation. We actually started a new AI governance
committee within our organization. And we started to talk about, okay, well, how would we start to put in standards around
validation of AI algorithms, especially for use in clinical practice, as well as other use
cases, operational uses and things like that. And I think it's going to be really
tricky for us to actually decide or to come up with repeatable standards
for AI validation for clinical purposes, because it just feels like we don't
have the gold standard in many cases. If the gold standard is my busy intern, who's trying to go through 11 admissions in
a busy night in the winter, I guarantee you there's things that are going to get missed
and not because they're doing anything wrong, but just because there's the time pressures
that they face in clinical practice. Right, no, I think that's exactly right. That's a situation probably
where AI would outperform. That's right. Talk a little bit before we go about bias. I mean, we'd be remiss if we didn't talk
about, as we rely on these and we give ourselves over to them, how can we be
certain that the black box that we're using doesn't have implicit or explicit
biases built in just based on the inputs? You know, we have to be really cautious about
both the inputs and the outputs introducing bias. As an example, there are implicit biases in
our training sets for many of these data. As you pointed out, Ali, in this manuscript,
we don't actually know what the ethnic makeup is of the children who were in the study,
which could certainly be a source of bias. We know that visual classification algorithms
may be biased based on skin tone, for example. So that's a very fair point. I can't say I know what the differences in skin
tone are of the middle ear, especially when you're trying to diagnose otitis or if it's important
or not, but it's worth asking the question. The other types of bias that I would be
concerned about, especially for AI algorithms, the classic example was a health system
that used artificial intelligence to try to determine which patients would
benefit the most from care coordination. And the inputs were the existing data set of
patients who were using health services, right? So the more you tended to use health services,
this thing would preferentially say that, yes, this patient might benefit from a
care manager or a care concierge. And the problem with that, of course, is that it disfavors patients who actually
can't access the health system. They're using utilization as a
proxy for need, which is not true. The patients who need it the most may not
actually be able to access the services. Another classic example more on
the output side was an artificial intelligence algorithm that was used to
quote unquote, predict no-show rates. So trying to figure out which patients were
least likely to show up for an able-to-way visit. In this manuscript, the health system used it to intentionally double-book patients
who had a very high rate of no-show. And so it's not so much on the model
itself, but in the application of the model. Now they're actually introducing bias. So if you do show up for your visit, there's
a good chance you'll get bumped, right? And so you're biasing the health system, the care delivery against people
who, again, need that spot the most. And you might need the most support to make
sure they can attend the visit, whether it's reaching out to them ahead of time, offering
transportation or whatever the need may be. We need to be very cautious and make sure that we
understand at every step of AI model development, validation and implementation, are there possible
biases that could creep into this along the way? How will AI algorithms be
better than all the years of prediction rules that people have developed? People have been using information that you have
in the medical record, like labs or what have you. These AI models can do this more
objectively, like with images more than you could just do some subjective
assessment of a TM in a prediction rule. But we have prediction rules
and they don't get applied. This study was great in
that it was part of a visit. It's attached to your otoscope. It's giving you this in real time,
but we could develop AI algorithms, but if we don't have a way to put it in at the
point of care, it's not going to be helpful. Yeah, I think that's a very fair point. A lot of these algorithms
work great sort of in silico, and you can show that they have very high
accuracy and high performance, and things might fall apart in actual clinical application
or clinical practice for a number of reasons. One is that the recommendation is
not actionable or it's too diffuse. We're not even sure what to do with the
recommendation or what the action step should be. I always use the example when I teach about sort of test characteristics and things
like that of the Apple Watch, right? So many people have Apple Watches that have this
atrial fibrillation algorithm, an AI algorithm. This algorithm is actually
really, really accurate. It has almost 100% specificity
and 98% sensitivity. Really good algorithm. And in the FDA clearance application,
Apple actually showed that this in the study population had a 99.6%
positive predictive value. But the study population had a 50%
prevalence of atrial fibrillation, which is completely unnatural. There's no naturally occurring population that
has a 50% prevalence of atrial fibrillation, except maybe like the waiting room of your electrophysiologist's
office or something like that. And so in actual practice, and you could actually
model this out, like how would this thing perform? Well, first of all, it would be
generally useless in children because the prevalence of atrial fibrillation
is vanishingly small in that cohort. Even in people of my age, I'm roughly 50 years
old, almost 50, the prevalence is only 0.1%. And so if you do the math, even with
a 99% specificity and 98% sensitivity, this thing only has like an 18% positive
predictive value in people of my age. So what I would do and what I try to
socialize within our organization is treat AI algorithms no different than
any other diagnostic test, right? Just pretend it's like a lab test and
apply to it the exact same rigor that you would if this was a genetic assay or some
biochemical assay or something else, right? Because all the same things still apply. You still have to characterize
its positive predictive value, its utility in clinical practice. You still have to figure out how you want to use this as part of a guideline or a
pathway or a clinical protocol. And my worry is that when people see
the word AI they think of that as a way to shortcut all those steps that
biostatisticians and epidemiologists have worked for so long to establish as
like the right way to establish rigor. And I think that's the risk and also the
opportunity for us to remind the world that AI is really no different than other types of
algorithms, except for the fact that it might be quicker, it might be easier to implement
at the point of care, and things like that. That's great, Bimal. I think that's a great way of thinking about it. We have a lot more to discuss with
you, but we don't have a lot more time. So, Bimal, thank you so much for your
time and sharing your expertise with us. Yeah, thanks. It's been great. You're very welcome. Thanks for the invitation. This episode was produced by Shelley
Steffens at the JAMA Network. To follow this and other JAMA Network podcasts,
please visit us online at jamanetworkaudio.com. Thanks for listening.