Honeycomb: Enhancing Observability with AWS

(bright music) - Hello, everyone. I hope you are having a wonderful AWS summit. And tonight, when you go home, I hope that you get to have a nice, peaceful night of sleep afterwards, having learned a whole bunch. But what if there&#39;s a problem? What if your pager goes off 20 minutes before you want to be in bed? How are you going to feel? Are you going to feel anxious? Are you going to feel nervous? Or are you going to feel confident that you got this, and you&#39;re going to be able to solve it within half an hour and get back to sleep? This is what we&#39;re about at Honeycomb. We&#39;re about helping software engineers, like yourselves, debug their systems with the power of Big Data. This capability is something that we pioneered and call Observability. Observability is a sociotechnical capability. It&#39;s something that really enables us to answer any question about our systems, even ones that we hadn&#39;t thought in advance to ask. It requires us to add data, and collect it and store it economically, and then to query it as quickly as possible when you need it the most. And what do I mean by as quickly as possible? I mean, you should be able to answer any question in less than 10 seconds, so you don&#39;t get distracted by going to get a cup of coffee, and so that you can keep on iterating on your train of thought. We think that observability requires having as much context as possible, and to have all of that context connected together, threaded together into user requests that are traced through so you can follow the breadcrumbs. And we believe that you need to be able to analyze all kinds of customer experiences, good and bad, so that you can figure out what is differentiating those and how can you resolve the issue as quickly as possible. So our product is differentiated by its scale and speed. Our customers get a lot of value out of us when they&#39;re able to quickly and instantly get results. And analysts agree that we are a leader in the field of APM and observability. So, how does this actually work under the hood? It starts with high quality telemetry data which we collect in partnership with AWS. Using services like the relational database service, or CloudWatch, or any number of services that we integrate within the AWS ecosystem, we&#39;re able to get data at the low level. But it&#39;s also crucially important to have telemetry that is collected, for instance, with the Amazon distro for open telemetry that comes from your applications and your real user requests. Once we have that data ingested, we&#39;re able to go ahead and analyze it and store it. So we pre-process it and upload it to Amazon simple storage service, and then we&#39;re able to analyze it on the fly with AWS Lambda. So, over the past 10 years, Honeycomb has grown really explosively. How explosively? Well, three years ago we were ingesting about 200,000 trace spans per second. Today we are peaking at 2.5 million trace spans per second. And our customers are asking 10 times as many questions about 10 times as much data. You can see where this is going, right. So, how did we scale our services economically? And, oh, did I mention? We only have 50 engineers. The answer is we built on the right AWS technologies. So this is what our architecture looks like. We have a combination of stateful and stateless services and they&#39;re mostly written in Go, but you&#39;ll also find some Java and some Node.js running in our stack. But all of these were able to be migrated on to AWS Graviton technologies. So let&#39;s start with the stateless technologies. So, we use Amazon Elastic Kubernetes Service to manage our stateless workload. And the nodes that are powering it are EC2, C6G and C7G instances powered by Graviton2 and Graviton3. So, let&#39;s take our ingest service for instance. When we trialed moving from fifth generation EC2 instances to Graviton2, we saw a 10% improvement in the median latency. That we are having many fewer tail latency spikes because the Graviton2 processor is just much more efficient and we are able to push much more load. But it gets better. Because, when we did an AB test of Graviton2 versus Graviton3, we found a further 10 to 20% improvement in tail latency and a 30% improvement in our throughput and median latency of our ingest services. And, not only that, it turns out that this workload just bin packs much better with EKS and Graviton3, that these CPU utilization is about 30% lower, which means we can push it a lot harder. And it turns out that you can compound the savings by using EC2 spot instances, because the node termination handler in EKS allows you to just gracefully fail workloads over as EC2 spot terminates the instances. So we save money by being flexible. How does that work? Well, it turns out that when we went from fifth generation to sixth generation, we saved about 20%. And then we saved a further 20% when we adopted sixth generation instances powered by Graviton2 with Spot. Our columnar data storage is powered by EC2 M6gd instances. When we switched from the I3 instance family to the M6gd instance family, tail latency went down by two thirds. And also, we tier data over to S3, and I&#39;ll talk about that in a moment. So, speaking of other data storage mechanisms, we use EC2 IM4gn instances to power our Kafka streaming data ingest. And these instances were newly announced at re:Invent this past year, and they&#39;re powered by Nitro SSD. So, we had a problem before, that we didn&#39;t have the right shape to scale with our workload. Where we stream data and we need to make sure that we get it right the first time, but we don&#39;t necessarily access the older data 1 until there&#39;s an emergency. So we were disc bound running 30 instances of I3EN. We started tiering the data to S3, which helped, but then we saturated on CPU. But right sizing everything onto IM4gn lets us hit our network CPU and storage thresholds appropriately. Okay, so about the query retrieval, how do we make it fast? Well, one answer is just use more M6gd, right? That only works up to a point, because if you have millions of files stored in S3, you&#39;re going to have a hard time querying them all from even just a hundred servers. So we utilize AWS Lambda in order to query, on demand, millions of files from S3 with tens of thousands of parallel workers, which allows us to give you results for any question you might ask about your systems in less than 10 seconds. And with AWS Lambda and Graviton combined together, we see about a 40% improvement in price performance, which really enables us to economically give our users the comfort of knowing they&#39;re going to be able to get a great night&#39;s sleep. So, to sum up, AWS enables us to go fast. With AWS, we&#39;re able to build quickly and help our customers move quickly too. We&#39;ve saved over 60% versus fifth generation instances. And also, as you heard earlier, we also have 60% reduction in emissions from using Graviton2 and Graviton3. So we&#39;ve scaled 10X over the past three years without blowing our SLOs, and without blowing our cost budgets. If you&#39;re interested in learning more, you can check out the chapter of Observability Engineering on our data store. And you can go and get all of the meaty technical blogs and details at the link that&#39;s on screen right now. So now I&#39;ll go ahead and turn things back over to Martin. Thank you very much.

(bright music) - Hello, everyone. I hope you are having
a wonderful AWS summit. And tonight, when you go home, I hope that you get to have a nice, peaceful night of sleep afterwards, having learned a whole bunch. But what if there&amp;#39;s a problem? What if your pager goes off 20 minutes before you want to be in bed? How are you going to feel? Are you going to feel anxious? Are you going to feel nervous? Or are you going to feel
confident that you got this, and you&amp;#39;re going to be able to solve it within half an hour and get back to sleep? This is what we&amp;#39;re about at Honeycomb. We&amp;#39;re about helping software
engineers, like yourselves, debug their systems with
the power of Big Data. This capability is
something that we pioneered and call Observability. Observability is a
sociotechnical capability. It&amp;#39;s something that really enables us to answer any question about our systems, even ones that we hadn&amp;#39;t
thought in advance to ask. It requires us to add data, and collect it and store it economically, and then to query it
as quickly as possible when you need it the most. And what do I mean by
as quickly as possible? I mean, you should be able
to answer any question in less than 10 seconds, so you don&amp;#39;t get distracted by
going to get a cup of coffee, and so that you can keep on iterating on your train of thought. We think that observability requires having as much context as possible, and to have all of that
context connected together, threaded together into user requests that are traced through so you
can follow the breadcrumbs. And we believe that you
need to be able to analyze all kinds of customer experiences, good and bad, so that you can figure out
what is differentiating those and how can you resolve the
issue as quickly as possible. So our product is differentiated by its scale and speed. Our customers get a lot of value out of us when they&amp;#39;re able to quickly
and instantly get results. And analysts agree that we are a leader in the field of APM and observability. So, how does this actually
work under the hood? It starts with high quality telemetry data which we collect in partnership with AWS. Using services like the
relational database service, or CloudWatch, or any number of services
that we integrate within the AWS ecosystem, we&amp;#39;re able to get data at the low level. But it&amp;#39;s also crucially
important to have telemetry that is collected, for instance, with the Amazon distro for open telemetry that comes from your applications and your real user requests. Once we have that data ingested, we&amp;#39;re able to go ahead and
analyze it and store it. So we pre-process it and upload it to Amazon
simple storage service, and then we&amp;#39;re able to
analyze it on the fly with AWS Lambda. So, over the past 10 years, Honeycomb has grown really explosively. How explosively? Well, three years ago we were ingesting about 200,000 trace spans per second. Today we are peaking at 2.5 million trace spans per second. And our customers are asking 10 times as many questions about 10 times as much data. You can see where this is going, right. So, how did we scale our
services economically? And, oh, did I mention?
We only have 50 engineers. The answer is we built on
the right AWS technologies. So this is what our
architecture looks like. We have a combination of
stateful and stateless services and they&amp;#39;re mostly written in Go, but you&amp;#39;ll also find some
Java and some Node.js running in our stack. But all of these were
able to be migrated on to AWS Graviton technologies. So let&amp;#39;s start with the
stateless technologies. So, we use Amazon Elastic
Kubernetes Service to manage our stateless workload. And the nodes that are powering it are EC2, C6G and C7G instances powered by Graviton2 and Graviton3. So, let&amp;#39;s take our ingest
service for instance. When we trialed moving from fifth generation EC2
instances to Graviton2, we saw a 10% improvement
in the median latency. That we are having many
fewer tail latency spikes because the Graviton2 processor
is just much more efficient and we are able to push much more load. But it gets better. Because, when we did an AB test of Graviton2 versus Graviton3, we found a further 10 to 20%
improvement in tail latency and a 30% improvement in our
throughput and median latency of our ingest services. And, not only that, it turns out that this workload just bin packs much better
with EKS and Graviton3, that these CPU utilization
is about 30% lower, which means we can push it a lot harder. And it turns out that you
can compound the savings by using EC2 spot instances, because the node
termination handler in EKS allows you to just gracefully
fail workloads over as EC2 spot terminates the instances. So we save money by being flexible. How does that work? Well, it turns out that when we went from fifth generation to sixth generation, we saved about 20%. And then we saved a further 20% when we adopted sixth generation instances powered by Graviton2 with Spot. Our columnar data storage is powered by EC2 M6gd instances. When we switched from
the I3 instance family to the M6gd instance family, tail latency went down by two thirds. And also, we tier data over to S3, and I&amp;#39;ll talk about that in a moment. So, speaking of other
data storage mechanisms, we use EC2 IM4gn instances to power our Kafka streaming data ingest. And these instances were newly announced at re:Invent this past year, and they&amp;#39;re powered by Nitro SSD. So, we had a problem before, that we didn&amp;#39;t have the right shape to scale with our workload. Where we stream data
and we need to make sure that we get it right the first time, but we don&amp;#39;t necessarily
access the older data 1
until there&amp;#39;s an emergency. So we were disc bound
running 30 instances of I3EN. We started tiering the
data to S3, which helped, but then we saturated on CPU. But right sizing everything onto IM4gn lets us hit our network CPU and storage thresholds appropriately. Okay, so about the query retrieval, how do we make it fast? Well, one answer is just
use more M6gd, right? That only works up to a point, because if you have millions
of files stored in S3, you&amp;#39;re going to have a
hard time querying them all from even just a hundred servers. So we utilize AWS Lambda in order to query, on demand, millions of files from S3 with tens of thousands
of parallel workers, which allows us to give you results for any question you might ask about your systems in
less than 10 seconds. And with AWS Lambda and
Graviton combined together, we see about a 40% improvement
in price performance, which really enables us to economically give our users the comfort of knowing they&amp;#39;re going to be able to
get a great night&amp;#39;s sleep. So, to sum up, AWS enables us to go fast. With AWS, we&amp;#39;re able to build quickly and help our customers move quickly too. We&amp;#39;ve saved over 60% versus
fifth generation instances. And also, as you heard earlier, we also have 60% reduction in emissions from using Graviton2 and Graviton3. So we&amp;#39;ve scaled 10X over
the past three years without blowing our SLOs, and without blowing our cost budgets. If you&amp;#39;re interested in learning more, you can check out the chapter of Observability Engineering
on our data store. And you can go and get all
of the meaty technical blogs and details at the link
that&amp;#39;s on screen right now. So now I&amp;#39;ll go ahead and turn
things back over to Martin. Thank you very much.

Transcript for:Honeycomb: Enhancing Observability with AWS

Transcript for:
Honeycomb: Enhancing Observability with AWS