Kafka Metrics Monitoring Overview

okay so now we're getting into the very important topics of which metrics you need to monitor and because there are so many Kafka metrics you don't really know where to start first so here is my answer you need to start and the whole internet is agreeing with this by a few metrics that are extremely important to have number one is number of active controllers and that should always be one so in Kafka one of the broker is an active controller and all the ones are not controllers so the controller basically allows to do a lot of stuff including you know leader election assignments of consumer groups ericeira and so you only need one to manage everything otherwise you have huge issues so zero means that your calf cluster is not working and to them means that there's a huge bug so that's number should always be one and it's something super important to have in your dashboard number two is the number of under replicated partitions or it's called you RP and that should always be zero so you basically don't want your partitions to be lagging behind the rest of your clusters so if for example there is a producer it's producing a very very high writes and the replication that is not fast enough then you're going to start seeing on the replicated partitions and that may mean that your brokers are being overloaded or misconfigured or that your network is struggling anyway that may mean something so number of underrepresented partitions or you are peas should always be zero and then you have a number of line partitions so this is really bad if an offline if a partition is offline that means that all the partitions are down and that's not good because that means that your topic is partially down and so that number should always be zero as well okay something very very important to monitor on top of monitoring these metrics is really good to set alerts on them because you get to know better when these happen you get you get an email and can fix the problem right away now there's tons of metrics that say you can find online and there's gonna be the Kafka documentation that we'll see in a second and also the conference documentation on calf gap which provides good advice finally word of advice it's always better to have more metrics monitor than less because you want to be easily able to travel shoot issues when they happen so very very important so let's go ahead and see the documentation now so as we can see here in our graph on a dashboard we don't have any of these three metrics which is mentioned from the slide and we'll have this in the next lecture during the hands-on we'll see how we can create these metrics in this dashboard so you can get started with graph Anna and have the dashboard just as you want but for now let's go to the CAF cat documentation and then in the CAF car documentation we're going to find the section it's called point six point six and it's called monitoring and this time of recording so we can scroll down wait for it to load and six point six right here monitoring so it explains how Kefka works and so kefka uses Yammer metrics to expose their metrics and they expose by JMX we know this already and so here is the list of all the metrics that are given by kafka so as you can see there's a ton of them you can really look at them and it's a very exhaustive documentation so these are metrics that you definitely want to monitor so just have a look at like do you want to see error rates you wanna see request rights you want to see bytes out right in half has the luggage flushed on the replicated partitions we'll just talk about them main is our and so there's a recommendation normal value thing to look out for so as we said on the replicated partition should be one a zero mean is our zero of lime Laboratories zero controller active only one etc etc etc so these are really really good values to monitor and so for example let's take a mystic for example is is are right on the replicated partition so if you look at this metric it's called under replicated partitions so if you go to Prometheus and type under replicated partitions as you can see now we have Catholic cluster partition under replicated or Kefka server replicas manager on the replicated partitions so one of these metrics is going to help and you can always you can always execute a query not a de Graaff to see how this works and see the value of it so this is by the partition to tell you how many we partition is under replicated and then the other one under replicated partitions execute we only get by server how many under replicated actions they are so it's two different kind of metrics one is at the server level and one is that the partition level but that's really really helpful so that's the idea here and it's gonna need to get this whole list but you can also type in Kafka monitoring confident documentation and so when you do this you get this page that we saw before in the slide dark stuff confirm that IO and so this page gives you more advice on how you should monitor Kafka and so this is a better organized page in my opinion there's server metrics broker and zookeeper producer metrics consumer metrics and all consumer metrics so if we scroll down and go to broker metrics basically it advertises the coffin could fall Center which you can get but it's paid license but otherwise it tells you which metrics Nietzsche gets and so again we under replicate a partition definitely something you want to have offline partition counts active controller by its in per second by its out per second number of requests per second not or reproduce request etc etc you can scroll down but it gives you a lot of good insight into what each metric does and so on and then you get the zookeeper metrics if you wanted to explore this and so on I'll let you explore this with the whole documentation okay but this is quite nice to have these two documentation and so in the next lecture we'll go ahead and play a bit big Ravana just so I can show you how to add these metrics to graph Anna and then you're on your own to add as many metrics as you want alright alright I'll see you in the next lecture you

okay so now we&#39;re getting into the very important topics of which metrics you need to monitor and because there are so many Kafka metrics you don&#39;t really know where to start first so here is my answer you need to start and the whole internet is agreeing with this by a few metrics that are extremely important to have number one is number of active controllers and that should always be one so in Kafka one of the broker is an active controller and all the ones are not controllers so the controller basically allows to do a lot of stuff including you know leader election assignments of consumer groups ericeira and so you only need one to manage everything otherwise you have huge issues so zero means that your calf cluster is not working and to them means that there&#39;s a huge bug so that&#39;s number should always be one and it&#39;s something super important to have in your dashboard number two is the number of under replicated partitions or it&#39;s called you RP and that should always be zero so you basically don&#39;t want your partitions to be lagging behind the rest of your clusters so if for example there is a producer it&#39;s producing a very very high writes and the replication that is not fast enough then you&#39;re going to start seeing on the replicated partitions and that may mean that your brokers are being overloaded or misconfigured or that your network is struggling anyway that may mean something so number of underrepresented partitions or you are peas should always be zero and then you have a number of line partitions so this is really bad if an offline if a partition is offline that means that all the partitions are down and that&#39;s not good because that means that your topic is partially down and so that number should always be zero as well okay something very very important to monitor on top of monitoring these metrics is really good to set alerts on them because you get to know better when these happen you get you get an email and can fix the problem right away now there&#39;s tons of metrics that say you can find online and there&#39;s gonna be the Kafka documentation that we&#39;ll see in a second and also the conference documentation on calf gap which provides good advice finally word of advice it&#39;s always better to have more metrics monitor than less because you want to be easily able to travel shoot issues when they happen so very very important so let&#39;s go ahead and see the documentation now so as we can see here in our graph on a dashboard we don&#39;t have any of these three metrics which is mentioned from the slide and we&#39;ll have this in the next lecture during the hands-on we&#39;ll see how we can create these metrics in this dashboard so you can get started with graph Anna and have the dashboard just as you want but for now let&#39;s go to the CAF cat documentation and then in the CAF car documentation we&#39;re going to find the section it&#39;s called point six point six and it&#39;s called monitoring and this time of recording so we can scroll down wait for it to load and six point six right here monitoring so it explains how Kefka works and so kefka uses Yammer metrics to expose their metrics and they expose by JMX we know this already and so here is the list of all the metrics that are given by kafka so as you can see there&#39;s a ton of them you can really look at them and it&#39;s a very exhaustive documentation so these are metrics that you definitely want to monitor so just have a look at like do you want to see error rates you wanna see request rights you want to see bytes out right in half has the luggage flushed on the replicated partitions we&#39;ll just talk about them main is our and so there&#39;s a recommendation normal value thing to look out for so as we said on the replicated partition should be one a zero mean is our zero of lime Laboratories zero controller active only one etc etc etc so these are really really good values to monitor and so for example let&#39;s take a mystic for example is is are right on the replicated partition so if you look at this metric it&#39;s called under replicated partitions so if you go to Prometheus and type under replicated partitions as you can see now we have Catholic cluster partition under replicated or Kefka server replicas manager on the replicated partitions so one of these metrics is going to help and you can always you can always execute a query not a de Graaff to see how this works and see the value of it so this is by the partition to tell you how many we partition is under replicated and then the other one under replicated partitions execute we only get by server how many under replicated actions they are so it&#39;s two different kind of metrics one is at the server level and one is that the partition level but that&#39;s really really helpful so that&#39;s the idea here and it&#39;s gonna need to get this whole list but you can also type in Kafka monitoring confident documentation and so when you do this you get this page that we saw before in the slide dark stuff confirm that IO and so this page gives you more advice on how you should monitor Kafka and so this is a better organized page in my opinion there&#39;s server metrics broker and zookeeper producer metrics consumer metrics and all consumer metrics so if we scroll down and go to broker metrics basically it advertises the coffin could fall Center which you can get but it&#39;s paid license but otherwise it tells you which metrics Nietzsche gets and so again we under replicate a partition definitely something you want to have offline partition counts active controller by its in per second by its out per second number of requests per second not or reproduce request etc etc you can scroll down but it gives you a lot of good insight into what each metric does and so on and then you get the zookeeper metrics if you wanted to explore this and so on I&#39;ll let you explore this with the whole documentation okay but this is quite nice to have these two documentation and so in the next lecture we&#39;ll go ahead and play a bit big Ravana just so I can show you how to add these metrics to graph Anna and then you&#39;re on your own to add as many metrics as you want alright alright I&#39;ll see you in the next lecture you

Transcript for:Kafka Metrics Monitoring Overview

Transcript for:
Kafka Metrics Monitoring Overview