Transcript for:
Introduction to Cassandra

hi guys welcome back to the channel hope you all are doing well today i'll go over the basics of one of the most popular nosql databases called cassandra cassandra is an always available distributed nosql database at first i'll go over the features of cassandra from there we're going to dive deeper and see how a read and write path works in cassandra finally we're going to go over how to design data models in cassandra effectively then we will wrap up with some good use cases of cassandra so with that being said let's get started [Music] at first let's take a look at the features that cassandra gives you the first one is it's a distributed database so your data is replicated across different machines [Music] cassandra is always available by default so whenever you're acquiring cassandra it should always work even if one or two of the machines go down by default cassandra is eventually consistent but the parameter is tunable so if you want strong consistency you can tune your cassandra cluster to be strongly consistent the wrights and cassandra are very very fast so if you have a very right intensive application now cassandra tends to be one of the better choices more often than not cassandra does not have a leader architecture like mysql or some other databases so even though the database is distributed there is no one leader when our right comes in any of the nodes can take in a right and then replicate it amongst the replica nodes uh unlike a my sequel or postgres where there is one master node that takes in the rights cassandra does not operate like that this makes cassandra more resilient so if one machine goes down which in the case of my sql or postgres can be the master node you can still read and write to cassandra effectively so it's more resilient compared to like a relational a single host machine all right so let's dive deeper into uh some of the features that i mentioned the first one is what does it mean by uh cassandra being a distributed machine sorry a distributed database so over here i have four nodes so let's think of cassandra right now as being a four node uh cluster so you have four individual computers where you're hosting your cassandra cluster by distributed it means the same data is going to be replicated across different machines so if you look at the two blocks over here the yellow one is data and the green one is data2 so these are two distinct sets of data but you can see both of them are replicated across different machines so you have the data block in node 1 node 2 and node four and you have the green block in node one node two and node three that means the data the same data is uh residing in different machines so if one of them goes down you still have your data in another machine to read from so what does it mean by cassandra being always available so it's very similar to the previous point i have the same diagram over here where we have a cassandra cluster of four nodes and you can see that the yellow data is there in node one two and four right so let's imagine a world where node one and node four goes down for some reason because these are just individual uh computers running in data centers so it's very common for these machines to go down so even if node 1 and node 4 were to go down and you want to read the yellow data from cassandra you can still read it successfully because your data is there in node 2. so by being always available it means that cassandra can always return to your query as long as at least one of the replica node is working whereas in a case of uh mysql or some other non-distributed database that tend not to be 100 available because you have like the master node when it goes down it becomes very difficult to write or reach the database maybe reading would work but writing would be very difficult in this case given it's 100 available the same thing happens when you're trying to write to cassandra if three of the notes go down but only one of them is working your query is still gonna work because cassandra is a peer-to-peer database so any node can take in a write it doesn't have to be like a master node or something [Music] that brings us to the next thing what does it mean by not having a leader that means cassandra employs a peer-to-peer architecture to put it in comparison to a more relational non-distributed database like a mysql or a postgres in my sql or postgres you have a master master slave architecture so a master or leader database is the one that is going to take in all the right and then the data is going to be replicated amongst all the different replica what that means is if the master or leader node was to go down you cannot you wouldn't be able to successfully write to the database until a new master is elected by some process in the database there is of course gonna be a delay between the old master going down and the new one coming up so you will have a phase where your rights won't be successful however in a distributed a database like cassandra which is leaderless any node can take in a right so if you have like a node one node two so we have uh two nodes set up over here if note one were to go down your rights would still work successfully because you have no two to receive the data and right so peer-to-peer is much more resilient than a leader follower kind of architecture all right so how does cassandra store data let's take a look at that now so cassandra stores data in something called partitions imagine partition uh like a file within a system so related data are gonna be stored in a single partition so the most atomic unit in a cassandra database is gonna be a partition [Music] over here i have a four node cassandra setup and you can see that each node has three partitions so this is to show you that one node or one machine can have multiple partitions where your data is going to live [Music] related data should be in the same partition so when you design a cassandra database you want to make sure that related data ends up in the same partition this is just to make your queries quicker we're going to take a look at how you can design your schema accordingly so that all related data ends up in the same partition that's because you get to define what your partition is going to be and what kind of data is going to be grouped together in the same partition smaller partitions are best for query performance ideally you want to keep your partition small so that when you're doing a read query in cassandra the operation is done very very quickly and to do that you want your partitions to be small in an ideal world you want one query to read from one partition only if you design your schema in that way you get the maximum read performance out of cassandra partition is the most atomic unit in cassandra so when you design your data model you want to make sure that if possible each read query is touching only one partition when read queries get expensive and cassandra has to scan multiple partitions that's when the operation can get slower and slower ideally you want to keep your partitions bounded that ties up with the previous point where we want to keep the partition small that's because if on at design time you create partitions that are unbounded that means to begin with your partitions might be small and your query is going to be very quick but over time given the partition is not bounded it's going to keep on increasing indefinitely and you will reach up reach a point where the partitions are so large that you read query the same query that was very very fast it's going to take significantly more time now so now that we know the importance of partitions let's see how do you define your own partition this is an example a create table query in cassandra so if you take a look at the last line where i have the primary key that is how you define the primary key of your schema or of your table the way cassandra works is within your primary key the order of the fields matter the first field that you have in the primary key is going to end up being your partition key in this case we have a table with all the players playing for the same club if you look at the primary key the first field is club which means your schema will use the field club as the partition key that means every player from the same club will end up in the same partition so if you have like 12 clubs overall and let's say each club has 10 players then theoretically you're gonna have 12 partitions and each partition is going to have 10 rows of data that's how when designing this schema you can design the schema in such a way that the partition is always small and abounded you can also have a composite partition key so your partition key does not have to be only one field for instance in this case uh we have the primary key statement just like we had over here but instead of having the parentheses with only the opening and ending parentheses you can see we have name and club within parentheses that's how you define a composite partition key that means if you want like three fields let's say you want a name club and league then you would want to start your parenthesis at name and end it at league so basically to define a composite partition key you take your primary key and you put that part within parentheses starting from the first field in this case name and club is our composite partition key that means multiple players with the same name and the same club will end up in the same partition this is just an example but you can see how you can design your partition composite partition keys in such a way that they make more logical sense [Music] this should give you an idea how important the schema design is in cassandra well ahead of time so if you're the schema for cassandra it needs to be designed with the query patterns in mind it's a very bad idea to go design your cassandra schema and then add features and keep on adding things to your app and evolve the queries because that way cassandra is gonna start acting uh much slower as you throw different queries at it the best way to use cassandra is you have a defined set of queries that you know you're going to be querying and then you design your data model based on those queries as you saw what primary key what partition key and there are more keys like clustering key you have in your schema uh defines how fast or slow your query is going to be when it comes to write queries uh it doesn't make a ton of difference but when you were talking about read queries your schema becomes very very important so you want to keep that in mind now let's take a look at why writes in cassandra are so fast cassandra is known to have one of the fastest to write performance if the primary goal of your application is writing so if it's a very right intensive application cassandra tends to be a very very good option so let's see why exactly uh our right operation in cassandra is so fast so over here i have a more of like a back end of cassandra architecture i kept i tried to keep it as minimal as possible so let me just run you through what i have so i split the memory and disk through that dotted line so underneath that line everything you see resides in disk and above that line everything you see resides in memory as you know accessing or writing anything to the memory is fast accessing or writing anything to the disk is slow so you wanna interact with memory only as much as possible for the best performance [Music] in cassandra the first thing you have is a commit log which is just a pen only log for every new data coming in and then you have a mem table which is a data structure that resides in memory and then you have ss table which are different data structures that reside in disk i won't go into each of the data structure in depth because they tend to get more complicated so keep things more surface level the fundamental reason that rights are very very fast in cassandra is when a new piece of data comes in the first step cassandra takes is write that data to the commit log which is in memory so very quick and then it writes to the mem table which is also in memory which is also very quick so writing to both commit log and mem table are very fast because both of them are in memory the moment cassandra does these two it acknowledges the right and calls it a successful right and that's why it's so fast because for our right to be successful you need almost no disk interaction that means each right tends to be very very fast it's essentially like writing to a cache because you don't really interact with disk in the back end what cassandra does and i i'll have another video that goes into more details is uh periodically the mem table from memory is flushed to the ss table to keep the data up perfectly over over there for more permanent pieces but given the distributed nature cassandra can make do with writing to a mem table when a write comes in instead of writing it to the disk every single time that's why compared to most of the databases where data is stored to disk immediately cassandra tends to be significantly faster when you're writing to it all right so we're going to wrap it up with a few good use cases of cassandra so the first one as we talked about is very high right throughput applications we just went over why writes in cassandra are very fast so if you have a very right intensive application cassandra tends to be a good use case internet of things applications tend to be a good use case for cassandra that's cause different temperature sensors or pressure sensors these are designed to emit data almost every couple of seconds or maybe every second so you usually these kind of applications tend to be very right heavy just by the nature of different sensors so cassandra ends up being a good choice because of the high right throughput web activity data for similar reason because if you're tracking user interactions in a very busy website the volume of data tends to be very high because multiple users are inter interacting multiple times within a minute when using the website so cassandra is used in a lot of those website activity trackers [Music] predictable read pattern applications as i mentioned cassandra can be insanely performant if you know your queries ahead of time so if you know you have let's say these five read queries that you're going to be doing every time you can design your cassandra schema to be perfect for these five queries that way you get the maximum benefit from cassandra and if you can design your schema like that the reading becomes insanely fast just like right so you will end up having a database with insanely quick write performance and very good read performance too lastly health data applications tend to be a good use case uh the same reason as the internet of things application health data tends to be uh very very often so let's say if you have an apple watch that measures your heart rate the watch measures your heart rate almost every minute or a few times every seconds at times and these data have to be written to a database very very quickly and once again cassandra is a good use case because it's a very high right throughput application so yeah that's all i have for today hopefully this gave you a better idea of how cassandra works let me know if you have any questions in the future i'll have a different video diving deeper into how cassandra works and how the different data structures work if you want that quickly just let me know and i'll try to get that prioritized with that being said hopefully you learned something and i'll catch y'all in the next one bye