Efficiency of Facebook's Tectonic File System

tectonic file system efficiency from access scale disclaimer this video is for entertainment purposes only and is now created on behalf of facebook and contains publicly accessible information only tectonic is facebook's exabyte scale distributed file system the purpose of tectonic is for consolidating multiple large tenants that were formerly housed in different service specific systems the generalized file system aims to achieve similar performance as the former specialized systems while enabling lower operational overhead simpler services and paved the way for better performance utilization this talk is an overview of facebook's fairly recently re released paper facebook's tectonic file system efficiency from access scale please let me know in the comments below if you have any questions concerns or corrections about the contents of this video so what is all about tectonic so currently houses around 10 tenants prior to tectonic facebook storage infrastructure included a constellation of systems including haystack which is one of them another example is f4 and another example includes their data warehouse infrastructure which used hdfs so hadoop distributed file system which is an open source file system now they have all these different technologies and then afterwards they all were migrated into tectonic which is a generalized file system and tectonic ultimately eases operational compact complexity because it is a single overarching system to operate it's more resource efficient because it allows for efficient resource sharing amongst cluster tenants which we will soon see so for example the haystack system is a blob store and it is a blob store specializing for newer blobs so blobs are binary large objects so for example photos so since those are for newer blobs it's more likely that those who get queried so for example you may have a picture they upload to your news feed on facebook that's more likely to appear on other people's news feeds on the other hand we have f4 which may be for slightly older blobs as the purpose of f4 and in that case f4 may not have as many requests but there may be more blobs to store so in that case f4 would more likely bottleneck on this capacity and not necessarily on iops while haystack is more likely to bottleneck on iops but it may have spared this capacity so by having a generalized system we may allow for more efficient resource sharing so that's just a little hint into what this talk will be about so in aggregate there are three challenges that faced when tectonic was being built the first one is scaling to exabyte scale pre-existing solutions many of them were not able to scale to the scale that facebook was operating at so they needed to create an in-house solution this also includes being able to handle so many different tenants and also data sets for each specific tenant the second one is also enabling tenant specific optimization we have different types of tenants and different types of tenants may have different types of loads for example in a data warehouse you may become more more concerned about throughput meanwhile something like haystack or f4 which are blob stores you may be more concerned about low latency an image may not be very large but you may want to not have to wait two seconds for an image to load on facebook meanwhile for a data warehouse it may be okay to have that lower latency at the beginning as long as you can sustain a throughput for the large for example map-reduce jobs to be executed and the third one is providing performance isolation across tenants now providing performance isolation across tenants is very important because you may have for example a specific tenant that can hog all the resources and then your other tenant that resides on the same uh storage fabric may run into sla issues where they're not able to sustain their load their services may crash because for example using up all the memory or using up all the storage capacity so in that case you may not be able to effectively store multiple tenants on the same cluster so there is a specific way that tectonic solves this problem as well so here are the high level solutions for these three challenges these are three high level solutions and we'll go in depth into each one now for the first one a high-level solution is disaggregating the file system metadata into independently scalable layers so what does this mean so we're just creating file system metadata into independently scalable layers in essence similar to azure data lake service what it provides is the ability to spread out the metadata for where all these different files are stored in the storage storage nodes across several different nodes so in hdfs you may remember from my google file system talk the name node was responsible for storing the metadata essentially concerning where all the different chunks are stored on which machine you may horizontally scale out the storage of different files on different machines we need to have some central location in order to create the mapping of indicating where those specific bytes are on which which server are the different chunks of data now in a hadoop distributed file system that's stored on a single node which is a clear issue in terms of scalability on the other hand tectonics sort of alleviates this by disaggregating different elements of it sort of peeling it off and distributing it across multiple nodes and we'll see precisely how tectonic achieves this for the second one for 10 specific optimizations this is achieved by using a rich client driven micro services architecture now essentially what this does is tectonic allows clients to drive manipulation of data chunks instead of providing a fixed set of apis within tectonic itself so for example the blob storage is a replicated quorum of pen protocol for reducing latency and small rights and later reads solomon encodes them meanwhile for data warehouse all rights are immediately resolument encoded to improve network efficiency reduce i o consumption and also reduce i o space so essentially this is a client driven oops client driven microservices architecture the advantage is that the client can specif can specifically perform behavior that may be suited for that specific type of task rather than tectonic having to step in and provide all these by default and trying to mimic essentially what all the different tenants would enjoy to have so the third one is performance isolation across tenants achieved using a construct called traffic groups so traffic groups is a construct created by tectonic essentially to provide performance isolation across different tenants this is a technique sort of used in in their perfectly fair weighted scheduler and essentially in traffic groups there are aggregations of applications that fall under these groups and later the umbrella groups are each assigned different classes as a proxy for their priority so it's actually broken down into three different classes bronze silver and gold which we will see later but in essence traffic groups will contain these aggregations of different applications and those will be used to provide restrictions in terms of resource management so tectonic has actually been in production on facebook for several years replacing the former haystack f4 and hdfs systems the impact of the transition facilitated large improvements including a 10 time reduction in the number of warehouse clusters it helped data warehouse handle traffic spikes with the spare blob storage i o capacity and as mentioned it reduced operational overhead now let's go a little bit more in depth into facebook's previous storage infrastructure so for the blob storage blobs were typically immutable and also they range from small photos several kilobytes to video chunks which are in the megabytes and again we're optimizing for low latency haystack was the hot storage system this one's this one's for hot and this one was for worm so f4 experience is a lower request rate so it actually uses resolument encoding for improving space efficiency though it has lower throughput the overall issue with the two separated systems is that haystack over provision in storage while being able to accommodate more iops while f4 lacked iops but had more storage on the other hand we had our data warehouse this was the storage for analytics including snapshots of the social graph tau training for ml ai models and large-scale map-produced jobs the data is partitioned and used by different products such as ads news feed and search so this one again like i mentioned earlier tends to be less latency focus but rather focus on large throughput larger writes and reads than blob storage so as i mentioned earlier one of the key problems was for the data warehouse specifically they were unable to scale this effectively part of that reason is because in the architecture for hadoop distributed file system if you remember from the google file system we have something called a name node here and then we have a series of different chunk servers here that are actually storing the different chunks and the problem with this is that the name node is a single node here and since this is a single node we're not able to scale effectively and it actually turned out that formerly large single data sets were often sufficiently big that they had to be split across several hdfs clusters and through this the compute engine logic became complicated as data had to be queried across clusters ultimately this led to a two-dimensional bin packing problem where one dimension was handling the cluster's capacity constraints and also the available throughput per cluster so we had to sort of optimize for the cluster's capacity constraints and the other one is also the available throughput for a cluster so now let's take a look at the overall architecture and high level design of tectonic tectonics to more formally motivate this design so as i mentioned earlier tectonic clusters are multi-tenant and support up to 10 tenants on the same storage fabric it also supports any number of namespaces so we support up to 10. tenants on the same storage fabric any number of namespaces and i also interact with tectonic vienna append only api it's a hierarchical and although it's similar to hdfs tectonics apis are configurable at runtime and this is what allows the tenants to match performance of specialized systems through this added flexibility so if you want to take an overall look at the design you can sort of draw it as follows so at the very beginning we have our client driven architecture so we're going to have a client library this will be the entry point for the different users so client library is here this is what communicates with different chunks and then on the other end we have our metadata store over here now the metadata store is a little bit interesting because the metadata store isn't necessarily on a single loan so as i mentioned before we have this disaggregation right so on one hand we have the name layer on the other hand we have the file layer and on another hand we have the block layer all right and all of these communicate with a key value store which we'll talk about later but this is spoiler it's uh zippygb okay so all these talk to a key value stroke now also we need to store the actual chunks so the way we store the actual chunks is the client library will communicate with the chunk store inside here we have a fleet of different machines all of them storing chunks and so forth and this is the chunkster last but certainly not least we have one additional element which i haven't really talked about but it is the background services so you can think of this such as anti-entropy repair if you watch my um my leaderless architecture video but these are background services they're demon processes that are stateless and these include aspects such as rebalancing garbage collectors disc inventory storage repair and scan so now we can go through each one individually and sort of see what their limitation was like so the junk star is a exabyte scale storage system it's distributed in flat object store for chunks which is the unit of storage and tectonic so tectonic files are comprised of blocks which are further stored in chunks and there's actually two ways that chunkstar contributes to exabyte scalability for tectonic so chunk so there's two ways that the trunk store contributes to the scalability the first one is the notion of chunks are oblivious to the higher level abstractions such as files or blocks so this is all handled by the client library and the metadata store hence it supports specialization of tenants performance needs without having to pre-support custom file system management this is a large advantage so essentially what this is is that chunks are oblivious to higher so in this case you can perform optimization such as solomon encoding or deciding when to read something in a code or if you want to use some sort of replication model as well the second one worth mentioning is that the chunk store itself is flat as the number of chunks grows linearly with the number of storage nodes so we have that chunk store is flat okay so how do these chunk stores actually work so the underlying file system in each individual chunk node is xfs this is the file system used on each one which is a 64-bit journaling file system linux tectonic uses a particular version which stores the metadata of the file system on the ssd so patent updates can happen more easily and quickly in the metadata so each storage node has 36 hdds 36 hard drives and a one terabyte solid state drive the one terabyte solid state drive can be used for both caching operations and also for storing a system metadata in terms of durability there are two main approaches so for durability we have first one is replication it's in replication data trunks are copied into various domains such as server racks background processes will repair lost and damaged chunks to maintain durability just to note that this entire thing is not actually geo-replicated and that's actually an important consideration because clients will need to implement geo replication on top of the system and it is indeed implemented by various different systems second approach for durability is actually just read solid when encoding it so read solomon encoding so for a specific rs r k the encoding block is split into r equal chunks and okay parody bits okay now this is the basic idea of how the chunkstart works the next major component is the metadata store which is responsible for naming exabytes of data this is the metadata store this component maps blocks chunks it also performs a disaggregation of different layers namely the naming file and block layers scalability and load balancing will come naturally from this design paving the way for exabyte scalability so previously i mentioned these various factors that allow for exabyte scale in tectonic and here we'll see precisely how this all works out so i want to draw a schematic precisely of the different layers used in the metadata store and we will see how all this fits in in order to provide the scalability that we're looking for so across the different components in the metadata we want to be able to clearly identify where all the chunks are stored on which disk so we're going to have different types of layers on one side everything will be stored as key values so we have a key and a value we will also shard the data because it will no longer be stored on a single name node like we saw in hdfs and we will also store a mapping so there's several different types of layers the first one will be name then we will have file and we'll also have block okay so the name layer will have a key of the directory id followed by a sub directory name and it will also have a key of directory id followed by a file name and the values that these mapped to our subdirectory info and subdirectory id followed by file info and file id so i may return a tuple and then this is shared by the directory id both so essentially this creates a mapping of directory to list of subdirectories but this is expanded which we will see in a second what this actually means similarly for file we may take a file id and a block id it may produce a block info which is again shared by the block info in this case oops and here we have a block mapping to a list of disks and we also have a block which may have a block id to a list of disk ids sharded by block id and here we have block going to list of disks which are chunks and then did a small mistake here this one's actually from above so this one is a file mapping to a list of blocks which is expanded and lastly but not least we take a disk id block id and then we create a chunk info and here we have block id here we have disk to list of blocks expanded so several remarks i want to make the sharded buy is actually very critical in this case because by sharding first of all we want to shard by either using something like a hash partitioner or using a range partitioner adls azure data lake service they actually shared and use a range-based partitioning and this is a little bit disadvantageous because it may lead to potential hotspots now if we look at the name node you notice that it's shared by the directory id so this means if i do a lot of operations on the same directory then they will most likely reside on the same chart on the other hand though if i also want to do something recursive on a whole folder then when i go into a subdirectory then it's less likely or practically essentially depending on the hash function how many different shards there are likely to become a hotspot so within the same directory since we're sharding all the information will reside at that current at that current depth in the same chart anything below it will not and that is critical to this design because we're able to leverage any operation with the same shard consistency and linear but on the other hand we avoid hot spots because anything that resides on another shard won't be impacted and included in the same query so that is one critical thing to notice another critical thing to notice is all these expanded and what this means by expanded is instead of storing a list for example of a key and a series of values we instead simply store key valid key value key value spread out and as the names just expanded and the advantage of this is that when you want to remove or add new values you don't need to rewrite this list data structure you can just simply append or remove and this is more efficient because for example you may have a an entire list and you want to remove something in the middle it's a classic data structures probably you may want you may use a linked list in which case you'd be removing an item which may be somewhat performing but then you have difficulty indexing it and so on so using this technique that sort of alleviates this problem so this is the high level approach of how the metadata store actually stores items so for example for the name node if we have a directory id and we're looking for a sub directory name just simply provide the name and then we get the subdirectory info and subdirectory id then if we want to get a file name we have another operation for that for files we can review the block id and then for blocks specifically we can retrieve information about which chunk it resides in so all this information sort of aggregates down from these different layers name file and block and this is the guiding principle behind the metadata store but all these different disaggregated separately scalable layers that are sharded by specific keys provide a lot of benefits for the overall scalability of the tectonic system as a whole and now we can look more specifically at the actual key value storage how is data actually stored if we looked at our previous diagram from before we noticed that we were using a key value store so all this all storage is delegated to zippy db which is a key value store created by facebook that is fault tolerance shared charted and linearizable each individual shard stores an embedded ssd based single node key value store called roxdb so we essentially have a series of these rocks db instances stored on each of those different charts and the shards are furthermore furthermore um replicated using paxos which ensures cons consensus across the different nodes so while any replica can serve needs reads strongly consistent reads must be performed through the primary note so again what this means is since we started by these values we sort of know where the master node and all all residing components inside the directory at the top level live so we may be able to leverage these strongly consistent reads for multiple shards on a single instance you may not have any cross shard operations that are going to be atomic we create and populate multiple shards on a single node the advantage of this is that if that node goes down may be able to in parallel we create a node more easily because then the remaining shards are located on separate different nodes and that allows for a very parallel operation compared to having just one node acting as the entire replica of another node then it would put a lot of burden on that specific node to help bring back to life the node that just went down also we have an optimization called ceiling now ceiling for object metadata is an optimization that essentially creates and makes certain things read only so that way it can be cached more properly the only thing to note is that we do not cache blocked chunk mappings as chunks can move across disks hence invalidating block layer caches so in terms of meta uh consistent metadata operations we guarantee we'd have to write consistency for data operations move operations where source and destination are in the same folder so how do we operate in cross chart transactions it's actually a non-atomic process so for cross-shard transactions we can look at this non-atomic process for moving a folder as an example moving a folder to a parent folder on another shirt so this is a non-atomic process here are the steps involved first we create a link from the new parent directory and then delete the link from the previous parent create a link from the new parent directory and delete the link from the previous parent and two the move directory then keeps a back pointer to its parent directory which is used to indicate pending moves this indicates that only one move is active for a specific directory at once so the move directory then keeps a back pointer to its parent directory is used to detect pending moves so this indicates that only one move is active for a specified directory at once there's actually issues that could arise with this whole setup and one example is a race condition where a file named f1 and director d is renamed to and then we also create a new file with the same name at the same time since there's no crosshard transactions there is the possibility for multi-shared metadata operations coexisting on the same file to experience race conditions these all must be done carefully provide example right now for potential race conditions so race conditions for cross shard metadata okay we have r1 this is the first example we get file id fid for f1 r2 we add f2 as an owner of fid and r3 we create a mapping f2 to fid and delete f1 to fid in an atomic transaction similarly we can do create a new file id fid new and then we also map f1 to fid new and delete f1 to fid so what actually happens here we may experience a race condition if we do the following sequence r1 c1 c2 r2 r3 and this specific condition what will happen is that we'll actually leave the system in an inconsistent state let's actually sketch out precisely what happens so r1 happens we get for file f1 you will get a file id of fid however afterwards we create a new file id called fid new then we map f1 to fid new and delete this mapping but then r2 adds f2 as the owner of fid so it looks something like this and then delete f1 to fid in the atomic transaction but this no longer exists this link is broken so because of this specific race condition the system can up in an inconsistent state and there's lots of different operations that may occur so in this specific example a new file named f1 and directory d is renamed to f2 but concurrently you're creating a new file with the same name and then this is an example of a potential problem that may occur okay so this is sort of the metadata store now we can continue on talk about the client library so client library so the client library is specialized in orchestrating the chunk and metadata store services so in our previous diagram we saw that for the client library oops the client library actually orchestrated calling the metadata store and the different chunk servers the client library operates at a per chunk granularity which provides flexibility in applying optimizations to run applications in the most optimal way possible and the typical approaches from sharing durability as mentioned prior include read solomon encoding or replication there's also single writer semantics tectonic restricts single writer semantics per file which avoids complexities arising from serializing rights from several writers so this features single writer semantics but per file so the way this actually works is each file contains a write token so this way is that upon every time a process opens a file for appending a token is added to the file metadata which subsequent rights must include if another process then opens that file a new token is generated it's included in the metadata and then only that specific process because it holds the token is able to write to it for serialization guarantees it is possible that you have multiple writers if the client library chooses to implement something like that but the by default behavior is using this token token based system to ensure such semantics so this is essentially all there is to the client library now on the other hand we have our last component which is background services background services aim let's pick a different color now background services aim to maintain consistency across metadata layers durability by repairing lost data so that's similar to anti-entropy repair rebalancing data across storage nodes publishing and publishing statistics about file system so for this we have for example aiming to maintain consistency durability by repairing lost data and rebalancing across notes okay also note that background services all process one shard at a time so it's on a per share basis there's not really many operations that occur across multiple shards at once another example for this application is the example of copy sets and copy sets is mentioned in the paper and the idea of copy sets which i found pretty interesting is it's based off an idea found at stanford and on paper that mentions how many times in distributed systems it's very common for nodes obviously to fail and if you have thousands of machines failure will happen it's inevitable but sometimes you may choose replication factors such as two or three and you don't really want to lose any specific data if you do lose the data because one machine one particular machine fails then you will have to replicate it from another machine but what happens if all the replicating machines fail so if you have like a thousand machines like x1 x2 so on and so forth to like x1000 we choose a replication factor of three which means that any no any piece of data any for example chunk is stored on three different machines we may say okay i get my data i'm gonna write it on one machine i'll pick two other machines at random to store my data for example x 47 and then i replicated on x 68 and x 421 okay so these are my nodes that are starting my replication with that replication factor of 3. now one of the problems is that if within my system i'm using a random selection for my different replicas if i do happen to have three different failures of machines it's almost guaranteed that i will lose data and what this means is that it will cause a lot of headache operationally in order to find and restore that data from other logs or forms of cold storage and it was found that oftentimes in terms of engineering overhead any lost chunks takes significantly more effort to find that potentially having it occur less often even though more data is lost and there's actually quotes by the hbase engineering team at facebook mentioning how they specifically may choose this trade-off when it comes to losing significantly more data but much less frequently and very frequently losing small bits of data but how can we actually balance this trade-off the way we can balance this trade-off is that we can simply not use the random assignment when it comes to picking notes instead let's always pick sets of three and that's the idea of copy sets the idea of copy sets is there's different sets of items of where we would choose to replicate so here's an example let's say we always choose to replicate them replicate items and pairs of three so for example if i write to x1 then i'll also write x2 and x3 if i write to x4 i'll also write the x5 and x6 if i write to x7 i'll also write to x7 x8 and x9 so on and so forth so we'll have some of these sets that are in sets of three now if i happen to have a failure on x47 x68 and x421 i'm most likely not going to lose any data at all because in order to lose data i would need to lose data on for example x47 and then i would have to use data on x48 and x49 or if i were to lose data on x 68 and we need to lose data on x 67 and 69 as well so this is an example our total search space in terms of potentially losing data is much smaller because now we need to have consecutive items consecutive machines to lose their data and not necessarily just any sporadic pair of three now the disadvantage of this approach is that now when we do lose data if x1 x2 and x3 fail and suddenly we lost a whole lot of data because a lot of data is focused on those three replicas replication notes so this is sort of the trade-off that's done by tectonic it's about deciding when to focus on losing a lot of data at once or losing less data but more frequently and another factor to consider is also how large are these copy sets the larger the copy sets the more storage is being used but at the same time we may have more reliability and durability so these are the trade-offs that have been taken into account by the engineering team at facebook when deciding how to incorporate this design so this sort of goes through the high level architecture and the main considerations taken when designing tectonic now we can talk a little bit about multi-tenancy so sharing resources effectively as tectonic handles a diverse set of tenants it needs to have an approximately weighed fair scheduler to ensure fair resource sharing and performance isolation amongst tenants there are two types of resources used in tectonic ephemeral and non-ephemeral an example of ephemeral is disk space and the reason why this is ephemeral is because it doesn't often change at least in terms of capacity constraints oftentimes it is known or the rate of growth is known but example of a non of an ephemeral resource would be demands changing such as iops that could potentially change over time considerably because these things can be very dynamic there may be certain optimizations as we discussed prior in order to fit multiple tenants and also maintain a fair solution so that all services can maintain their slas so how do we distribute ephemeral resources ephemeral resources are managed at the granularity of groups more specifically called traffic groups and traffic groups reduce the cardinality of the reasonable sharing problem which also reduces management overhead tectonics supports up to 50 traffic groups per cluster and each turn each traffic group is assigned a traffic class a traffic class indicates the latency requirements and is used for deciding which traffic group is prioritized each tenant will get a predetermined quota of the cluster's ephemeral resources along with each traffic group so now that we have these different traffic classes and traffic loops how does this actually work out so we first have global resource sharing so the client library would use a rate limiter in order to achieve the aforementioned elasticity by using near real-time distributed counters which internally are as mentioned in other literature is called operational data store the demand for each track resource in the tendon will be measured in a small time window the rate limiter algorithm is implemented using a leaky bucket algorithm so we use new real time distributed counters and perform rate limiting using a leaky bucket algorithm the throttling request of the client puts back pressure on clients prior to making a wasted request but there's also a local resource sharing local resource sharing includes later rate limiters existing on metadata and storage nodes in order to prevent local hotspots the notes provide fair sharing and resource isolation using a weighted round-robin scheduler it's a weighted round robin scheduler so the way that round robin scheduler provisionally skips turns of a traffic group if it would exceed the resources quota it also needs to ensure that the co-location of lower latency blob store reads don't get collocated with larger reads such as from data warehouse so for example as we mentioned previously there are different traffic classes traffic classes can be broken down into bronze silver and gold status goal status means it's that it is the most latency sensitive while bronze status is not particularly sensitive so it may be pushed back if a gold request comes through so traffic class requests may miss latency targets if they're blocked below a lower priority request on storage notes there's actually three optimizations for sharing latency for gold class traffic nodes that happens in tectonic let's go through the three optimizations of ensuring low latency for gold graphic nodes one the weighted round robin scheduler provides a greedy optimization such that a request from a lower traffic class will give up its turn for a higher traffic class if it has sufficient time to fulfill its request after the higher traffic class finishes so if we have a higher traffic class such as a gold request it takes a specific amount of time and we have a bronze oops and we have a bronze request i mean we may push forward to have the gold request happen before the bronze request as long as this is the acceptable window for bronze so essentially we push back bronze in this direction because this still finishes the floor and we have this much buffer room anyway this is in contrast to if we just did bronze earlier in gold here this is bad because we want to optimize for the gold request which is more lens sensitive to occur before the bronze this is a fairly straightforward optimization the second one is the disk may rearrange i o traffic requests such as service a non-gold prior to gold so for this tectonic will stop scheduling non-gold requests to disks should should there be a gold gold request depending on the disk for a threshold amount of time so for this example let's consider there is a series of different i o requests including one gold request as part of the i o requests here we may find that the dismay we arrange i o traffic requests to service a non-gold prior to gold so if we want if we are about to service non-gold prior to gold tectonic will actually stop scheduling now gold requests to disk should there be a gold request depending on the disk for a threshold amount of time so t is greater than or equal to threshold it may not serve as non-gold requests for the time being and gold it's fast tracked to the beginning also if there are conditions for the third optimization if there are pen and gold requests here then in order for the actual machine to service the requests any request that is non-gold may be rejected immediately and that will provide back pressure onto the client so that is the third and final optimization that's used for gold traffic nodes so that's with any multi-tenancy here we essentially shown that for ephemeral types of resources we create a fair sharing process where we divide different applications into traffic groups which minimizes the cardinality and management overhead and for each of those traffic groups we apply a traffic class based off of the traffic class we have certain rules that will allow gold or latency sensitive traffic groups in order to skip the line in the weighted round robin scheduler in order to be processed earlier than other types of requests but as with multi-tenancy we have another topic with discussing which is access control so tectonic is a token-based authorization mechanism this may remind some of the single writer semantics from the client library a dedicated authorization service would authorize top-level client requests such as opening a file by generating an authorization token for the next layer used in the system and this process continues down the different layers so we discussed up over here the different types of layers in the metadata store as we go through down we may receive different types of tokens in order to provide access tokens payload will describe the resource given access to and which allows granular access at each layer the verification process itself happens in memory which is efficient and can be processed in the order of tens of milliseconds you can also be token passing to reduce the access control overhead such as avoiding repeatedly calling the authorization service so in our original diagram we may actually need to augment this authorization service to have a clear picture so the overall grand scheme of things that uses a token based authorization mechanism and also it uses this dedicated authorization service which authorizes top-level client requests at each layer of the different services which is also unique so this is does it on a per layer basis so this is the basic ideas behind multi-tenancy which is one of the various challenges that tectonic was facing and for the other one is also the ability to have tenant specific optimizations which is the next component with mentioning so this was the third challenge the 10x specific optimizations so there are two optimizations that pave way for 10 specific optimizations that sort of distinguishes tectonics from other systems first one is that clients enforce configuration on a per call basis unlike hdfs which configures durability on each individual block right the metadata store as it's more scalable facilitates operating this on a per block basis the main advantage of this is that you may have different cases where you want to ensure different configurations because the metadata store is significantly more scalable in tectonic compared to hdfs this will allow the system to have such optimizations and the second one is that tenants have nearly full control over how to configure an application's interaction with tectonics which includes trunk level granularity so for example the optimization for iris encoding lazily for blob storage so tenants have full control over how to configure an applications interaction with ticatonic so several specific optimizations that we see deployed by tenants include first of all hedged quorum rights so this was released by a paper from jeff dean for resolving tail latency loads from querying multiple replicas so tectonic uses a variant of quorum writing for reducing tail latency without additional i o so one of the ideas behind hedge quorum rights is that when you're performing any request you may be performing it to several replicas aiming to reach some sort of quorum but some of those replicas may take a longer amount of time to reply to it may be down or may be unavailable or overloaded instead of sending it specifically to the amount of nodes that you need to fulfill the request what about sending it to more replicas and asking for their availability ahead of time this is sort of known as the hedge quorum rights so tectonic uses a variant of chromites without asking for additional i o the way this is achieved is that rather than sending chunk right payloads to external nodes tectonics sends reservation requests ahead of the data and then writes the first nodes to send the reservation so you may have tectonic here and you have requests so this would be the client library and it sends it to a bunch of different nodes reservation and you only really need three of them to send requests back and these are fine if they don't reply we got three which is good now here's another example so if we rs encode nine and six client library sends requests and reservation requests to 19 nodes with reservation requests versus regular 15. so you may have slightly more network overhead but this is very marginal we do not incur an i o penalty because this is just simply a reservation request so this is a slight variant on the actual model used and mentioned by jeff dean for the first 15 that accept we send the right and then we wait until at least 14 reply that it is a quorum if the 15th right fails we depend on the mg entropy background process to fix it so what this means is first of all we send 19 reservations and we don't even wait for 15 approvals you can accept it with just 14 afterwards you pick a threshold and for the remaining we depend on the anti-entropy background process to fix it so using hedged quorum rights we achieved a roughly 20 performance improvement at p99 and this is very important because you want to ensure certain latency considerations and our p99 that's often where we start seeing issues where certain nodes may be unavailable so this alleviates this problem second example would be the blob storage optimization so blob storage optimization facebook stores tens of trillions of blobs hence the quantity of blobs needed to be indexed as a challenge in and of itself we may have all these different sort of files lingering around and even just using the media store storing trillions is quite a big challenge but notice that the files tend to be small so tectonic in turn tackles this problem by storing multiple blobs together in log structured files such that new blobs are appended to the end of the file to locate blobs they are indexed by blob id to the location in the blob found in the file they are written as partial block of pens such as to be read after write consistent however they're replicated hence they use more disk space so what this actually means is that we may have a specific file that stores an index inside the file indicating what what the different offsets are so in order to locate blobs they're indexed by blob id to the location in the blob found in the file so here we may see this blob x y is that offset 20 and then x z is that offset 21 so on and then inside here you can look at 20 and 21 to actually identify the blob so the advantage of this is that we incur a little bit less overhead in terms of actually locating the file that contains all the different blobs because we may have a set restriction on what the size of the different chunks are and so on so in order to compress them all into one file we receive an optimization in this form okay this is one approach so these are sort of the approaches for tenant specific optimizations that are used in tectonic now another thing we can talk about is production trade-offs that are used in the system so some production tradeoffs it's not all just fine because an issue that may occur is that there may be higher metadata latency hdfs metadata operations are in memory and are all stored on a single node but we introduced scale by scaling it out and sharding the data for tectonic they're stored in disaggregated metadata layers and a sharded key value store this means that tectonic file operations may require greater than one network call for example open would interact with name and file layers so this means that the data warehouse had to adjust handling of several operations due to additional metadata latency so we have increased metadata latency because of this aggregation slash layering of metadata plus sharding on different nodes and this caused an issue for data warehouse another issue is actually the approach of hash partitioning this is one so this is latency and next one is hash partitioning since tectonic directories are hash partitioned directory list operations can now work as subdirectories do not reside on the same same metadata chart so this means that listing directories may recursive require recursively involving querying many shards so performing so unlike hdfs tektonic does not have a disk utilization so tectonic do you instead periodically aggregates per directory statistics on usage which may be stale so this is part of the background services that run so this is an issue with hash partitioning and recall that we use hash partitioning specifically because we wanted to avoid those hotspots in the metadata so something like azure adls azure data lake service does not run into these issues because it uses range partitioning so we may have subdirectories coinciding within the same range lastly another thing with mentioning is that several services do not use tectonic this is because they are for example bootstrapping services such as configurator and other systems may using the underlying systems leveraged tectonic such as different services using blob store may not be directly using or residing in tectonic yeah so this is an overview of tectonic itself several other related works that are worth mentioning include hdfs and gfs so hadoop distributed file system and google file system these ones are designs that use single node architecture for metadata another one worth mentioning is federated namespaces for increased capacity so for example azure data storage or windows azure storage so this is windows azure storage for federated name spaces another example of disaggregated and sharded metadata includes the azure data lake service and hop hops fs which increases file system capacity via disaggregation of metadata into layers on separately sharded data stores but again adls and hopsfest they also use this disaggregation of file system metadata but they use range partitioning and on the other hand last but certainly not least closest is another system so close this is another system by google that provides cluster-wide multi-exabytes storage that is also client driven so this is an example of a client driven design but instead it uses spanner which is a globally consistent database for file system metadata and not systemdb not zippydb so zippydb only provides within charge strong consistency and no crosstalk operations however this specific limitation hasn't really been an issue at facebook itself so these are the related works if you're interested in checking them out hdfs gfs windows azure storage azure data like service hfs or colossus hope you enjoyed this quick talk if you have any questions please leave them down below or if you have any feedback and thank you very much for watching

Transcript for:Efficiency of Facebook's Tectonic File System

Transcript for:
Efficiency of Facebook's Tectonic File System