Understanding Cloud Data Management Challenges

hello so we will ah continue our discussion on cloud computing today we will discuss about some aspects of managing data in cloud right so as as we understand that in in cloud as as we have discussed in our earlier lectures that in cloud one of the major aspect is the data because ah at the end of the day your data and even processing applications are in somebody elses ah domain right so they are in the in being executed at some somewhere else which is beyond your direct control so it is virtually host in some virtual data a virtual machine somewhere in somewhere in the cloud so it becomes tricky ah to on the security point of view that we have discussed not only that if you look at from the other point of view so from the clouds provider point of view managing huge volume of data keeping their replicas and making them quarable and these becomes a again again a major issue so all over conventional relational or ah object oriented model may not directly fit into the thing right so long you are doing on a on a small instances experimental experimental some database application or ah some small experimentations then its fine but when you have a large scale thing where huge amount of retried going on or the volume of data is much much higher ah than the than the normal operations then then it is it is we need to look in a different way these are the things which which come in to not only for the cloud it was there a little earlier also like how this parallel database accesses parallel database execution read write execute operations can be done so those things become more more prominent or a de facto mechanisms when we talk about ah in context of cloud so what we will try to do is more of a overview of how data can be managed in in cloud or what are the difference ah strategies or schemes people or this isps follows and it is not exactly the security point of view it is more of a management data management point of view right ok so so we will we will talk about a little bit of relational database already known to you then what you known to do that scalable data bases or data services like one of the ah couple of things are important one is google file system big table and there is a map reduce parallel programming paradigm those are the things which comes in back to back when we are doing to the things so what we want to do ah when we were we are managing a anything on on a cloud platform whether it is application or data we want to make it scalable scalable in the sense the it suites scale as as the requirement goes up so scale up scale down ah in a in a ubiquitous way or minimum interference from the or minimum human or management interference so that type of infrastructure we want to ah come up with right it is true for ah data also so these are primarily suitable for large volume of massively parallel text processing right that is one of the major thing or it is suitable for environment ah say enterprise analytics right i want to i have a ah if we want to do analytics on a distributed data stores right it it may be ah a chain of a a shopping or commercial staff or it may be a banking ah organization or financial any financial organization even it is something to do with large volume of other type of data like it metrological data it maybe climatological data something which need to be chant or has a distributed things i need to do some parallel processing down the line where the actual effect comes into play if you have a simple database with a simple instant then you may not have gone to cloud for that right so it is it is may be a simple system or you buy a very ah a a work on it then the actual effect of cloud things are actual ah advantages of cloud you are not taking out so so we will see that similar to big table models there are google app engines datastore amazon simple db which are which different provides provide in different flavour but the basic philosophy are same so if we look quickly look at the relational data base which is known to all of you ah or most of you ah users application programs interact with rdbms through sql right so it is the structured query language or sql by which i we interact with the user programs etcetera so there is a ah relational database management parser which transforms queries into memory and disk label operations and optimize the execution time so in any query we need to optimize the execution time of the query right so if it is a large data base like you whether you do project before select join before or after select that makes a lot of difference though the query may be ah same ah the query output will be same but the execution time ah may may vary to a great extent right like i have a huge two two data bases like r one say relational databases r one r two and i do some projection or or selection of some of the things right i select select a one a two and then ah do a then do the join whether i do the join before or after makes the things like suppose ah if i do the select on r one the number of tuples come down from one million to say ah few thousands similarly for r two if i do a select on that right so then joining is much less costlier so whether you do the join first or it said that becomes a thing that is a database optimization problem nothing to do specifically for cloud but relational database allows you to optimize those things disk space management layer ah this another property that stores data records on pages of contiguous memory block so that the ah disk movement is minimized pages are fetched from the disk into memory as requested state using pre fetching and page replacement policies so this is another aspects of the things like one is looking at that property making it more efficient in the query processing other aspect it make it more efficient in storage terms of things like nearby things ah if the query requires the some five tables if they are nearby store then the access rate is high so database file system layer so previously we have seen the that rdbm parser then disk space management layer then database file system layer so it is independent of os file system its a separate file system so its in order to have full control on retaining or realizing the page in the memory files used by the db or database may span multiple disk to handle large storages right so in other sense like if i dependent on the operating system for phase all those things then it is fine when your again database load is less if it is pretty large then the number of hope you take it text it becomes costly so what you need to do we need to do directly interact at the at the much lower level with the with the hardware or the available resources and that exactly this database file system layer tries to immolate uses parallel io like we have heard about raid disk ah raid one raid two raid five raid six e n type of things arrays or multiple clusters so which keeps a redundant redundancy into the thing so the your ah this failure down time is much less so that means is it is basically ah full failure proof implementation of the database so usually the databases storage as row oriented that is we had tuples and its a set of row of the same schema optimal for write oriented operation the transaction processing applications relational records stored in contiguous disk pages access through indexes primary key on specific columns b plus tree is one of the favourite storage mechanisms for this sort of thing column oriented efficient for data warehouse workloads right so those who have gone through data warehouse is so it is a it is a ah high dimensional data huge volume of data and being collected and populated by different things so its not its more of a warehouse rather than a simple database so this is this column oriented storage are more suitable for data warehouse type of loads aggregate of measures where rather than individual data it is more of the analysis on analytics come into play so it is aggregation of measure columns need to be performed based on the values of the dimension columns so we are not going to the data warehouse so it has a ah it has a different dimension tables and type of things and we need to the operations are more aggregate operations right we want to do some sort of analysis and type of things so projection of a table is stored on as a stored on a dimension table dimension values in case of a column oriented require multiple indexes if different projection are to be indexed in a sorted order right so it is if it is a different different things because the organization may have different views for different type of data and need to be stored in that fashion so data storage techniques as we have seen ah it is b plus tree or join indexes ah so if it is ah if it is so one is tow oriented other one is column oriented so this is row oriented data and this is column oriented and we need to have a join index which allows this data to be ah linked to one another so these all these we will get in a any any standard database book or in standard literature primarily as we are following that stuff enterprise cloud computing book ah for this particular thing so thats why we have mentioned but this is a very standard operation and you can get in any standard books so if we look at the parallel database architectures so it is broadly divided into three aspects one is shared memory one is shared nothing another is shared disk right so i just see the picture fast then come back so this is a typical structure of the shared memory right so these processors different processors shared the memory here it is a shared disk so different processors shared the ah disk ah here we have shared nothing so individual processor has individual disk so in case of a shared memory suitable for servers with multiple cpus so if there are multiple cpus so if there are multiple cpus memory address space is shared and managed by smp operating systems like ah the memory address this is shared among these smps and schedule processors in parallel exploiting the processors so it it schedules small things so that means i have a shared memory space and i basically do a execution in a parallel mode so on the extreme other end is shared nothing so cluster independent servers with each of its having own disk space and connected by a network so at the with a back bone high speed network ah if any server shared its own disk space and then do the rest of the execution and if we look at that in between the thing is the shared disk like its a hybrid architecture so to say independent server cluster storage through high speed network that can be nas or san and clusters are connected to storage data via standard ethernet fiber etcetera what we have shown here so it is it is a shared storage and these different processor access this so based on your application type of parallelisms you need we can go for any of this structure so here ah we see that it is more ah this more efficient if the if the memory things are more compact where in the other a and we if the processors are individually working on separate data sets and there are machine to say then this could have been a advantage so if we look at the advantages of parallel db of relational database if you dont want to put that what are the features of relational ah parallel database structures which is which is more advantages for parallel this sort of operations then the relational database efficient execution of sql query by exploiting multiple processors for shared nothing architecture tables partition and distributed across possessing table right so if it is if it is so happened that i can partition the table and every the the the the data accountant in the table can be executed parallely they can be distributed in the different days and the processor can work that totally depends on your what is your working mechanisms out there so sql optimizer handle this distributed joint so whenever we need to do some join then we need to fall on the distribute ah your sql optimizer so ah distributed two phase commit ah locking for transaction isolation between the processors ah so these are the some of the features fault tolerant like ah system failures handled by transferring control to standby system so i can have different standby system or some some with some protocol or some policy and then if there is a failure then i can shift that particular execution to a some of the standby system so that is possible in this sight of things and restoring computation for data though these are the things which are more required for data warehouse type of applications so there are examples of databases capable of handling parallel processing traditional ah transaction processing things are oracle db two sql server data warehouse application are some of the vertica teradata netezza these are the some of the things which are more of a data warehouse type of database now with these ah background or with these things in our in our store what we say we look at that cloud file system now as we understand it will not go something become totally we cannot through the whole thing out of the thing and start doing something new ah because this database has grown they are fault tolerant they are efficient we have raids and type of things we need to exploit some of the things and put some more philosophy of which behind the cloud so one of the predominant thing is cloud file google file system was gfs and back to back we have a open source ah stuff called hdfs hadoop distributed file system so which is ah what we say some one to one mechanism set google file system so google file system design to manage relatively large files ah using a very large distributed clusters of commodity servers connected by a high speed things so it is whether gfs or hdfs they are unable to work on very large data files which are distributed over this community servers typically ah some of the things are linux servers which are which can be which are interconnected through a very high speed line so they can handle failure even during read write of individual files right during the read write operation if there is a failure it can handled fault tolerant ah it is definitely a necessity so if we have any that is any simple system term that p of system failure probability of system failure is one minus one by one minus one minus probability of component failure to the power n so for if the n is pretty large then you can say that we can go for a that is the risk of this failure is minimum so supports parallel reads writes appends multiple multiple simultaneous client program so it is parallel read parallel write and update by the client program and we have hdfs that is hadoop distributed file system which is open source implementation of gfs architecture available on amazon ec two cloud platform from so we have hdfs which is there so if we have a big picture so that how a typical gfs are there so there are some of the components are there is master or the name nodes ah master node in gfs or name node is hdfs and there are client applications and we have different chunk server in case of gfs and data nodes in the case of hdfs in a typical cloud environment so single master controls the namespace so logically a single master is there which control the namespace so namespace is important because it gives us that how how there are stored how data can be referred its more of a it may modes of a ah meta data sort of informations which is controlled by the ah master large files are broken into chunks in case of a gfs and block what we called in case of a hdfs stored on ah commodity server typically linux servers called chunk servers in gfs and data nodes in hdfs so replicated three times on different physical rack network segment so this chunk so what we have we have the gfs or hdfs in the things below that we are having a chunk servers which are ah basically linux servers ah chunk server or data nodes in the things which are the main custodian of the data and they are the every data di is replicated ah on different three times at least three time on different physical rack and network segments so if you look at the read operation in gfs ah client program sends the full path ah offset of a file to the master right where it wants to read or name node in case of hdfs so we will refer the gfs master node and which is back to back when we it is refer to the name node in hdfs master replies on meta data for one of the replicas of the chunk where these data is found right client caches the meta data for faster access it reads the data from the designated chunk server so master from the master it gets that and gets the mirror ah this this ah meta data and from there it basically access this chunk server so for read operation any of these chunk server or replicated chunk server will do where write append operation in gfs is little tricky ah client program sends a full path of file to the master gfs on name node hdfs right the master replies on the meta data for all replicas of the chunks where the data is found the client send data to be appended into the all chunk servers chunk server acknowledges the receipt of the data master designate one of the chunk server as primary the primary chunks server appends its copy of the data into the chunk by by offset choosing an offset right so that that it do it appending appending can also be done beyond end of file to account for the multiple simultaneous right so this is a pretty interesting thing that even if you can have append end of ah eof ah beyond eof because there are simultaneous writers which are writing and it it basically consolidated at later stage sends offset to the replica if all replica do not success in writing in the designated offset the client retries right so the repli ah the all offset so idea is that whenever i am looking for a data i need to know that ah for all the three replicas it should be at the same offset ideally so that i the read processed as there is no delay in that things because once its calculates its directly access the other chunks ah on that offset right so fault tolerant in google file system the master maintains regular communication with the chunk server what we say heart beat messages ah sort of a are you alive type of thing and in case of a failure chunk server meta data is updated to ah reflect failure for failure of primary chunk server the master assigns a new primary clients ah occasionally we will try to this failed we will try to this failed chunk server update their meta data from the master and retry so in case of a failure the chunk server meta data after a reflect the failure so the chunk server meta data says that there is a failure so the next time you do not allocate or like that and the for failure of the primary server itself the master assigns a new primary so it assigns a new primary to work on the thing And update the clients occasionally we will try to try to this failed chunk server because it will be ah flagged right now another related stuff is big data or related concept of big data distributed structure storage five system build on gfs right so it is build it is a structure distributed structure storage file system it is build on ah gfs right so data is accessed by row key column key timestamp so if you look at so it is it is ah it is a ah multiple instances are stored so there is a time key column key and ah of course say row key which says that where the data is there so ah in big table each column can store arbitrary name value pair in the form of column family and label right so here if you can see that a these are column families and it is labelled and they store a name value ah pair set of possible column family is of a table is fixed ah when it is created so which are the different column families will be there so that is somewhat fix labels within a column family can be created dynamically and at any time so i can recreate the or create the table each big table cell row and column can store multiple versus of the data in decreasing order of the time stamp so that means it is the chronology is meant it in that fashion so it is multiple persons are stored in a decreasing time stamp so again we see these things so there are different tables there are different tablets which are referred to this table and its a hierarchical structure and we have a master server it is primarily a registry or a meta data repository so each table in big data is split into rangers called tablets each table is manage by tablet server so its stores each column family for a given row range in a separate distributed file called s s table so this type of management goes into the into play so that the ah my access rate end of the day the axis rate or ah will be pretty high so a single meta data table is maintained by the maintained by the many ah meta data server the meta data itself can be very large so the meta data while storing this itself can be in that case it is again broken down into split into different tablets a root tablet points to the other meta data tablets so if the meta data are repository a pretty large it is again broken down into different ah tablets and there is a root tablet which coordinates with the ah your meta data ah hm this tablets and one to real a one to ah emulate or realize that meta data services supports large parallel reads and inserts even simultaneously on the same table insertion done in sorted fashion requires more work can be ah more work ah than the simple append right there is two for a other databases also because once you insert it is basically you need to push the data aside and create a insertion point where as in case of a append you are putting data at the ah end of the end of the that storage or data ah or the tables so dynamo ah it is developed by amazon that supports large volume or concurrent updates each of which can be small in size ah different from big table supports bulk read and writes right end is so data model for dynamo it is a simple key value pair well suited for web based ah e commerce type of applications and not dependent underlining distributed file systems right for failure handling conflict resolution etcetera they they do it their self so this is ah typical architecture of the dynamo where there are several virtual nodes ah and different physical nodes and they are ah logical connectivities are zone so if you look at the dynamo architecture so its a key value pair with arbitrary value ah key value pair with arbitrary arrays of bytes like it uses m d five generates a one twenty eight bit1hash table hash value so it basically try to map that were virtual node will be mapping to by using this has function range of this has function is mapped as we are discussing that set of virtual nodes arrange in a ring type of thing the object is replicated as a primary virtual node as well as in minus one additional virtual nodes the n is the number of physical ah nodes so that any any the objectives ah replicated into the things each physical nodes are were managed is a number of virtual node at a distributed position on the ring so ah if you look at that ah this physical node server they are basically linked with this virtual node server dynamo architecture from load balancing for transient failure network partition this can handle write request on object that executed at one of its virtual nodes right forward all the request to all other nodes it is executed one of the virtual node and say in all other all other nodes which have a replicas of the object so that means if i am a object if it is replicated into another n minus one node so one is updated rest are ah being communicate so there is a quorum protocol that maintains eventual consistency of the replicas when a large number of concurrent reads and writes going on so this quorum tries to find out that which are the ah minimum level of ah replica will be there to handle this large ah rewrite of person so in next we are having this dynamo distributed object version right creates a new version of the objects in his local time stamp created ah there are algo for ah column consistency so read operation are write operation w so read plus write operation should be ah greater than any of the system is quorum consistent there are overheads which will be coming ah there is a efficient write large number of replicas are to be read and if it is for a b c and read large number of ah large number of replicas need to be written so these are the two things which are ah they are so it is implemented by different storage engines at node level berkley d b ah used by amazon and can be implemented to using my s q l and etcetera ah another the final concept what we are having is the data store google and amazon of a simple traditional key value pair database stores right google app engines data store in case of amazon what we say simple d b all ah entities objects in the data store reside on in one big table right ah data store exploit column oriented storage right data store as i mean store data as a column families so unlike our rational traditional thing is a more of a row family or tuple base it is called column family so there are several advantages or several features or characteristics like multiple index tables are used to support efficient query ah big table horizontally partitioned call sharded and across the disk whereas stored lexicographically in the key values other thing beside lexicographic sorting of the data enables there is a execution of prefix and range queries on key values entities are grouped for transactional purpose because ah if there is if when we are having transaction so that is a set of entities which are accessed in a more frequent way and index table to support varied varieties of queries so we can have different indexes or different type of queries so it is not we should understand is not a simple a low a database its a large database so in order to do that i cannot churn the whole database so need to slice them appropriately so that based on the ah different variety different queries it can be executed more efficiently and there are few more properties like automatically it creates indexes ah single property index or there is a kind index supports the efficient lookup queries of form select all type of things ah the configurable in ah indexes and there is a query execution indexes with highest selectivity is ah chosen right so it is ah when we do the query execution so with this we will stop our discussion here so what we tried to discuss over over see is there different aspects ah we have the notion of our traditional databases which is established fault tolerant efficient and there are different mechanism to do that so we have we have also already this parallel execution things and ah its present so when we deal with a large volume of data in the cloud which are likely to be there then whats are the different aspects we need to look at so we may not be able to follow the this column oriented or tuple oriented relational database we need to a sorry ah row oriented database we need to four for column oriented data base and there are different ah file system like g f s h d f s and over that this data store dynamo and your simple d v and those things what which are being ah implemented by various ah inter ah cloud service providers c s ps ah for efficient storage ah access ah i mean read write execution of very very large databases thank you

Transcript for:Understanding Cloud Data Management Challenges

Transcript for:
Understanding Cloud Data Management Challenges