in the 15 years i've been working in analytics i've seen a growing focus on data governance in this data governance tutorial i'll go over what data governance is why it's important for pretty much every single business or organization and what sets apart good data governance from poor or even non-existent data governance at other companies hi i'm jen i help people learn about analytics skills and careers check this video description for additional resources [Music] as the amount of available data has grown exponentially over the last decade and more regulation has been put in place around data and information management many organizations have started to think more about data governance so what exactly is this data governance data governance is the rules processes and accountability around data there are multiple goals of data governance you want the organization to use data in a routine way for sources to be harmonized for the people that have access to it to be people that need to have access to the data and for people that shouldn't have access to not have access it also means ownership of the data who's responsible for it being right who's responsible for it being managed and updated correctly successful data governance considers the who what when where how and why of the data that it's governing while controlling the security of the data and ensuring compliance among many other things data governance should also be concerned with how can this data be made useful to the organization how can we do more than just have a giant storage location for information you may have also heard about data management what's the difference between data management and data governance the main difference is data governance outlines the overall structure that should exist it has the rules it has the processes the accountability it's more about what should happen how should things happen and data management is more about implementing all of those rules it's the hands-on everyday work to ensure that that governance the that's been put in place is being followed so it's the it teams executing on it it's the day-to-day management of information and access requirements and whatnot that people may need when it comes to the data if you want to know more about data management i'll link to a video on my second channel for avant analytics my consulting company now let's talk about why data governance matters i mentioned that data governance isn't just about the rules it's also about the use of the data and making it useful really good data governance implementation means that quality data is accessible to the right people and only the right people in an efficient way throughout the organization it means not having multiple databases that have the same information or access to systems for certain people that shouldn't have them it's making sure that these are in place so that there's a consistent understanding of who has access why they have access and what they're doing with that information that exists let's talk about getting started with data governance what does that actually look like practically for an organization that's implementing it or focusing more on it when it comes to data governance one of the first things that you want to think about is who's involved typically there are multiple roles for a very structured larger organization that is implementing data governance or has been working on it for a while sometimes in smaller organizations or ones that are new to data governance you might see these roles overlap let's talk about what each of those roles are first though the first role is the data owner or this data sponsor these are people that have ultimate decision-making ability about the data and have ultimate accountability for that data being correct and up-to-date typically these people are going to be higher up within the organization and have the ability to order or ensure that those working beneath them are complying with what is outlined in the roles that are defined as part of data governance there are typically many different data owners and sponsors when it comes to data governance often overseeing specific types of data so in an organization you might have a manufacturing data owner data sponsor who's responsible for maintaining everything related to a product that's being manufactured you may also have a sales data owner or even maybe a segment of sales you might have a commercial and a residential data owner depending on the different applications that your company is working with they're responsible for owning providing and following whatever guidelines are outlined for their set of data you can still have multiple data owners even for companies or organizations that aren't dealing with physical products for instance in a local government organization you may have one person that's responsible for voter registration data and one that's responsible for real estate tax data and one that's responsible for home ownership data for the locale so regardless of what scale what type of product or service that your organization has or is providing you can still have multiple different data owners if you're in a super small company if you've got a dozen people maybe there is just one person but even then sometimes you have multiple data owners if you have people responsible for different parts of the business in addition to data owners in larger organizations you'll also have data stewards subject matter experts and data champions these are people that are working more regularly with the data that really understand the content of the data they should ideally be consulted in any data governance project because they understand the more on the ground work that's being done with this data they understand the different uses that people have for it why certain people may need access that an executive may not see an immediate answer for any time that these people are left out of the data governance process there usually end up being a lot of headaches and hoops to jump through in the practical application of using that data there's still room to question here whether just because someone had access in the past do they still need to have access but incorporating involving these people that have more knowledge can really help improve that process and make sure that you're not backpedaling and having to redo a lot of work later on after you roll out these new rules larger organizations also typically will have a data governance committee this group is ultimately responsible for all the decisions that are made if there's conflicts between different groups they can help resolve them if there's decisions that need to be made or implementations that maybe need made to standardize how the data is used or stored or accessed across the organization this committee can act as a central resource to make sure that the data of different types isn't implemented in a lot of different ways across each different area so maybe instead of having one set of sales database with one access point to that and then a separate application that deals with client information or production information the data governance committee can look at it and say how can we integrate these better how can we have one location how can we make it similar regardless of the type of data you're looking for which can usually lead to a more streamlined process overall and make it much simpler to implement changes rather than dealing with maybe a dozen different types of systems to access data you consolidate to much fewer which can pull all of the resources into one location and make it easier to teach people how to access the data that they need to access even if it's contained within one system this doesn't mean automatically everybody has access the nice thing with the actual implementation is you can still have different roles that allow access to different pieces of data but having that centralized location can make it a lot easier for individuals or groups within the organization that need to use data from multiple different sources to complete their work in addition to establishing who's involved one of the very first steps that you should take concerning data governance is to think about the scope of the data that you want to govern it's really tempting to say we want to control it all but the reality is unless you're a very very very small company it's usually not practical to try to control everything from the start instead think about what your top priorities are so an easy solution for this is if you have areas of data management that tie to government or regulatory compliance this is a great place to start with your data governance because it's not just about your company or your organization it's about are you meeting the requirements of the law so focus on that area and then as you have pieces in place you can expand further and further but anytime that you try to take everything within the scope of work that you're doing you are much more likely to fail it's much more likely to take a lot longer to make the same type of progress because you're trying to take care of everything at once instead of one piece at a time an easy comparison is think if you tripped and fell down the stairs if you cut yourself and were bleeding profusely and you had a broken arm and you hit your head you ideally yeah you would fix everything at once but the company that is going through this with their data you fix the thing that's going to hurt you the most so if you fell down the stairs and you're bleeding you stop the bleeding that is the most immediate pressing concern that doesn't mean you ignore the broken bone or the potential head injury but you take them in order of what is the most serious that i deal with first the same is true anytime you're working with data what's the most immediate need what's going to have the most immediate consequence negative consequence if i don't do something about it and then once that's taken care of you can move on to the next thing if you're not dealing with compliance issues or something that's otherwise urgent you still need to set some sort of scope in this case you can just pick an area that may have a lot of advantage to working with or just pick an area sometimes people get too hung up on making sure they pick the right area that they don't just take action so if you're not sure what to do pick something say that you want to work on client data as the first step of governance and then you can move on to the next step or pick manufacturing data whatever you do don't let it stop you from doing something this can also sometimes inform who is involved in the data governance process up front if you're just getting started and you have people that you know are eager and want to be involved that can be a guideline for what you pick to work on or if you pick what to work on that could inform who should be involved in that process now that you have the who and the what it's time to move into more detail document what data you have available these are your data sources what information is in this data where does it come from do you have multiple sources that are providing the same information who owns the data who's an expert in it how often is it updated who checks to make sure it's updated correctly who accesses it and what do they use it for when they do access it answering these types of questions can really help you make a more informed decision on what rules processes and accountability that you put in place regarding a specific type of data before you jump right into making rules it's important to understand how people are currently using the information and why they're using it in that way otherwise again you end up with a poor implementation you end up making more work for people that still have to get their job done but now they have someone who doesn't have any idea what they're doing making decisions about what they can and can't have access to and how they're going to access it this doesn't mean you're not going to bother people by the decisions you make for data governance there are going to be people that are unhappy with the decisions you've made but they tend to be a lot more receptive to change and the organization is typically a lot more receptive overall when you've at least taken the time to listen to them account for their concerns factor that into the decision you're making and at least make an informed decision even if you know that makes things more challenging for some individuals or some teams or department it's easy for people to think about how they think the data should be used it's a completely different story to know how it actually is being used it's rare that there aren't some surprises along the way of how people are using information sometimes because they can't access what they really need and so they're substituting and making adjustments to existing information to be able to do the work that they need to get done you may also find that as you start exploring the data that's available that there are multiple sources for the same data in this case you have a decision to make do you still retain the information from multiple sources which is your primary authoritative source if there's a conflict for instance a simple example of this is in the automotive industry if someone files a warranty claim for their vehicle there are multiple ways that the company can get information about that they can get mileage information based on what's manually submitted on the warranty claim how much the dealer or the customer reports in terms of mileage there was on the vehicle at the time that the repair was scheduled however with newer vehicles that have a lot of remote technology they can also read this information off the control units on the vehicle so if there's a conflict there if there's a difference between the mileage that the dealer says and what the vehicle says unless there's a known issue with the vehicle where the mileage would be reported wrong typically you're going to want to prioritize what the vehicle says what the control units say automatically because there's usually less room for error there you can run into this with all sorts of data where you'll have these conflicts even if you don't see immediate conflicts it's still good to set a priority of what is your main source what is going to be the authority when it comes to the accuracy of that data as you look into the available data in most areas you're going to find that the data being used doesn't just come from one singular location it's usually made up of a variety of different sources for instance let's take sales data sales data might sound like it's one complete isolated thing however most sales data consists of client data information about who made the purchase it consists of actual sales data like what was the sales date what was the sale amount what was the exact order that was placed and it often consists of some sort of inventory or production information this isn't quite as typical for something off the shelf where client information isn't reported but if you're working for an organization that provides a service or provides any sort of customized product or even products that offer multiple variants this production information probably has information on if somebody orders shirt for instance what color did they order what size did they order was their inventory to fulfill it so all of these different pieces are in themselves individual data sets that are brought together to form the one complete set of data the one source of information that we think of the sales data to combine all of these we get into data mapping data mapping tells us how the data the information in one of our sources relates or maps to data in another source to combine to give a more complete picture so in the example of that sales data we would have the order information linked to customer information probably based on a customer name or a customer id so if you look in the customer file you have an id or you have a name there that is identical and unique for an individual for a company that is doing the purchase and then in the order you have that same unique identifier that same unique name that same unique number so that when you match up you look for the same thing in one and two and that's how you tell the systems to combine this information how to map this information same thing with the order and maybe inventory data where you have the part number that was ordered the service that was ordered and then in your inventory or your service information you have probably more detail so you have part number one in your order you have part number one in your inventory database and in your inventory you talk more about the details of that and so you combine those then we have cases where the mapping isn't direct so to map customer to inventory there's no direct mapping there's no direct single relationship they only relate or map together because of that order information that's a fairly simple step to get there sometimes it can be more complicated sometimes there can be two three or more steps in between to connect these different pieces together make them relevant make them relate to each other another piece of information that you'll typically have or should have about your data is metadata think of this as information about the data what type of format should it take by default what type of information is contained within it so for instance let's talk about order information metadata would describe every field that exists in that set of data so if we have our date our metadata would tell us what format that it's in let's say it's a a month day year format then what does it contain a short description so date of order from customer or date order received it gives a almost a dictionary of sorts and sometimes you'll see it called a data dictionary which describes the information that's contained within that data set metadata and a data dictionary aren't always exactly the same but in general they're giving you more general information about what's contained there so that everyone can understand what content should exist there and does exist within those types of data fields ideally different tables of data or different data sets will have clear mapping of how they relate even if they have to go through one or two other tables in the intermediate to connect one to the other however that's not always reality and when that's not the reality sometimes we need to use data scraping to be able to capture the right information for instance maybe we need to know the mileage of that vehicle when there was a warranty claim but what if the warranty database doesn't have mileage in that case maybe we ask that someone puts that information in a text box when they submit the claim data scraping is going and finding that and automatically pulling it out to see how it relates so how might it relate on a warranty claim well most vehicles have a fixed mileage or age limitation on the warranty so if this process has to be done through data scraping we'll scrape out what the mileage is and then map that to the warranty coverage to see is that mileage within the limits is the age within the limits does it qualify to be covered or does it not meet the criteria is it too old does it have too many miles those sorts of things so that's a simple example of data scraping sometimes it can get more complicated in general think about it as a way to try to create structure where not much structure exists i mentioned data quality which has a lot of different subtopics that could easily be ours on their own i'm not going to get into all of those today but there is one area that i do want to talk a bit about data integrity is a subtopic within the data quality area and data integrity is not just what's the overall quality of the data but it's how stable is our data how routine is it can we always trust it how is it updated how do we know that it's not corrupted think of data integrity as how well the accuracy validity and consistency of data is maintained across its life cycle that is from the moment that we first collect that data does it remain the same does it remain consistent does it remain to be true does it continue to be accurate as we move it around within our systems as we put it into different tools as different people start to use it do we maintain that integrity of information so we don't essentially end up with the telephone game that you might have played in school where you whisper something in some person's ear at the start and by the time you've gone through 15 different people the message out the other end is very different than the message at the start the same thing can happen with data for a variety of reasons it could be system problems that create this challenge it can also be multiple people being involved that don't understand the context they don't understand where why how the data was collected and they make assumptions a lot of times there's assumptions at every step and by the time you get to the end then it's not really representative of what you started with this doesn't always have to be the case and as long as you're aware of it it's something that you can put more things in place to check you could have someone that checks the source data and the end result data to see do they match do they properly convey the right information are they still accurate are they still valid really checking to make sure that you don't end up with a completely different picture than what you started with maintaining data integrity and data quality throughout the life cycle isn't just a one-off thing you don't do it and then it's done it's also about having checks in place to make sure everything continually functions as expected if you have data automatically pulled into a system every day what checks do you have in place to make sure it all happened accurately how do you catch when mistakes were made so that somebody doesn't need to stumble into a problem with it and raise an alert how do you automate some of that to help ensure that that integrity that quality is maintained all along on an ongoing basis ultimately all of this work of data governance should lead to sets of rules processes and policies that are applied across the business to make sure that you have good data being used in a good way throughout the organization it's the right people accessing the right data at the right time with the right amount of accountability this work should inform business policies as well as data management as with data integrity and data quality data governance in general is not a do it once and forget about it forever sort of thing it constantly needs to be rechecked as fast as information is growing it's exponentially growing every year you may be getting more information tomorrow than you were getting yesterday and you may have different people doing different things with it than they were in the past so it's important to keep up with that if you set up data governance policies now even if they're perfect chances are very high that two years from now they're not going to be perfect something is going to have changed so be aware of that that doesn't mean making changes every day every week every month but it does mean that you have some periodic schedule that you come back and review and check and make sure that your policies your rules your guidelines are really keeping up with the information that's there rather than being really reactionary it's already a little reactionary to only follow up once a year or whatever frequency it is but at least then you only have a small thing you need to react to instead of five years from now two years from now realizing that none of the rules that you put in place are being followed because they're no longer relevant they no longer apply to the information that's available and how people really need to use that data to effectively run the business to effectively do their jobs i hope you enjoyed this data governance tutorial if you did enjoy it please consider giving it a thumbs up and sharing it with someone that you think may benefit from it thank you so much for watching