Understanding Amazon Macie Onboarding Process

I'll go ahead and start showing people what it looks like to get onboarded and what the new user experience is for Amazon Macie. So initially to get started, as I mentioned, one of the things we've done is integrate directly with AWS organizations, which allows you to delegate an administrator account to manage Macie at scale across all of your accounts. If you are at the organizational master account, this is when you'll be able to delegate your account that you want to use as your delegated admin. Once you do that, then that account will have the ability to control and manage Macie and also view the settings on Macie across all of the accounts in your organization. And I'll show you just how easy it is to get going and get started and enable this for all of your accounts. Once you've delegated your admin, you'll now have the ability to see all of your accounts. And you can very quickly go ahead and add all of those accounts to enable Macie across your entire organization. In addition, we have an auto-enabled functionality that allows you to make it so as new accounts join your organization, they're automatically added as member accounts and automatically have Macy enabled, which allows this master account to have visibility into the buckets that they have created in those accounts and how those buckets are managed. So what this lets you start off with is a dashboard view to see. how some key controls and metrics are across all of the buckets in all of your accounts. We track for three main things that we track. So whether or not the bucket's public, the encryption status on the buckets, and then how they're shared with other users, and whether or not those users are part of your organization or not. We also give you a high level view of kind of your total storage, total object count, and total buckets. And again, this visibility is across all of the accounts that you have in your AWS organization. If any of these settings change, so if you were to make a bucket public, disable encryption policies on a bucket, or start sharing a bucket with an external party, and by shared we mean if you're granting a user from another account access to the bucket that's different from the bucket owner, we consider that cross-account sharing. We'll look at the principal that you're giving access to and determine whether or not that is true. that role is a member of your organization or not. If it is, then we consider that it's shared internally. However, if you've given permission to someone who's not a member of your organization to your bucket, we would consider that shared externally. In this, we also track whether or not bucket replication is set up to an external account as well. So if those things change, you'll end up with policy findings that I mentioned. These will be sent to Security Hub. They're also sent out CloudWatch events. This allows you to determine through automation whether or not that was an expected event or if it's something you want to take action on, and maybe send a ticket or even revert that change automatically. The next piece of visibility that you get is to be able to see all of those buckets that you have, again, across all of the different accounts and understand how those buckets are configured from an encryption standpoint. what their public access is and we look at that by evaluating bucket policies, bucket ACLs, and also whether or not the block public access feature of S3 is enabled at both the account and the bucket level to determine the effective permission set. We also again track whether or not there's replication set up on the bucket and if it's replicated to an external account as well. From the object level we give you visibility into how many of the objects are encrypted. and what type of encryption is used, whether or not that encryption is the server-side encryption with S3, or if it's using server-side encryption with KMS. You also get an idea of the total size of the bucket. When we look at a total size, we also highlight the aggregate size of objects that are compressed. And the reason for that is, as we go to do the discovery on these objects, we're going to uncompress those. amount of billing that is associated with it will be to the uncompressed size. So this gives you a hint as to how much larger than the reported total object size the amount of classification will be. But looking at these buckets is also how I can go and start and create a sensitive data discovery job. So I can create a job with one to as many buckets as I wish. And when I do that, I create a job. The first step of the create job. Wizard will be to just let you review your settings if you want to make changes to those you can do that at this time Next, you'll define the scope and the type of scan that we're going to do. So we have two different types of options. One is a one-time job, which is pretty self-explanatory. We will go and list all the objects in that bucket or buckets and perform sensitive data discovery on all of those imported objects that are within those buckets. The second type, which is where we see customers typically use for a lot of their buckets, is the schedule job. And this allows you to kind of set it and forget it. We will go and process all the objects that are present there at this moment. And then every day we'll get an update from S3. You can also make this update frequency longer if you wanted to do it on a weekly or monthly basis as well. But we will get the updated list of objects, and so every day we will go and process any new objects. So the existing objects that have already been processed won't be re-scanned. We'll only look at objects that are new for that day. We also give you the ability to do a sampling depth. So sampling depth allows you to provide a percentage of objects that you want us to do discovery on. This can be useful if you wanted to spot check a bucket, for example, and just check a random set of objects within there. So based on the percentage you give us, we will randomly choose that percentage of objects across the buckets that you have chosen and include those for the sensitive data discovery. You also have additional ways of fine-tuning what is included or excluded. So you can create additional filters based on tags on the object, the date it was modified, file extension or object size, to either remove or include or exclude objects based on those requirements. The date modified can be very useful for customers that are migrating over from AC Classic because it allows you to not reprocess objects that you may have already... had sensitive data discovery ran on in the existing previous product. The next step is to determine what, if any, custom data identifiers you want enabled. So as Imanjou mentioned, this was a key feature ask that we had from customers, was the ability to augment our managed data, sensitive data types, with things that are unique to you. So customer identifiers, employee IDs, things that are not going to be applicable to most customers. are interesting for your particular business. And you can create these in the console using regular expressions. When you do that, you give a name for the identifier, a description if you choose, and then you define the regular expression that would match the strings that you're looking for. You can further narrow that down again by including keywords. If you have a keyword that's defined, you can have up to 50, at least one of those must be within the defined match distance. from the string that matches the regular expression in order for it to be considered a match. We also allow you here to test your regular expression with some sample data that you might have to ensure that the regular expressions that you've written are operating as you expect before you go and run a large job. You can create up to 100 custom data identifiers and you can include up to 30 of them in an individual job and there's no additional cost for leveraging the custom data identifiers. The next thing you'll do is you'll give this job a name and we give it a name because we're going to make this job immutable as soon as we create it. The job will get an R, so it'll have an Amazon resource name, which you can also apply tags to, to do things like cost allocation down the road. But it'll also be useful when we look at the output and the reports from a compliance standpoint. So you cannot create, change this job or add or remove buckets from this particular job once it's been created. However, you can create new jobs from an existing job and there's an unlimited number of jobs that you can create. So there's no there's no quota on the number of jobs that can be created. The output of these jobs will again be findings. And the findings are going to be created every time we look at an object and that object contains sensitive information. We give you several different ways to view it. So you can just view the full list of findings. I tend to like looking at these more from a bucket level or by the finding types themselves. When you're looking at an individual bucket, you will also be able to see the number of findings by type. So our sensitive data findings, in this case there's personal information, financial information, and then objects that contain multiple different categories. We also break this down by severity. If you click on any of these, it'll bring you to a a narrower filtered list of just the findings that you've decided are interesting based on those filters. When you go further and look at these, we will show you all of the information we have on this object. So what bucket it was in, what path it was at, and what type it was, and then what we found. So in this case, for example, we found 29 credit cards, 20 addresses, and 32 names. So we never, at any of our output, include the text that we identified from the sensitive information standpoint. We don't want to add new sensitive data in the process of identifying sensitive data. We will only tell you the type and the number in the findings. We also give you more information that's about that bucket, so the resource that was affected, so what type of encryption policy is on the bucket, whether or not it was public, and then the same thing at the object level as well. And the reason for that is this allows this information, which is sent to CloudWatch events, to be responded to in an automated fashion using things like Lambda or Step Functions. For example, if I had an object that was not encrypted and they had sensitive data or that was maybe publicly accessible, I could have an automated task that goes and remediates that or opens a ticket to the owner in order to do some remediation. I do want to call out there's a second set of output that's our detailed result output. This is a superset of what we find here in the findings. So for objects that have sensitive information they generate these findings and for those that also have sensitive information when the detailed results in addition to the information that you see here We also include the location of where we found that in the object. So the location will depend on the file type of the data. For things like CSV or text files, we will give you the line number of where that data was found. For big data formats like Avro or Parquet, you'll get the record number. And for PDFs, you'll get the page number. This makes it a lot easier to go and identify what was found within those objects. Again, those are written out to an S3 bucket that you own and also provides long-term retention for these findings. And also from a compliance use case, we also include in the detailed result output a record of every object that we attempted and did not find sensitive information in. So again, just from a summary, you'll get a finding only if there was sensitive data in an object, but the detailed results will include a record that shows that we did process. an object and whether and if it did not contain sensitive information we say as much. So it's important to make sure you set up that repository before you start running a lot of your jobs. In order to do that you'll need to provide an S3 bucket. You can provide an additional prefix on that if you choose and then we also require a customer managed KMS key to write this data as well. Once you start using the service, also you might want to, you'll have the ability to track the usage as the master or the delegated administrator of the service and be able to see how the usage is being spent across all of your different accounts, as well as which accounts are on free trial. So we have a default service quota of five terabytes per count. In the self-service AWS quota tool, you can raise that to 25 terabytes. by yourself and you would just open a support request if you'd like to go above that. As was mentioned in the presentation, every account when you turn on the service does get 30 days of the bucket inventory included at no cost. The sensitive data discovery is not part of the free trial but you do give you your first one gigabyte sensitive data discovery per account included as well every month. But every account gets the first 30 days of the bucket inventory and the evaluation and monitoring for the policy level findings uh free for the the first 30 days all right and that uh wraps up our demo for today i do want to let everyone know you can go ahead and quickly get started as i showed by starting your free trial of amazon maci

Once you've delegated your admin, you'll now have the ability to see all of your accounts. And you can very quickly go ahead and add all of those accounts to enable Macie across your entire organization. In addition, we have an auto-enabled functionality that allows you to make it so as new accounts join your organization, they're automatically added as member accounts and automatically have Macy enabled, which allows this master account to have visibility into the buckets that they have created in those accounts and how those buckets are managed.

So what this lets you start off with is a dashboard view to see. how some key controls and metrics are across all of the buckets in all of your accounts. We track for three main things that we track.

So whether or not the bucket's public, the encryption status on the buckets, and then how they're shared with other users, and whether or not those users are part of your organization or not. We also give you a high level view of kind of your total storage, total object count, and total buckets. And again, this visibility is across all of the accounts that you have in your AWS organization.

If any of these settings change, so if you were to make a bucket public, disable encryption policies on a bucket, or start sharing a bucket with an external party, and by shared we mean if you're granting a user from another account access to the bucket that's different from the bucket owner, we consider that cross-account sharing. We'll look at the principal that you're giving access to and determine whether or not that is true. that role is a member of your organization or not. If it is, then we consider that it's shared internally.

However, if you've given permission to someone who's not a member of your organization to your bucket, we would consider that shared externally. In this, we also track whether or not bucket replication is set up to an external account as well. So if those things change, you'll end up with policy findings that I mentioned.

These will be sent to Security Hub. They're also sent out CloudWatch events. This allows you to determine through automation whether or not that was an expected event or if it's something you want to take action on, and maybe send a ticket or even revert that change automatically. The next piece of visibility that you get is to be able to see all of those buckets that you have, again, across all of the different accounts and understand how those buckets are configured from an encryption standpoint.

what their public access is and we look at that by evaluating bucket policies, bucket ACLs, and also whether or not the block public access feature of S3 is enabled at both the account and the bucket level to determine the effective permission set. We also again track whether or not there's replication set up on the bucket and if it's replicated to an external account as well. From the object level we give you visibility into how many of the objects are encrypted. and what type of encryption is used, whether or not that encryption is the server-side encryption with S3, or if it's using server-side encryption with KMS.

You also get an idea of the total size of the bucket. When we look at a total size, we also highlight the aggregate size of objects that are compressed. And the reason for that is, as we go to do the discovery on these objects, we're going to uncompress those. amount of billing that is associated with it will be to the uncompressed size.

So this gives you a hint as to how much larger than the reported total object size the amount of classification will be. But looking at these buckets is also how I can go and start and create a sensitive data discovery job. So I can create a job with one to as many buckets as I wish. And when I do that, I create a job. The first step of the create job.

Wizard will be to just let you review your settings if you want to make changes to those you can do that at this time Next, you'll define the scope and the type of scan that we're going to do. So we have two different types of options. One is a one-time job, which is pretty self-explanatory. We will go and list all the objects in that bucket or buckets and perform sensitive data discovery on all of those imported objects that are within those buckets. The second type, which is where we see customers typically use for a lot of their buckets, is the schedule job.

And this allows you to kind of set it and forget it. We will go and process all the objects that are present there at this moment. And then every day we'll get an update from S3. You can also make this update frequency longer if you wanted to do it on a weekly or monthly basis as well.

But we will get the updated list of objects, and so every day we will go and process any new objects. So the existing objects that have already been processed won't be re-scanned. We'll only look at objects that are new for that day. We also give you the ability to do a sampling depth.

So sampling depth allows you to provide a percentage of objects that you want us to do discovery on. This can be useful if you wanted to spot check a bucket, for example, and just check a random set of objects within there. So based on the percentage you give us, we will randomly choose that percentage of objects across the buckets that you have chosen and include those for the sensitive data discovery.

You also have additional ways of fine-tuning what is included or excluded. So you can create additional filters based on tags on the object, the date it was modified, file extension or object size, to either remove or include or exclude objects based on those requirements. The date modified can be very useful for customers that are migrating over from AC Classic because it allows you to not reprocess objects that you may have already... had sensitive data discovery ran on in the existing previous product.

The next step is to determine what, if any, custom data identifiers you want enabled. So as Imanjou mentioned, this was a key feature ask that we had from customers, was the ability to augment our managed data, sensitive data types, with things that are unique to you. So customer identifiers, employee IDs, things that are not going to be applicable to most customers. are interesting for your particular business. And you can create these in the console using regular expressions.

When you do that, you give a name for the identifier, a description if you choose, and then you define the regular expression that would match the strings that you're looking for. You can further narrow that down again by including keywords. If you have a keyword that's defined, you can have up to 50, at least one of those must be within the defined match distance.

from the string that matches the regular expression in order for it to be considered a match. We also allow you here to test your regular expression with some sample data that you might have to ensure that the regular expressions that you've written are operating as you expect before you go and run a large job. You can create up to 100 custom data identifiers and you can include up to 30 of them in an individual job and there's no additional cost for leveraging the custom data identifiers. The next thing you'll do is you'll give this job a name and we give it a name because we're going to make this job immutable as soon as we create it. The job will get an R, so it'll have an Amazon resource name, which you can also apply tags to, to do things like cost allocation down the road.

But it'll also be useful when we look at the output and the reports from a compliance standpoint. So you cannot create, change this job or add or remove buckets from this particular job once it's been created. However, you can create new jobs from an existing job and there's an unlimited number of jobs that you can create. So there's no there's no quota on the number of jobs that can be created.

The output of these jobs will again be findings. And the findings are going to be created every time we look at an object and that object contains sensitive information. We give you several different ways to view it. So you can just view the full list of findings.

I tend to like looking at these more from a bucket level or by the finding types themselves. When you're looking at an individual bucket, you will also be able to see the number of findings by type. So our sensitive data findings, in this case there's personal information, financial information, and then objects that contain multiple different categories. We also break this down by severity. If you click on any of these, it'll bring you to a a narrower filtered list of just the findings that you've decided are interesting based on those filters.

When you go further and look at these, we will show you all of the information we have on this object. So what bucket it was in, what path it was at, and what type it was, and then what we found. So in this case, for example, we found 29 credit cards, 20 addresses, and 32 names. So we never, at any of our output, include the text that we identified from the sensitive information standpoint.

We don't want to add new sensitive data in the process of identifying sensitive data. We will only tell you the type and the number in the findings. We also give you more information that's about that bucket, so the resource that was affected, so what type of encryption policy is on the bucket, whether or not it was public, and then the same thing at the object level as well. And the reason for that is this allows this information, which is sent to CloudWatch events, to be responded to in an automated fashion using things like Lambda or Step Functions.

For example, if I had an object that was not encrypted and they had sensitive data or that was maybe publicly accessible, I could have an automated task that goes and remediates that or opens a ticket to the owner in order to do some remediation. I do want to call out there's a second set of output that's our detailed result output. This is a superset of what we find here in the findings. So for objects that have sensitive information they generate these findings and for those that also have sensitive information when the detailed results in addition to the information that you see here We also include the location of where we found that in the object. So the location will depend on the file type of the data.

For things like CSV or text files, we will give you the line number of where that data was found. For big data formats like Avro or Parquet, you'll get the record number. And for PDFs, you'll get the page number. This makes it a lot easier to go and identify what was found within those objects.

Again, those are written out to an S3 bucket that you own and also provides long-term retention for these findings. And also from a compliance use case, we also include in the detailed result output a record of every object that we attempted and did not find sensitive information in. So again, just from a summary, you'll get a finding only if there was sensitive data in an object, but the detailed results will include a record that shows that we did process.

an object and whether and if it did not contain sensitive information we say as much. So it's important to make sure you set up that repository before you start running a lot of your jobs. In order to do that you'll need to provide an S3 bucket. You can provide an additional prefix on that if you choose and then we also require a customer managed KMS key to write this data as well. Once you start using the service, also you might want to, you'll have the ability to track the usage as the master or the delegated administrator of the service and be able to see how the usage is being spent across all of your different accounts, as well as which accounts are on free trial.

So we have a default service quota of five terabytes per count. In the self-service AWS quota tool, you can raise that to 25 terabytes. by yourself and you would just open a support request if you'd like to go above that.

As was mentioned in the presentation, every account when you turn on the service does get 30 days of the bucket inventory included at no cost. The sensitive data discovery is not part of the free trial but you do give you your first one gigabyte sensitive data discovery per account included as well every month. But every account gets the first 30 days of the bucket inventory and the evaluation and monitoring for the policy level findings uh free for the the first 30 days all right and that uh wraps up our demo for today i do want to let everyone know you can go ahead and quickly get started as i showed by starting your free trial of amazon maci

Transcript for:Understanding Amazon Macie Onboarding Process

Transcript for:
Understanding Amazon Macie Onboarding Process