నెట్‌ఫ్లిక్స్ కోసం భద్రతా అనుమతి శాస్త్రం

My name is Manish Mehta. I am a security engineer at little-known company called Netflix and my projects, I've been there for about four years and my projects involve secure bootstrapping, PKI, secrets management, authentication and authorization. Authorization is what something we're going to talk today. I have co-presenter Torinn. Hi everybody, so my name is Torinn Sandal.

I'm the tech lead of the Open Policy Agent project which we're going to talk about in this presentation. I've also contributed to Kubernetes and Istio. I love Golang and high quality software. So take it away, Manish.

All right, let's get started. So before we start talking about the main topic here, I want to just get some background definitions out of my way using an example. So let's say if I'm trying to send a request to my bank and say transfer $1,000 from account X to account Y. In this particular case, the bank is going to perform two steps. One, it is going to first verify the identity of the requester, that's me.

That is what we call authN, authentication. And then verify that the requester, this identity, is authorized to perform the requested operation. That's authorization, authZ. Now, for some of you, it may be really obvious, but I cannot tell you how many times I get into conversations where people...

confuse these two things and then the conversation goes nowhere. So hopefully I start off with this background definitions. So we're going to talk about bullet number two not one.

All right now one more thing I would like to say is these two steps do not need to be tied together. They do not need to happen within one system. They could be completely decoupled.

In fact I will go one step further and say If you tie them together, sooner or later you're going to lose your flexibility. If you have interest in that statement, meet me afterwards. I can go into deeper conversations there.

So, some more background about Netflix's architecture. So this is a very, very simplified, high-level view of Netflix's architecture. We have our customers, we have our back-end, we have our cloud provider, partners. And then of course the CDN that basically stores your movies and shows that gets you the bytes as quick as possible to your TV.

Now we are going to focus on this big empty box today, which is our back-end that runs all our control plane. What we have there is a CI CD pipeline container orchestration system workflow management system, which are very similar to Kubernetes in many ways. And I think this morning you probably caught Diane's keynote.

She's the director who manages the team under Spinnaker. So these are all the systems that basically drive and launch all the applications and workloads. Then we have these applications that are basically some sort of API gateway, personalization, account management, key management, legal encoding of movies, and all those things.

And then we also have some sort of bad jobs or periodic on-demand tasks which run in containers through our container management system called Titus. Now we also have some internally hosted services like for storage or real-time data streaming and then we of course have employees or contractors which are who are responsible to bring these applications together and run them, maintain them. Now this all looks simple and not maybe not too different from your setup. But then things get challenging when this happens.

They want to talk to each other, right? Of course, there are other interactions where applications go and talk to the cloud provider resources like, you know, storage. In case of AWS, it would be S3 or database or queuing.

But we are going to today focus only on the interaction within. within the control plane. So all these applications, all these services are hosted by us and controlled by us.

Now when they want to talk to each other you want to make sure that they have an opportunity to decide who gets to talk to them at what level. So of course as I said we're not talking about authentication. A lot of people say when you have network reachability that's all we need. You have network reachability that means somebody is authorized to talk to you.

Not really. So first of all, that's not authentication and authorization is definitely not. So what you want to do is you want to go much granular level, not just the network reachability.

But I'll give you an example. So what if one of these services is a REST-based service? Then you want to control exactly who gets to call what REST endpoint, right? So let's define this kind of problem.

I just gave an example of REST, but it doesn't mean anything because this... is a very very diverse back-end where we have REST based services, gRPC based services, there have some services that have their own custom binary protocol has nothing to do with any standardization. So how do you solve the problem in a world like that where you have such a diverse set of services they have random different protocols that they use, random different resources that they host and is being called by people and services, right?

So, once you do that and you try to solve that problem, you have to first define that problem and with this kind of diversity it feels very general. So, the only best thing I could do with that kind of problem at hand was come up with this definition. We need a simple way to define and enforce rules that read something like this.

Identity I. can or cannot perform operation O on resource R for all combinations of I, O, and R in your ecosystem. That sounds like boiling an ocean, right? However, this problem needs to be solved because if you have any subset of this I, O, and R, and then you have one solution for that, you're going to end up having nine solutions in your ecosystem and lose visibility and control.

completely. So that was not an option. We had to have one system that would take, if not 100%, like majority of your combinations of I, O, and R, that is identities, operations, and resources.

Now, just before you start building something like this, you have to have your guiding principles and requirements in place. So we wanted to make sure that we write down all these things before we actually propose something. So first thing first, I don't know if you caught Diane's keynote today, but she did talk about like how company culture impacts tech, as in the solutions you build.

And it sometimes goes the other way as well, where whatever you build also impacts the culture. But in this particular case, because this is an authorization system in a cloud native environment where you want to make things self-serve, because what we have at. at cultured like core of our culture is something called freedom and responsibility where all our engineers all developers all teams are free to do whatever they want whatever is the whatever is best for their own service now in this environment when they have their ownership of their own service they are also required to define who gets to talk to their service at what level so if if a solution is not giving them that kind of freedom It's not gonna fly in a company like Netflix. So first thing first we had to make sure that the solution works with the company's culture.

Second resource type as I mentioned we don't have one resource type we don't have we don't want to just do a solution to rest services or gRPC services. So remember I'm talking about random stuff here not even rest and gRPC as in some sort of API calls I'm talking SSH access too. So for example, if you have a VM and you need SSH access into a system, SSH becomes your resource, right? So it's not just the API call, it's SSH too.

So identities, a lot of identity authorization system that you will see around, they're mostly RBAC and they are either LDAP based or you know some sort of AD based. The problem there is now you have to have accounts. Most of those systems are designed for users. But here you have incoming identities that can be users, and users can also be like full-time employees, contractors. And then you can have software which can be back jobs, which can be containers running services, or some VMs running services.

So all these callers need to be identified and supported. Underlying protocols. So as I said, it could be HTTP, gRPC, completely custom binary protocols.

Implementation languages. So freedom and responsibility again, where people are free to use whatever language they prefer. I mean there could be a religious war about this, where you know Java, Scala, Node, Ruby, Python, Rust. All right, latency. I think this is one of the one of the requirements that I really had to think through and has actually big impact on the architecture we ended up coming with.

So think about a Kafka cluster, right? Which basically has a bunch of nodes and each node has 1,000 requests per second. Now, if you go back to your queuing theory for a little bit, if your authorization decision on every request to put or get from a Kafka topic takes more than one millisecond, you are thrashing.

you went over your service rate, right? That means your authorization decision has to be made in sub millisecond. Otherwise, you're not even serving, right? So in this particular case, can you even think about an authorization decision that requires a network round trip?

You cannot. So some of these things you have to be considered. Flexibility of rules, I think this is where Torinn will talk more about, but Once you have all these resources today, you know your use cases today, but that doesn't mean you know and you can predict everything that is gonna come next week.

So if your rule engine or if the way you write your policies is hard-coded and does not actually allow you to write it in a way that it feels more like a language, then you can really restrict yourself in future. So we wanted to make sure that the flexibility of rule is there. And the last one I call capture of intent. What I mean by this is basically when people are self-serve, they tend to make mistakes.

They're not malicious. They just didn't have their coffee, right? They think they did something, but that's not exactly what they actually ended up writing in policy. So is there any way to basically make sure that we give them the freedom but not enough rope to hang themselves? So this is what we came up with, where I think I would say at this point.

We'll go one by one, but look at service A on the bottom left and service B on the bottom right. Service A is a VM that is running its application code, and you see a little box called AuthZ agent. And on the right, you have a pod which has application code and another container in the same pod, which is authorization agent. So let's look at this architecture one by one, what happens.

So here you have policy portal. engineers or developer team members go and go write their own policies for their own services. It's a UI based system and they're able to create policies, delete policies, reorder the rules inside the policies and then there are sometimes we have to give some override mechanisms to you know some critical teams like you know SecOps and forensics and stuff like that and all the policies are versioned and stored in the database.

Now sometimes you have to write policies based on data that is not necessarily incoming from your request. It's from external data. Now for example, let's say if you had a REST based service and you say slash admin slash anything is only accessible by owner of this app.

Now in that particular case you need to find out who the owner of a given app is. That mapping between app and app owner is coming from some external source. So in this case, it could be application ownership database, right? Or another example is like, okay, I have this application and this application is only meant to be used by finance team.

Okay, who's in the finance team? That information about user and finance team needs to come from somewhere else, probably employee management database. So now you're writing all these policies and you need facts, source of truth for all this information and needs to come from somewhere else.

So Depending on in future how many different types of policies you write, you may fetch data from multiple sources. So we have a concept called aggregator whose job it is to basically fetch all this data from different sources and keep it fresh. Then there's a concept of distributor which basically pulls all the policies and related data from aggregator and keeps it hot.

Now the difference between aggregator and distributor is distributor is fairly scalable because it keeps everything in memory. You can slice and dice it and put it in different, let's say, cloud provider account for security and stuff like that. And then...

You have these distributors, as the name says, starts distributing all these policies and relevant data to the authorization agents. Now what happens is the authorization agents are able to then asynchronously download all this information and keep it hot. So you see the red arrows right there are what I call hot path, where the request comes into the application.

It is going to the authorization agent and come back with the answer. Now, I mentioned something about the latency. See here that we are not making a network round trip.

The authorization agent is sitting on the, it's right there. In case of pod, it's still probably like right next to each other. So, you're not spending a real network round trip.

Now, if you zoom in a little bit into the agent itself, it has two paths, like hot path and asynchronous path. So, hot path is the gray path. where the application is making a request for authorization decision. Whatever request you received for whatever resource is going to pop that information to the policy engine.

You see here we are using open policy agents engine. Torinn is going to talk more about it. Then we have a slow path or asynchronous path, which is the blue path, which is downloading all this information periodically from distributors. Now, This is all like architecture and theory. So let's take one concrete example in a familiar looking setup.

So think about a very, very simple REST-based payroll system. And it basically has only two REST endpoints that it exposes. One, get salary.

Second, update salary. Now, you want to write an authorization policy for this particular app. This is what you want to write. Employees can read their own salaries.

and then salaries of anybody who reports to them, right? So in this case, let's say Bob reports to Alice. Now when Bob reports to Alice, Bob is able to get his own salary, but Alice is able to get her own salary and then Bob's salary too. This is what you want to achieve.

Then you want to have report generator bad job, some bad job kicks off every, I don't know, week, weekly basis, and writes some crunches, some numbers. You want to give that report generator app permission to read anybody's salary. So you want something like this, getTrailers.star.

And then you have, let's say, a performance review app that is, you know, I don't know, kicks if yearly, six-monthly, whatever your company does, and goes and updates that salary. Of course, you don't want to give access to employees to write and post their own update salary. So you say, all right, only that application has access to the post API.

At this point, I'm going to hand it over to Torinn, who will explain how all this magic happens within OPA. Okay, thanks, Manish. Okay, so Manish just gave a great overview of how Netflix is solving authorization at scale across their stack. And what I think really resonates for me and for a lot of us here today is that so many organizations are trying to solve authorization and policy enforcement at scale. across all these different kinds of resource types and execution environments and languages and cloud providers and so on.

Now what I also really like is this desire for a general-purpose solution that solves for all of these different combinations in a holistic way across the stack. And so this is what we set out to do when we created the Open Policy Agent project. So the Open Policy Agent, or OPA as we like to call it, is an open-source general-purpose policy engine.

What that means is that you can take OPA and you can apply it to any system at any layer of the stack. And what you get when you use OPA is this purpose-built engine that you can use to offload policy decisions to. So the idea or the way this would work is that, say, you're building this service that exposes an HTTP API.

Well, you would take that service, and you would integrate it with OPA to execute a query against OPA when it wants to enforce access controls over who can access or who can do what via the API. In that query, you would supply a bunch of input, like the method and the path and the headers and maybe the body and so on. And then OPA would take that input, that query, and it would combine it with the policies and the data and so on.

And it would evaluate all of that to produce an answer, like allow or deny, which would then send back to your service so that it could be enforced. Now OPA itself is implemented in Go and it's designed to be as lightweight as possible. So you can take it and you can run it as a sidecar next to your application or you can run it as a host level daemon or you can embed it directly into your application as a library just like Netflix is doing.

Now I said it's lightweight and the reason for that is because basically all of the policies and data that OPA uses for evaluation are kept in memory. So it doesn't introduce any kind of runtime dependencies at deployment time. So it doesn't depend on an external database or an external service or anything like that. Everything's cached in memory.

Now, in addition to the core evaluation engine that OPA gives you, OPA also provides a suite of tooling that you can use to develop your policies locally. So it gives you an interactive shell to experiment with and debug policies. It gives you a test framework to codify unit tests over your policies and so on.

Now the core thing that OPA gives you though is this high level declarative policy language. And we call that language Rego. And what Rego does is it gives you the ability to write or express policy as code. And so what that looks like when you use Rego is you write a bunch of rules in this declarative language and the rules exist to answer questions or make decisions like, you know, can user X perform operation Y on resource Z. So what we thought we would do is step through this example that Manish set up and show how you would use OPA to enforce it.

So the policy in English is fairly simple. It says that employees are allowed to read their own salary, and then they can also read the salary of anybody who reports to them. So let's look at how we would actually use OPA to enforce this. So when you're using OPA to enforce policy, what you're mainly thinking about doing is writing rules that make decisions over some data. And the language that OPA gives you to do that is purpose-built for writing policy and reasoning over arbitrary data.

And the reason for that... is because when you're thinking about policy, what you're thinking about is data and logic. And so what you really want is a language that lets you focus on exactly that.

And so that's what the language is purpose-built for. And so what we're going to do is create a rule called allow. And that rule is going to allow requests if the employee is trying to read their own salary.

Now, in order to make that decision of whether or not to allow the request or not, we're going to need some data to make the decision over. And so the service is going to provide some input, and you can see an example of that on the left. So it provides the method and the path and then the authenticated user making the request. And then we're going to have the rule use that data to make a decision. So you can understand this rule or you can read it as basically allow is true if the input.method matches get and input.path matches get salary id and input.user matches id.

Now the interesting thing about this example is that that id value is actually a variable. And so that variable is going to be bound when OPA evaluates the rule to a single value across all of those expressions. And so for example in the second expression in the rule, it's going to get bound to Bob in the path. And then in the third expression, that's going to act as like an equality check.

So it's going to see whether or not the input user matches Bob. And in this case it would, and so the request would be allowed. Okay, so now we're going to add another rule, called allow again, to handle this second case of where someone is requesting the salary of an employee who reports to them.

And so this rule is going to have exactly the same structure. We're going to match on the path and match on the method. But this time, we need to do something a little bit different. And so the input data to the policy engine would be exactly the same. It's exactly the same.

But we're going to make use of additional data or context that's held in OPA. And so in this case, we see an example of the data on the left. And so we've got the management chain saying that Bob reports to Alice and Ken, and Alice reports to Ken.

And then what we're going to do is use that data or that context to decide whether or not to allow the request. And so that's exactly what's happening. in the third and fourth expression in this rule. So the third expression looks up the management chain for a given user, and then the fourth expression searches over that management chain to see if the input user is a manager.

Okay, and so at this point we've actually codified the entire policy using OPA. But there are a couple other things that I want to point out before I hand back to Manish. So the first thing is that in this case we have this logic that determines whether or not one user is a manager of another.

And while it's relatively simple, you may want to have this logic reused throughout your policies, and so you don't want to duplicate it, you don't want to repeat yourself all the time. And so what you want to be able to do is share and reuse that. And so to do that, OPA gives you the ability to compose policy.

And what that means is that you can basically take logic and you can split it, you can factor it into separate rules or separate functions, and then you can call those rules or functions from other rules and functions. And so in this case, we're going to do just that. We're going to take the check to...

for managers, and we're going to pull that out into a separate function that will return true if A is a manager of B. And then all you have to do is just update the original rule, obviously. So what I haven't shown here, though, is that all of these policies are actually contained in packages, and so they're actually namespaced just like you'd be used to in a standard programming language like Go or Python or whatever.

And so that ensures that these policies are namespaced correctly and that they don't run into collisions. The second thing I want to point out is that OPA is completely resource agnostic. So it's not coupled to any domain specific model and this is the main reason why we can say that it's general purpose.

Because regardless of whether or not you're writing policy over HTTP APIs or Kafka or SSH, it's all just data to OPA. OPA doesn't care, it doesn't matter, it's all just data. Now, obviously, if you're thinking about enforcing access control in HTTP APIs or message brokers, your performance is going to be absolutely key. And so this is something that we've designed for from the very beginning of the project.

And so, for example, if you take OPA and you try to use it to enforce a role-based access control policy, where the policy basically has to search for bindings that match the authenticated user and then find roles that match those bindings, you see latencies of around 10 to 20 microseconds in the worst case. But the really cool thing here is that even as the data set grows the latency remains relatively stable And so for example in the second row there the data set that the engine actually has to search over is about six orders of magnitude larger than the first one so it scales very very nicely Okay, and so while you can take OPA today, and you can use it to enforce Authorization policies in your services you can also use it to enforce a variety of other kinds of policies throughout the stack So for example, we have integrations and we've shown how you can use it to enforce admission control policies, workload placement policies, risk management policies, rights elevation and more. Now to do that you don't have to start from scratch because we've got a bunch of great tutorials on the website and we have a number of pre-built integrations that you can use out of the box for projects like Kubernetes and Docker and Istio and of course we've got many more coming. So I just want to say that we're very excited about the Open Policy Agent project because it provides this reusable building block to the community and to the ecosystem, and it helps solve fundamental security problems like authorization across the stack. Because ultimately, at the end of the day, we all need a way to control who can do what throughout our systems.

So before I hand back to Manish, I'd just like to point everybody at the repo, please check it out, give us your stars. And we also have a demo booth in the vendor area. So if you're interested in this kind of thing and you want to see a demo, please come on by and say hi. OK, Manish, back to you.

ANISH KUMAR-Thanks, Torin. So OPI is amazing. It has a lot of flexibility.

And as you saw, some of the policy snippets, it's not that hard from Syntax perspective. However, we're talking about a company like Netflix, which has hundreds of teams. here and then they remember go back to the original requirement of self-serve so I have to make this system self-serve so these teams are very competent and everything but sometimes they forget their coffee so I really don't want them to write any complex looking code so what what we had to do was basically make sure that their life is as easy as possible when they're starting to write their policies So what we ended up doing was take two steps. So first step was we built a UI on top of this OPA language so the complexity of the language is hidden from them. So I'll give you an example here for the, it's an animated thing, I don't know if it's very visible, but this is the UI that does the exact same thing as underneath.

It basically converts the UI action into OPA policy. So in this particular case, all I'm doing is saying that this post endpoint is only accessible, should be only accessible by performance review application, right? And then, this is what people, this is what I call capturing intent. Their intent is to just allow this particular application. This hopefully very intuitive UI allows them to do that without making much mistakes.

And then if I make a second example of the get salary endpoint. It's slightly more complex because has more than one rules, has more than one rule because you have employee, then you have manager and then you have the report generator application. So in this particular case you have three rules and as you see the animation is trying to do very similar stuff that OPA was showing, it's just that it's in UI format. Fortunately in this particular case all these three rules are not overlapping. So, order of those rules won't matter.

However, the way we write policies is basically if you have ever had pleasure of configuring IP tables in the past, you basically have all these specific rules at the top and the generic rules at the bottom. So, you can catch everything here. So, the way we have made this is the UI allows you to arrange your rules the way you want and it will be executed in the order it was listed.

So, That helps write policies in a way that you intend. We'll take all the questions. So one more thing we had to do was, yes, this is good and handy, but it still doesn't actually answer the question, did you capture the intent? Because the intent is only with the person who's actually making these rules.

So they know in plain English what they want to achieve, but they don't actually know exactly what they did is going to perform what they think they did. Right, so the second step we took was basically we built in unit testing mechanism in this UI. So what we did, unfortunately I don't have screenshot for that at this point, but what we ended up doing is we said okay you want to write this policy.

You finish writing the policy and then you write a test for it as in whatever you think you did this test should pass so you can have positive use positive unit tests or negative unit tests and then before you actually save your policy and it gets pushed into product production it will run all the unit tests and only when this though they pass your policy will be updated in the production. Now what happens is policy is written six months later somebody wants to go and add one rule to it. and they completely forget all the intents that they had six months back. So, these unit tests will save their day because the unit tests are saved with the version of that policy.

So, as soon as you update the policy, all these unit tests that you had thought about, they will run before the policy is pushed into production. So, yeah, we don't want to be a gatekeeper as Diane mentioned this morning, but we do want to provide the guardrails and this built-in unit test is basically the guardrail that we built on top of the UI. So just to summarize everything here, we basically have this very diverse back-end which has all these services that are using random different protocols and have all these different resources that they host and they have clients that look like people and jobs and VMs, bad jobs running in containers whatnot. So we had to first solve the authentication problem which we did and then once we had that we had to make sure that the authorization system is flexible and extensible. Now latency was also a big deal.

So we, I think, Torinsh showed some numbers from OPA's perspective and when we did our own benchmark, this was basic policies could easily be done in less than 0.2 milliseconds. So which works for Kafka. If it works for Kafka for me, it probably works for all the other services at this point inside Netflix. At Netflix scale, coordinating updates is very hard. So if you had like any kind of hard coded rule mechanism and not using language-based evaluation engine, you're going to have really hard time over time to push out any sort of updates.

Once you have language-based system, it is very easy to support new kind of use cases. And then obviously for being culturally successful in a company like Netflix, your solution has to provide something that is different. goes well with freedom and responsibility.

So having a self-serve system with a good UI and good guardrails will actually make this project very interesting and successful. So in closing, I would say that something that you want to take away is authorization is a fundamental security problem. It is not new to cloud, by the way.

Cloud just makes it more interesting because the way it works. And if you're not there yet, if you're not there solving this problem yet, you're going to be there soon. All right, you can't just wish this one away because, you know, in our parents'days, you had network security, that was enough, and definitely not enough in cloud environment. What I would say one more time is that if you are going to tackle this problem, try to see how you can have a comprehensive solution rather than some hodgepodge of nine different authorization systems in your back end. Because at the end of the day, if they don't talk to each other, you don't have a common problem.

place that you can go and have some visibility is going to be really messy. And then you have open source projects like OPA that you can make use of. In fact, I came to know about OPA only like earlier this year and I knew my requirements but as soon as I saw that, I'm like this fits my requirement. Even if let's say a language is not Turing complete, it doesn't mean that it's not good enough.

It's still a language, right? So you should use, go around, look for open source projects and make sure that if it fits your requirement, you're able to get there faster. And the last one I would say is you don't have to build this alone.

Like this problem is not necessarily new. So a lot of people are thinking about it. There's a very young community called Padme.

They had actually a session earlier today. So if you are interested, maybe you should get involved in community so that you can solve with other people and you may even end up learning something more. about this problem and you may find some more use cases you may not have thought about. Alright, so thank you so much. I think we can take a couple of questions.

So question is, is it available for public use? What part? Yeah, so the open policy agent Europa that's totally open source.

It's been open source since day one. It's Apache 2 licensed You can check it out on github The UI is so you is purpose-built for Netflix at this point But I would say UI is very very specific to your environment as well. So, and I don't think it's the biggest component of this whole project anyways.

So, yes. How do I compare this project with Istio initiative? I will not try to compare this because I don't know a lot about Istio's initiative about authorization. But I would say one more thing. Remember, I have to solve this problem for even SSH.

So, I don't think HTO does SSH, right? And we can. We can talk later, but I mean, this project started about a year back, and I had not heard about HTO back then.

But yeah, we can talk. Yeah. So I should have mentioned that.

So the question is does the distributor pick up only set of rules to send to an agent. So we from day one we designed the system in a way that not only it sends the very very specific rules and only things that are applicable to you but the updates are delta updates. It's not sending everything. Anything that just change only those things are sent over the wire otherwise this will just become a mess. You're right.

By the way we are right here for next 10 minutes or so. And then if you have more questions, feel free to come by. Thank you so much for your time today. I hope this was helpful.

Transcript for:నెట్‌ఫ్లిక్స్ కోసం భద్రతా అనుమతి శాస్త్రం

Transcript for:
నెట్‌ఫ్లిక్స్ కోసం భద్రతా అనుమతి శాస్త్రం