Building a Resilient Ride-sharing App

Alright guys, so I'm happy to show you this project that I've been working on in my spare time. I've basically been trying to build an app similar to like Uber and Lyft. I think it's a good example of Elixir in action and it's going to show us, I'm going to show you a couple things.

I'm going to show you guys GenServer as an actor. Just before I start, quick question, how many of you guys have used the GenServer? So you guys know what a gen server is? Okay.

So we'll talk about that. I'll explain that to you. We'll talk about supervision, so how you can monitor your system, how you can make things resilient to failure.

So when things break down, how it will like repair itself. And I'll show you guys a couple techniques that I kind of ran into as I was going through this, through developing. So quick caveat, I never ran into production. I'm just researching and I didn't test it on the hood. So use it at your own risk.

So why did I build this? Well, I wanted to find a good learning tool, something to copy like a hard problem that would teach me Elixir. I feel like a ride sharing app is a perfect example because there's so much asynchronous things happening.

If you think about, you know, requesting your on Uber and you're requesting a ride, it's got to go out to the server, it's got to find out who's around, what drivers are around you, it has to tell the driver, you know, pick a driver, assign the driver to come pick you up, and so there's like a lot of asynchronicity there, and then there's also a lot of failure conditions, you know, what happens if you're telling the driver to come pick up and then the passenger's phone dies, right? So you have to handle all these weird situations, and it's very communication heavy, which Erlang, obviously coming from Ericsson, is great at that. So I did a couple experiments to start.

I tried to build something with Phoenix and I'm not going to go through all my experimentation. You can take a look at my gist if you want to see this stuff, but I'm going to basically focus on what I learned by doing these things and show you guys a way to do it. So I came up with this thing called the grid.

It's a reference. And basically, the grid, it's kind of like a cell phone network, right? So you have this tiling data structure where you have each tile is responsible for a region.

So you could think of each of those little towers as a gen server or an actor. And we'll get into that in more detail. And each one is responsible for people in that region. So when you're on your cell phone and you dial a number, it goes out to the nearest tower near you. And then it sends your signal to some other tower where the person you're calling is.

Now what happens if a tower fails in Timbuktu? The towers in New York still work. So there's resiliency because you have this repeated structure.

So what I did was I built a tiling structure that would map the entire Earth. So you would have quadrants, and each quadrant is responsible for a latitude and longitude area. And you don't necessarily need to use a honeycomb.

Like you can use squares, you can use triangles. For simplicity, I use squares, just because the math, I'm not that great at math. So the best place to start, what I'm going to do is I'm going to show you guys some stuff, and then I'm going to show you like a code example with it. So I think the best place to understand OTP is to think about an agent. So what is an agent?

It's basically an actor. An actor is basically, it's like if you guys are familiar with Ruby or JavaScript, it's like an object. But it's even kind of more object-oriented than it is in Ruby and JavaScript, because in those languages, if you think about one object calling another object, right, when it calls another object, it's waiting.

It's stuck. And that's not the real world. If you ask somebody for something, you don't stop breathing, right?

You continue to breathe while you wait for the response to come back. So it's not really, this is a much, Erlang's implementation of objects is a much, I think, realer, better implementation. So an agent is one type of actor.

And it's the most simple kind to me. It's basically, it's anytime you want to read and write some state, some data. You can use an agent to do that. It's nice in the sense because when you write the code, and we're going to see an example now, you can keep the code next to the API implementation. So you don't have to define callbacks and things, it's all located in one place.

So let me hop into some code, and I'm going to show this to you. Show it to you in GitHub. Yes. Cool.

All right. Agent. Let's make this bigger. And, alright. Okay.

Is that visible? Yeah, I guess it should be bigger or smaller. Alright, so a grid.

A grid is a module and it's using an agent, right? And What we do is when we start the grid, we basically pass it an empty map. So the grid is basically just a map data structure.

Map by map, I mean a dictionary, not a global map, a dictionary. And so you could join the grid, and by joining the grid, what that means is you're basically just putting inside the map the coordinates of the person that's joining. Now if you want to move somebody in the grid, you're basically just updating again, you're updating the state. You're saying now I want to put new coordinates.

And then the other option is to leave the grid, you basically just delete. So essentially you have creating a record in the dictionary, updating a record in the dictionary, deleting a record in the dictionary. And this is what the client code looks like.

So you would start the grid. You would join, like here's Mike joining at these coordinates. Here's Sally joining at these coordinates.

And then you could move. And then you could leave. So I'll just quickly run the code so you can see that it does indeed work. And so there we see, we got, this is the PID of the grid, and we see Mike joins the grid, okay, Sal joins the grid. So that's basically the simplest agent, but there's a big problem with this code, and it's that you can't have one dictionary with like a billion people in it, right?

The whole purpose of having this grid is that we need to find, a passenger is going to request a ride, and we need to find the drivers around that position, right? So we can't query one map data structure for the existence of everybody on Earth. So that's basically where we come back to the next step. So we need to split up our system into many different tiles and if we're gonna have a lot of these moving parts we're gonna need a way to track them. So Erlang has this really cool system where imagine you have two actors, two processes, whatever you want to call them, two gen servers A and B.

If they're disconnected, if one of them terminates, the other one is totally unaffected, right? So it's isolated. If you monitor, if A wants to monitor B, then if B fails, if B terminates, A still will continue executing, but it will get notified that B terminated. So you want to think about like in the case that I said before where the passenger's phone died, the passenger terminates, it notifies maybe the ride actor that it's not available.

A linked process is when both of them, when one process wants to link with another. If either of them die, the other one dies, so they're basically joined for life. And so then there's this thing called the supervisor, and the supervisor is basically using those monitors, using that monitoring system, it can monitor different actors, and when something fails, it will restart it again.

You know what, I like to think of it as almost like, it's kind of almost like, you know, if you have like a bad day and like you tell yourself like, you know, tomorrow I'll start again, you know, I'll try it again. That's what a supervisor is. It's that process that brings you back to your known state.

So it can monitor actors when things go bad. It'll bring you back to that state. It can control how things propagate. So you could say how fast things restart.

Imagine if it's failing and it's restarted and it fails again. Maybe at some point you just don't care anymore. I'm going to show you guys that.

So how you would use a supervisor. Let me show it to you in... Here. Supervisor.

OK. So now I took the exact same code I showed you before. I just renamed what I called grid before.

I call it now tile, because I'm going to split it out into many. So the data structure will have now many of those. And what I call grid now is a supervisor. So the supervisor is a grid monitoring a bunch of tiles.

Each of those tiles is a dictionary. And it's a very simple API. You start a grid just like you would start any GEN server, any actor. And the difference here is that I construct a giant list of tiles. And I...

let me see if I can show you that. So I just basically go through a giant loop and I create a bunch of tiles. And then the supervisor then starts with like the whole globe tiled, essentially. And so when you ask the grid to join now, you tell it you want to join some specific coordinates, it figures out what the name is.

So you don't have to say what name of tile you're... Oh sorry, I should mention, each tile has a unique name. So if you think about a tile on a specific coordinate, the left-hand corner is used as the name of the tile. So you don't have to tell, if you need to do something with a tile to correspond with the tile, you wouldn't have to specify the name.

It automatically figures it out for you because it's inferred from the coordinates. So basically the code is very similar. The difference is that when you call the grid, it actually will go out to specific tiles. And what I've done is this is actually pretty Useful tool in Elixir if you call sys trace on a specific name of a process. In this case, I'm monitoring a specific tile.

It will print out all the messages it receives, so it's kind of useful for debugging. So we'll see that right now. So I'm just, I'm only monitoring two tiles here because I don't want to have huge logs.

It'll be like, you'll get logs for every single tile. Let's run that. We can make the window.

Oh, you mean the window? Not you. The window. Huh? The window.

You can make the window. Wait, what? Oh, like that. Okay.

Thanks. All right. So. Let's run this guy. Let's see what happens.

Okay, so you see these dbg messages? That's the trace thing. That's this thing that I showed you.

By putting a trace on the process, we're now seeing every message that's getting sent to it. So we see that when we joined the grid, this specific tile got notified that, and it has now, it has states. Now this is its new state.

Mike is at 1010. Then Sally joins the grid. Sally comes in. She's in tile 910. Then Mike moves.

So now you see Mike's state updates. He's now at 10.4. So basically, it's routing. The grid is a routing component, routing you to specific tiles.

OK, well, that's cool. But the next thing, the thing that I wanted to get at was to be able to do a radius search, to be able to query across multiple tiles and find out who's nearby. So sometimes this goes across one tile. You can imagine, I want to know.

who's within 10 kilometers of me, if I'm in New York, the tiles might be very small, and they might be, so I might need to ask many tiles to figure that out. So another use, very useful Elixir module is a task module, and it allows you to do things asynchronously. So imagine you have a bunch of things that you want to do that are concurrent, you can instead use a task to then run them in parallel. So I have, so instead of querying one tile, so I can query like 10 tiles or 100 tiles at once, each one will be done in parallel, then I'll get back to the results, and then I'll sort them based on the distance.

Does that make sense? Yeah? Okay.

So basically you want to use a task, like here's my example over here. This is like the main thread of execution. You see over here, I'm like starting a new async task, a new async task. They could finish whenever.

So they're running on their own thread, essentially. So I'll show you guys tasks. So.

So you're running in asynchronous mode, you have a performance bottleneck on your parallelized... Uh, what do you mean? Depending on when they complete.

Oh yeah, yeah, yeah. You're not going to scale. That's what I'm saying.

Well, it depends how many you're going to go across, but you're not looking for an immediate answer. You're asking for a ride, and you can wait 10 seconds to find the latest, right? So you're trading off performance for scalability sometimes, right?

That is. For resiliency, let's say. So you're choosing scalability. Yeah, I'm choosing scalability at this point.

Okay, so this is the exact same code again. The tile is left exactly the same, and we still have the same supervisor. I've only added one new thing, which is called nearby. So if you call tile nearby, it'll look inside the tile.

It'll enumerate the tile. It'll map each location and determine what the distance is between the origin and the coordinates of that person. And then it will filter out everybody that's within that radius. So it's basically...

saying it's basically going through this dictionary it's saying go through each person each person what's the distance that they are from the requested coordinates that's the that's the radius that's their distance then we check is the distance within the radius that we're looking for then there are match so there then returned to the collar And in the grid, we have the same medium again, where grid also has nearby. The difference is grid, nearby, first needs to figure out who's surrounding this radius. So what tiles should I communicate with?

You don't want to ask everybody, because if you're in New York, you don't need to ask Timbuktu. But you want to ask the people that are surrounding you. And here's the key takeaway that you call task async.

So you map over all the surrounding tiles. You call task async. And you ask the tile for a nearby coordinates radius.

So now we get back a list of lists. Each tile is going to respond back with a list. We map that list, and then we sort it based on distance.

Comparator. Where's comparator? Here. It's just basically doing searching the distance. So this way, then we get back a final list of everybody ordered by the distance that they are away.

So that's basically, yeah, I think it's kind of cool. It's very fast. You could query a bunch at once.

Also, the nice thing is, Talk about failure for a second. If one tile is down, it doesn't affect anyone else. But you can still query. You can still query the other ones that are working. So it's kind of cool.

Ready? All right. So at some point, like I showed you guys how to use an agent, but at some point there are some things agent can't do.

So agent is great for simple state like we had. We were just using a dictionary. But there's some situations where you need to use you want to have better like lifecycle control.

Like, for example, you want to get a message. You want to be notified if the code version changes, or you want to be notified when some, let's say another process that you're monitoring failed, right? So you can't use an agent to do that.

You've got to then refactor your code, and you've got to take it to, turn an agent into a gen server. And the reason that I needed to do that was because I want to, like in the version I just showed you, there is a problem. And the problem is that when you write to the dictionary, you're writing, like we were showing, we wrote, just go back to the code for a second here.

Like, task. So, yeah, so like here, right? Yeah. When we say join, what we're passing as the ID is just an atom, right?

It's a symbol. But that means, what happens if Mike's device loses power? The grid will have no idea, right?

It will still be showing in the list, right? So we need a way to monitor the device to know when it disappears. So if we want to set up a monitor, remember I showed you guys before monitors and linking? You can't do that with an agent, so you've got to refactor to a gen server in this case. And that's basically what I did.

I took the tile as it was before and I just did the exact same implementation in a gen server. It's a little bit longer. The main reason it's longer is because usually in a gen server you have, it's like a callback-based approach where you would handle different, you would pattern match. So you would say when I receive these messages I do something. And that would be one function.

Then you have a second function which is the public API. So you have two... Moving parts. So it's a little bit more than let's say using an agent where there's just one, but it gives you some added benefits. So sometimes it's worth it.

I'll quickly walk through what I did. So starting now, I've started, I've changed the state data a bit. What I'm tracking is a bunch of records.

And we'll see that I, because I need to track not just the position of the user, I need to also track what the process monitoring data is. So when we join, what we do is, so here we're doing handle call. Handle call is basically just like, it's almost like a callback, right? You're pattern matching that when this object receives join, it should do this.

So what it's doing is, it's storing the position, and then this is the magic. So here it goes and it says, hey, I want to monitor this guy. So let me know if this PID, anything happens to it, notify me because I need to update my data, right?

And then it just stores, it just puts that data in the state and that's saved. This guy is another pattern match. Here we're matching when it moves. So when it moves, we just update, we just put in the state records, the position. Simple.

When we leave, the only hiccup with leaving is we obviously have to delete the person, but we also have to tell them, okay, we don't want to monitor them anymore. They said they want to leave. That's the way or whatever.

Yeah. And then, We do nearby, that's the same as before, there's nothing really to show. Okay, now here's where we see the monitoring code. So this is where, because we asked to monitor the process, it's possible that we'll receive this message. This message basically says that a process has gone down, and here is its ID.

And when we get that message, we'll delete him from our records, essentially. Here's the public API, so it's pretty much standard. This is all the same as before.

This is exactly the same. The supervisor did not change at all. And now what I did was, I created just a dummy module just called driver just for testing. But essentially what I do now is, I create a driver.

So let me show you that. So here you see I start a driver. And so now the driver Mike is no more a symbol, now it's a PID.

It's a process ID. And I can pass that process ID in. I can move it around. And then later, if I want to simulate Mike's device losing power, I can exit on Mike, and it will get removed.

And I'm going to show you that right now. So Elixir. Is this four? Oh, yeah, there's a lot of...

Yeah, sorry, hang on. Before I do that... So I did a bunch of stuff, I join, join, then I go see who's nearby, then I sleep for five seconds. So it's going to check who's nearby, it's going to wait five seconds, it's going to kill Mike, sorry that sounds bad, terminate Mike. And then, that's the one thing about it.

A lot of times, you talk about actors dying and stuff, and people think you're reading the Hollywood television. Yeah. Yeah, it's horrible. So yeah, so basically, you're going to see the program's going to run. It's going to pause for five seconds.

Mike's device loses power, and then you're going to see that he gets removed from the grid automatically. You don't have to do anything. So four seconds, tumbleweeds. One, two, three, four, five. Yeah.

And then, so let's go back here. Tumbleweeds, device lost power. Then we search near that radius. Okay, we see that we got the down message. That was the message I showed you before.

So the process, this one, is the process of Mike. So Mike's process terminated. And we now no more have him in our records.

So that's basically how we monitor our drivers and passengers. So another type of actor, and I'm not going to show you this today, but it's a state machine. And any time you have states, like if you see in your state, for example, status, status field, you need this.

A good example would be, imagine an engine that you can either turn the engine on or off, but you can only... accelerate the engine when it's on. So you have some rules there that you can only allow certain events to occur in certain states. In that case you want to use state machine.

There's gen FSM but it's deprecated, don't use it. There's gen state M which is the newer one and there's like some wrappers you can you can use which are pretty awesome. And yeah I showed you that demo and so some tips and techniques.

Yeah like I showed you I started with some small experiments that's how I got started I think that's always the way to go. What really helped me, like when I was like being confused about what I need to do, I felt that when I made a list of all the actors in the system, what states they have, like what are their states, what are the events, who are their collaborators, like who do they talk to, like that cleared it up for me. And then I had a good idea of what I needed.

You could also like draw your supervision tree, is pretty helpful. And if you want to think about a little bit how it would work on multiple nodes and multiple data desires. like in our case with these tile things they can actually run on different data centers right because they're the only time they ever need to communicate is when one person leaves a tile and goes to another one and that's fine that could happen across data centers so it's it's almost like very naturally sharded yeah so uh i guess i'll take any questions if you guys have any that's that's pretty much it

Transcript for:Building a Resilient Ride-sharing App

Transcript for:
Building a Resilient Ride-sharing App