Create a data pipeline for near real-time ingestion of Netflix, clickstream, or playback data. All right, thanks so much for being here with us today, Karthik. Can you quickly introduce yourself for our viewers?
Hello, thanks for having me here, Angie. Yeah, I'm currently working as a senior manager, a special lead data engineer. I have over 15 years of experience in the data space.
I have all my expertise in designing, building, and... optimizing large-scale data pipelines for different industries. So my expertise lies in building them, predominantly the big data and cloud engineering space. My tech stack primarily revolves around a big data computation of platforms like Sparks, and then real-time stream processing is like Kafka, designing microservices and different architectures.
If you're enjoying this video, you can watch dozens more videos like this at tri-exponent.com. Check out our real interview questions with full written solutions. Over a million people use Exponent to ace their interviews in product management, software engineering, engineering management, machine learning, data science, and much, much more. Get started for free at tryexponent.com. Yeah, and we're excited to have you.
And I think you have the perfect background for our question today. So let's get right into it. Our question today is, create a data pipeline for near real-time ingestion of Netflix, Clickstream, or playback data. specifically designed for ad hoc monitoring of certain metrics. You said we're designing the Qlikstream ingestion pipeline?
Yeah, for Qlikstream or playback data. Yeah, Qlikstream or playback data. So we start with some clarification questions just to scope it out.
So you mentioned about metrics. Is there any specific metrics that we are looking at for this particular pipeline that we would like to model? So basically a variety of metrics that are based on the user clickstream or the playback actions that are performed.
But which specific ones are up to you? Yeah. OK, cool.
So it's basically user engagement, then playback. data but of course we have from the clickstream perspective we have a lot of different data data that we can collect from clickstream so we're primarily scoping it out only for this then i'm going to be asking about so is there any specific do you want to keep like the solution in general? Are you specifically looking for any tools and technologies or any type of platforms to model on or keep the solution generic? Yeah, that's a good question. I think for now we can keep it pretty general, but if there are certain specific examples of frameworks that you would find helpful, we can totally pull that in later.
Yeah, that sounds good. I would start with generic, but I would definitely pick on pure technologies just to accentuate the model a bit so it is easier to understand. Good.
That's good. Okay. So from the user engagement, so I'm just going to be expanding a bit on giving it generic here.
expanding on our metrics part because um so the the primary reason the clickstream even clickstream uh clickstream is basically collecting and capturing the user engagement and navigation the user behaviors and the interactions of the users with the application so it is going to generate a lot of data about uh how about the user so uh from the user engagement perspective um that could be um So some of the examples being we can look at customer churn. For example, if the user has been frequently visiting the website for a while but has not been visiting for the last few weeks or so, especially if we are looking at Netflix, Netflix has close to 200 million users and I think it's growing every day. So for a Netflix scale, it is very difficult to monitor every user but they would really want to understand whom to concentrate on or is there a risk of losing set of users in the in the line the next one or two months that means they would want to action and then probably would like to understand what would be their interest so that they can they don't they don't really lose it to their competitors so those are some of the metrics that we can take from customer customer churn perspective or it could be a path analysis like what are what is the most popular path a navigation path that the customer is taking and what are the most visited pages in the website and it can also we can also do some behavior profiling as well of the customers This could very well be used for machine learning down the streams, clustering customers.
And then you can feed into some recommendation regimes as well for recommending some of the movies or streams that a user might like. So you had mentioned a variety of different contexts in which you would need these metrics to answer certain product questions. I was curious what that might be for path analysis.
For the path analysis. So for the path analysis. Say if some of those pages are, you were expecting some, say you were running a marketing analysis or marketing promotions and you would expect a lot of people to look at that promotion and then say you, so from your expectation, you set a target of the promotions page getting clicked, for example, n number of times. But you haven't really seen it happening for a while. You would wonder why and this clickstream data can help you with that.
So what is it that's blocking the customers from actually navigating to that path? Is it like too long or is it like, say, you had to do seven or eight different clicks to reach there? So usually for a well-designed website, anything more than three clicks to reach to a target state is not really good. And if it's more than seven or eight clicks and you're going to be losing.
at least like 60, 70% of the customers there, but then they're not going to be doing that. So those are some of the analysis that you can do, why the customers are not reaching that particular point or what is blocking them. Okay, perfect. Yeah, those sound like some really valuable product insights. And yeah, and you mentioned about playback data as well.
Playback data could be about number of sessions, or what is that? the users are streaming right now um um trending trending uh series movies um and uh it could be about uh information about when they clicked on a pause for example uh when they clicked on a pause um so if this um this could this could also this is actually a really interesting thing uh which i have thought about multiple times as well is um you know um when you when you watch a movie or a series and then you are really interested in it we don't really pause it that much right we just want to even if it is a series like having eight eight episodes uh i've been there you know we just like continuously watch all eight and without even like a whole night right so yeah so so and but if it is slightly too boring and then it is like but you just somehow want to finish it you just like have a lot of pause So that kind of like is an indicator of whether that particular user or customer is really enjoying that movie or series, or is it like maybe it's not really that engaging. And this can also contribute to the time it took for the customer or the user to watch the series. Suppose if the series or movie is just one hour long, but the user took like, say, three hours to watch it.
and if you take into account all the pause times and everything that means like it's not really a very engaging movie it could be probably a trending at the top for some reason but most of the customers are just like maybe finishing it for for finishing it but they're not really enjoying it this could even translate they could even translate to a recommendation because you don't want to recommend it recommend a customer or a user a same type of series because while they have finished it They may not have enjoyed it. So that is another factor to take into account on another feature or variable to take into account for our machine learning or analytics space. So those are really a few things.
I mean, of course, clickstream is an ocean and we have a lot of variables in it. But these are a few things that we can really look at from the metrics point of view. Do you have any other questions?
No. I thought that was perfect. I really liked hearing your product insights. Yeah, cool. Thank you.
To keep it in scope for the clickstream, we're going to be looking into the data ingestion pipeline itself. So we're going to be using these models or the metrics, for example. I'm planning to use the customer churn metric just to help us exemplify the design so that I can explain how this metrics can flow through. But But the main focus of this design is definitely going to be building that pipeline and then the different components that involve in bringing that clickstream events data into our data lakes or data warehouse.
Okay, cool. I'll quickly jump on to the next section, where I can start with a high-level design. Are you okay with that?
Yeah, you mentioned you wanted to build out a pipeline, so where do you want to begin with that pipeline design? So I'm going to categorize this pipeline into different segments for the ease of our understanding. itself is like a data capture, then streaming, processing, then storage and analytics.
So basically any type of data, any data pipelines will have all these aspects or even most of these So basically we want to capture data. It could be if it's a streaming pipeline, we do streaming. If it is like a batch pipeline, it will be running in a batch process.
And then we obviously want to store and then we want to analyze the data. And that's exactly the reason that we are actually storing the data, bringing in the data. So I'm going to build a pipeline on a high level that takes into account each of those things.
And then we'll have some components that can help us do that. So there are going to be a lot of ways of doing it. So we're going to be starting with an option one and probably we can discuss that and then probably move to option two.
And then there is obviously quite a few other ways of doing that as well. So in the rest of time, I'll just start with option one and then we'll move on to other options. Cool, so I'm gonna be starting with, let's start with say we have a users, so the users are using some devices right and then let's say they are browsing some websites.
for me saying i'm just going to keep it small here it's just like here some website a sample website is what i'm reading here or or in this case it's like play netflix.com right so we have we have the user browsing this website so obviously uh we need to from for a um for an application of Netflix scale, which is geographically distributed, we need to also consider the user base across different geographies. So there will be cloud distributions, cloud-front distributions that will be there across, so whereas the users can access those links and then get to that particular website. So yeah, so by, so geographical distribution is something that we always take into account as well so here the users are navigating this website so now for when the users interact with this website that's when we have the clickstream data that gets generated so so when we have this right and um yeah so sorry he jumped on i want to just uh before this i want to discuss one more thing um here as well from the matrix point of view right so uh the matrix point of view uh so why we first why we need metrics here is to understand um the different to understand and evaluate different options that There could be always trade-offs and then the cost factor that come that we need to take into account and versus the performance So it is better to have some metrics here.
So from the from a Netflix Like I think the last subscribers number of subscribers that I Vaguely remember it's just like around 200 million. Does it sound like a good number to start with for you? yeah sure so yes so we have a subscriber base of 200 um 200 million users um so and out of that let's say if we have 50 of them uh active on a daily basis right uh so that will be around 100 million 100 million all right so and then um so i'm going to be uh saying uh let's say Roughly in a day if we take around 100,000 seconds, I'm just going to be very approximate here, 100,000 seconds. And then we have 100 million users and we have 100 million users by 100,000 seconds. If we roughly calculate that, that's going to be 1,000.
thousand say thousand users per second right um second so we can have um for example and this is this is obviously a uniform distribution right so uh we're not taking into account any spikes here so and that's not going to be the case all the time so no it's not really every second that is going to be only thousands thousand users so if we apply some 80-20 rule where we say 80% of the traffic comes in 20% of the time. So if you look at that and 20% of 100,000 if you say 20,000 seconds has a whole 100 million users coming in and then if we say 100 million users coming in in 20,000 seconds which is this instead of 100,000 that we calculated. which could be 5000 users per second, which is not a lot as well, but this is just like a minimum, but that's the max could be even like, you know, even 50,000 users per second, it could be. So, but these are just like some rough numbers to work with. And let's say each of those users, like Every click or navigation from users can generate clickstream events.
So each of those users So let's say an user on an average generate 10 events So that means we are looking at User generates 10 events. That means we are looking at close to 50,000 events per second, right? Does those numbers look sane to you? Yeah, that sounds reasonable. Reasonable, yeah.
So, I mean, this is quite large. And I think for a Netflix scale, I wouldn't be surprised even if it is more than this at some point in time. Especially when you have a very popular series that everyone is waiting for.
Just release it and everyone just comes in and then... like yeah what is that all in one time so that time this number even could go much higher so those are some things where yeah right yeah so maybe later on we can talk a little bit about the traffic spikes but for now I guess I guess there's some context that the system you design should be very scalable yeah yeah sure yeah yeah so so just just sorry switching back to the design here so So now when we look at it, there is usually a couple of models or the methods through which we can collect data. So we are in this data capture space.
We are still collecting the data. This is where the clickstream events are generated. So usually that could be push method or a pull method.
So push methods are something where the servers, where these apps are running or where the users are actually browsing. that actually can push this data there could be agents or demons running on that and then that can constantly access our um data infrastructure i'm gonna just like call this um i'm gonna just like put a container here let's call this as a data um Infra. So this server, the servers, the nodes that are running the application will have an agent that is running in it that can actually constantly push those logs, push those events to the data infra.
So we will obviously have some endpoints or API, which the agents might call and then like can push this information. And that is a push model. And then for from the. So from a pull perspective, we will have our services actually accessing those applications and then pulling the data. So there are advantages and disadvantages in both.
Push model is like you just have the agent running on the servers. You don't really have to worry about it and it keeps on pushing the data. But what is the disadvantage?
The disadvantage is it can quickly overwhelm your infra if you have not catered for it. keep on pushing data like i said if the peak traffic period your your infrastructure needs to be elastic enough to handle that but how elastic it would be and how much it can handle is something that need to be thought through from the pull model perspective uh it's uh that is the disadvantage it is overcoming because full model it's you decide when you want to poll for the more data and then you can get more data So at the same time, it depends on your polling interval. Suppose you take the data and then you take like, say, half an hour to process it. And then again, you poll.
So the half an hour, you are going to be polling in only in the intervals of half an hour. And then by the time you make sense of those real time clickstream events, the time is already gone. So, I mean, the insights that you could generate may not be valid anymore.
Because those are real time generators. Suppose if you want to display some marketing promotions to a user when you do that, right? And you respond to it after one hour by the time the user is already gone.
So you don't really get to maximize that moment from the extreme data. So that is something to be worried about. I mean, to be thought about.
Right. So it sounds like time is really of the essence here. Yes, right. Exactly.
So from the events... uh streaming perspective it's really the time right so of course of course we can get some some of those uh training movies you know all those behavior profiling those are all more aggregation data that you can do uh even the customer churn as well but some of those things like you know uh in terms of the example that i explained uh time is very important so that could be a lot of metrics where time is very important so here i'm going to be starting with an api so I'm going to start with an API gateway. So, so just going to be an API.
I'm going to be exposing an API endpoint where the servers can push the data to. We have the API gateway and then, so this is the data collection point. So we are getting the, we are collecting the data from those servers and I'm having a lambda or any type of processing that you can do. the pre-processor so that you can you can collect the data and then like send it to before you can pre-process it before sending it to the rest of the infrastructure.
right so like I said there are a few options that you can do here one is a Kafka the Kafka is a mainstreaming platform that's that's really resilient and fault tolerant and it can scale pretty fast for our for our use case it's like 50,000 events per second so Kafka can really use can be used here so one of the the kafka and then uh spark a spark is a distributed computing platform um which you can um which can be used to um analyze those events so the spark has a spark streaming module the kafka and spark uh work nicely together uh so we have a kafka the kafka so we have the events that are coming in a huge number of events we cross we properly pre-process it and then send it to Kafka and Kafka acts as a buffer so you don't really so once the events reach Kafka like even if the even if the consumers so this is a producer and this is a consumer consumers are not online Kafka can still store the data and then when the consumer comes online and it can pick up the data so we are not really losing much data here so the we can keep pushing the data irrespective of whether consumers are active so Spark has a Spark streaming that can plug into Kafka and then like upstart consuming the events. And when we look at any type of logic that needs to be written, we can use a Spark programming to write that. And it is very performant and it works in the concept of micro batches.
Suppose Spark is something, it works in micro batches. It cannot give you a sub-second latency. Suppose if you have events, or if you have metrics that require sub-second latency you can think about instead of spark we can use flink as well flink is also a real-time distributed computing platform so uh and it and but it can give you millisecond latency and it's very very performant that uh and so the it is really useful for cases where uh the time is really important right then then I'm going to be bringing in so obviously for with any kind of events in addition you get the events and then you process it but you also want to store that data and that's where the data lake comes in so data lake here so the data lake is made up of multiple layers so this whenever you get some any of those events you would want to ideally store it in any of those object stores uh for example uh so here um you have um you have data lakes are usually be usually uh based off uh different layers so you will have different uh different buckets say for example it could be a raw raw layer raw layer is where you store the raw events and then it will move to a process layer for example and then that would be an access layer.
So these are different data lake principles wherein you bring in an event and then you always store that in a data lake as well for further analytics purposes. Can you tell me like why a data lake instead of other types of data stores? So we are going to be having other type of data sources as well, but the data lake is primarily important for use case when we say we want to do a customer churn analysis or a behavior profiling. We need to have that data stored over time. For example, we want to see like a few weeks or a month's worth of analysis of understanding the frequency of visits of the customer.
So how do we do that? Because we can't really store infinitely in Kafka. Kafka gives you a storage for seven days and it's primarily used for stream processing and analysis.
If you want to do any type of historical analysis, it is recommended to store that data in a data lake. So at any point in time you want, you can actually access this raw layer for example and then do some analysis on top of it. So, but this is not all we will, we will obviously will have, and also the good thing about this data lakes is you can run some analytics straight out of the box, for example.
In case of AWS, you can have Athena, that is a service that can run straight on top of S3, and you can straight away start running queries. Analysts can run queries on top of raw data or even, usually the access is not provided for the raw data, and it is provided only in the access layer. So the analyst can run queries on top of S3 data without having to do anything.
It is just like straight away they can start doing some analysis but from the no sql perspective or from the from the real-time perspective it is all it is recommended to store some of those key insights or the real time insights that you would want to actually expose to the customers in a more performant database for example in a no sql if i take an example so amazon has a dynamo db which you can which can be used um yeah so here so when you when you start when you bring in those events and then the buffer it in kafka and the spark spark or flink processes those events and the analysis if you want it to be accessed real time it can be stored in a no sql database you can connect it to a no sql database no sql database is very performant uh and you can access data um you can store millions of columns and it doesn't really have a schema so that is a that is a good thing about it uh so you can and it gives you a fast reach as well on fast writes as well so you can store that in a real time so even if you get a 50 000 units per second no sql databases are the ones that are going to help store those events in a very performant way. Suppose if you have a relational database out of this, it's gonna be a bottleneck. That's going to decide how performant is going to be your pipeline. Even if Flink processes it in milliseconds and you access a relational database to store it, that's gonna be a bottleneck.
And the pipeline is gonna be as fast as the database. That is going to process it faster, right? So moving on so we can have Like analytics as well running on top of this. So here we So we have so so we can have Any type of analytics like I said Athena is another.
Athena is a service that we can use that can directly run on top of the data lakes and then provide some useful insights. Or you, so the NoSQL databases like I mentioned, it could be helpful in doing some certain real-time insights but you also, but that doesn't mean that you don't really use any of the relational databases. Relational databases are really important as well.
You want to store some facts and dimensions for historical analytical OLAP queries. So you would want to store that data. That's where since if you had stored this raw data, that's another use case. If you want to store, if you store these events in in a data lake and later on you can have those connectors, you can have those services.
For example, you can have another service here that can get the data from here and then load it into your relational database for example and that can that can serve some of the analytical purposes now the main thing to note is this is not in the critical path so the these are is like a parallel stream of parallel stream of parallel pipeline that is happening so it's not going to anyway affect your performance in terms of the ingestion right so when you have it in no sql uh you can also uh run some um so there are some there are there are some more processes or some more um analytics that you want to run on uh no sql you can run that as well and then probably publish it to some service called you know even if you want to put it in an elastic search for the user to search on something or if you want to again put it into another s3 or rdbms depends on the type of storage if you want to have some analytical queries running on top of it you would either put it in an s3 or rdbms or if you want to expose that data to a user for example you want to show it in a ui like a marketing promotion then you will have these like loop it back to the ui and then like a show a pop-up or something having that promotion or if you want to if you want users to have the ability to search based on something maybe elastic search is another service that you can use where you can just push in this data the analytical data that can help some of those searches Were there any other components to this pipeline that you wanted to add? Yeah, so just in the interest of time, I'm going to be, like I said, I have like few options. There are a few options.
I discussed about the Kafka. So Amazon here, Amazon has managed streaming for Kafka. That's what I used here, along with Spark and Flink, which are pretty open source. So this is not the only way to do it. We can have, Amazon has If I just stick to AWS, there is a pipeline called AWS and there is a component called Kinesis.
AWS Kinesis is another event streaming platform that we can use instead of Kafka that goes well with actually with the clickstream data and also with the IoT devices, which is primarily designed for that. So Kinesis has a few services. So this is Kinesis Streams which you can think of it as like a kind of a Kafka so which can connect to which can actually bring in which can be used in the data collection space where it can connect with it can get data and it exposes the endpoints where the agents can push the data.
for the pk it is a very useful for clickstream ravens as well a lot of companies use clickstream in this is for the stream as well and there is this data analytics uh kenny says data analytics that you can use that can nicely go along with the kenny stream suppose if you want to run some streaming queries on top of the clickstream uh events that you are receiving on the fly without even storing anywhere you can do that using kenny says analytics those are some simple queries that you can write and then And then you can just push it to, for example, any of the NoSQL or even back to the UI if those metrics are really time sensitive and you want to really act on it. And there is another interesting one, just one more thing, is this Firehose. So this Firehose is like kind of a buffer. So suppose if you...
So always, it's good practice that you want to act on some of those real-time events and then act on it as quick as possible. But also, you don't want to lose that information. Always, you want to load it somewhere in your data lake so that at a later point in time, you can always come back and then run some historical analysis on top of it.
So that's where the Firehose could be really helpful. Firehose, you can have... you can set the time, for example, 15 minutes. So every 15 minutes, you can configure it to push it to data lake, any of the raw buckets, raw buckets or any buckets that you would want. Or it can even, it has quite a few destinations.
It can even push it to Elasticsearch. It can even push it to Redshift, for example. Redshift is another data warehousing solution that nicely fits into this analytical paradigm.
Okay, amazing. Thanks so much for telling us about this huge variety of technologies. I thought that was really informative. I think this is a great place for us to pause. So I'd love to hear from you.
What did you think of this interview? What do you think went well versus what do you think you would want to change or do differently in the future? I think overall, we covered one specific, like one or two specific ways to do things.
So probably one of the things that we could also add on to this is like discuss about... cost factor based on the metrics and also like different there are different trade-offs or different trade-offs or challenges in this pipeline that could be discussed in terms of the schema evolution in terms of what type of format that you would use to use for this clickstream data and also you know in terms of how the metrics could be modeled you know There could be some data modeling that we can get into in terms of the customer churn as well. But obviously we can't really cover everything.
Like I said, the pipeline creation itself is driven from the business process. So based on what is required, you make your data engineering pipeline as flexible as possible. Like this particular pipeline is more generic.
It can bring in any data, but how it would be modeled for a specific use case, you will think about all the things, whether you want to act on it on a time or it is a historical or is it more of a RDBMS or a more... SQL related things based on that you'll pick and choose technologies. Yeah, I agree with a lot of those points.
I really liked that you tried to reference your high-level design principles whenever possible for motivating your design. So for example, what you just said about trying to design something that's as flexible as possible. I also liked that before you actually jumped into the technical details, you talked a little bit about the business context behind why you would pick certain metrics. What in the end do we care about?
and why are we tracking these metrics in the first place? That was super helpful. I also liked that you were able to provide a variety of different specific technologies that you would use at stage of the pipeline.
I think the trade-off on that is that when you mention a very specific technology, it's helpful to then back up and explain, what is the general principle behind this? If you're using a particular database, I can explain that this is NoSQL and why would we use NoSQL here? And there's a lot of trade-offs there to discuss, right?
So I think you very briefly touched upon this, the idea of schema evolution. So for NoSQL solutions, right? Like when you then change your schema in the future, because maybe you want to do new types of data analysis, it's a lot easier to do that than with a relational database, right? But on the other hand, like you say, the trade-off that your data is not going to be as strictly validated as if it was in a relational database, right?
Yep. And... One last thing I would have loved to hear a little bit more about, and of course, it's hard to fit this in. There's already so much info here, is maybe the general, like, do you see any potential bottlenecks in this system? And how fault tolerant is it overall?
And would you put any measures in place to try to like improve the fault tolerance and reduce any bottlenecks? Yeah, but other than that, I thought you did a fantastic job. So thank you so much for being here with us today, Karthik.
Cool. Thank you. Thanks for having me here. Yeah, of course.
And thanks everybody for watching. If you have any upcoming interviews, good luck.