Transcript for:
Overview of OpenCL 1.2 Concepts

Welcome to my YouTube video. I will be describing the OpenCL 1.2 high-level view of the specification. I'll provide you a little bit of background about myself. First of all, I'm a parallel programming guru.

I have extensive experience and training with C++, OpenCL, and Linux. I'm primarily concerned with real-life software, which means I have a slightly different take than many of my counterparts who are discussing OpenCL. I'm an OpenCL middleware developer.

This talk will partially motivate the role of OpenCL middleware. Suffice to say, my primary interest is in providing you with tools so that your OpenCL development is much easier than it would be without those tools being available to you. I've been doing high-performance computing for a long time.

Right now, there's a big push into big data technologies. High-performance computing, in many ways, is still more complicated than these technologies. And I just wanted to point out that I've been considering these issues, which have prompted the development of OpenCL long before the existence of OpenCL or before the recent trend into technologies such as Hadoop or MapReduce.

I'm available for consulting. I certainly would love to help you ensure that your applications are achieving their top performance and delivering value to your customers. You can see my blog and information about myself at my website.

www.ajguillaume.com and you're welcome to send me an email at any time. I also support PGP if you want to talk in a confidential medium. So I'm going to provide a quick outline of my discussion today. The first thing I'm going to do is motivate OpenCL.

I assume that you already know what OpenCL is about and you already are curious about it. You just want to see how it all fits into a picture and how you are going to use it in your application. My goal here is not to convince you that you should use OpenCL. I assume you're already interested. My goal is to give you the complete background you need so that you can really start to use OpenCL.

I'm going to discuss the underlying models of OpenCL. There are several models which are behind the scenes in OpenCL, and they're very important. If you just read the specification directly, you may miss these important concepts. So I will be discussing the device memory and execution models.

This is going to provide you with very important motivation for many of the things you are going to learn in future regarding OpenCL. Essentially, the most important part of this talk is to really provide you with everything you need to understand how OpenCL fits together, what makes it tick, how it's been created, so that more and more advanced topics can be introduced. This is OpenCL 101. Everything that you are ever going to learn about OpenCL is going to draw upon the concepts you learn in this talk.

If you understand this material very well, future discussions will be very easy to follow. If you don't understand this material, feel free to send me an email or ask where, you know, tell me where I went wrong in this discussion, in terms of perhaps I did not motivate something in the way that makes the most sense, and I may issue an amending video later on. We're going to discuss how the execution model maps to the device model.

You're going to see how everything fits together, how we have this abstract notion of execution, and we map it to a physical device in such a way that we can achieve our goal of heterogeneous computing. I'll also discuss the host API, and you're going to see how the host sees OpenCL. You're going to see how we get devices to actually do the work that we're asking them to do. I will note that you're not going to see much code, if any, in this talk. The reason for that is that I am trying to give you all of the high-level ideas and background that you need so that you can read the specification yourself, so that you can start understanding the big picture ideas that you might lose if you were just looking at how to actually do software design with OpenCL.

Further talks will give more information and will show you more code. This particular talk is to give you an idea as to what the concepts are. how everything fits. So the first question is what is OpenCL?

At a high level we have a host which is dispatching commands to a device. These devices are part of a heterogeneous system. This means that we are asking a device to do something on our behalf as a host and that device is programmed in some way that we may not understand or know.

For example, the device may be a GPU from NVIDIA, the device may be another GPU from AMD, the device may be a Xeon Phi. The key point here is that we have a device that does something interesting and we want to communicate with it and tell it to do work. This is a key point of OpenCL.

The first main actor in OpenCL is the host itself. Now the host, its sole purpose is to tell devices what to do. You are actually going to write programs that tell the device how to operate and this is the OpenCL C API. It can be confusing because we also have OpenCL C. but the host is what is calling the OpenCL C API, which links to libOpenCL.

This is something you've installed, which allows you to talk to the device. We also have devices which actually execute work for the host. This is where all of the important work is actually being performed. On the behalf, the device is performing work as asked by the host and providing results back when it is asked to do so. The role of the host is to offset loads.

to the devices and tell them to do something interesting for the purpose of a larger application. There are several models to consider in OpenCL and this is the main thrust of this particular talk is to make sure that you understand these models completely because they influence everything that you will do. The first model to understand is the device model. What do devices look like inside?

How do we program these devices? What is the common layer? What is a common concept?

that is provided to us. The next thing to consider is the execution model. How do we actually run work on these devices?

And finally, we have to discuss the memory model. How do devices and the host both see data? And finally, finally, we have the host API, which is how the host actually controls the device. And I want to give you a few use cases.

We're going to discuss each of these models in detail. This is essentially the entire talk today is to discuss every one of these models. But I want to provide you with some motivation as to why you might want to do this and the type of things that OpenCL can help you with. So consider that we're doing fast permutations.

So we have data, it's not in an order that we want. A permutation could be a sort, a permutation could be some sort of hashing, a permutation is just some sort of shuffling of memory. Now if the device that is attached to us can move memory faster than the host, It may make more sense to copy a large piece of data to the device, have it permute the data, and copy it back.

This is a potentially interesting use case. Another use case which is slightly related to permutations is data translation. If we have to translate data from one format to another, and if devices have good properties, or the translations involve a lot of bookkeeping or a lot of memory movement, we may find it is easier to do the translation on a external device such as a GPU, than to do it on the host itself. Finally, numerical software is a very important use case and is probably one of the main reasons that the pioneers of GPGPU software are coming from fields such as numerical analysis and engineering. If devices can execute instructions fast enough, you might want to use them to actually do your modeling or do your simulation.

Now I want to point out these other two use cases of for fast permutations and data translation to show you that OpenCL is not strictly a numerical software standard. The point of OpenCL is you're controlling a device and you can use these devices in interesting ways according to what they're capable of doing. I want to discuss the OpenCL standard itself.

There's a brief history here. OpenCL 1.0 was released as a specification in December of 2008. And this year, July 2013 the OpenCL 2.0 provisional specification was released and in between we've seen several iterations of the standard being released which have added new features. The OpenCL 2.0 specification adds significant new features and does adjust the philosophy and model that we are discussing today.

However, because it is a provisional specification and because these other OpenCL versions are still going to be around, this talk is very relevant. Everything you learn here is going to help you understand how OpenCL 2.0 refines these concepts. Now there's several parts to the OpenCL specification itself.

The first part is a core specification. This means that if something implements OpenCL, that you have, this is something that you can assume that every, you know, for OpenCL 1.2, say, The core specification dictates that every conformant implementation supports something. So if image support, which is something that is just there's hardware support for doing image manipulation, if it is part of the core specification, for example, it means that any device that says that it can do OpenCL 1.2, it support it. This is the main concept of the core specification.

There's an embedded profile which relaxes the core specification, provides a subset of functionality. The intention of the embedded profile is to target handheld devices. We have extensions which also provide additions to the specification. They might become part of the core specification later.

And we also have to consider the fact that even though it is one specification, There are many facets to it. So for example, if you're coming from a C or C++ background, these are very foreign concepts that there are degrees of freedom provided or permitted to vendors in terms of how much they can really support OpenCL. Perhaps they can't fully implement the core specification.

We allow them to relax what they can support and call themselves an embedded profile. This is a slightly complicated notion if you're coming from a background, as I say, in C or C++, where if you implement it, you implement pretty much everything, unless you're doing something very specialized, at which point you won't call it C or C++. There are several components to OpenCL.

The first thing is that we have the C host API. I alluded to this already. This is what you call from the host, and this directs devices. It tells them what to do.

This is the main part. that an application programmer is going to use, who is basically saying, what devices do I want to use, what do I want them to do, what functions I want to call, where should memory be? This is the point of view of the host.

The device uses OpenCLC. This is a language based on C99, which has many built-in functions, and this is also within the standard. You will see roughly half of the standard is dedicated to OpenCLC. Underneath...

the standard. It's not necessarily directly addressed. There is a memory and execution model.

There is a preamble in the specification which provides some details of this model, but in my experience you have to look through the specification to really understand what these models are and how they relate. That's the primary value of this talk, is that I've already done the work of scouring the specification and completely understanding how things fit and conveying it to you in a manner. that should make it easier for you to access the specification. If we look at the big picture on the left, you see that the host API is responsible for the operation of the host.

The host is going to call the host API to manage devices, which you see on the right. Devices are programmed using OpenCLC. This will be the topic of the second talk that is a subsequent video to this one. And underneath all of this are these models.

These models are here to guide everything. They're not necessarily directly addressed, but they underlie everything. And if you understand these models, understanding OpenCLC, understanding memory management, everything becomes much simpler because you understand how the models work.

So let's start discussing the models. We're going to discuss them one by one. Let's start with the device. As I pointed out to you before, the device is external to the host.

This line here does not have to represent a physical pins or anything like that. It could be that we're accessing a device over the network. It could be anything. The point is just that the host has some way of seeing a device.

So let's take a look inside of it. This is the device model, and we're going to look at this piece by piece. You can see these rectangles inside of here, inside of the box. You see that we have something called global memory, constant memory.

We're going to discuss every piece, but I want to keep descending into the model. So the device is broken down into further pieces. For example, this is a compute unit.

As you can see here, we have 15 compute units. And let's descend into the compute unit so that we can understand it. Inside of the compute unit, we see eight of these smaller blocks. with the label PE and with something called private memory. PE stands for processing element.

Now we're going to recurse again and let's take apart one of these blocks of the PE here and private memory and see what that's about. So as I've mentioned, the PE is a processing element. And the best way to think of this is to think of the processing element as a very simple processor. In particular, all instructions are executed on the processing element. This means that everything that you are going to do in terms of actually making devices do work, the PE is going to be responsible for all of that.

So let's zoom out again. And look at this particular fictitious device. In this fictitious device, we see that we have 15 compute units, which I use as a shorthand, Cu. We have 8 processing elements per Cu. Pe is a shorthand for processing elements.

And someone might claim that we have 120 processors. I first want to address why this is not a good way to think of things. It comes up over and over again. And you really don't want to think of a processing element as a normal processor.

There is quite a bit of marketing hype. You will see people who say, I have 30,000 processors on my particular device. Don't think of it that way. This is misleading and it will confuse you as you attempt to do programming and design algorithms to best fit the models you're learning today.

What is important is to realize that processing elements are contained inside of compute units. A compute unit cannot do anything. All that a compute unit is, is a container for processing elements.

That's the best way to think of this. And don't think that I have 120 processing elements. Don't substitute processors with processing elements.

You want to think of the grouping. This is very important. Think of, I have 15 compute units and each has 8 processing elements within it.

This is really the best way to think about the device that we have seen, and I want you to really start to think this way. Now, there's another part of this I only discussed so far. We've only seen what can do execution on the device. We need to look now at the memory model.

I am not discussing the memory model in terms of how the host views memory. I'm discussing still the device. The big picture here is that we have several memory regions. If you have a background in C, you're used to having really two memory regions. You have one physical space of memory, which may actually have virtual memory provided, but that's an operating system abstraction from the point of view of the process.

You have one region of memory, you have a stack, and you have a heap. That's your view. Memory is roughly equivalent in terms of what it can do. The only thing you take into account is, say, cache coherency. and cache effects.

Now this is not the case with OpenCL. You have several memory regions. These memory regions have different properties, have different capabilities. This is one of the most significant differences from OpenCL as compared to what you probably have seen before. So the first thing to look at is this global memory.

Global memory is shared with all processing elements. That means that every so-called processor can access this memory. The host can also access this memory.

It can map it, it can copy data to and from this memory. It is persistent storage. It is not persistent in the concept of a hard disk. It's persistent in that the execution will not... will affect this memory, but the memory will remain across subsequent executions.

We will discuss this in more detail as we discuss the execution model, and it will become clear. The other memory regions you are going to see are scratch space. So global memory is where you load data and then you're going to run functions and it's going to manipulate the global data, but pretty much after you run something, it's going to come from global memory and it's going to stay in global memory. This is what I mean by persistence.

Let's look at constant memory here. Constant memory is rather unique in that it, again, is shared with all processing elements, but it's read-only memory. If you've written C programs or you've written a constant, x equals 3 or a const float pi equals 3.14, you know that there's an opportunity to have variables that you really aren't going to change. It's a logical mistake to change these variables.

Inside of the device model of OpenCL, you actually can go ahead and just stuff variables into read-only memory, and it is physically guaranteed it's not going to be affected. The reason you do this is not because you want to protect that memory, It's because the hardware has very efficient ways to share data with all device processing elements. This is a special memory in that it may be rather small in terms of what you can put inside of it, but it is very efficient for every processing element to access this memory. It is not persistent. This is something that's going to change over time, but we'll get into the details about this later.

This is what we showed as the compute unit. As you recall in the previous slide, I showed you that you had 15 compute units. Each have eight processing elements per compute unit. I did not mention explicitly that we have local memory. What you see here is a memory region that is within the compute unit.

Each compute unit has its own area of local memory. In particular, you will see in the device I showed you that you have, say, 15 compute units. This means that you have 15 local memory units, one per compute unit. Local memory is unique in that it is shared with all processing elements in a compute unit.

It's a very efficient way to share data with all of the processing elements inside of a compute unit. They can actually cooperate in some way with a local scratch space. It cannot be accessed by other compute units. So if compute unit 1 has some local memory, compute unit 2 cannot access it. It's not possible to do.

You have to use global memory for this purpose. This memory is again not persistent, and I will clarify later what that means. If we descend into a processing element, it has private memory attached.

This is memory that can only be accessed by a single processing element. No other processing element can directly access this memory. And again, it is not persistent. It's going to change in a way that I will explain. If we zoom back out, you can see that we have several memory systems.

We have global memory, constant memory, local memory, and private memory. And the configuration of these memories is important, and you will use this When you are designing high-performing applications, you will rely upon these properties to actually achieve the performance that is promised by OpenCL. I want to ask you a question.

I'll answer it in a second. What kind of device did I actually show you? What are you looking at? Are you looking at a Xeon Phi? Are you looking at a GPU?

Are you looking at a compute cluster? What is it? The answer is I didn't show you any device at all.

It's simply a model. I want you to understand that the model is something that is how you should reason about things, but how the model is translated to various physical hardware is dependent upon your vendors. But you should understand the model because this is the common terminology, this is the common level that everyone is discussing.

And as programmers, we mostly are going to be programming the model anyways. The job of the implementation is to take the model and efficiently compile it to something else, but we'll get into this in other talks. Now we're going to discuss the execution model.

This is probably the most confusing aspect of OpenCL. If you have any questions, don't hesitate to post a question to the YouTube comment section or to go ahead and send me an email and I will post a follow-up. that tries to better explain it if something's unclear. I hope that this is at least as clear as what is in the specification, and what I would suggest is that if it's unclear here, wait a day or two, read what the specification says, and then go ahead and ask your question. So let's try to get into what the execution model is all about.

First, OpenCL executes kernel functions on the device. Now, the name kernel is a little bit unfortunate. This does not rely... This does not indicate some sort of kernel space in the operating system sense.

A kernel is just a special name for a function, which I believe is a remnant from graphics. Now, the function that you're executing is just an ordinary function with a special signature. My talk on OpenCLC will get into details about what that signature is and what a function looks like.

However, I want you to understand a few things about it so that we can actually understand at least the model of execution. Kernel calls have two parts to them. There's the ordinary function argument list, and there's some external execution parameters that control the parallelism. OpenCL does provide direct support for parallel computing in a very unique way. And you need to understand that you have your ordinary function that you are calling, but you also have some parameters that are controlling that execution, which are outside of the ordinary function call.

So let's look at... this simple function. This is an ordinary C function call. We have an integer being returned by the function called add that takes two integers as arguments.

The argument list is in this case int x int y. As I said there's two parts so what's missing? These extra execution parameters which I will begin to describe momentarily are bundled with the function call with the arguments provided. And we call this an OpenCL kernel call.

What this means is that execution of functions is not what you're used to from C or C++. There, when you just call a function, you can call a function as soon as you provide the parameters. In contrast, in OpenCL, you have to call a function and provide extra parameters that control its execution and then tell it go. We're going to get into detail and I hope that this is all going to make sense. So what's the role of the host in kernel execution?

Kernels are executing on the device. These are functions that the host is calling to execute on the device. And so the host is really coordinating the execution.

It's telling the device, call this function. Here are the arguments. Go ahead, call this thing.

And it also is going to provide the parameters to launch the kernel. This means that the main role of the host is to tell the device what to do. and to provided arguments.

This is fundamentally what kernel execution is all about. Now, there are a few ways in OpenCL 1.2 to call kernels. I'm only going to discuss the most useful and general version. And trust me, this is really the only version you need to know, unless you're doing something specialized.

We're going to talk about the nd range. The key concept here is that the same kernel function will be invoked many times, and the argument list is going to be identical for every invocation. By an invocation, I don't mean a call per se, I mean that the function is actually being run many times. Now, the basic strategy of the ndRange is to call the same function over and over and over again, and the number of times we make this call is going to be dictated by the execution parameters.

So the execution parameters are really telling us, for this function, How many times should I execute it? What should I do? A little bit more complicated than that, but we're going to see a lot more detail very soon.

The host is going to set extra execution parameters prior to it actually launching. So this is another role of the host that is going to say, here's the parameters and go ahead and run. We've seen this and there are particular parameters you have to set for an ndRange that we're going to see. So let's show graphically a little example of this. On the host side, it's going to say launch foo 123 and run it 10 times on the device.

This is all the host is going to do. It's not going to run it. It's not going to do anything itself.

It's just going to tell the device to do it. The device, for its part, is going to execute these functions 10 times as it was told to do. This is somewhat natural.

You probably will think of this as a for loop. Now, it's a good way to think of things as a for loop, but I'm going to show you something called an index space, which is a much better way of thinking of things. But I want you to understand the concept from several different directions so that you completely understand the ND range. Now, one of the first problems we have here is that we don't know which call we're in.

So how does a kernel function that's executing know what to work on? If I'm being called, the argument list is identi... has been, sorry, is identical.

So in the previous slide I showed you launch foo123. Now foo123 has been called several times, but how do I know what I actually am? How do I know what work I should be performing? What if I'm supposed to be moving data around?

What is my data to copy? The key insight here is that the execution parameters I keep mentioning but have not directly discussed yet provide an index space. And each function invocation knows its index in that space.

So for example, in the launch of foo123, I know that I'm call 3 of 10. This is how you're basically going to write algorithms, in that you are going to understand what index you have in the space, and you're going to have to formulate your function in this way. And I will show you why this is fairly efficient from a hardware point of view. I'll point out that the index space is n-dimensional.

You can think of this geometrically. You can also think of it as a number of elements inside of a tuple inside of a set. I will get into this in detail because I want you to completely understand the ND range. If you get nothing else from this talk, understanding the ND range is absolutely critical to unlocking performance and to achieving all of the goals that OpenCL all of your goals with OpenCL. So I mentioned the ndarrange as maybe a for loop, and is this a good way to think of things?

So we can certainly think of it this way, as a first approximation. So you're going to provide this function, the outside component of the for loop, saying, you know, size t i equals zero, i less than size, increment i. This is all being done by those parameters for you behind the scenes.

What you supply is the function, the name of the function with the parameters, and the structure of the outside for loop, if you want to think of it that way, will be provided for you. Now, if you think of it this way, we have what OpenCL calls the global work size. This is the number of iterations we are going to complete, and i, in this case, is the global ID.

So, what we actually are going to wind up having is that foo is able to actually know which i it is and the full size. So it has some understanding of where it is inside of the for loop. Now we can do another thing with for loops, and we can have a work offset. So if we iterate from an offset as the base, and we keep incrementing j until we hit that the size plus the offset the j becomes larger than that value.

This is a fairly natural way to think of what a work offset is. OpenCL provides this. It calls it a global work offset. So now we have two concepts here.

We have the global work ID, and we have the global work size, along with this global work offset, which by providing these three values, we can get different for loops out. And this is one way to think of the ND range, is that the structure is going to be provided to you. Now if we want to do multiple loops, here we're iterating simultaneously over x, y, and z, then we can go ahead and just say the number of iterations is the work dimension. Now you'll notice that the function foo in all of these cases has not depended upon x, y, or z directly. That's because we assume there are functions within it.

they're going to allow it to determine which x, y, and z it actually is. So you really don't want to directly provide the iteration to the function you're calling, and the reason for that is that, as I said, the argument list is fixed across all invocations. Now, another way to think of an ND range, and this is probably the best way to think of it, but I wanted to motivate you what this really is in terms of for loops, which is something that I assume you're familiar with, is We want to think of an index space as a set of indices that we're pulling from.

So really don't think of for loops per se. The reason for that is that for loops are really sequential. You have to execute one thing before you can do the other thing. Inside of libraries such as threading building blocks and other things are parallel fours, and this is a little bit like what you have with OpenCL, although the way that the OpenCL parallel four works is a little bit... specialized.

There is no such thing called parallel four. That's what ndrange is all about. So I want you to think of a set of indices.

That's how you should think of the index space. And each element is an element tuple. This is really what you've already seen.

Now, each invocation, so each function when it runs, is going to pull a random index from the set. It is not random precisely, but it is best when you are doing design to consider it to be random. Otherwise, you will find yourself hitting problems you did not consider because you think thought through something sequentially on a piece of paper rather than randomly.

Now what's going to happen is that this set of index, the set that makes up this index space, is going to be populated before kernel execution. And what's going to happen is that as a kernel runs, the functions are going to pull out an index and it's going to run on the hardware. So it's going to stop when the index set is empty. So this is a way of visually imagining this. We have a set here called index space on the left.

We're going to pull out tuples one by one, and we're going to pass those tuples as some sort of parameters that can be accessed in a way that's... different from the argument list. It's a need to use functions to access that. The OpenCLC talk will discuss this. And the argument list has been fixed in place for these calls to foo.

So really, we're just pulling things out. As you can see here, it's very random. And as we pull things out, the index space is going to shrink until it's empty, until there's nothing left to do.

At this point, the kernel terminates. And this is at its heart what an nd range is all about. I'm going to provide you with a few definitions that are within the specification.

The work item is what you should think of as an invocation of the kernel for a particular index. So in the very first example I said for i equals 0 to 10, i is the work item. Now if i is equal to 3, then your work item is... has a global ID, you see the next thing, but let me just say for a moment that your work item is equal to 3. Now what's going to happen is that you complete your work upon the i equals 3 and then you're finished. You don't do anything else.

You're done. The work item's dead. Now, this is a very interesting way to think of a for loop.

It should become natural to you over time. I understand if it's not natural right now. This is probably one of the most intimidating parts of OpenCL is understanding these concepts.

Now we have what's also called a global ID, which is a globally unique ID for a work item. You can think of this as basically your index that you're using in your iteration space. And we have a global work size, which is the total number of work items. Now this is per dimension.

As I said, you can have multiple dimensions. And the work dimension is naturally defined as the dimension of the index space. You can either think of this as the number of four loops, so dimension 1 is one loop, dimension 2 is a loop inside of a loop, and dimension 3 is a loop inside of a loop inside of a loop, or you can think of it in terms of indices, that you have one element within the tuple, two elements, or three elements within the tuple. Either way, they're equivalent.

It's up to you how you like to think of them. Personally, I prefer to think in terms of index spaces now. When I began using OpenCL, I thought in terms of four loops. So I encourage you to think in terms of index spaces, but ultimately it doesn't matter. Unfortunately, something is missing from the execution model, but we haven't really motivated what's missing.

Everything I've said is accurate, but it's not the complete picture. And I want you to understand why it's bad that we're missing this. And we're going to have to discuss how we map the execution model to the device model in order to properly motivate.

what's different. The first thing to consider is how do devices actually do their work? This is a very important question. And when we look at our execution model, we have tuples being pulled from index space and being passed to identical functions. The source for these functions is identical, the invocation, sorry, the argument list is identical, it's fixed, they all have the same parameters.

So this is how we've seen executions done. On the other side, we have this device model where we have all of these processing elements, and we kind of know that the work items are going to have to be mapped to the processing elements in some ways because I told you that they're the only things that actually can run instructions. So immediately remember the processing element runs the instructions, so work items should run on the processing elements. That's the natural way to think of it.

And what's going to happen is that we're going to assign multiple work items. Now let's just pause here and think about it for a second. As I said, work items are particular invocations of the function. You can think of them as indices.

So i equals 1, i equals 2, i equals 3. Those particular invocations are going to be assigned to a processing element. So in order to map the concept of the execution to the device, we're going to take the work items and map them to processing elements. we're going to have to handle the case that we have more work than processing elements.

This is a really good approach, but something's missing. We're still missing something here, so I'll give you a second to think about what it is before I just tell you what it is. What we're missing is an ability to use local memory. Local memory is there because it provides great benefits in performance. But the model I've shown you cannot use local memory at all.

Now, we're going to make a slight adjustment to our execution model. We're going to add a concept called a work group. And now I promise you there is nothing else I've withheld for you.

This is the complete concept. But I wanted to motivate why it's important. So the work group partitions the global work size into smaller pieces.

And each partition is going to be called a work group. Now, the workgroups are what execute on compute units. So workgroups execute to compute units, and work items inside of the workgroup are mapped to compute unit processing elements. So the workgroup is a natural way of scheduling work on a device in OpenCL.

The workgroups are scheduled to compute units. Compute units share local memory. Work items in the workgroup are then mapped to the processing elements within that compute unit. Now all of the work items inside of that work group, they share local memory. This is now the full concept.

The work group size has a physical meaning. So how large should your work group be? The work group I showed you before that I had, let's say, eight processing elements per compute unit.

Now perhaps the driver will tell you, oh, I can do... 1024 work items in a work group. This has a particular physical meaning, and it's a device-specific meaning.

We're not going to get into that and we'll get into this into talks when we're talking about performance because we will use this fact and take it into account. But I want you to understand that how you size your work group, even for problems that are seemingly independent, where each work item has an independent thing to do. like it's just adding two numbers in array, and each work item and each global ID is a particular element within that array, the work group still has an impact there. And we'll see this when we talk about another video on performance.

Work items are able to find out what work group they are in. Now, we talked about memory, and I want to show you another perspective now that we've completely motivated it. From the perspective of a work item, it has private memory attached to it, constant memory, attached to it, and it has the ability to access global memory. Now, if we extend this to a work item in a work group, we now have local memory in the picture.

So you have many work items that are executing within this container of a compute unit. Each work item has some private memory that it can access, and all of the work items within the work group or compute unit are able to share the local memory of that unit. Every work item on the device, as I mentioned earlier, is able to access constant memory.

That's because the processing elements can access constant memory. And likewise, every work item can access global memory. Now I'll mention again the work group size.

The maximum work group size is a device characteristic. You don't want to calculate it from, say, just the number of processing elements per work group. sorry, per compute unit.

There's a size that's going to be given to you. There's a maximum that the device can do, and there is a meaning to it, which will be discussed in future discussions. The way to figure out the maximum size is determine the value by querying the device and just asking it what it can do. For example, you may see on a CPU that it tells you that it can execute 1024 work items within a work group. You may see on a GPU that it says the maximum workgroup size is 256 work items.

The maximum workgroup size is actually an integer, and we have to handle the fact that, as I've mentioned, we can have n-dimensional workgroups. So we can size our workgroup to be in multiple dimensions, and we have to have a way of bringing this back down to a single scalar integer. As I've mentioned a few times now, I want to make sure it's completely clear. We haven't motivated how to determine the best workgroup size.

Another presentation will discuss this. For now, if you'd like, you can just choose the maximum that the device supports and work without the intuition for exercises or to get familiarity with OpenCL, but we'll come back to this later to figure out what the best selection actually is. This is a duplicate slide in some sense. My apologies, this is slightly similar to something you just saw, but basically what we have happening here, what's new is that I've shown you that you can have at the bottom here, work items can find out a few things about themselves.

They can find out their work group ID, the size of the work groups, which is the same for all work groups, they almost have the identical size. They can find out the global ID and they can find out the global work size. Now, conceptually, I want to show you what's happening so that you can see sample source code, that you can see what this execution kind of looks like. Again, the only thing you're providing is parameters that kind of create all of this boilerplate for you. You can't directly do this.

All you can provide is the function you're going to call and a set of execution parameters, as I mentioned before. So let's see what it would kind of look like to do workgroup launching. You see here. that we have the workgroup size.

This is just a parameter that says how many elements or how many work items are in a workgroup. This is the global workgroup size. So, what's the total space we're working with and what's the offset?

Now, to actually launch workgroups, this is again not what happens, this is just a conceptual method, but what's going to happen is that we're going to launch the elements or work items, group groups at a time. So for every group we're going to go ahead and for group 0 say we're going to locate some local memory. This is actually physically on the hardware so this doesn't really happen.

And then we're going to iterate over a local identifier so that we actually will, we launch the work group and then we go in and we launch all of the work items within that group. We calculate the global ID as the offset. plus the work group ID, plus the work group size, sorry, multiplied by work group size, plus local ID.

So really what's happening is that this is what the function that's executing, all of this boilerplate is kind of done for you in a way. And all I want to do in this slide is make it clear to you that it is the work groups that are scheduled, and the work items are broken apart within the compute unit itself. sharing the local memory resources.

That's something that you may miss when you read the specification yourself. So how do we handle n-dimensional workgroups? What do we do? Well, workgroups can have multiple dimensions.

You can think of this geometrically, which you'll see soon, and you can think of this as pulling n-dimensional tuples from a set, which you've already seen. But as I've mentioned, your maximum workgroup size is a scalar. Your workgroups can be n-dimensional sized, For example, if your maximum workgroup size is 32, and you're launching with dimensions 8, 2, and 1, how do we translate to make sure that we're not going over the maximum size?

Well, from the perspective of the device, the workgroup size is really one-dimensional. The two three-dimensional stuff is just an abstraction for your convenience. What's going to happen is you give these...

K workgroup sizes, so W1 to W2 all the way to W sub K, and just multiply them out. So long as it's less than equal to the maximum workgroup size, you can use it. If it's larger, you're going to hit an error from the host API. It won't launch, and this is just something to be aware of. Now let's think about this geometrically.

I've given you an approach where you can think of this in terms of index spaces. I want to give you another approach where you can think of work items geometrically. because you may find that this helps you.

Suppose that I have a global work size of 32 here. I'll note that this slide slightly in error that it shows an index of 23, but let's ignore that. So we have a global work size of 32, we have a work group size of 8, and we have this space.

One way to think of this space is that we're going to have the OpenCL itself break this into workgroups for us, such that contiguous work items are within the same workgroup, and this is all done automatically for you. What OpenCL will then do, okay, so the only thing we've specified is the workgroup size and the global work size. Everything you're seeing now is automatic.

It's going to break everything into workgroups, and then it's going to take those workgroups and give a workgroup to a compute unit. So each compute unit is going to get its own workgroup. Now, What's going to happen is that the workgroup will be broken apart when it's actually given to a compute unit, and each work item will be given to a single processing element. Now, what happens if we only have one compute unit?

Well, very simply, what it's going to do is just go ahead and run those workgroups one by one on the single compute unit. And it's going to be the normal case, likely, that you have more workgroups than you have compute units. The hardware is optimized to handle this case.

So in terms of having a single compute unit doing everything, you now are going to see processing elements are going to be given multiple work items to do at a time. This is, again, the normal procedure. This is not abnormal. Let's consider this geometrically. Now, we have a global work size of 8 by 4, and we can think of this as a 2D array or a rectangle, however you like to think of it.

I just want to show you a geometric interpretation. And we're going to specify a work group size of 2 and 4. Again, things will be broken down. We're going to get four work groups, and this is a 2D partitioning.

This might be useful if you're doing some sort of 2D grid problem, just because it helps you think things through. It makes the indices a little bit easier than doing everything. by hand in one dimension. Now if we have a three-dimensional problem, then again we can go ahead and provide a three-dimensional work size and a three-dimensional work group size, and everything will be broken down for us accordingly.

I'm going to make a few last points about the kernel calls. The host provides execution dimensions to the device. It's going to create the index space.

And the parameters to the functions can either be values, or as I have not mentioned before, they can actually be global memory objects. When you learn about OpenCLC, you'll see this. Now, global memory is persistent between calls, but as I mentioned earlier, constant local and private memory is just scratch space, and each kernel call is going to result in those memories being completely reset. So the main idea here is that if you actually care about a particular value, you should just go ahead and leave it in global memory. The OpenCL implementation is considerable flexibility.

These are models, so the implementation can decide how to best map work items to processing elements and how to actually schedule the work. You will see very big differences between, say, how a GPU will schedule work groups versus, say, a CPU. I hope that this last section made sense. If you are saying to yourself, this is completely trivial, and I don't know why he made such a big deal out of that, then I have done a very good job in teaching you this concept. I found these concepts to be quite challenging myself when I first started to get exposed to them, and I hope that I've broken them down into such a way that you find them now a natural way to think.

So let's talk about the host API. The host API is the other part of the standard that's telling... that's providing opportunities for the host to do what it needs to do. So we have here a platform, context, and program. These are the main things I'm going to discuss, as well as how the device does its synchronous device calls.

There's more to it in the standard. Again, my purpose in this talk is to give you an opportunity to get into the standard and not be intimidated by it, and to have enough background so that you understand what you're reading when you go through and actually write your applications referencing the standard. So let's discuss the platform.

What's this all about? Well, the platform, you can really think of it as an implementation of OpenCL. And platforms are kind of like device drivers. They're just exposing what devices are available to you.

So, for example, if you have a system with two GPUs and a Xeon Phi, and let's say that one GPU is from AMD, one GPU is from NVIDIA, and the Xeon Phi is from Intel. So, in this particular system, OpenCL will see several platforms. It'll see a platform from AMD, one of the GPUs, and... Just because AMD supports CPUs in their implementation, it's also going to see the CPU and expose it to you.

If you have a platform from Intel, it's going to show you the Xeon Phi. And if you have a platform from NVIDIA, it's going to show you the other GPU. So the platform is just a way of discovering what's available to you.

The platform is what discovers devices. It makes sense if you think of it as something like a device driver. Now the context. You create a context for a particular platform, and you cannot have multiple platforms in a context.

This means that when you create a context, that you can't create one for both, say, a Xeon 5 from Intel and an NVIDIA GPU. Can't do it. So what is a context? Why are we discussing it? Well, a context is essentially a container.

It contains devices, and it contains memory. And you are creating this container, and it's going to help OpenCL manage itself. Most operations... are related to a context either implicitly or explicitly.

So creating a context is the first thing you do. When you write an OpenCL program, what you're going to do is discover the platform, get a context, now you can start allocating memory, now you can start controlling devices. So when you read the specification, you will see the platform model, you'll see the context model, now you know what these things are. I will note that the platform is what discovers devices, and the devices through the platform model will tell you what they're capable of doing. Now let's discuss programs.

We're fast-forwarding through the specifications a little bit here. So programs are simply collections of kernels. You've already seen this. So you've already seen kernels.

We're just now saying that we can assemble groups of them and call it a program. And we have to extract kernels from your program to actually call it. So one of the things you're going to do is you're going to create a platform, you're going to create a context, you know, after you've found a device. And then you're going to go ahead and load a program, which is a collection of kernels, which are these special function calls. Now, OpenCL applications have to actually load these kernels, and the way...

can do it is it can either compile OpenCL C source code so we can load it by doing a compilation which will be done as your application will have to stall out and wait for it to be completed or it's going to load some sort of binary representation I will note that programs are device specific you're going to probably want to compile OpenCL C with OpenCL 2 we have SPI are which will eliminate a little bit of this issue now We want to discuss asynchronous calls. The host is managing devices asynchronously. As you know, you have multiple devices within your host, or attached to your host, I should say.

So you may have, in our previous example, Xeon PHY, an AMD GPU, an NVIDIA GPU, and potentially you can use a CPU as another device. Now, what's going to happen is that you're trying to manage all of these devices asynchronously for best performance. So OpenCL has an asynchronous interface that you should become familiar with.

and I'm going to provide you some motivation for it as well as a general idea of what's going on behind the scenes so that when you read the calls and prototypes, they make sense. On a high level, what happens in OpenCL's asynchronous device management is that we're going to issue a command to the device. We're going to tell it to do something.

And what's going to happen is that devices are going to take the commands and do whatever they say to do. The host is going to wait for the command to complete. Now, It doesn't have to sit there and busy wait. It can enqueue quite a bit of work, but the host won't know that the command's finished until it kind of waits for it. It's no different than a join or anything like this.

Now, commands can be dependent upon other commands. This is a little bit unique to OpenCL, and this raises some complexities in terms of asynchronous programming in OpenCL. OpenCL commands are issued by the functions you see in the standard called CL enqueue calls. So if you look at the standard, anything that says CL in queue is issuing a command.

Now, a CL event object is returned by any of these enqueuing calls, and you're going to use it for dependencies. So let's look at a sample call. I will note, first of all, that this is not the actual structure of OpenCL calls.

This is what I'm showing you as a simplification, just so that I can teach you the concept. If you look at the code, It is different how this is done. So permit me a little bit of flexibility to teach you the idea, and then you can get into the nitty-gritty details when you actually read the specification. So this command, clenq foo, is going to enqueue the command foo to run on a particular device.

So when we give a command, we're giving it to one device. We're telling it to do something. Now, the command is going to return to us a handle to represent when it's completed. We don't, once we enqueue something, it's not going to execute right away.

We're just getting a handle to tell us, okay, when it's finished, we'll let you know, and here's a handle you can use to query that. Commands also take a list of dependencies. These are previously issued commands that have to finish before this one can be done. This provides an opportunity for the driver to better schedule itself.

Let's take a look at a sample execution. So I have here two commands I'm going to run called foo. and one command bar. Let's trace it.

I'm going to get an event called e1 because I'm going to say I want to enqueue the foo command. I want to enqueue the second foo command. They have no dependencies because the dependent set is empty in both of these cases. So I see something visually like that on the right hand side.

Now I'm going to enqueue the last event and you see now a dependency. So the function bar, sorry the command bar, cannot be completed until These two previous calls to foo have been finished. Whatever these things are doing, foo might in real life be doing memory copies and bar might be a kernel. The point is that we're providing a structure to our programs in a way that is not necessarily, this makes the host API difficult, so it's not necessarily intuitive. And things are going to be dispatched to devices.

So where are these commands actually going? I'm saying that we're enqueuing things, but I haven't really said where they go or what they do. So OpenCL has command queues, and this is where these commands are going to be placed.

A command queue is attached to a single device. You can't attach it to multiple, but you can create as many command queues as you want. Whether or not you want to is something that you have to look at your vendor documentation, and it is specific to implementations of OpenCL, how this has been interpreted.

Now, CL and queue commands, I haven't shown it to you, but they all have a command queue parameter. So you're going to tell it, I've been queuing this command to this command queue. That command queue also depends upon a device when the command queue was created. So the complete picture here is that we have the host, and the host is going to call a command like clenqueuefoo.

It's going to generate... and place that command into a command queue. So this little curved rectangle is really some sort of command that's been encapsulated and is being placed inside of a queue. But this all happens in one step.

You can't separate it out. You enqueue the command at the same time it's created. Everything's done in one step. I just broke it down so that it makes more sense conceptually. So let's let time pass, and let's go ahead and see what happens now that foo's at the head of the queue.

Just enqueuing the command doesn't mean that it's completed yet. It just means that it's sitting there waiting and may have further dependencies. So when the command is run on the device, the device is going to do something in response to that.

Maybe it's allocating memory. Maybe it's doing something else. I will clarify that these foo commands, you don't get to create these commands.

You have to choose from a set of commands that are provided by the OpenCL standard. But this concept that I'm showing is applicable to all commands. Finally, what's going to happen... is that the host can find out that the command has completed. And this is the nature of asynchronous programming in OpenCL.

So I want to give you a quick summary of the host API. Now, what's happening at a high level is that the host API is controlling the device. The device can't do anything itself. You have to tell it what to do. And the way that you tell it is the host API.

There's an asynchronous execution model, which is critical for speed. And it's a little bit different from traditional asynchronous APIs that you've seen because of the dependency model and because of the command queue system and everything else involved. And I've certainly not covered everything there is to know about the host API.

However, at this point, you have everything you need to start learning a lot more about OpenCL, but you won't be able to start understanding the OpenCL C part until you watch the next video. So in conclusion, you now understand one part of OpenCL. The next part to learn is OpenCL C, and that's the next video that follows up with this one. My goal in doing this has been to teach you the key ideas.

You should really read the specification yourself. But armed with the ideas that I've given you, you should find it very easy to approach the specification. It's going to take you some time to learn OpenCL.

You're not going to learn it overnight. It's going to take you a little bit of time. And if anything is unclear, you should really comment on this video.

Let me know if I didn't explain anything well. I'm not going to re-record this video if I stumbled somewhere and had to correct myself verbally, but I will make amendments to this video and issue new videos to clarify anything that is fundamentally unclear. Hopefully everything seems really simple, and if I've done that, I've done my job really well.

I'd like you to subscribe to my blog. You'll see it on the last slide, and it's in the notes of this video. And I'd like to tell you that I'm available as a consultant. One of the things I can do for you is help you get top performance from your code. I can write algorithms that are efficient in OpenCL.

I can help you figure out how you can leverage OpenCL in your application and how you can deal with the heterogeneous computing world. But the other thing I can do for you is help you by doing training that's specific to your company where maybe I'm teaching you what you need to know to achieve your objectives with OpenCL. So if you like my style of instruction, You haven't seen any exercises here, but I'm certainly available to help you set up something for your firm.

This is the address of my blog, and this is my email address. Feel free to give me a message, and I hope that you have enjoyed learning about OpenCL.