[Music] okay it's uh about 2:00 now so I'll get started and uh welcome to Archer's online MPI course um we're just going to start with a sort of quick 10- minute overview of these courses and just introduction to what we'll be doing okay uh these talks are available and license and you can use them if you wish as Pur this license there you go exciting uh so what is Archer Archer is a UK UK National supercomputer service um managed by eper the engineering and physical sciences research Council which I'm sure some of you are familiar with uh but it's actually housed up here in Ed on the edge of town supported bycc uh the machine itself is a cray XE let me get this right I want to say 30 hopefully that's correct if not I have no doubt my colleagues from K will be um let correct thank you CLA for confirmation that he is the right one um so it's supplied by Cay uh and we as part of the contract to run Archer provide training and computational Science and Engineering support um you're probably already aware of this as you are on this course um these are always free to all academics um so as I said Archer is actually housed out on the edge of town at the advanced Computing facility um so this is a nice picture of the side of it I'm told I've not actually been to visit yet but I'm told that that that lovely panel that's well decorated is up against the wall so you can't really see it very well but here's a good picture of it um and what is EPCC well we're a UK National supercomputing Center possibly the UK National supercomputer Center um there are a few other places but it tends to be things like the Met Office who have their super computers for very specific purposes and won't allow just anyone to to apply for research Grant time on it um we found in the 1990 uh it's a self-funding center at the University of Edinburgh I used to be an Institute we've moved up a layer of bureaucracy which is nice for us we're now A Center within the College of Science and Engineering um and we don't receive any direct funding or very little from the University itself uh but instead fund ourselves through um commercial and research contracts uh I've been running National parallel systems for a very long time since 94 um around 90 full-time staff and we are growing currently um for example I started uh about a year ago actually on MAR or in March um we do a range of academic research and Commercial projects and we offer every one year postgraduate Masters in HPC also HPC with data science and we do run a few other uh courses are involved in a few other courses as well um do get in contact if if you're interested in collaborating uh many of our staff are named Ras on Research grants and we do put together joint research Pro proposals and we are involved in a number of European consortia um which I'll be talking about more a little bit later as well um Key Resources so again some of that you're probably already familiar with since you are here but you can find all the material for these courses on our website um including as well links to YouTube videos which will cover the same lectures as ones I'll be giving today and I believe this is being recorded also but I know it is um so you can always come back and and check things if you wish and find our slides uh who Am I who is this voice out of nowhere speaking to you my name is Oliver I'm an applications developer apcc um currently working on two main projects uh that being epigram HS which is a big European consortia uh there we're looking at programming models for uh exascale hetrogeneous systems so very large super computers with things like gpus and fbga strapped into them uh and HMS which is highle meter scale modeling system that's the one got there in the end um and that is looking at making it easier to write uh ltis boltzman codes which a particular type of computational fluid dynamics um trying to make that slightly easier for the developers to develop models by creating a domain specific language so that's the two main things in one as also previous on intertwin which is another ex scale programming model uh European research project as well as other do teaching I'm here right now doing this uh but also teach on some of our MSC courses a personal tutor a data science technology and Innovation program which is actually a a college Run online learning course I also am an arj help desk operator so if you are in the future or have ever um had a problem with Archer and you you've emailed in or one of our other machines in fact and you've emailed in there's a chance that this has come to me to uh to redirect someone who knows how to help and I'm also an auditor which is very exciting as you can imagine um so other resources please fill in our feedback forms uh that's a great way of a finding out if we're doing something badly we'd rather know that and do something to fix it and B also so if we're doing things well let's just prove that to our funders AK research councils they like to hear that we're not just wasting all their time and money so please do fill in our feedback forms uh as I mentioned before I'm help disc operator and you can always ask the helpless questions by emailing support in um and as also previously mentioned we run an NC in HBC and in HBC with data science um scholarships are available this may be of interest to you or it may be of interest to your students or it may be of interest to people you know um Talk by APC staff plus his options in other departments fairly standard set up for an MSE in terms of how to run we also run some online accredited courses including practical introduction to data science and practical introduction to HBC both of those are actually also in that data science technology and Innovation program I mentioned um so during the course uh you will well you've been given hopefully you've received email about the fact that you've got access to Sirus which is another one of our machines uh it's a bit more uh easy to use especially for this course than than Archer itself um you'll have guest accounts for the duration of the course I please want me use these for things we have to do don't just um decide you want to start mining Bitcoin we will notice and we will stop it uh these accounts will be closed immediately after the course so make sure if you do want a copy of anything you've done during the course that you copy it off before the end uh and all these materials will be available from the web page um until uh whenever uh for longer term access to Sirus um there's several ways both for industry and Academia uh they will require justification of resources um in a form of technical assessment which you submit to us and then we'll let you know what it is basically that just says or asks you what it is you want to do with c and we let you know whether or not C is a good fit we may recommend a different we feel if we feel it's not um but you can find more information about that on s's website uh and the process is similar for arer um there's various different BRS you can apply for okay and that takes me to the end of our little welcome uh slides so now we'll get started on message passing Concepts um which is the first proper lecture as it were of our course uh so we're we're going to start off by talking not about NPI itself but about the programming model on which it relies um which we'll hopefully make go away there we go uh which will hopefully make uh the explanation of the actual interface make a bit more sense later on um so message passing programming so we're going to cover the message passing model uh spmd which is a single program multiple data communication modes and Collective Communications uh beginning with program model so programming model itself is um sort of the the conceptual view that a developer has on the tools that they're using um to solve a particular problem so for example in in seral program we have fairly High Lev Concepts like uh arrays sub routines and variables things you can make use of um but they're independent because there just ideas about how you should structure a program they're independent from the actual languages and Al languages uh the standards that Define them are also independent from the actual implementation of those languages so for example for C it will depend on your compiler's interpretation of the C code that you've written um because the the compiler translates between your human readable C code hopefully human able if you've done a nice job of it um to something that the computer itself can understand um and same for for and any other compoud language uh and equally interpreted languages have interpreters that do the same job um more or less so and all these things are are separate uh in message passing in this parallel programming model that we're going to be discussing uh the concepts are things like processes sends and receives collectives single program multiple data and groups and these are independent ideas about how programming can be done from the libraries themselves um which are NPI libraries in this case uh and these are or rather the NPI standard defines the library interface um and then that library and that interface is implemented by several different uh vendors so for example there's open MPI which is an open source effort to implement NPI that's very commonly used um there's an Intel version of NPI there's Nitch which is a funded effort from a US National Lab there's also a cray version which they provide with their machines and there an IBM version there's a HP version there are lots of different implementations of NPI but they should all follow the standards set down um for the library interface so you should be able to program the same NPI code and use any of these implementations but what actually happens after your NPI call under the hood is dependent on the implementation okay and NPI is just one library that uh sets out ways of using the concepts defined by a message parallel program a message passing parallel programming model this I said for it's based on the notion of processes a process is you can think of it as a running program to with the data and so processes are isolated from one another they see their own memory area and not any other processor's memory area um so in this message passing model you achieve parm by having these processes cooperate on the same task and they can communicate with one another to achieve that by sending and receiving messages but not by any other method so for those of you who are familiar with openmp openmp you have a shared memory model where everybody can see the same memory area and access it all at the same time if they need to well different parts of the same memory area at the same time um in in NPI we don't have that or message passing we don't have that every process is completely separate and sees only its own data um it's a bit like saying all variables are private and open NP um okay again the NPI standard is such that uh to the applications developer um these sending and receiving messages just look like Library calls um okay so okay here we look at a a sequential Paradigm this is just reiterating the point I've already made about the process being some processor and some memory area unique to itself um so in this parallel Paradigm there's this message casting interface uh to some sort of communication Network that allows the sending and receiving of messages between processes and a message is just any any item of data and so it could be an integer it could be a string can take on any form but it's the only way to share data between processes uh this this type of model as programming model works or is primarily use on distributed memory architectures so here we have a photo of Hector who is the predecessor to Archer um and what looks like an older cray machine actually although I can tell you which um so on a machine like Archer or Hector you have many many many many nodes uh which are distributed um around the room in fact and they're all connected together by some sort of interconnect which is quite fast Archer has an Aries a crate areas interconnect um for example uh and these are light of the rapid passing of messages between different nodes and and the advantage of this type of approach is that it lets you split up a large problem uh across many different nodes with separate memory areas which may allow you to to solve a problem that simply wouldn't fit on a single node um given the available memory for example but this sort of distributed machine which is what we would normally think of as a supercomputer as a to um many independent work stations which is what you'd have without own connect um so process communicat so say process one defines some variable a = 23 uh it can chuse the send to process two its value of a okay so it does that um process two receives that message but receives it into a different buffer so the fact that it was called a on process one makes absolutely no difference to process two process two just gets some stram of bytes which it interprets um or which it assigns to the variable P okay and then it puts that into its own memory Arrow so it can set its own variable AAL B+ 1 okay and that has no impact on what's happened on process one these are two different values of a uh so most message passing programs use a single program multiple data model uh in which every process runs their own copy of the same program and a single program but they have their own unique data um now you might be thinking if they're all running the same program how do they do anything useful or interesting uh the answer is that you can branch that program based on a unique identifier for each process so so you can for example say if you are zero process process zero and do this you any other process do something else entirely um and this is how we get them to run different things while running the same code fundamentally um that might seem like an odd way to do it but it actually simplifies a lot of things for example you don't have to compile 100 different binaries to run a program on 100 processes uh you compile one and they will run it but it's forked internally um based on the process ID typically we would run one process per processor SL core um you don't have to so you can you can easily run uh four processes say on a dual core laptop uh it will be slower if you were just running to in all likelihood um but you can do it it's mainly useful for testing if you're doing something that is important you would probably just stick to one process per core and it's most efficient approach generally speaking um yeah so here we have a sort of um f like studo code some sort of message passing program and as you can see it's branched based on the controller process ID oh sorry on the process ID so there's one controller process which we normally choose as process zero for example and then everything else does a different thing entirely and it's way we get our processes to cooperate um and the same sort of Pudo code in a Fortran like style um hopefully that makes sense to you all uh so messages this is obviously the kind of key important aspect uh of a message passing program what exactly the messaging is doing so a message transfers a number of data itm of a certain type from the memory of one process to the memory of another process um now it's not quite as clean as that unless it will typically contain things like the ID of the sending processor the ID of the receiving processor the type of the data items it needs that because as I mentioned the received buffer is completely independent from whatever buffer the data was sent from so whatever a variable name um so to interpret the data correctly you're process needs to be told what type it is um so we also need the number of data items uh the data itself and usually some sort of identifier for the message and some other bookkeeping stuff ends up in there um so there is always a certain amount of overhead in messages um for practical reasons uh there are different modes of communication available um so your sending can be the synchronous or asynchronous um so synchronous send simply is not completed until the message has been received um so it blocks the sending process uh until that receipt has been confirmed whereas an asynchronous send can be sent and then the send is free to go about its day and it doesn't need to worry about whether or not and when that data has been received um reive usually synchronous there's nothing you could write a message passing uh library that did not need that um it would make things more complicated typically you would just do a synchronous receive so essentially you generally send things or rather you generally receive things immediately before they're needed um there are there are asynchronous receives available but they're less commonly used it tends to be the in process moves on but then receipt is synchronized um but a fairly Advanced usage so um a synchronous sense uh the analogy here is with taxing a letter I'll be honest with you I have never seen a fax machine um In the Flesh so to speak but I gather from this slide from Context that when you send the fact um you you have to wait until the whole thing has been received before and it will confirm to you that that has happened um so you know when the letter has started to be received it'll give you some signal to say that that has happened um whereas an asynchronous send is more like posting a letter where you simply Chuck it in the post box and then you assume at some point a post man will arrive pick it up and then it will eventually be delivered to the recipient but you don't know when exactly that's happened uh and you may indeed not care you your job ends as soon as it's sent and you carry on throughout your day um so then the next sort of obvious mode of communication is simply point to point um so so far we considered only two PR processes one sender and one receiver um and this is point topoint communication uh hope it's fairly clear when that's the case um this is the simp form of message passing and relies on a matching send and receive so that's uh quite key actually is that for every send you must have a receive elsewhere in your code we'll come back to that in a bit I believe um and there's analogy to sending person in emails so this is simply for one person to another as opposed to examp for example mailing a mailing list or many people would receive in point to point is simply one send one receive and a single Link in between uh on the other hand Collective Communications are available uh so while a simple message communicates on between two processes uh there's often times when in your program you may need to communicate between a whole group of processes and Collective Communications achieve this now the underlying implementations can be built from point too Communications um and indeed when Collective Communications were first proposed and introduced the standard they often were because it was the simplest way to create a working implementation to begin with and times have moved on since then and now I would be very surprised if that were the case I would expect them to be implemented in a different way that was more efficient um simply because there's there's some routing things you can do that are better for these types of communications um but also there's certain types of bookkeeping that are not required in this situation so there are ways to to optimize colletive Communications compared to doing a simple point to point um and I would expect them to be used in any uh up toate MPI Library um but this is this is closer in terms of the program model uh to sending an email to a meeting list where you expect everybody on that list to receive that single communication so a type of um Collective communication ah well no I take that back so one way of synchronizing your code is a barrier a barrier is probably the most sort of Hardcore Brute Force way to do it um it's a global synchronization takes every single process and says okay you're not allowed to pass this point in the code and remember that all these processes are running exactly the same binary they're running the same code so they'll get to this barrier and they have to wait there until every other process is also in that barrier and then they may proceed um so it is a global synchronization point in your code uh there are reasons to do this um one good one for example is if you're doing some sort of profiling uh and you need to time a particular section um you might well want to put a barrier at either side of that in order to make sure that um you can see how quickly every process gets through it um if you want to time that particular section of the code um but in general it's best not to try and fill your code of too many barriers because um chances are you don't always need them um and you will have a lot of processes just sort of sitting and waiting for a while um for the slowest one to reach that Z Point um Okay so I'll go back to what I was starting to say that this is the type of of collective Communication in that every process is involved in it uh so another type of collective communication is a broadcast um and here uh exctly what you might expect it's a onet to all communication so one sends its data to every other process in the group um with the end result that everybody or every process has that same piece of data although again they can overse into whatever variable or buffer they like so they may not be using the same uh variable name for that data they will have a copy of that same data and here we have little example for you so here process 8 is broadcasting to every other process oh sorry not process eight a process is forecasting the number eight to every other process uh another type is scattering so in scattering instead what you do is you take some array for example uh and you divide it up across all of the processes in a group um so everybody has a different part of that array so for example here and so you can see each process has been assigned a different variable or a different part of array sorry um and what's important to note about this is uh the root process so the one that's actually got that whole array itself a has to define a separate buffer to receive um its individual part of that array into so number two is separate uh and B that the order is in no way guaranteed and does not relate um to the ranking of the processes or the RDS of the processes um there's no guarantee about where each part of the array will go you just know that it will be evenly divided amongst your processes uh so Gathering is is the inverse process um where you know that your entire group of processes has individual bits of data and you want to collect them all into one process um okay so they will send to some process here and you can reconstruct your original array that we previously brought uh scattered um yes so again it the important point to note here is you need to make sure um that whatever that the array that your root process is going to receive into is B enough to fit everybody's data in or you will run into problems um so there's also reduction operations um where you combine data again from several processes to form single result but here instead of Simply Gathering them into some buffer or some array uh we're actually going to combine those numbers in some way for example could be a simple summation um that would be the case in this example which is a strike M his I did not write these slides but the example here is a strike B so you could have for every um yes result one every no result is zero and then you perform a anation to find out whether enough people have voted for it here they have not um but it doesn't have to be a summation uh it could be any kind of reduction operator um for example you could be multiplying them all together you could simply be trying to find the maximum value um so we'll discuss this more once we're on to looking at actual MPI um but for now concept is simply that you can simply that you can combine data from across your process group uh into a single variable if you if you need to and it's a common operation um for many types of application okay so launching a message passing program uh as we've said a few times already um it runs from a single piece of source code so compiled binary with calls to message passing functions um and this is all compiled with a standard compiler and it links so there are you wouldn't necessarily call what your standard compiler uh name to compile MP code um because often uh the message passing vendors will provide you with a wrapper to a standard compiler uh that includes linking to uh the NPI libraries for you so you don't have to do that step yourself um so it won't necessarily that you won't just be calling GCC yourself but underneath it is just using for example GCC or the Intel compilers or whatever um so the the actual compilation step is is not really any different um from compiling serial code um and your single piece of source code looks just like your source code Norm is would in that language except there are calls to this NPI Library um and you have to include the NPI library or message passing Library I should still be saying at this point um okay you then run multiple copies of that same executable on your parallel machine um and each copy is a separate process uh the way you do that is through a launcher program which basically just says to the to the operating system uh I've got this this execut able please run it 100 times all at once okay and each process has its own completely private memory area um and they can and it's important to to understand conceptually as well is that they can be at different points in that program unless you've explicitly included some kind of synchronization such as a barrier uh the chances are they will be at different points in the program and so you can never rely on them being precisive the same point at the same time um because the timing is down to individual machines and network speeds and all these kind of things um which are not typically the same across a large computer or even a small one and of course your code is most likely branching based on processor ID so the coverage will be different across different processes as well okay it you have some kind of launcher that sets off those processes for you um and some some common issues with message passing models uh okay your sends and receives must match um for those of you familiar with which I guess is all from Polar and for those of you familiar with uh languages which require explicit memory allocation and deallocation this is exactly the same way you must always have some sort of free for every malk in C um you must always have a a receive for every send if you don't if you have a send well if it's a synchronous send certainly and there's no matching receive then your code will just deadlock um and it will be waiting forever for that receive to happen um if it's asynchronous it's a bit more complicated uh it's basically what will most likely happen is it's something will go wrong further down the line um if you if you have mismatches you know sends and receives um and if you don't that's sort of through or it doesn't go wrong further down the line that's entirely through luck and should not be relied upon um so there must always be a receive for every send uh and vice versa so if you starting adding extra receives you'll also run into issues because again they're blocking um it is possible to write extremely complicated programs but actually a lot of scientific goes in particular have fairly simple communication patterns and also uh have a a small set of of communication patterns across all of them that um is certainly nice from a research perspective for example because it gives you a list of common communication patters to to try and improve in order to um improve a wide range of scientific CS RS but in general um it's best it's it's better generally if you can to have a simple communication structure especially to begin with so it might be for some reason it needs to be more complicated later on but it's best that simple um and yes it scientific goes in particular T things like domain decompositions where basically you'll have some grid gets divied up across processes uh and at the end each time step those processes need to communicate um the boundary conditions for their particular parts of the Grid it's called a Halo swap um and it's extremely common across many scientific cods because they use this um domain decomposition structure internally um always use Communications if you can there's no point re implementing your own Point too version of what should just be a collective communication um for the for the simple reason that at the very worst the underlying Library will have implemented that Collective communication using pointto Point um but it'll still have done it in a single line well what for you as the application developer could have done a single line of code versus your multiple lines in order to perform the same thing um and it's very likely that the the library developer has actually implemented it in a much more efficient way than simple point to Point um so don't don't be afraid of using Collective Communications they did have a sort of bad reputation um when they first arrived for the reason that they were quite often um simply implemented on top of point-to-point communications um but these days that's very much not the case um so if if what you need to do is a collective communication just use the collective um communication function okay uh and then summarize message is the only form of communication um our processes are completely Sil from one another they have no way of sharing data or doing anything other than these messages uh most systems use this spmd model uh where all processes run the same code there's a lot of reasons for that um especially from a practical perspective it's a lot easier to implement uh it's a lot easier to write code in that model um because you only have to have one source file uh and yes every process has a unique identifier that allows it to Fork uh or to take a different branch in the same code um the basic communication form is point to point uh but you also get Collective Communications uh that Implement more complicated patterns of UR in any codes um and also things like barers and synchronizations as well when those are necessary um so message passing is a programming model and that model is implemented by MPI which is a message passing interface I to say that earlier um the message passing interface is a library of functions sub routine calls in fact NPI is a standard that lays out um the API or the interface for um NPI but it doesn't say and it says what each function should do um but it doesn't say how it should do it that's up to the individual Library developers which are several um but it's sort of essential to understand this these basic concepts um that all variables in each process are private and that Communications uh are always explicit so there's no there's no messaging between processes that you don't as a developer see and don't explicitly say should happen and there's no sort of hidden shared State between processes they're entirely rely on you saying okay talk to processor 3 now um and they all run the same binary so if you understand this sort of model then the chances are um it will be easy to understand NPI itself uh because it simply implements these ideas um so a very different models of sequential programming and then gives little example um so this this little code exert here uh would not really work as you might expect uh in a message passing model for the simple reason that x uh is likely be different on every process um and if so if x happens to be less than zero on a single process that process will will exit um but the others will keep going and they will not know and they will not expect that another process has gone down uh and this will ultimately cause you problems your your code will eventually fail Report with an NPI error um which I mean is sort of what you want but it's not really the way you want that to happen um so you could only make something like this work if you a use some sort of communication to tell all the other processes that something bad has happened um and B you would have to make sure that X was defined on every process and had a value that hopefully would be greater than zero on every process um okay so it's getting your head around that fact if you can do that then you'll be very happy with the rest of this course um first of all I ask does anyone have any questions um and I realize I should have said this in my in my welcome uh presentation um if you do have any questions feel free to ask them on the on the Blackboard chat as I said you have that open on a separate computer from the one I'm presenting on um so if you do have any questions just let us know in there uh if you best way to ask questions is either to type it into the chat act if you're not familiar with Blackboard if you go to the bottom right hand corner of your screen and click the Pink Arrow and then click on the speech bubble you'll see the chat area um or if you want to raise your hand if you want to interrupt with a question then Oliver will see that and he'll be able to pause and address any questions as we go along if you want to ask using your mic you can try uh sometimes that works sometimes people have problems so um whatever works for you thank you very much C CL for uh pointing out and people know how they can ask us questions um it's looking like everyone might be okay though uh in which case I would propose that we break now and then come back suggest we come back at at 3:30 um and to answer your question Roberto s so uh Roberto W is asking in the black board chat um if it's possible to call NPI from python R uh python the answer is yes I believe there is a pmpi [Music] um package available NPI NPI for pie I think ah NPI for pi there we go and thank you CL um it I believe as well is quite quite Advanced is and does work work quite well um when I first started using NP I think it was people didn't really rate it that much but it's doing much better now uh we won't be showing those API specifically um but they are similar in in many ways to the stand to the NPI standard um uh so the concepts will all be the same uh I think also there is an issue with running that on something like Archer I think it's not possible currently on series it might be different um I have to defer to Authority I can check in a break for you uh R is a different story I believe R has its own um way of doing parallel things although someone with uh more experienc of might be able to tell you difference but I think R does not use MPI as it's underlying model although I don't know for sure so hope that answers your question yeah great you're welcome see some more people are typing oh Mar says it is possible to run p on arer that you'll need help from the Archer help desk to set all up and thank you CLA for um posting a link to MPI forly as well thank you Mar okay so I'm guessing from what Mar saying as well you need to in the very least install it yourself in a sort of minond um okay that's good R Points out R does have a racker for NPI um we haven't used it personally I also have not uh have to admit I've used R only very little um but for both Python and R the chances are that the API is very similar if not exactly the same um as those used for c for for example there might be differences in how it handles uh typing that might be hidden from you particular in pipon which doesn't really have the concept of strict typing um but underneath it will be doing the same thing I suspect there sort of bindings to the C code as well um in reality okay so Nico ask a good question uh so in The Gather operation how can you ensure the particular process would place the data exactly where you want to in the master process uh but simply uh ni you tell it so you say let's see yeah um so every process here has to run so that's one thing I I C their collect Communications although come up again when we actually look at them um in the library itself um they all rely on you providing on the root process uh the buffer where you want to put those things um and this is achieved because when you're calling a collective communication every single process has to call that Collective communication so say you want to do a broadcast that's not just a case of calling broadcast from your root process and then the others will receive that message they all have to call uh broadcast as well and they get told whether or not they the root process so they know whether to do a send or receive um and for The Gather Collective uh there will be an into that function call what library call that says this is the the buffer uh the memory address I want to put the data that's gathered into and that's you how you ensure where it goes which is nice and simple as long as you remember it and remember the order of the arguments welcome okay well thank you everyone for their questions uh it's good to to be able to get through those as well I'm actually quite glad that we finished a little bit early um but it seems like that's it for now in which case I propose we finish up for the time being and return at half three how will we going to get a coffee um thank you everyone and see you at half three it's half three welcome back hopefully all have a good break uh so next we're going to be talking about NPI itself as opposed to message path and programming the programming model um so on that note I thought I'd quickly actually bring us back to this this slide in the previous talk though and so there is a bit of blurring lines even in this first one which is meant to be at message passing programming concept between that and MPI itself and there is a reason for that um so where is in serial program there are many different languages that that Implement essentially the same concept Concepts um from serial programming um not all of them but many of them have very similar ideas about variables arrays objects um all these kind of things uh in message passing programming um there's not just MPI there are other message passing programming libraries um but NPI is by far the most widely used one it's it's one of the situations where essentially is dominated um which is actually quite convenient uh in any ways because it means that you know you only have to think about learning one message passing programming Library although it is not not the only one these days there are there are other options available um and indeed there's things like tasking Frameworks which utilize message passing Concepts um although many of them may in fact use NPI underneath um but that's why we're we're focusing so much on on NPI here and um the actual Library itself is what we're going to look at next so this is Introduction to NPI okay so what is it a method passing programming library jobs good no so the uh the MPI Forum um is a body which governs MPI essentially MPI itself is a standard um defined by the in or the document produced by the NPI form the first one was produced in 93 um The Forum itself uh consists of 60 people from 40 different organizations including EPCC I should say um uh users and vendors are represented from across the world uh and it's a two-year process of proposals meetings and review so the the MPI standard is quite carefully curated um always with an eye to ensuring um performance is never regressed and that the new ideas and new things introduced into the standard are sensible and so they don't want to introduce a whole the functions that no one will ever use they try to focus on things which are useful which goes back to this point earlier about many scientific codes for example having very similar communication patterns that's quite a handy thing from the NPI forum's point of view because uh it lets them focus on certain areas where they can you know achieve the most benefit simply by improving particular and capabilities of NPI um so MPI as you said is a library of function calls it's not a language um three main languages we'll look at today for in fact two in reality languages will'll be looking at um for calling NPI libraries a c in for but as we noted in discussion earlier there's a python interface which is quite mature now right as well um and an Our interface so other languages do support um calling in libraries um but it's a library not language there's no such thing as an NPI compiler however just to confuse you um most lenders provide a wrapper for your compiler um whatever it may be that links the MPI library for you simply to avoid you having to personally um provide those link options to your compiler so it simplifies things so you will call something like npic Cc or NPI f90 but that is not actually a compiler itself it is probably calling GCC or G4 TR underneath um for example um and the compiler doesn't know or care about the fact that there's NPI there it just looks like any other library that you might link um to your to your executable at compile time um and NPI handles interface to things like um hardware for example if you have uh create Aries interconnect on your super computer um the implementation deals with connecting up uh whatever message you're trying to send to that inconnect and delivering at the other end so those are all implementation details that don't make a difference to the top level API that you the applications developer uh will call the the goals and scope of NPI um as I mentioned before they very focused on um performance and efficiency uh but also portability so ideally uh it shouldn't matter what machine uh you're trying to compile your MPI on in reality of course it does it matters quite a lot but uh it's the implementation that bears the brunt of that so um or the the implementor I should say so whoever is writing the NPI library that you'll be calling um has to make a lot of changes based on the hardware but you the applications developer don't have to make any changes to your source code it just calls the top level API and it should just work um and the the idea there is to allow for efficient implementation by um separating concerns between the actual implementation and uh the goal of the applications developers the application developer says I would like to send this message and the library implementer deals with how to do that best on a particular type of Hardware um and so it also offers a great deal of functionality including support for heterogeneous parallel architectures again that's largely ination dependent how well not there not actually work and uh so should say well what it means by hetrogeneous here um is for example your computers or your nodes do not have to be the same uh same type of computer at all um it does not mean heterogeneous in the sense of gpus and fpgas and things like that just yet so there is sort of stuff on the horizon including um possibly some of the the work of epigram HS for example we'll be looking at things like um how you could make use of NPI on a GPU or to a GPU even um but yeah when it says hetrogeneous paralleles it means more like traditional computers but different kinds and NPI doesn't necessarily care about that and so the header files um so from our little uh sort of survey earlier um the ones that are relevant here are 490 and CC Plus+ um cc++ is a standard hash include uh np. um C++ developers May note that that is a c header um more on that later uh yep and it should be in a standard directory unless you've built your own and for some reason um you know we assume that you've installed it in some central location uh for tr90 it's a use MPI declaration um the function format is the same uh so so just in the in the chat Matthew has asked can we offload work to a GPU whilst using NPI um the answer is yes but not using MPI uh yet so generally speaking gpus don't have their own rank um in most NPI implementations um but you can use standard sort of Tuda type things to push stuff off um this is I should say well this is in my experience and as far as I understand um don't take that as gospel truth someone may know that now but understand that no um but there's nothing to stop using um standard commu at the same time uh and then you're into the realm of hybrid parallel programming um which can come with its own set of of gotches and so the NPI function format uh in C this will return an error code uh which you can also ignore um often will unless you have you want to do explicit error handling um and the format is all caps NPI underscore capital letter function name and then forr it's just all caps the main difference really is that you have this Ira parameter at the end um and that's because the NPI uh or the foram NPI routines don't return any values and not functions they are sub routines um so you have to include an output variable for your error code um which just some integer uh and that's optional F28 but otherwise essential so basically just always include it there's no reason not to um NPI controls his own internal data structures so in terms of C that means it type gets its own structs and so on um gives you handles to those uh all the Fortran handles are integers for forr reasons um but they all have sort of standard names um make things a little bit easier uh so one very important thing is that you must always initialize NPI yourself now it does not need to be the very first thing you call in your code you can do any man of setup and then call NP in it however it does need to be the very first NPI procedure that is called cannot use any other NPI function before NPI has been initialized um this is true across all languages um do note that multiple processes are already running before the NPI in is called so recall that we said before that your NPI job uh is run by a process launcher so you say to your launcher I would like 100 MPI processes please um it's doing that from the beginning of your your executable okay so before MPI in it is called there's already 100 copies of your code running so whatever variables you declare before that um are still available after NPI init is called NPI init does not itself sort of for correctable not processes they're all already running um which is more convenient than is inconvenient um but any any data that you uh clear before the NPI in it is still replicated across every process um okay so okay here's a little little example um so in C there's two ways you can well one way put you in it um using NPI SC in it however you don't have to provide it with r and RV there optional we can just send null in instead I often do because I don't often actually write meanss that really require any inputs from the command line um however you know you can provide those actually so what what it does with those um is implementation dependent there's nothing in a standard that says you know if they provide these options do this um a particular NPI implementation may choose to do something with a certain type of input parameter um may also not so there's good chance that doing NPI init with arog v will do um nothing different whatsoever than NPI init n n and for Fortran uh it's slightly simpler it's just call NPI in it I eror and so as err you need your ey eror and I note that there is a syntax error on this slide because error is all caps in the call that should be lower case um so you know do declare a variable before you decide to provide it to Fortran or to the NP routine as a return variable and so the next concept we want to explore is communicators uh now before in the message passing programming uh uh see so alesandro is is asking in the chat why would we use an NPI in it then if multiple codes already running even before the call um the answer is that it initializes all of the sort of sending and receiving um infrastructure so all the process launcher does is launches multiple copies of your code but to actually do any communication you do do need to have initialized the rest of NPI um so you could use an NPI launcher to uh Run 100 copies of identical code that didn't have any NPI inside them and that would be fine but they would all be doing exactly the same thing and there would be no opportunity to um branch that code based on process ID or do any communication between them it would just be 100 copies of exactly the same executable all running at once um which is you know a fairly Niche use case if indeed it exists at all so to do any sort of communication between those processes and use any of the features of NP you do need to have initialized it um under the hood it does a fair amount of stuff like set up the communication um and links to the hardware and things like that um yeah there's a fair amount can go on under the hood when NPI and it's called uh why does n launcher launch the initialization itself um is there an advantage in separating the two things if they all called n it would not make a difference for you it's an interesting question um I imagine there is some benefit largely it's expected okay yeah so so Bas us have points out that you could provide different parameters and that likely was a concern of the NPI Forum um since they have this mode in the C uh interface of these where you can provide input par um as I said most implementations don't really do anything with that um but they could that option exists and it exists because those things are separated um so I think a large part of it is simply to ensure those concerns are separated uh there may be other technical reasons why it's important um but that's a good question okay um so before in the in the kind of message passing programming lecture I talked about uh groups of processes quite abstractly and in the NPI Library uh it's more conr concretely defined as a communicator uh and there is always at least one Communicator available the default Communicator the default Communicator is NP com World um now NPI com world as you might expect simply contains every single process that you have launched um okay you can create sub communicators from that uh but you cannot you cannot generally create a communicator that is larger than World once you're processed because you would require a launcher to fire off extra processes and that's generally speaking not allowed [Music] um so you've got your Communicator that contains every single process but how do you know which process you are and so in C there's a function MPI com rank uh which has two inputs NPI com so as we mentioned earlier uh NPI does a lot of type Des for you NPC is a type um that is an NPI communicator um and the second input is the address or a PO to an integer um or an integer address and simply that is that is the integer that will be overwritten with your rank um so it's like a four time style output variable essentially and hence you need to provide it it's reference not the the actual integer itself uh meanwhile in forr uh it's API com rankle caps uh you need to provide Communicator uh an output variable which is rank um and I eror and all of them are integers okay um but again in both languages NPI defines NPI com world for you and you can always provide that that is just says you know use the communicator that contains every process which has been launched uh the rank is not a physical processing number it has absolutely nothing to do with the underlying Hardware um there are things you can do that will mean that processes that are near each other in rank might be near each other on your actual machine but it's that's not uh a library defined thing or a standard defined thing um the ranks are essentially arbitrary except that they always begin at zero uh and go to n minus one where n is the number of processes you have launched in NPI com World um obviously if you're using a smaller Communicator that you've created that will have less ranks um but it's numbered from zero so process zero uh is often treated as really special although it's worth remembering that it is not um but of course if you have to have one special process it is best to make it zero because you know that zero will always exist no matter how many processes you're running on so if you run on one process process Z will exist if you run 100 processes process Z will exist um that isn't true with any other Rank and rank is the general term that is used in NPI for the identifying number of a process okay so here we have a slight long example again of uh in rank being defined so in C style or Inc um NPI com rank is called using this NPI com world as input which NPI provides for you always giving you address of rank and it will return that particular process's ID um and very similar story in for round uh it's call NP Comm rank um and you need to remember I eror okay so the other important question you might well have is how many processes are contained within a communicator now you might think well I've launched 12 processes so 12 should be easy uh and that's true but there's not necessarily a way for your executable to know that by itself um because the the launcher and the library itself are completely separate um so what NPI does it provides a function for you called NPI comp size simply Returns the size of Any Given Communicator um so you can ask for npon world and it should return the number of processes that you launched and as you can see the format is is fairly um is more or less what You' expect mpcore size need to provide a communicator uh and then again an injur to be or address of an integer before we overwritten with the answer um and the same is basically true for the fourth round okay and the other important step um you've initialized NPI you must also finalize it um as with the initialized call it does not need to be the very last thing that you call in your code you're welcome to do other things afterwards um uh so Basel is asking in the chat does NPI com size return a number of active processes or all subscribers um it will return all of the total number of processes that in a particular Communicator which I guess is what you're calling all subscribers um so it will not necessarily be all of the processes that are launched it will only be all of the processes that are launched if you provide NPI com world as the communicator uh if one of them has failed uh then you're in trouble uh it will still return the total number um it won't so NPI doesn't in general internally to itself have any way to let you know that the process has died um that's generally not an expected situation uh your the launcher will generally notice that um or you'll have a situation where all of your processes except one has called NP finalize and nothing's happened for a while and at that point um your launcher May notice and kill the whole thing um but it's not uh there's no well there are extensions to NPI that allow for resilience um but in general it's not there so NP com size is just the number in the communicator um but good question thank you so yes you must always finalize NPI doesn't have to be the last thing you called but it does need to be the last NPI thing you call um uh gen just asked in the chat can NPI be called more than once um I assume it means can it be initialized more than once the answer is no so you must have one initialize one finalize within a single U program you can't so it's not like open MP for example where you might open multiple parallel regions through your code and close them MPI must be initialize exactly once and finalize exactly once on every process um so that's another important Point uh every MPI process must call that is launched uh must call NPI in it not just one of them so don't put if rank equals zero NPI in it that will not work bad things will happen uh and the same is true of NPI finalize um every process needs to call us they don't need to call it at the same time um so it's quite okay for your processes to go out a step as long as they don't try and do anything with NPI after they've called uh NPI finalized or before they've called NPI in it um yeah uh so everything needs to happen in between these two um the you're welcome uh so the the syntaxes is again I think what you would expect uh mpor finalize returns an integer which is just the error C in see um but again you don't need to actually assign that integ to anything if you don't want to um and for TR NPI finalize I error and you do need the integer I eror to be there except in for 2008 but because it's only in Fortran 2008 you can omit it you might as well just have it there okay uh so another thing that that the MPI standard defines should be there um is some way of checking what machine you're actually on uh this can be useful so the main way to identify an individual process um is by Rank and that's what you will use 99% of the time you might want to know if your ranks happen to be located or a set of ranks happen to be located on the same machine um and NPI get process a name will allow you to do that um how useful that actually is in practice is debatable um okay yeah it can be use as a debug feature to confirm that your mapping is something nice for your particular application um but keep in mind that you may very well and on Archer certainly will have multiple processes per individual um machine um what it's calling processor name the name is a bit misleading um because MPI was defined I recall in the early 90s where it was a bit more likely that you might have a machine with a single processor on it um that is almost certainly not the case these days uh but you can do this with NPI get process and name um yeah okay and the important thing here is you need to remember that uh to provide it with an integer which gives you name link oh cla's done some digging thank you CLA so she's spoken to one of our other NPI gurus and they point out um that indeed you there are scenarios where it can be useful to pass parameters NPI in it um and okay it would also mandate NPI allocating memory at the beginning rather than it's actually used so that's a good point um okay and it might restrict more advanced functionality thank you very much CLA for for that and thank you Adrien Jackson for for passing that on um okay and thank you again for the question alandro um yes processor name you need some kind of counter array and c and N which is going to be the um I see okay which is a l link that that name um and then a similar call in Fortran but as noted that's of somewhat debatable usage um it can be interesting as a debugging thing or check the think is something nice but it's not a call that you would use very often um so we've covered some basic NPI calls uh but there's no explicit message passing yet however K still in principle right useful programs so a task farm so if there's no communication involved you could currently launch um your jobs and Fork that code based on the different um ranks so it would actually you be possible to do some useful um compilation and launching of parallel jobs is not specified by the NPI standard and these are implementation details uh we'll start talking about how to actually run this stuff um for this course so the first sort of uh practical uh situations where you might need to use the fo8 wrapper instead of f90 um are anytime that you need to use foron 2008 functionality and therefore need to compile with the f08 compiler uh well because sometimes people have Legacy Fortran codes not even sure Fortran 90 counts as as Legacy um at this state so uh if if your code is for 1990 then uh use the f90 style if your code is 4tr 2008 use the Fortran 2008 style if you mean in terms of why Always provide I era um when it's optional in fortron 2008 uh I mean if it code has to be for 2008 anyway you make a good point you don't need to uh you can just omit it um yeah that's entirely up to you um I would say the only sort of minor difficulty there is if you're in the habit of omitting it then you suddenly have to write some for whatever reason some f90 cod you might find yourself having to go back afterwards and throw in a load of ir integers that you forgotten to put in that's the only reason not top in the head um why that would be important uh Chris Stuart ask can some processes continue using NPI messages whilst other processes have called finalized the answer is yes uh with the caveat that they cannot communicate with the one that has finalized um so you can envisage a situation where uh one process had a lot less work to do than all the others um okay this is B to load balancing but if it happens that's fine uh that process might get to finalize and as long as it doesn't need to communicate with anything else you're fine um your code is still correct the others will continue doing their thing until they two hit finalize um and that's all okay you're only in trouble if you try and communicate um with a process that is finalized because far else it then presumably won't be ready to receive um or indeed if your other processor are expecting a message um it will never be sent because send hasn't been called on that process presumably if you've reached finalize and it's in the correct place but there's nothing to stop um individual processes from doing their thing uh if there's no communication involved and finishing so thank you for that question as well um okay so I'll talk a little bit about MPI on cus specifically um but I mean also quite generally uh the first thing is in terms of access so uh we do have hopefully you've seen them some suggestions for how to actually access Serv up on the training page um however just to run through them again on Linux this is pretty easy you have SSH you can use it um here we're suggesting that you use XY not sure why both perhaps an SSH Guru can can answer that one for me um as I understand that you should need one of those but uh either is fine that will provide you obviously with an X Server Connection which lets you use graphical applications uh the reason we're recommending that is so that if you wish you may use gedit or get it um in order to modify files um however it is not strictly necessary there's no certainly in this week There's Nothing graphical um involved in the in the Practical so you're quite welcome just to SSH uh use at sir- mc. epcc.edu uh where user is your username that you've been provided already um hopefully if you've already signed up for a service account um okay so on Linux that's it that's all you really need to know uh on Mac it's it's marginally more complicated because as I found out recently apparently they don't include an X server by default anymore um just for clarity an X server is the thing that LS users to provide graphical application or a graphical interface um uh so you need to be running one locally in order to get things sent to you from the computer sshd or from cus in this case um so Mac you can install a program called excorts which will provide a next server and display any graphics or graphical applications um windows need install an ex server program such as m x term m xterm is quite a nice SSH tool in general we used to recommend um there's still nothing to stop you using puffy and xming if that's your preference or indeed just puy if you don't need the xera uh I will actually be using so the other option is uh windows and down does have open SSH built in um I will be using that um through a thing called conu which allows me to set a nice color scheme for my terminal um all these options are completely fine um if you're not sure which feel free just to follow the the recommendations on our page um and if you do have any problems connecting to sarus do let us know uh so there is a thing called MPP templates. tar available that contains some useful files uh for building and running applications in service especially um it contains example job submission scripts and thank you CLA CLA has wanted out that MX term also comes if you're a Windows user MX comes in a version of his mobile and does not need admin rights to run which is helpful if you're on a managed desktop [Music] um yeah it's I've used it before it's very nice um uh Matthew ask what batch system serus uses it is I believe PBS proof um for those who are familiar says yes it definitely is it's good um yes for those who are familiar with batch systems um You probably don't need to see the NPP templates. t uh PVS scripts although it may be useful as well to check the Cirrus uh documentation stuff on PBS scripts um now one feature of Sirus as well as worth pointing out is that login nodes are identical to the computational nodes so you actually don't need in terms of the hardware so you don't actually need to submit batch jobs you can run stuff interactively on login node and that's okay um Claire is very helpfully provided a link to C documentation too thank you Claire um yeah I recommend taking a look through that if you do get stuck um as it does have answers to most questions um but yeah seriously you can run stuff interactively and that's okay on the login as long as it's not huge please don't try I Hadad you know take over the entire login node again we will notice uh but for small things especially stuff we're doing here where you may run on sort of four processes that's completely fine um it has 36 cores um and that's no issue but there's useful things in mppp templates. tar um and there's other sheets and the course web pages uh in terms of setting up your service environment to actually run stuff uh you need to load the message passing toolkit npt through module load npt so this is the way that we manage environments on our machines is through modules um so this simply loads in the correct libraries for NPI for example uh and also modu of Intel compiler 17 uh we recommend using the Intel compilers uh for this course simply because these are the ones that are suggested by HP and are tested by HP who make MPT uh which is the NPI Library being used here um so those things are most likely to work nicely together and we're not doing anything especially fancy here so it should all just work fine uh you can add these things to your bash profile if you wish you notice there so this graphical text editor for those who aren't familiar with I mentioned gedit is what's being called there um if your if your preference is vim or emx that's fine that's TI to you I will personally be using Vim on for any people don't like that um it means that I can just share one application with you and you'll be able to see what I'm doing um but whatever text editor you're most comfortable with is fine um yep uh now to compile things so this assumes that you've already loaded NP and until compiler 17 that's important you do need to know those things first um you may forget and that is a good reason to add it to your bash profile uh but you don't have to you can just remember to do this every time um if you're compiling C code it's npic CC so this is a standard rapper MPT provides for c compiles um by default it will just use G so these are qus that are specific to to Sirus um by default even though HP recommend using Intel compilers with npt still defaults just using the GMU compilers so you do need to add- cc equals ICC you order to tell it to use the Intel compiler um for C++ programmers um it's NP cxx and again simly you need to say d cxx equals CPC in order to actually use Intel compilers for fortime programmers uh it's just NP f90 um and you don't need to worry about telling you which compiler because it will pick up I for by default rather than g4r um I don't know why that's just a f qu of servus um there are some provided make files in that MPP templates uh tar which again is available from the the NPI web page uh sorry the training page um for online NPI okay okay and so for running interactively um wouldn't use it for profiling um because it's a login node and you know there's lots of people on there all doing exactly the same sort of things and just running stuff um but you can use NPI run so NPI run is the launcher uh Das four where four is the number of processes in this case it can be anything um You probably don't want to go certainly not more than 36 um because there are only 36 sces um and possibly not that high if you are just running interactively on login node uh now it does say it points out here your output might be buffered and you may need to exp explicitly flush prints the screen uh I was speaking to Da the other day and he thinks that actually so David Hy that is Sor who often runs this course and he thinks that's actually not true anymore so if you find that there's no output appearing at all it may be the issue um or indeed if your program is deadlocked or crashes that could also cause that um remember you can always use control C uh to kill it if that does happen uh batch jobs so as mentioned we have PBS Pro on there um there is a standard batch script that's set up in a a slightly fancy way so what you can do is give it the name of your executable so if your executable is called hello then you can call your bch script uh hello. PDS um and if it's in the save directory or pick that up and run hello. PVS uh apologies I just realized that now oh no we're okay no I take it back I thought I was running late there but it's just the start of the official practical time so we're still good um so yes you can make a copy of that and do it that way uh you can also have a look and at the Sirus docs and use their standard PVS script if you prefer depends on how famili your match systems uh it's entirely up to you uh it may be sitting in a queue for a while I'll be honest uh service is often quite busy um however there is uh I don't think we do have a reservation uh in this case so we ignore that bit about using the reserve during live sessions but of course because this is an online course you're welcome to use any time um but just if you submit a ban job uh it may take a little while to run but you can always check back on it later um yeah and if you do that you'll get an output file of the form um your B script name yeah your B script name. and then some identifying number uh you can follow progress using qat which is another batch system tool qstat minus you and then your username is probably the most useful invocation there you can also um provide it with your ah so uh abot says hit an issue trying to use ke um so if you go and in fact this is at the bottom of this slide so this is well timed um so that default uh V script we provided has the budget d167 in it which is our MSC budget U you're not part of our MSE so you don't have access to that one but you can use tc04 so if you open up um thepbs uh file with the text editor of your choice there is a line on there that will be # PVS uh space- capital A and space and then it will say D1 67 you can change that to tc4 um and that's just how we keep track of how much time everybody's using um for accounting purposes uh and we've given you all the budget for this this course so that's fine um Claire's also doing this information in chat that's good thank you Claire but do let us know if you have any issues afterwards I'll actually be going through how to do all the stuff um in a minute as well well uh yes okay so by default those rackers are not in your path you do need to load MPT first um and importantly as well if you're running interactively use MPI run if you're running if you're submitting a bat job um you need to use NPI exact NPI EXA dmpt because um yeah you need to mload the Intel compilers and you need to specify the Intel compilers for c um just because so important note for all the C+ plus people out there um now there was once upon a time uh some attempt of making a C++ uh MPI interface and it was essentially abandoned um so C++ interface is not supported you may occasionally still find remnants of it around um but it's not recommended to use them you should instead just call the C library instead okay and cla's providing us more information about the budgets thank you CLA um so yeah C++ is uh yeah should just users should just use the the C interface although you do of course still need to use npic cxx so that linking happens correctly and compiling happens correctly and all that um you don't need to do you don't need to wrap your MPI calls in a um uh oh what is it again so I forget what the exact syntax is but there's a thing you often need to do when you're importing a c liary into your C++ code I believe you don't need to do that um for uh MPI you can just use it as a SQL in its bare form and that's fine um yeah but yeah the main point to take away there is that if you see C++ interface R VII um stay away from it as it's not yeah yeah it's now been removed and is not supported ah yes thank you Chris Stuart xtern C that is that is uh the one that I was trying to remember yes um you don't need that you don't need to even to include the NPI it's all it's fine um the NPI standard is available online it's very long and probably not that helpful but it may be of interest um you can also buy a printed copy of the book if that's your thing um and the Man pages are available on service they are also as you might expect available on the internet um you can type man MPI function name and there are a couple of online versions um there is a book that also talks about npl lot that may be useful to if you're interested in learning more okay and so hopefully you all found the exercise sheet as well which is again linked from the main training page let me just find it myself um but the point for this first practical is uh simply to really to get everyone on surus on up and running and able to compile and run some very simple code um so what I'll do now is basically just do a live demonstration of this hopefully this works yes good question from Chris which cla's already answered thank you Claire what is NP run in list6 because there are only 36 processes on each node um Al C so it actually doesn't need to be less than that but it should be um because it's much better to have no more than one process per call uh so uh I'm going to begin right um so there's some questions coming up about how multi node stuff works um I believe so the point is for NPI run which is just on a login node uh you probably shouldn't be doing multi- node stuff uh yes so Mar has has pointed out and has it right it's because it's on the login node on for a bachor m job you're more than welcome to run on multiple nodes um just not on the login node um and Chris has asked um yep the the same thing essentially so yes the point is just on the login node um it's limited to 36 because it's a single node but um if you submit a batch job that can be across many nodes and that's fine and Matthew also makes the point that um one process per CPU is a better approach certainly on a single node um okay okay yes okay I connected I imagine you're all far ahead of me at this point um did I put anything here yet no so if you're wondering how to get uh stuff onto um Sirus easily um there is answer so say for example that what you want is a copy of MPP do templates. tar what you can do is go to website and copy the link okay and and do W get hopefully this works uh and then paste into your terminal now how you paste does differ depending on what you're using um I've done a right click there to achieve that paste shift and insert is a popular one for Windows command shells MX terum uh so MX term actually I think lets you do a drag and drop copy across um just like any file browser you will have to download the file to your um own computer first uh however one easy way to do it is once your on source to use WG to simply pull the file right there um and as you can see there it is and if we do to minus um XF okay we have a look in there as you can see there is um some example World scripts and some example bch scripts and some make files um so what I will do is I will just put out the C example first of all okay we'll have a quick look what this looks like okay um so as you can see this standard input output a standard lab so these just VAR standard libraries this one handles includes the function standard live I think is actually not strictly necessary here um but it's there anyway and then this is the important one include mp. so you do need this uh so Chris is asking which nmv templates. tar is C and which is C++ hello. C is the C1 uh let me just quickly check for you and CC is C++ and Claire has pointed out that um to begin with it's best just to run some interactive jobs and once you're sure your Cod is is doing what you want then you can try submitting as a batch job um across multi noes however it may take a well to run so you may come back tomorrow and discover the whole thing has just failed immediately as it began um and it is very annoying yes okay so uh and yes uh Sam is asking about using NP exore MP in the PVS script yes basically what it is is that the um comp compute nodes uh don't use NPI run so NPI exore npt is simply the equivalent for the computer no as opposed to login node ah okay sorry apologies M but thank you for pointing that out then is helpful see I don't need to do anything you guys can can light um this great so we have here a fairly standard hello world type uh program uh it's not bothering ourv doesn't need them so we're supplying null and null to the init prints hello world and it finalizes um and so generally the compiler will work out that if you don't return anything in your main and that's the same as returning zero I like to be explicit so I'm going to add them back in but you don't need to do that okay and what I have forgotten to do you might have noticed is load the modules so let's do that module load MPT and module Lo inel compilers 17 you can tab or to complete those um but it does usually take a while so it's better just type them in top tip there um and then we compile using npic CC cc equals ICC so this is telling mpicc to please use the Intel compilers that we've just followed to load um thank you very much I say minus all hello so it says create a binary called hello from hello.c what have I done and I've missed it there we go you saw nothing uh okay have we can't run hello uh by itself because it does need at least one process to be launched for it but we can do that it will run as if it were a seral program okay um so if it is definitely possible to write uh MPI programs that will not run on a single process um it is much nicer if you can to make your MPI programs independent of the number of processes it's simply not always possible but um as much as possible it's always best to avoid situation where you write a code that will requires a specific number of processes to run um but that's a sort of a bit of general advice so now let's try this across four processes and as you can see we've had hello world printed four times um okay so let's try this again with the uh let's go with the C++ code quickly first doing hellc there okay so let's have a quick look at this hello. CC um okay so here we're including IO stream instead of standard iio uh and it's still NP H I believe that could be in angle brackets it doesn't need to be but anyway it doesn't really matter using name space standard uh which brings things like C out up to the main Nam space um and again you know doesn't really need this but I like it so we'll put that in but I mean also this makes the point that um finalize doesn't need to be the last thing in your code and that's okay um okay and if fact let's modify this slightly let's say int a equals 3 and and just add that in okay okay and this time we need to do NPI cxx D cxx equals ICPC hello. CC hello okay and another thing that's sort of interesting to just look at quickly so if I do that you can see that by default it's g++ um just this little name this point again and really the main point here is just to show you that it is just the standard compiler underneath um so now if I do NPI run uh okay and so I realized that in the notes say to use n NP does exactly the same thing um okay and as you can see they've all got the same copy of a um not the same copy of they all got their own independent copy of a but they're all initialized with the same value um and again we just get four copies of that same code running um so far so good [Applause] okay and let's try this with the f90 hope you're ready for a good chuffle here because um more of a c person than a forun person okay but I think I can get hand of this one uh so here is the equalent of our includes in C and C++ implicit none as a recall is always a good thing to do it stops foron from making some assumptions about what you're trying to do and this code will just do the same thing as all our other ones um note the supply I eror here so in our SQ we didn't bother collecting that return function uh so abot is asking how to check programmatically uh if you're running on a simple process um I see so you could check uh MPI com size that would tell you the total number of processes that have been launched um and then NPI com rank uh will still work and it will tell you that you are ranked zero um on a single process or if a single process has been launched that will always be zero which is why I suggested earlier if you have to have a particular special process it's best to choose zero because that one is guaranteed to exist in all um in all situations [Music] 90 okay and I successfully compiled and around some very simple four Trend code I didn't have to r at me um but okay that's uh all that we really want from this this first week U is just for people to get um set up on cus and comfortable with the idea of of launching and running an NPI job feel free to experiment with um bat scripts as well and submit a larger job across a few nodes um you might if you're feeling like uh extending yourself a little bit as well uh you might like to have it print out um and in fact I think this is what yeah so if you look at the exercise sheet this is what i' ask you to do is actually to have it print out which rank it is as well um I won't go through that just now I'll I'll leave that as an exercise but I will go through it at start of next week's session um uh yes so a is asking again more on running on a single process so if your code might use NPI to run uh on more than one process then it still needs all of the the NPI stuff in it uh including it needs to NPI in an NPI finalize and needs to be launched with a NPI launcher such as NPI Z or NPI run um but you just so I'll just demonstrate again here uh you just do it in one as your number of processes to launch and it is effectively running in serial um you can uh you can use debuggers on MPI code um so you can use things like GDB for example uh it is more difficult because there may be multiple processors um that you need to attach to GDP GDB you can also run things like Val grind um to protect memory area memory errors in your NPI code uh there are also specific MPI debug is available um which will tell you more about the communications themselves themselves so uh yes there are special debuggers but you can also just use standard ones however um the standard ones will have some difficulties they won't be able to tell you much about what the communications are doing they will just see multiple processes running um but there's nothing to stop you um attaching uh NPI processes to debuggers such as GDB um but it is true that it is a little bit trickier uh to debug than a serial code um to Le then you may well resort to print statements which include rank um whe that's a good thing or bound thing is is a matter of opinion but that is often a useful approach [Music] um so I realized in the end I have managed to run slightly over time here so apologies for that um but hopefully uh that was useful just to run through um getting things running on serious for you um feel free to email me if you do have any questions um but for now I I'll leave you to it let you go back to your to your lives thank you very much for attending hopefully you found it useful today um and I will see you this time next week okay thank you very much CLA and thank you everyone else okay it's uh it's 2:00 so I'll begin hello and welcome back everyone to this Archer online NPI course um today we're going to start off by talking about um Point too Communications uh but first what I wanted to do was just quickly run through um the first exercise with you guys so and so one thing I should make clear as well uh this this EX exercise sheet um that's been provided on the out training page um this is actually the exercise sheet for our entire um MSC course on message passing programming so don't feel you need to get through all of it um and certainly in this first week we're really looking forwards for people to have a shot at number one but they're just all on there um but don't feel you need to to bash through them all and if you want to you're obviously completely welcome to um yeah okay I'm going to quickly uh show you one way of doing just this first bit um if anyone had any problems doing this exercise uh do let me know as well in the in the chat and we'll see what went wrong if I can um and if not I can try and do it later okay now so so I'm already logged into s um and as this is s essenti live coding this is very light Rong use in advance uh so this is just the how well so I'm going to do it in C because that's simplest for me but whatever you choose to to do this in is completely fine um and this is just the example that was provided in mppp templates uh and then the aim of this crust exercise uh was just to essentially split or Fork based on the process IV uh or the rank the NPI rank of each process um and see how that works uh so going to modify this well so this is just a simple hello world and it prints hello on every rank if you run it across multiple ones but we want to make it a little bit more sophisticated than that adding that in because that always annoys me so rank so so far just modifying a printf statement to include the new information uh sign okay so what we're what we're expecting us to do is reach rank to print hello world from rank my rank of however many there are in total uh and a way we achieve that or we need to declare some integers first to actually uh fill out Rank and size value okay and it's NPI com rank now the first argument to this NPI function that we talked about last week is the communicator use NPI com World which is the default all of the ranks Communicator you could also do something like um MPI com at least see hopefully that's the right data type let me just check the ah of course there we go something like that um or indeed that would also work fine but for now I'm just going to put it directly into NPI com rank uh so NPI com world and then the the address of an integer which will be overwritten with the with the rank value um now this is mainly a foil of c um that that's important you might think it's okay to do something like that uh and Supply a pointer and that will compile because it will um meet the signature of the function but uh because that will be an uninitialized pointer it won't be pointing at any actual space need to maloc um some heat space of size one integer or point it out an actual integer um that's already been defined elsewhere in order to um make that work so the simplest thing to do is just to declare an integer and then supplies address um that is that just a c thing really but it's worth remembering so it's quite common to for people to accidentally do that and that will not go well okay and then the size uh function is very similar in that it takes um Communicator again we'll just use a devolt and the address is some integer to over right okay and that should fingers crossed Haven forgotten anything be it um so what I haven't done yet is this load Intel oh made the Fatal trying to T complete this on serus always takes a little [Music] while apologies I should have just typed out myself I get for being lazy on there we go okay that doesn't worth it right nothing PC is it yes it is got a feeling that I'm doing something wrong here but but we'll find out very short I'm sure it's good yeah I'm on the front end it is MP run so let's try that on to first of all okay so what I've done there him Miss Men new line okay so there we can see Hello World from rank one of two and hello world from rank zero of two it's worth noting that these are not necessarily in the order you might expect them to be um so of course there is no guarantee in MPI whatsoever that things will happen in a specific order unless you include synchronization explicitly and it's even less likely uh when it comes to doing things flushing standard out bu that things will happen in order that can be kind of a pain to be honest um there are ways around it but it's something to be aware of hello M he's waving in the chat um yeah something to be aware of that you know theorder of different processes is is never guaranteed um okay but we can do is for any number and get exactly the same thing so hopefully all um manag to get somewhere with that okay do let me know if you did have any problems um either in the chat or by email and I'll see if you can help and with that we'll get on to the first of uh the lectures for this week we're going to talk about the good stuff messages oh that's not there we go can learn how to use PowerPoint Okay so messages in [Music] NPI so a message in general contains some number of elements of some particular data type um for that reason NP defines its own set of data types um and there are basic types which are the sort of things you find in in most languages and our deriv types which are essentially custom data types so for example and see you could Define your struct um as a derived type uh built up basic types um we're not going to cover that in this course but there's a thing that can be done uh okay the basic types different for for obvious reasons um so one thing that's important is that it the message contains a number of elements of some particular dat site need to tell NPI what your data type is um is that there's no obvious way in either of the or any of the languages that we're looking at at the moment and to determine uh what how to interpret a message that's been received very mind it's just a stream of btes that gets transported between processes um how to interpret that depends on on what the data type is um and needs to be told by the sender receiving needs to be told by by the sender um how to interpret it and it does that by simply sending the data type as well um because just receives a streamer bites and has no idea what to do with it okay so the basic data types in C are all you know essentially all of the basic data types in C but with NPI underscore in front um you know how likely you are to use these is uh correlates pretty strongly to How likely you are to use any of the um basic types in C so I would imagine ins and doubles possibly fls are quite common the rest much less so um car maybe as well so uh one thing as well that's sort of worth noticing is there's an NPI bike uh type which just says don't bother trying to interpret this just interpreted stream of byes um quite often people just use NPI car instead because character is one bite um not it's a good idea is depends on what you're sending I guess um but yeah you can you can attempt to uh recast the receive buffer at the other end to whatever after sending a set of bites although obviously if there was a data type basic data type available then it was much better idea just to use that um and in general if you have your own data type it's better just to use the NPI Drive data types um to create that instead that's a much cleaner way uh makes much easier debugging than just sending around B streams and hoping for the best um so I probably would recommend against using NP by andless you have some extremely good reason so um but the point is you need to tell NPI okay I'm sending 5 in from wherever to wherever similarly in Fortran it has its own set of basic data types um again I would expect that double prision and Eng are quite common complex as well slightly more likely so forr programmers might note that um MPI real is that all of them really uh is a little bit vague a definition because kinds exist um the NPI standard was written before before kinds existed uh but it does support that um we're not going to go into in this course particularly but if you have a look around online you can see how it deals with kinds but basically the answer is that if you're using the default kinds you're okay um if you're using something a bit specialist you'll probably need to create a drive data type and their support for for doing all that sort of thing um and otherwise you just do what you would expect um okay and obviously it's machine dependent as well to an extent yeah okay as if you do have any questions just just let us know in the chat respond but don't be sure about it so Point too Communications this is the most basic communication mode available or form of communication available in in NP uh Al the most commonly used as the simplest um way of doing lot of things now uh as you might expect this is a communication explicitly between two processes um some Source process that sends a message to some destination process anywhere in a communicator um communication takes place within the communicator uh we are going to talk a it in a little bit in the next lecture about defining different communicators um but default one of course would always be NPI Comm World um which simply contains every rank that has been launched by the NP launcher and however you can Define your own communicators in some good reasons for doing so um okay but the destination process and indeed the sending process is always identified why it's rank in the communicator that you provide so uh you always have to provide Communicator even if it is just NPI con world and that determines to which processes the rank that you provide is associated this may be different between different communicators um so the rank is not a universal thing it is specific to to that Communicator um and you always have to explicitly provide a communicator in basically all MPI calls okay so the process is that a sender calls a send routine they have to specify the data that is to be sent unsurprisingly uh well this is called a send buffer and that's how we refer to it uh this is essentially in in see least a memory address um okay for the start of your data uh whether it's a single integer or an array of them or El dat time the receiver calls a receive routine again uh specifying where the incoming data should be stored we'll call out the receive buffer and again it's just the memory address now you do need to make sure we'll talk about this in more detail um shortly you do need to make sure that that receive buffer is large enough um and is just pointing uniz memory or you will run into problems uh now as well as those things um so there's the place your data is coming from and the send uh point and the place is going the receive Point as well as things the message also contains some metadata um now this you don't as the program you don't need to worry about that um it's received into separate storage that's handled by NPI Itself by the NPI Library uh and called the status um and we'll we'll talk a little bit about why that's useful as well um so the real focus of this oh in fact of the next lecture actually will be to talk in a lot of detail about communication modes um but we're going to going to start that that Journey here um so MP distinguishes between communication modes which are synchronous an asynchronous um okay and communication forms which are blocking and nonblocking um hopefully I've got that the right way around yes I we have so these are slightly different concepts in NPI and say we are going to go into a lot of detail and later in the course we'll discuss asynchronous sends uh or sorry non block sends but for today we're only going to be looking at blocking um Communications what I mean by that is that the function will not return until the send buffer is available again for reuse um so it blocks the program or on the process that's running it until after the buffer can be safely freed or reused um okay okay whether it's synchronous or asynchronous communication so there's a blocking and an unblocking form of both and we're going to talk more about non-blocking later in the course for now we're looking only at blocking but synchronous asynchronous so synchronous send is obviously the synchronous one and it only completes when the receive has completed so synchronizes your two processes involved in the point to point um because one has to be able to receive or has to have posted to receive uh at the same time that the sender has reposted it send whereas the buffer send is an aous communication um it immediately completes always completes unless an error occurs uh no matter what the receiver is doing it just sends the message you throw out there um so the synchronous send is more like a phone call where you know nothing happens until the other person picks up uh whereas the buffet send is more like sending a letter and you throw it in your post box and then carry on about your day um there is a thing okay and this is somewhat unfortunate um there's a thing called a standard send that is either synchronous or buffed okay which is is was meant as a convenience um but in my opinion is quite unhelpful and in fact uh not just my opinion when I first took this course myself many me ago um we've always recommended that you don't use um MPI send which is a standard send uh for reasons I'll explain in more detail later uh and actually although we're going to show it off a bit today the same recommendation kind of holds for the buffard send for reasons that I'll explain uh but for now uh we'll we'll consider receives receives are are simple they are again we're only considering blocking and receive is always synchronous there is no asynchronous receive U so when a processor posts its NPI receive it will wait and it will not move on from that function call until it has received the message okay okay so here's the the kind of uh calls for the various modes the standard end which is NPI send um don't use that there's a synchronous send do you use that that's an NP send there's a buffet send uh which is asynchronous blocking um and NPI receive so send s send we send and Rec receive okay and the the call names are the same in both languages um the signature is a little bit different uh because Inc MPI defines its own data types um such as NPI com for the communicator in Fortran I all just inges and as ever there's an i error um which you should always Supply uh the equivalent in C is the returned integer that's just an error code which you can choose to do something with or completely ignore as you see fit um yet okay and so the the this type buff for Fortran that's obviously not a real um a real Fortran thing uh the point is that in for everything is um passed by reference not by value so um there is just C code indicating that that can be of whatever type you like um in NP we have to basically uh include a pointer to whatever type we like the call sign is just a void pointer um to our buffer okay so we're going to try a tryit of interaction here so if I want to send data from rank one to rank three and I write some code that looks a little bit like this um okay can anyone tell me if this is going to do what I want it to do I just want to send oh let me go back a second uh I should point out in these um in these cool signatures uh so you have the buffer which points to the memory that you want to send from the send buffer uh the count which is the number of data types the data type which specifies what data type and from that it can work out the size of the buffer that it's going to to send along and then there is an in which is des the destination rank it's just the the rank of the destination process um and then a tag which we'll talk more about later and in the next lecture as well but um essentially a tag is just any number but it should match the receive and the communicator um okay so here then this is me attempting to send data from rank one to rank two so uh do people think that this if the code looked a bit like this it would work you know all the things that are missing Hope just to bail into the chat and make some suggestions and Conor J has has absolutely nailed it yes uh that's exactly what will happen so um if you Tred to run code and it looked at it like this it would deadlock um for anyone ah so Alexander just asked does it work without the iror argument uh yes because this is uh c c like code um in Fortran unless it's the8 interface no you would need to to provide IR um but because this is a c it's just a return um and you don't need to do anything with that but to go back yes has got it completely correct if you run your code like this every process would be trying to do a synchronous send um to rank three which is going to cause your problem because that includes rank three itself um so and there's going to be no receive posted um so this with deadlock deadlock um just looks like like your code has hung forever is Sheran is asking if this would be flanked by a debugger um if you have an NPI specific debugger maybe probably not though is the answer um not sure how sophisticated the available tools are I think there are so if you look you know what I've got a feeling that there is actually a um some debuggers are able to do this and we do have a few of the more sophisticated ones on on Archer um so they can examine the communication patterns I think however uh in as a general general rule no um because this this C is correct okay as it won't be by the compiler because the compiler will look at and go yeah this all looks fine so what will happen is you'll try and run your code and it will simply pause because all well because everybody will be waiting for rank three to post or receive including rank three um and there's no timeout in NPI it will just run forever um you can Implement timeouts of your own it's horrible doesn't really work PR well um so NPI just doesn't do it it just relies on you on writing correct code and that's part of why we strongly recommend especially to begin with you write your code using synchronous sends um because it's more lik to throw up an issue like that your Cod works with synchronous serums then it is correct as we'll see later on um that's not necessarily the case for asynchronous communications jump a little bit there so yes this is highlighting the destination rank is three and the attack is zero um so that's fine and here's the answer uh to how to get this to to run as we would want you have to Fork it based on the rank and say if rank equals one okay then perform the synchronous send yes sure that's correct it is I mean it's unfortunate there's not an easy way um to debug MPI calls in terms of the communication PS um yeah but rest assured if you do have code that Deadlocks you are far from the first and certainly won't be the last person uh that's had that issue um and so one thing uh that we point out here as well is so there we were sending an array of 10 integers so we saw we had k equal 10 and data type is NPI in if we want to send a Scala we still just send again there explicitly say the address of X um that's again just a c thing uh but count is just one just as simple as that for sending a single value rather than array um and then the only other thing here is that do we have that yet no that on rank three we would need an if well no we wouldn't yes we would we would need an if rank equals three post an N receive for this to actually not deadlock um and here's the the four CH equivalent um including the FJ and now you can see the IR eror um argument is back okay but otherwise the principle is very much the same okay and that would do do just the same thing okay so for the receive uh the the function signature is really quite similar again we have some uh some buffer that we receive into um a count a data type so that NPI can work out um how many btes it should be reading or writing writing in this case source so you need to say okay I'm expecting a message from rank one orever um it should have this tag uh it's all within this communicator and then additionally there's this NPI status um so this is an array in for of integers I believe um okay and in C it is an NPI status data type so it's a struct in fact um now you can choose to completely ignore the status and never do anything with it that's fine um it's more useful for blocking sorry non-blocking Communications and the blocking ones um there are a couple uses for it that we'll we'll go into here or possibly in the next lecture um okay but it does it does need to be there regardless of whether or not you're going to use it um so you can just declare a state as some your gold and then just provide that and overwriting it is is completely fine as well especially if you're never going to look at it you don't need a different NPI status declared for each receive your posting um it just needs to be some part of memory that's allocated to the right size that NPI can right it um okay now so here we're going to look at receiving data from rank one on rank three I can't remember what yeah there we go so again uh We've highlighted the source and the tag so is also zero so that's all good um and now you can see as well we've we've created an NPI status just the top of that code that we can just overwrite uh and we're saying we're receiving into the buff of Y which has been created large enough 10 integers from rank one tack zero um and we need that if rank equals three to make sure that again our code doesn't deadlock because receive is another blocking function so if the process goes into NPR receive and never receives a message that it's expecting it will just keep waiting forever and ever and ever until you kill the process and if you ire NP job um again very similar rules for receiving a single number um however one thing that's important to note here is what exactly the count means so what there is a subtle difference okay between the the count that you send and the count that you receive right um so what do you think would happen and again feel free to just post this in the chat what do you think would happen if um the the message we received had more than 10 integers in it it was too large compared to the buff we' allocated any guesses B wise guess it trated that would be a a sensible guess um actually what generally happen and it is implemented a dependent um so you might be right it could well just be truncated it's more likely the NPI would actually with an error you bring down the entire job um quite often it will simply say nope something is wrong here and and give up um on the other hand if it's if the message that's received is smaller than the buffer that's completely okay so in in the finest traditions of um sort of statically allocated language is you can simply Define a buffet that's much larger than you will ever need and use the status to later determine how big it actually was um but the point is that the the send so equally when you're sending the buffet you're sending from can be much longer um than the actual size of message you send if you just wanted to send the first 10 elements for example U of an array you can do that simply by providing an address at the start of the array uh and saying count equal 10 and it just take the first 10 the fact it's longer doesn't matter um however of course if you if it's the other way around if the buffer is shter in your account then you will SE it will be bad uh and the same that the receive end so the receiving buffer can be larger and but it cannot be smaller than the message you receive or you'll run into problems um ideally of course those counts will always just match up and so uh e just ask uh for a bit more information on the tag um so we'll we'll be talking about it more a bit later on in the next lecture but um broadly speaking it is a special identifier for uh a particular message so if say NPI receive is waiting for a message tag to zero if a message arrives at a tag of one uh it will ignore that you will not do anything with it um it'll continue waiting for a message to be sent with a tag of zero so it's a way of matching up sends and receives um whether or not that's important to it depends entirely on your use case uh quite often it's not happy just to receive whatever message is being sent by the process and in that case the best thing to do is just to say zero you can also while C it we'll talk about that later that has some DRX um but hopefully that's that's enough for now and promise there will be more discussion later on um Mart is asking if your count of the receive is smaller than the count of the send but the buffer is large enough for the ah I see um no you're again MPI will probably just the B um so MPI believes you um when you say how big the receive buffer is so it will not try and and overwrite does the receive function allocate no no so daas was asking if the receive function allocates physical memory in advance it uh very much does not um you need to have allocated the the correct size yourself um and the NPI receive will believe you about whatever size it will won't try and go write more than that um okay uh yeah so you do need to either create on the stack of a heap enough space for the message you want to receive uh and Josh is asking does a mismatch of memory types uh automatically calls in the v no actually that will be okay um and that would work fine provided that the memory type that you said was larger or rather the total buffer was larger um so the basic Point really is that what NPI does is sort of distinct from from what else is going on in your code so if for example you created an array of 10 inges um say you create an array of 10 floating Point numbers in your code was empty you could receive what you told MPI was five doubles into that and that would work okay um because those things would have the same size and then when your code came to interpret that it would see it has a floating Point array NPI doesn't check the data types it just uses it to work out how many bytes it should be writing um that said there is no benefit to L um to MPI so you can use this MPI bite uh in fact we pretty much do that so there instead of Count Your essentially providing the raw sizing bites if you tell it to treat as an NBI bite data type it'll just go okay and assume that you've done the math yourself and the buffer is large enough and all those things um but there's no there's no benefit to you as a developer to to line to NP like that about what's going on um I hope that makes sense uh not even for for the allocations the allocation should already have been done by the time the NPI gets involved it's simply for knowing how much to to write because you provide it just with some memory address um okay and then you know saying 10in in in C or Fortran isn't well 10in is is Meaningful if you didn't have the integer or the if you didn't have the data type there just saying 10 it isn't really meaningful unless it's bites explicitly um because different data types have different sizes and of course you can could be a derived data type uh so you could have an array of whatever size struct um youve created for yourself um so it's to allow to do the math to ro out but it's not about to to se fa um but it's the the memory allocation should already happened else where in the code NP will not do that for you okay great um so this is NPI receives um buffer count data type Source tag Communicator status in C um and in forr it's that but with an ey error on the end for good measure okay and here you can see the status is defined as an integer array of Dimension NPI sta size which is a convenience um definition for you so you don't have to worry about how big that is just NPI status times um okay so then the synchronous blocking message passings the SS um as mentioned before the process synchronized and what I mean by that is you can tell that your sending receiving processes are at you know what point they're at in the code because one is supposed to descend and one is supposed to receive and neither will get past that point until they're both there okay until a received is completed they're both going to sit in that function and wait and if you know if one of this end is missing then they will sit and wait forever and the code will deadlock and there is no time out okay it will just wait forever um if you're running on say like AR or serious it will run until the wall time you specified has passed and then your job get killed and both processes wait until the transaction is completed so it blocks okay so already sett all this but for the communication to succeed the sender must specify a valid destination rank okay so you can't say uh you have to send this to rank 100 when you need launched four processes that will fail uh equally the receiver must specify our source rank um the indexing of of rank starts as Z zero um I just right and proper if you're a c programmer uh okay and it's always positive there's no negative ranks or anything like that um just an index from zero positive in list the communicators have to be the same um so that's an important point because uh ranks are not the same across communicators okay so if Communicator has its own list of ranks or individual processes um it's only meaningful if the the ranks match within that Communicator if communicators are different um then the rank means nothing and there's no way and Al so there's no communication between communicators in fact um so the communicator must be the same for these Point too Communications the tags mismatch I said that's just another way of matching m messages um the message types must match okay so apparently I was wrong about that you can't get away um apologies you can't get away with lying to MPI that's that's for the best I think um okay and the receiver's buffer must be large enough as we discussed okay so that's important you the only way that you can um essentially get NPI to ignore the NPI data type uh is by using NPI bite and there it will just say okay I'm looking for this many bites um uh so the receiver can be wild carded card you can say uh NPI any source to receive from any other rank um and you can say NPI any tag through receive with any tank uh it's so it's better just to use these things if you absolutely have to it's always more efficient um if you can specify these things and safer in terms ofing sure your code is correct there are good reasons to use NPI any Source um however if it's just because your code is deadlocking and then when you put NPI any SCE it doesn't deadlock anymore uh that means that your code is not incorrect but could be reordered to make that not necessary in all likelihood um so it's better to do that then to Wild Card it equally the NPI any tag there I mean it seems like a might be obvious thing to do because you just don't care about the tags you're not using them for anything um then okay why not just do NP tag uh that's fine except it's actually just a little bit slower if you do that so if you just set it as zero if you don't care that's more efficient that's use essentially the same effect um okay in the actual source and tag so wild card you can check the actual source and check the actual tag uh by examining the status parameter say and see that's a struct um in forr it is an integer array okay and this metadata is is kind of like the the envelope um in which the message is contained okay rather the the kind of send and receive functions can hand all of the information that an envelope put if you are sending a letter just like they did in new old days okay so it has the senders address uh who it's for okay and it contains data uh this envelop information is returned as a status and includes the source the tag and account so as I mentioned earlier if your receive buffer uh m is asking why you would use wild cards and if you so it's really more for synchronous communication you Pro well for synchronous blocking yeah so Mar just said it seems like high risk you run for an error and that is absolutely correct which is why we recommend you don't um however there are reasons why you might if you're doing uh more well if you have a more complex communication pattern which is not necessarily predictable uh in the same way at a certain point your code or indeed um if you're doing nonblocking Communications um so it might be for example you have a controller worker type pattern where the controller is just waiting um and it's like a task Barm so okay and the controller is just waiting to hear that a task has been completed before it sends another one out to Any Given process there you would just have post to receive um that said okay I don't mind where it comes from but when that happens do this um so that might be an instance where you would need to Wild Card it but you're right it's always better if you can avoid it to to not use wild cards okay this is a better idea um yeah there's a lot of things the NPI will let you do um that for the most part you shouldn't because it's it's designed to be very flexible um and then what's happened over time as well is that uh obviously new functionality has been added as new cases new use cases have been discovered um but also you'll find that the things are more commonly used are the things that the NPI Forum focuses on improving um so if you another thing is if you're using a sort of more Niche feature of NPI it's less like to be well supported um and well optimized so you know if you stick to sort of the mainstream things those are where you get the best performance out whe that's a good or a bad thing I can't really say but sort of the way it is um yeah uh so so as I mentioned earlier it's quite a right to create a receive Buffet that's larger um and then you can get the real number of things that have arrived from uh using NPI get count uh and then supplying the status and that relies on the count that the sender specified so obviously if the sender is lied then you're still in trouble although they can't in fact because it'll just read that number of that much memory in so that's okay um yeah so the the the message count uh again NPI get count in C um and you just Supply the status uh for the message uh the data type and then a integer address which can be overridden uh and the for TR call is okay and this will tell you the real number of things that have arrived uh so Chris sh is asking what is the difference between the outcome of NPI get count and status. count let me go back and see if we actually so I can't find the actual makeup of a an NPI St struct I suspect that's not available um the important point though is that MPI get count relies on the data type ah yes okay so now that's a good point yeah so Chris has said is it B Count versus dat count the answer would be yes so if there is a status dot count which a SP problem is um that will be the difference is it will just be the raw bite count and an NPI J count uses the data type you provide it to tell you how many whatevers um yeah well sced Microsoft docs and NPI amazing yeah there you go well thank you for that link so whenever I end up looking things I end up looking at the N pitch which npic um one of the various uh Library vendors and their their documentation is quite ter us but yeah almost certainly it's just raw by cands including the status um and NPI get counter function we use that plus the data type you provide um to give you more useful information okay so another important uh MPI feature this is one of the things from the standard is message order preservation so messages do not overtake each other even for nonsynchronous sends um however there's an important carat to that um this is only true between the same two processes so other stuff going on in a network um you know is not anything can happen but for any two processes messages between them will never overtake one another um okay whether they're synchronous or asynchronous okay and so this means that if we look at some examples of message matching so here we have a synchronous send or in fact rank zero is sending two messages synchronously uh to rank one okay with a tag one and a tag two uh and meanwhile rank one as posted two receives into buffer one buffer two both through rank zero with tag one and tag two uh and this this will be fine okay because the first synchronous send is posted the first receive is posted they match up because they have the right source and the right destination um and we're assuming that the data type information member for sizes are all correct okay and they have the same tags so they will match uh that fend and that receive will complete and then the second as send and the second receive will be posted um and they will both complete successfully everything's great okay okay okay option two this C this c will deadlock okay and we deadlock because the synchronous send from rank zero has tag equals one and the first receive that's posted will have tag equals two uh now because these are synchronous it it's obvious okay that um that both of these will never twet okay so it doesn't matter that the next one has the correct tag uh because it's never going to come out of that first receive it'll just be waiting forever for a tag equals one message to arrive sorry a tag equals 2 message to arrive and it will completely ignore the tag equals one there okay hav posted receive for this so whatever um meanwhile the first synch send will never complete because it hasn't been received so this code will deadlock okay uh the solution to this deadlock is of course just to reverse the order of one of them it doesn't matter which um which is why we say that synchronous sends help you get correct code and that creates a better stting place okay but say in the next lecture we'll be going into more detail on sending modes um so we plenty of um discussion of the relative benefits of buffard ss at that point um message matching three okay so here we have an asynchronous buffer send um where the tags match okay the both just tag equals one um so any ideas how this one would work out um so say we'll talk about more about what exactly a buff senders later but for now we need to know is it's asynchronous um it would Ascend and then then return immediately whether or not a receive has been completed um so does anyone have any ideas uh whether this one will be okay y so Bas W says it looks fine and I am in agreement um so the messages have the same Tes and what's key here is they're matched in order so that's that's the the tiebreaker here is the order in which they arrive so message one is received into buff one and message two is received into buff two okay but this code will happily run with no deadlock uh everything's fine and dandy and note of course that okay the second Bend will probably have begun before the first receive or may have become before the first receive um has completed and that's fine in this case because it's an asynchronous end how about this one this is asynchronous sense tag equals 1 tag equals 2 Alexandro so say should be okay too and correct this one will also be fine uh however okay what's different here is that because uh the first Bend is sending with tag equals 1 um that will not be received until after the second message has been sent and that will happen in this case because it's a Bend um so tag equals 2 gets sent and message two is received in buff two okay uh and that's the first thing that happens at rank one um it will ignore the presence of the tag equals one message until that second receive is is posted and then it will receive message one into buff one okay so you can use tags if you're using your synchronous sends to match up um particular uh messages or particular sends particular receives and indeed that's probably desirable um if you are using a synchronous lens um because helps you to know what exactly you just received right if you're writing code um however the thing to know is that uh there's no real advantage to doing it this way around okay so okay well I'll save it for the next lecture I'm not start Ling into BN just yet um okay but here this will be fine but just it's matched on tags not on order uh but this still don't overtake each other even though they're as synchronous uh it's just that the receives um are posted out of order compared to the sends but here that doesn't deadlock um but just switching to asynchronous Communications is not a good way out of a deadlock writing your code correctly is a good way out of a deadlock okay uh and what about here so what where do we think uh various messages will will end up in this situation uh several gues there um so actually Chris Chris Stuart has it correct um so here okay message one will be received into buffer one uh and message two will received into buffer two because they don't overtake one another um but you're right it's you have to know that by looking at the code in this way uh there's nothing programmatically will tell you that you'd have to check the status um to find out the actual tagged values um however they are guaranteed to match and send order um Alexander is asking if I would ever code like this last example I sincerely hope not Alandra um Mar's asking where the previous one worked uh so this one worked Mar because um Bend is asynchronous so it doesn't block it sends and then it moves on immediately um we'll go into more detail about what it actually does uh in the next lecture but for now is not to know that it sort of returns straight away without waiting for the receive to complete and that means that the second message is also sent okay and then once the second message has been sent the first receive can complete ah so yes it's correct they don't overtake um so message one arrives at rank one first but nothing happens to it uh although this will become a bit clearer in the next lecture PR when we look at what BN actually does um but here this corod works correctly because uh the second s is able to happen uh a saying the meta code which presumably had overlapping messages in case tag equals any was implemented using standard scent um so the standard sent is basically either synchronous or buffed um yeah so there there anything could happen uh again this is pretty much the focus of the next lecture uh we strongly recommend against using standard send for that exact reason um yes because you'd have to check what the tag was if it's been wild carded and they're using standard sends because then yeah um so's asking if the the receiver's reading messages into a rather than after you're receiving them um kind of yeah so there is there is in 7 however okay so what the buffet Center is actually doing I might as well say this now the buffer send either um send synchronously or if there is enough room in the local buffer copies into that buffer so what is uh special about the buffered send is that it makes the send buffer reusable straight away okay so you can overwrite or free message one and that's fine but it doesn't mean the message is actually sent uh it may have just copied it into its own buffer space or buffer space that you have provided um we'll talk about uh in a bit but it means that message one is reusable straight away um but it doesn't actually mean that it's gone anywhere because it's not synchronous and um so what will actually happen is that there is some amount of communication between NPI processes uh that NPI does for you um so the buffet Sy will work out that there is no receive yet and it will just hold on to that message until it's told that it can send it to to or there somewh to receive it on rank one um it's a it's a slightly messy situation um okay so hopefully that clears it up a bit but as pretty much the next lecture is is mostly about exactly this sort of issue um so Alexandra has asked uh which one is more efficient between send and bend um well for that reason because because Bend takes copy um end okay Mar is promising more questions for the for the next lecture yeah so for alandra's question which is more efficient send simply because it doesn't copy the data into another buffer um it would ascend from the buffer that you provide it because it knows that you know there had better be a receipt well there will be soon so their MPI probably will but for the message in his own local memory until um uh until that receiver arrives if they don't have if the rece hasn't already been posted um Jos is asking if it's possible to have bends piling up in buffers for every uh forever if the receiving rank doesn't get what it's looking for uh Josh has already hit upon one of the major issues with Bend uh it will pile up in the buffer until you run out of buffer space and then it will crash your entire program so I think we're now fully loaded with all the spoilers for the next lecture um yeah that's why we say BMS are bad um and standard sends are bad because they might be a b end uh it is possible um to have code that runs correctly in one place and not in another that way because buffer sizes can differ um communication patterns differ on different machines for reasons that we'll go into but essentially yeah you can um just keep shoving BN in until the buffer is full and then everything will will go down uh which is quite unpleasant so sn's are much safer um and if you actually want so the reason that asynchronous communication is good is it lets you overlap communication and computation and actually there is a set of non-blocking communications that do a much a job of that um which we'll be discussing next week I think or the week after uh basel's asking if you can add code to clear the message B um no because um it's sort of an internal thing to NP yeah and also you would be it's never a good idea to remove messages right because then you know at some point point you assume that or will be posted for any given message or what was the point in sending it um and it will never arrive if you've already cleared it away um and it's not a great idea just to fling a message at another process in the hopes that buffer will appear um for SAR problems you end up just piling stuff into some unknown yes Bend is a synchronous always um Josh is asking if the buffer memory is protected or can a rank access its own buffer uh no not or at least the the developer can't um so it's it's allocated so basically there's a we'll see this in the next lecture there's a buffer attach uh function that you call where you say you create you allocate some space and then you provide it to NPI and say okay shove the messages here and if it's large enough that's great if it's not you're in trouble okay but you can't you shouldn't do anything with it um except through MPI after youve you've provided it um okay so yes so this one we looked at with the Wild Card uh tagging and here they will just be receive an order so buffer one gets message one and buffer two gets message two and you can check these status to out the actual tag values in order to determine what is likely to be in buffer one and buffer two um okay if receive matches multiple messages in the inbox um then the messages will be received in the order they were sent and it is only relevant for multiple messages from the same Source because message order is not preserved um across the network it's just between two particular points okay that's it for now we'll come back at half three uh I'm going go and get another coffee um um and then at half three I will quickly introduce culation of Pi and then we will launch into the next lecture which is on communicators tags and loads um and we'll a great deal of um saying that b ends are bad okay so um welcome back hope everyone's have a chance to go through get more caffeine um so I'm not say you're welcome to do the exercises at any time you like um I'm just going to briefly introduce an exercise to you right now and then we'll move on with the later um so exercise two is a calculation or approximate value for Pine um and the point is this time moving one step up and you'll need to actually use point-to-point Communications um to calculate this uh you can find the exercise sheet on the main uh online NPI training page is linked from there and actually as as I mentioned ear has all of the exercises for our entire um message passing program MSE course on it you're welcome to try as many view as you like um if you're keeping Pace with this ah CLA has linked it thank you very much CLA if you're keeping Pace with this then the next one to try is exercise two um a couple of small pointers to get you started uh the value of n in the expansion of Pi is not the same as a number of processes in fact it might be more useful for me to show you the actual equation right now yeah so that that value of n is not the same as the number of processes it is just a large number uh the larger is the closer your approximation will be uh it is however helpful in terms of implementation if um the number of processes or if it's divisible by the number of processes so we suggest 840 as a good starting place because conveniently that's divisible by two 3 four five six seven and eight um you should get the same answer more or less um independent of the number of processes uh it should depend mainly on n one thing to note for um C programmers like me is that the summation goes from one not zero goes from one uh and of course it includes in um so it's not from less than is from one to less to n um ideally you want to make sure that um it doesn't matter Al so you should be able to run the same code for any number of processes um so you may have to think a little bit about that remember that they're all running the same code on you Fork it so you don't need to have separate um values of of Pi uh or separate variable names for each process they can all just use Pi or partial Pi um for a bit and you probably want to break up the iterations and each process um okay and we strongly recommend just using send NP receive uh and the final bit is about npiw time so I'll go back to uh this as well so if you if you get that far there is um oh you can also try and make it so that the numbers not need to be so the number of um processes does not need to be an even devisor and it's not simple than you might think um basically just involves doing having a slightly imbalanced load um but so it's up to you how how much time you want to to spend on it um if you get not far as it would ask you to time it and indeed it's important for the pingpong exercise uh MPI helpfully just gives you a timing function and Returns the time in seconds as a double Precision which is great if You' ever had to deal with um clock monotonic uh and all that that business and C is very handy just to have npiw time available to you and it gives you it's a w time is war clock time so it's just seconds which is nice okay if you have any problems with that as ever just feel free to get in contact and I'll see what I can do okay so then uh next lecture we will be talking about modes tags and communicators okay um we're going to cover an explanation of the different MPI modes and a bit more detail possibly slightly suolis at this point but we're still going to do it the meaning in use of message tags and rational MPI communicators um these are sort of Fairly basic units of um of the npi's uh functionality it's useful to understand them profit properly and what exactly they're doing um in particular the use of different Communications is not immediately obvious um but they are quite handy as Al be true with tags um although tags it often more for asynchronous communication that becomes important um because synchronous Communications like everything else would better be correct or it's never going to work anyway so um yes so the three modes we're discussing so far and again these are all uh the blocking forms so it's actually next week uh in the next lecture in fact we're going to discuss um non-blocking Communications an NPI but for now we're all about blocking these functions will not return until the buffer that you have supplied them is safe to reuse and the send buffer can be safely reused or freed okay they have either so the synchronous sent has sent your message it's taking all of these from the buffer not taking it it's still there but it's copied it out the buffer um and sent it to the destination Rank and the destination rank has receiv received and then MP Isn completes so the routine will not return until then whereas the buffet end is asynchronous but it takes a full copy of the buffer of the send buffer before returning okay it copies it into a buffer uh and sends it later on um when it's ready to be received okay and it returns before the message is delivered so it's an asynchronous communication is still blocking because that buffer can be freed or reused uh if this he unfortunately need standard send so the MPI form are trying to be helpful right uh so the standard send um will either be synchronous or asynchronous uh basically depending on whether or not there is enough buffer space available um although it is entirely implementation dependent what that exactly means uh and that also means it's machine dependent which means that your code might run perfectly fine on your computer and crash horribly on Archer for a number of reasons um so we we never really are need service um so we just recommend stay away from send because um to be honest either the M fact that it's not guaranteed what it will do on different platforms is enough to put me right off personally um but the fact it you know if you're going to be sending communication if you're going to be communicating synchronously or asynchronously it's better to know for sure which of those you are going to be doing um so the NPI send itself is often not very helpful for that reason uh so let's consider the humble s then to begin with um okay so process a calls send with sun buffer X and it's sending to process B okay process B does its thing for a while and then uh and time here is meant to be running from top to bottom so it's running some other nonmi code um during this time process a is just waiting around okay it's still sitting in SN yeah um because it's synchronous process the post is receive it says okay I'm expecting a message from from process a please put it into buffer y uh the data is transferred as if by Magic through the network um okay then fend returns and receive returns so the two processes are synchronized at this point and then they continue about their day and most likly become desynchronized once again but in that one moment in time they're both existing from the send and receive function okay it's nice you know exactly what's happening um and if your code works correctly with this SS your code is correct uh after that point because it's they're both blocking um because it's blocking send X can now be overwritten by a it can reuse it or it can free it um delete it whatever um and Y is safe to read by process B Because theer is always blocking um for now probably the rece is always blocking actually even if you're doing asynchronous sends but it's for next week um the point is that once receive exits Y is safe to read it's been completely filled okay now the buffer send so what's going to happen here Prof a is doing its thing and it calls Bend okay uh okay again message for process B from buff x i meanwhile process B is going about doing whatever uh what process a is going to do is because there's no reer on process B it's just going to copy uh X into a buffer space uh so into a different bit of memory um and then carry on so this me means that the variable X can now be overwritten by a or it can be freed the developer free to do whatever they like with it um it's safe is nice okay and it's immediately returned so process this carries on okay meanwhile on process B process B now finally posts to receive um it receives into buff y from process a okay the DAT is transfer the receive returns and the buffer space is freed again over on process a uh and can be reused um so if if the receive had been posted before the buff send as well then that transfer would have just taken place straight away um you know it's not going to copy if it do receive waiting just go as well um but in this situation where the receive hasn't yet been posted and you know synchronization between processes if can is only guaranteed at synchronous Communications um so anything could be happening um so if the bend is called first it would just copy that out and say okay you can reuse that buffer um but it would just be sitting still on process a so an obvious issue with this is that uh it needs your taking copies all the time of the things that you are sending um and M requires if you're going to do this that you attach so if you use a BAL you should do call something of NPI buffer attach first okay um with an memory address um of some allocated space you also need to remember to detach later on okay or is it a memory address for attach uh yes okay so you tell it the size in byes um and the initial address okay um okay and that buffer space had better be large enough for all of the BMS that might end up in it now how many might end up in it uh depends entirely on your code may not be easily predictable in particular um you might write some code that uses Bend and runs perfectly well on S laptop um but you D on S like Archer or cus not necessarily because haven't allocated well um not because there's not enough memory on those machines in general but because if you're scaling up your code to run on more processes as you presumably would be unless you have very large laptops um there will naturally be more Communications okay um because there are more other processes to communicate with and that means that what was a large enough um over a much smaller communication network is no longer large enough um it's also to an extent machine dependent uh if you're to looking at standard send so what standard send does is it makes use of some default buffer that NPI provides but that's implementation dependent and machine dependent um I'll be talking about that next okay so what can I we be yeah okay so standard sends went in there um but the standard send you don't need to use a buffer cach uh because there is a small default buffer um but it will vary from machine to machine um and the standard send will act as a synchronous send if the buffer is not large enough okay but you don't know what that buffer size is uh and it changes from machine to machine as does the scaling of your code uh presumably uh so it's all just a bad idea just use sense when it comes to blocking Communications just use sens because in terms of creating overlap between communication and computation the non-blocking sends um provide a much safer and much more usable way of doing that than the BN does okay that also does not involve you having to um take a copy of your data all the time um so that's a much better approach to doing asynchronous communication than using Buffet send um because in buff send you're always going to be doubling up your memory usage um okay and here it says uh so where does the buffer space come from the user provides it um it's difficult to know ahead of time there are situations when you will uh you might be able to completely completely gaug it and everything will be okay in that case fine um but as a general rule it's best not to and indeed as well if you're writing code from scratch for the first time um it's better to start with it being synchronous because then you know that all your receives and your sends line up and are correct and if the code functions correctly is a synchronous code then you can start thinking about how you can make it synchronous to to get overlap of computation and communication and make it more efficient um but at least you have that working version that is absolutely right um so it's a nice Baseline to begin with um compared to the MPI send where you don't know what's happening or the buffard send um where it may be correct but uh poorly designed code um you know if your your receives and sends are out of order and it's working because they're Buffet sends you you know you could just change the order um and it will be more efficient uh the receive is always synchronous um yeah so you know process B would have waited until the send yes yeah Bend will uh not just bend will fail but MPI will abort if you overflow an attached buffer um yeah so that so it will only try and copy into the buffer in the first place if it can't send um but yeah it will never it will never just become a a blocking um so so it is always blocking cool um it will never just become a synchronous that true yeah it will only um so if the receive has already been posted it will just send straight away and then it's kind of like a synchronous operation um but it just because the receive happens to have already been posted and if it can be it will do that um but there's no way to necessarily guarantee that um and if it if there isn't a receive waiting for it then yes it will just take a copy and if the buff is full then uh it will fail and NPI will bought which is very undesirable behavior um it's not able to Simply say okay I'll just wait um because it is a strictly asynchronous function as defined in the MPI standard um I think even so even if it did uh do so Chris suggested um you know what if Bend blocked until something had been removed from the buffer I think even if it could do that instead I don't know if that would really be desirable behavior um because it has the same issue of a standard send in that you don't really know what's going to happen on any given machine or for any given run of the code um yeah so you're running always on exactly the same number of processes on exactly the same machine um how you're kind of data Siz and stuff fixed all the time uh it's not really reproducible um isn't nice um and indeed you could deadlock so you can write code that should deadlock but doesn't because know they asynchronous sends but then if it did what you're suggesting it would then deadlock it switch to a as indeed the standard does all right yes so Chris said that makes sense there's no reason for it to try to handle a situation rather than simply forcing you to provide the right size buffer and that's correct um because you know other sending modes are available so you know if you really want to go down this route then um the NPI position is okay do it right yeah okay ah here we go here here's here we um talk some smack about NPI send uh so send runs a risk of Deadlock only it's not really a risk if your code is correct um uh so is saying does B work without issuing NPI buffer attach uh that's sort of implementation dependent it might be if the implementation defines a default buffer size um it may not give you any way finding out how big that is uh it's in general I would expect it not to I would expect it simply to to either not compile or fail um if it wasn't a buffer um uh attached um oh another point about so if you are using vens um it's not the case that you so there's an MPI buffer detach um then uh it's an Empire buff of detaching must also call um you don't do buffer attach Bend buffer detach it's you attach it at the start of your code and detach at the end so us is saying that uh Bend Works without issuing a buffer attach with pi is that with the pi example or with a Raspberry Pi um ah with the example okay so it's probably Matt will be using hp's implementation so they're clearly providing some default buffer um and so I think in that yeah you're sending app double so um 8 bytes and I can see that that's probably in the realm of the sort of default Buffet space okay fair enough so yes I mean it can you know it can work fine uh but that's dependent on both the machine and on the MPI Library you happen to be using um what that size is and it's not necessarily scalable because as I say when you scale up your code onto many more processes there are likely many more Communications happening than when you are using less processes and that means it's more likely that you'll start filing more and more messages into the buffer space um and then you get this difficult to debug especially if you use bends in multiple places throughout your code you get this difficult to debug failure where one rank has filled its buffer space the question is on which send and why and that's dependent on the state of the network which is not easy to capture um so we we recommend Ence um okay and send uh so the FN runs a risk of Deadlock but that's only a risk if you're if you you know you can always mitigate that risk by having your receive be in the right order except in certain specific use cases where um uh like task Farm type things when actually you don't know who's going to be communicating precisely when um B is deadlock uh but you have to supply buffer space and it will fail if the buffer ring is exhausted and as we'll discover next week better options are available for asynchronous communication um NP s tries to solve the problem by uh you know just doing the right thing except actually that's not very helpful um because you can end up in a situation where it can't do either of those things and then it will fail uh okay so yeah could cause your problem program to deadlock if buffering runs out um so if it finds the buffers full it will just sit and wait as Chris suggested B might be able to do um okay and you know that's when you find out that one of your receives is in the wrong order one of your tags is incorrect and so you're sitting staring it's a deadlocked program uh that worked fine before uh which is difficult to debug so send is the way to go yeah here we go okay so here's an example of some Co that may or may not work um yeah uh however so one difficulty then with SNS is that I mean for a single pointto point communication it's fine um but having to to pair up all of your Communications is not necessarily sustainable or scalable um okay so in that case uh tags can be a good way to actually uh make sure that messages are matched correctly um Without You explicitly having to pick pair up sends and receives to make sure they always happen in the correct order as well um but for the more General solution we can use non-blocking Communications which we'll talk about next week um yeah I was not joking earlier when I said that this lecture will mostly be about reasons not to use pend or or send [Music] um yeah but for simple examples like the ping pong you can just make sure that the sends and receives match CH to make a send work okay and the um the pi example that's one I'm looking for okay so NPI does allow you to check if any messages have arrived uh with something called NPI probe it's the same syntax as a receive except there's no buffer uh cuz this isn't saying okay you received this message it's saying okay I think there might be a message for me can you just check uh okay and if that okay the stat a set so you can you can check the set status to find out what the size is and all that kind of stuff um and if indeed there is a message waiting for you then you can post a receive to get it obviously this is moreas F again for the kind of situation where you're not your communication pattern is not fixed um you're not necessarily sure when things will won't be arriving uh do be careful with wild cards uh you see you have to use a specific Source in order to rece in receive to guarantee matching the same message so you have to get that Source from the status okay um because NPI probe can take a look see there's a message but if there's more than one message uh it'll only have picked up the first one in um or the one that matches your tag uh so your MPI receive should then make sure it's definitely looking at the same message um yeah assuming you want to know exactly what it is okay every message can have a tag uh it's non- negative there is an actual maximum value um so one thing that's useful for debugging uh for example is to set the rank as a tag or the rank of the sender as a tag for every message um because okay when something goes wrong you can say okay what was the tag on this message um however be aware that that's not a scalable approach because there is a maximum value for the tag um okay it needs to be at least 32,767 a p on Archer it's uh around 100,000 or so I think something like that um Ser it also be quite large but it's not infin okay so do be just do be careful about that um often we just set to zero um but it can be useful if you want to only receive messages of a certain tag at a certain point Um this can be used for when you have multiple messages incoming from the same Source you're not sure exactly when but you do need the Rees obviously to be um to happen in a certain order so you know what buffer to put them in um you can do that just with tags um okay yeah you can also just use one cards and check the status for the actual tag value um that's also fine and inde that's how you would do this thing of just receiving um into the right buffer okay communicators so so far we've only looked at NPI com world it's the um p is always guaranteed to be there provided by NPI with the the name NPI com world uh there's also NPI com cell which is arguably less useful that just contains one process it's the process that uh is running on um there are good reasons to do Communications uh with yourself at times um maybe I help you write what general code um but that's quite Niche I suspect uh generally people choose NPI con world uh one situation in which you should definitely not just use NPI con world is if you are writing a library um the reason being uh if you're writing a library you want to be very careful that your NPI messages uh Mar is asking why you would check for messages with NPI probe um the answer is because again it could be this kind of collect um collector uh what's the word I'm looking for um okay so some sort of task Farm uh type pattern and in that case you're not exactly sure when you will be receiving a message um so one thing you can do is probe for a message and if there is a message there it will tell you give you enough information about it to allocate the right space um or to put it in the right buffer um so it's for communications patterns that are a lot less fixed um than we often find in scientific uh codes where usually it will just be something like a Halo swap domain decomposition or very like rigid communication patterns um that's not always the case and then there it can be useful to say okay is there a message um coming in and so it is blocking as well right so it will just wait until there is something but it will give you the information about what exactly that is I can't find wasn't that far back here we go yeah um yeah so you know something should be coming but and you know where from but you don't know what exactly that mist would be um it can also be uh if you don't want to do the thing I mentioned ear about just creating a reer bu is large enough uh you can use NPI proe to get information about how big a specific message is and then allocate a buffer um to receive it into and that saves you having to just create a quote unquote large enough buffer so there might also be situations where you know you have of dynamically sized things flying around um and this can help alleviate pain in that situation it's it's certainly a more advanced um way to use NPI then really what we'll be considering in this course but hopefully that makes sense let me know if it doesn't let me just check for you I'm actually fairly sure that NPI probe is synchronous and know they will just wait like a receive for something yeah so there is a separate uh non-blocking probe routine um so NPI probe is is is the is a synchronous one um you know it will wait but it won't do the actual receive part it will just find out the metad data about the message that's coming in uh communicators so as we know earlier all Communications required to provide some sort of Communicator um which is fundamentally just a group of processes right and in no particular order um in general okay so the ranks are not assigned on based on anything in particular they just sort of divied out uh an NPI Comm World always exists um and a message can only be received from within the same Communicator from which it was sent it is not possible to Wild Card income however you can create new communication cases um that contain just specific ranks uh each process is given a new rank when within each sub communicator and again they're always ordered starting from index Ru um okay so rank zero is guaranteed to exist in any Communicator um because the minimum Communicator size is one now this is useful if you're writing a library because it guarantees that so if you create your own communicate they just to your library you can guarantee that your NPI messages will not interact with the user's NPI messages um okay and that's important because the last thing you want to do is start accidentally uh receiving um NPI messages that the users trying to send for their own code uh in receive calls in your uh library or you will have some fairly upset users um as they're trying to debug what's going on and it translate all their messages are being swallowed by your library code um there are other uh uses for creating different communicators um so you can attempt to do a similar sort of splitting based on simply on tags um so you can say okay if I if my rank is even um I will use this tag if my rank is old I will use this tag um one way to do it h but the NPI Communicator split guarantees that no messages will pass between those different groups of processes okay uh so this NPI com split will let you divide NPI com World up uh NOP NPI con world still exists yeah um and there's no way to remove MPI com world right NPI com world is always guaranteed to exist so a question from Mar asking if um when NP com splitting the sub communicators does NPC world still exist it does Alandra is asking when splitting to they preserve the original relative order the answer is uh maybe but not necessarily uh you can actually uh yes I can uh so I look up NPI com split so what's missing here is the call Signature I think it's coming up actually hopefully it is uh but there is there's a thing called a key that you can provide to the um MPI consulate that will determine that okay can actually say uh you have some say over what rank you know or what the new rank is um but it's probably that will be what happens but it's implementation dependent it's not in the NPI standard um so the way that the standard is written is often to give a lot of freedom to implementers um you know to optimize things if they want or to change things is to give them some flexibility to you know create a better NPI library that still complies with a standard um often kind of really detailed things that I cannot defin um how that should happen however often common sense is is the common sense solution is the one that is chosen Mart is asking if so that means every processor can belong to a few communicators uh yes yeah so there's nothing um to stop any one process being the process can be in multiple communicators it will not necess so we have the same rank in those communicators but it can be involved with several um it's worth noting that you know the having many many communicators is not a pattern that MPI expects you to use so the idea is there might be several communicators um not there will be thousands of communicators um so yeah the NPI standard says these things to control the communicators should exist um but it doesn't say that they have to be particularly efficient or fast um because you know you're expecting it to happen a few times near the start of the code probably just to set everything up um you wouldn't want to do so one I say common pattern in scientific codes is a a domain decomposition where there's a Halo swap so um you have some Grid or some array and you need to divide it up amongst the processes and then processors need to communicate with their nearest neighbors uh the boundary conditions um okay you would not create a communicator for every Pro that had all its neighbors in um because that would not be a scalable approach because it require you to potentially create many thousands of processors of communicators sorry um that is not the way to go there on the other hand if you have two very like discreet um sets of tasks that need completing uh you might simply split your NPI com World in two um and say okay Communicator one is going to do this Communicator or Communicator zero I should say really is going to do this Communicator one is going to do this um and split things up that way but you have to be sure that they never have to communicate with one another um although they can still do that through com world um but that will be based on their their Comm world rank not on their sub Communicator ranks if that all makes sense okay um so you can also make a copy of NPI com world uh through the NPI com dup routine it contains all the same processes but in a new Communicator why is that useful as I mentioned earlier uh if you're making a library it's a very good idea to make sure that your Communications are absolutely separate from the end users um but simply duplicating com world uh is a very simple way of making sure you still have access to every process that's been launched um but in a way that you're not going to uh trample over the end users messages nor can you know the user in your library messages so it's just much cleaner and much safer um so you can you could you might think okay well I'll just set my tag to be like uh 117,000 no one else is going to ever use that so that'll be fine but because um tag wild carding exists um you know that can cause you problems in fact I've WR I've spent a day or two in the past uh debugging code that as it turned out the issue was um the sort of live routines were using uh a special tag to identify themselves rather than a different Communicator uh and my MPI any tag receives or swallowing those up and everything was going very wrong and it was a absolute pain so don't do that uh just use a different Communicator um okay so you know why why should we bother with all these s modes um I mean hopefully have points basically with anything butn don't um the the non-blocking routines are different they are much more useful we'll discuss those next week um those are how we suggest you should do asynchronous Communications um because basically they don't guarantee that the buffer is is reusable when they exit but they do provide a way for you testing whether or not it's happened um so it kind of gives you the best of both worlds um because you're not copying the data okay you're basically starting the sending process in the background um and then you can be sure when it has completed and like with the beat end when you just assume it has at some point um the standard send is also not helpful because it has the same essentially the same drawbacks as the B send and also night deadlock so um it's actually gives you the worst of Both Worlds um send is if SN will give you correct codes and should always be your starting point if nothing else um yeah yeah uh what a 4 uh potentially nothing at all just setting all to zero that's better than wild carding because it's more efficient because you're receive isn't you know checking um for whatever is coming first uh and you don't have to check what the actual tag was um so from the implementation side it's a little bit quicker because it doesn't have to expand to all the potential tags that there might be um and just says okay I'm looking for something with a tag zero uh it can also be debugging do be a bit careful that if you're running a very large job that might break because there is an upper limit on the size um can I just use MPC world yes yeah you probably will um there's a good chance you'll never need to create new communicators um okay it's probably bad practice to specify in com World everywhere just in case you do need to change it later on in much the same way that hard coding any value is a potentially bad idea um so you know it's always potentially a good idea to create a a variable instead that just is equal to world but could be changed um one case in which you and or another case in which creating communicators is useful is it is actually an NPI concept of a communicator to ology um so I don't believe let's have a quick look at the schedule no so we're not going to come on to it in this course but um you can do something called well you can create something called a cartisian communicator for example um and all it really does because the NPI ranks are just you know numbers arbitrarily assigned to each process it's tra a cartisian communicator you can tell it will actually uh my processes are going to correspond to different parts of a 2d grid for example or a 3D array um you know because I'm doing a domain decomposition and then it will also provide a way of you checking your neighbors so you can do that yourself um uh simply by you know doing a bit of integer arithmetic um based on the size array blah blah blah but you know NP provides a convenience function for you to do that if you told it you know this is how they're going to be laid out in theory it also lets the implementation do something a bit smarter underneath so you can tell it that you're happy for it to uh when you're gracing the cares and Communicator you you can say you know if you need to change the ranks around go for it what it can do then is make sure that ranks that happen to be um nearest neighbors are actually physically located near one another as well so try and put them on the same node um or the same board or whatever H it doesn't have to do that right there's no guarantee but it is at least a possibility if you give it that information and that's another sort of pattern that NPI has a standard follows is it uh likes to create ways for the user to provide useful information to implementer say actually this is what I'm going to be doing so um you know please make this as efficient as possible um and that helps uh be implemented a lot um but so that's that's probably the most common reason for users to create their own communicators is because they're specifying something about the topology um that says okay I actually want these processes to be uh considered as next to one another um there we go that sounds a bit of a diversion and Ah that's us so as I've spoken far too quickly um does anyone have any questions uh oh so iar is asking about a send receive command okay um that's another one on the just don't just don't this so send receive actually there's another mode as well which we've also ignored called a ready send which is um best the less said about that the better um okay I'm just looking at the send receive now so I get this right so the send reive uh is basically a buffer swap type operation um yeah so it combines uh it combines the sending and receiving into one single uh call um which is NP and receive and the point is that you send from the buffer and then expect another or the destination rank to have also called NPI send receive so it's a synchronous um a synchronous call uh and basically whatever is in the destination ranks send or send receive buffer will be received at your end and vice versa so it swaps the two um so it is equivalent to doing something like send on process a receive on process b um followed immediately by uh anend on process b and a receive on process a it can it can work great um but you know you just have to be a little bit careful whether it is a synchronous operation um but for you know it's for a certain specific use case and it it can do that well it's the kind of thing as well where you it's something you might want to introduce once you've already got a working version of the code and you notice okay this is always just swapping pretty much the same thing it's not I wouldn't recommend it is something to to start out with if that makes sense because and think of it as like an optimization almost so you know never never make that your first version but you might see okay this could just be done with a send receive and that might be better however it's best to to write it with simple synchronous sends first and then here good uh so Marth is asking about uh message order preservation uh say she so they call 2 B sends they land in the buffer in the right order and wait for the respect to receive but if a receive with any tag comes it is the order uh yes that's exactly right Mar so um so if if there's no if there's some sort of wild carding in effect and I think so let me just go back to my slides I think that's where was it here we go so this this was this situation I think um that you're talking about and yes in that in that situation precisely what will happen is that it will be the order that DET uh which um which buffer which message ends up in because they're both L carded with any tag uh so the a dangerous thing here might be that if you have an NP any tag okay it's going to receive tag equals one first because that was the first one sent fine if that second receive instead of having NPI any tag had tag equals one you would be in trouble that code would then lock um so I guess one thing that's important above all else is to be consistent when you're doing this kind of thing as well you know if you chosen to do it a certain way then make sure you stick to thatp or you might get yourself into a deadlock situation um but if you're using a world carding like this as well you can check status to find out what the actual tag value um received is which may be important uh in terms of determining what exactly you're expecting to find in in a given buffer so saying so in a way the messages can overtake uh you can receive them in a different order yes but they haven't overtaken one another in the network okay yes I see what you're saying um yeah the the point is that that MPI guarantees that once they're off node right then then the order is fixed B yes okay because what Bend actually does is copy it into a buffer as well then yes um they haven't really overtaken one another because the first one hasn't been sent uh but I see what you mean so just to go back to an earlier question as well I'm just looking at a um uh documentation file for MPI probe and it says here that yeah is a blocking call that returns only after a matching message has been found um there is an asynchronous alternative uh Chris J is asking do I have any recommendations on how to get started developing and running NPI software on your own machine uh for example choosing which NPI implementation of versions use any good online guys and head to set for NPI cluster um so actually that's a lot easier then you might be expecting it depends a little bit on your um what your computer is so I mean one thing is that uh as is often the case um for these sorts of development tools um your life will be a lot easier if you choose to do your development um on a Linux machine or on a virtual machine that is running some kind of Linux because then you can uh simply install from your friendly local repository um open MPI uh so open MPI is is probably the go-to for just sort of um local things um because a lot of the other ones are either commercial so you have to you have to pay for them or or very specific to certain types of Hardware open MPI is a more experimental um library but it is still um you know it has a stable version so open NPI is the obvious choice for simply installing on a a local cluster or even on your own computer um on that note so actually you can launch any number of processes on uh on any computer with the actually you know serious specific things in there to stop you doing that but you know if someone hasn't done that then you can do whatever you like so you can merily run and did and do often for testing purposes run say four NPI processes on a two core VM on my laptop that's not a problem at all um Chris is also asking on open MPI uh they see they in their friendly local package repository they have a choice an open NPI an open MPI 3 is there any reason not to use Reon version no no does not um so MPI you know is a standard right and the process is not so yes there are more experimental implementations um but the things that make it into the standard are very carefully vetted and very heavily debated often um so you're unlikely uh unless you're using a specific experimental or development branch of the the implementation um to run across stuff that doesn't um and I can't see any particular reason why you would so I don't I don't know off the top of my head what the difference is between NPI 2 point whatever and uh 3r um but all of the stuff that we've talked about so far in these courses will certainly work fine in NPI 3 and there'll be some new and exciting things in there too so M pitch npic is in sort of uh standard um MPI implementation in P is a little bit special because they are funded by the US government to maintain a stable and efficient MPI Library um so okay this is a somewhat esoteric analogy but it's a bit like Debian versus Ubuntu looking at Mich versus open MPI um one is more focused on stability and long term um uh one is more likely to have newer things in it and maybe a bit less stable but still basically Works um both are free there is so we say I'm just looking at our own uh yep so there there is a a version of NPI for Windows um I can't say ever tried it just looking at it now I so I'm a little bit um dubious mostly just because I know that c support is not um great on Windows because they they would really rather than when use C++ and obviously MPI itself just uses um or doesn't have a supported C++ interface anymore um but I see no reason why that wouldn't work if you are a Windows developer um but I would imagine that's your choice in terms of free implementations of NPI on Windows by its nature because it's designed for distributed computing uh it's more of interest to supercomputing centers who um are unlikely to want to Fork out for a uh Windows Server license so you know Linux is better supported Apple I think uh same thing uh for as Linux you can just get open NPI probably okay um as always with my waffling I've managed to go slightly over time apologies for that uh if you do have any more questions feel free to just email me uh if we have any or if you have any problems with the exercises um otherwise I will see you next week um when we will be a going through the uh well we have we have another quiz on Socrative um talk about solutions to calculating pi and we'll be discussing non-blocking communication sort of car been dangling quite a lot today as a better way of doing asynchronous than beat end um until then have a good week and goodbye for me okay uh so hopefully as you've all seen we're actually going to begin today with uh another Socrative quiz look behind the current here um so you should have all been sent a link earlier on today as well you have see we like to head over to Socrative uh so this quiz is uh it's not a test um but it is a way to to test your understanding of what we've we've done so far um hopefully you do quite well in this fast question uh we're going to be we're going to be running through this fast and this just a way to see you know where everyone's at and hopefully to to bring up some some new Ting question some of them uh probably been answered after you ask questions about it anyway last week looks like everyone who's here so far has has answered that one so we'll see how we did you should be able to see this on my screen what reasons over comp here okay well okay I I'll allow it um so yes uh it is a way of doing distributed memory parallel programming that's also correct um it stands for the message passing interface so both of those answers are in fact correct well on everyone let's see what the explanation has to say ah okay and this is the to point out that some questions have more than one answer uh and when they do you can select multiple okay all right let's move on to the next question which might be a little bit tougher okay so the compiler and running M program requires what of these four options so to compile an NP program requires special libraries so the the special compilers and I'll show the explanation here as well so NPI is just a library it's a library for doing message passing programming um but it is only a library what's tricky about it is that you get rappers to compilers provided by MPI that allow you to compile NPI uh programs more easily but all those rappers actually do is they provide all the link instructions to the NPI library for you so you don't have to put those in yourself so npic cc is actually just say the gnu compiler or on serus using the Intel compiler um it is not a special NPI compiler is just your standard system compiler but with additional arguments that link to the NPI library is provided for you and NPI itself is a standard that defines the interface to a library and so NPI mentations are those libraries okay okay and as the you'll realize you need a special computer or special operating system you can run MPI on any computer so let's move on after initiating an NPI program with NPI run dn4 what does the C NPI in it do s t one this time okay so this one is tougher so the correct answer is enable four independent programs subsequently to communicate with each other NPI in it sets up the resources that NPI requires um to allow communication and all other NPI calls to be made um but it doesn't do anything more than that so uh cating with four parallel processes that's actually the job of the NPR launcher okay I'll show this explanation here which hopefully agrees with me so the NP launcher creates the parallel processes those processes are running from the very start of your code so from the beginning of the main if it's C uh or C++ um and they are in exactly the same way that any executable would run except there's four copies of it okay what the NPI in it does is it allows NPI calls to begin being made and equally NPI finalize um freeze all the resources required by the NPI library and and but after that point you can no longer make an NP but the program is running from the start of main which may well be before well certainly will be before NPI in it and may even be some way before there's nothing to stop me putting stuff above the NPI in it as long as there's not an NPI call um starting program execution that as I said is is from the start of main um so you can happily run four completely independent processes four independent copies of a program um without any actual NP included simply using the NPI launcher what the NPI in it does it allows communication but good that everyone avoided the the threads uh option so NPI deals with processes lower level things deal with threads so if you call NPI receive and there is no incoming message what happens oh well done uh all those who did answer yes the receive Waits until a message arrives potentially waiting forever um so however no no no however this is so this won't happen Okay unfortunately that would actually be nicer um if it would eventually fail with an error but the fact is there's no timeouts in NPI U you can Implement your own timeout system I've had to do it before in a project it was an absolute nightmare it doesn't really it's there's so much about it that you need to configure separately um for different machines it's very difficult to to make useful um and that's one of the reasons the NPI just doesn't bother so NPI makes the assumption that you have written correct code um whether that's a safe assumption is is neither here or there it makes it um and if you don't it will just keep on waiting it'll deadlock um what that does mean is that it allows optimizations to be done uh by the Library implementers part of rationale for doing it that way is that you can write code that optimizes things on the basis that it will always be correct rather than having to worry always about whether or not a message is actually going to arrive because if you always have if you have to do time out you have to constantly pull you know and check and keep looking and then eventually say okay then throw an abort and it consumes a lot of resources that fundamentally aren't needed it's well done okay if you call NPI synchronous send and there is no receive posted then see how we did okay so again it actually won't fail with it will just headlock um just wait it'll assume that there is a receive and it will wait forever um so with the S send this is also not true the messages not stored um uh it will just sit in in that buffer um in the send and the send will never return allowing that buffer to be [Music] reused it will just do this next mission here just let making a phone call um or they W that wouldn't eventually kick you off either phone calls actually do time out eventually uh NPI will not or NPI send will not they just wait and it's not always easy to know that that's happened so on like AR or SCE you got through the the pdsq then eventually it will just kill it you'll run out of time ah okay um so is asking about mat lab Lab probe from the parallel Computing toolbox and saying it's asynchronous uh so yeah um Al be careful about the distinction between synchronous and uh and blocking um there are different things there's blocking and nonblocking there's also synchronous and asynchronous and we'll be covering uh more in the next lecture uh later this afternoon in fact so we're going to be talking about non-locking um Communications next so hopefully this will be a lot more clear so from last week uh the things that are synchronous and asynchronous are senss are synchronous bends are asynchronous because they free the buffer immediately um and NP send could be either so the the synchronous versus a synchronous is whether or not the buffer which you've used to send or receive into is readily available again or no the buffer you're sending from is readily available again there only a synchronous receive um it just receive um yeah it's about they're blocking functions sorry yeah yes they're blocking function so uh as soon as they complete the buffer is reusable because they are blocking um but some are synchronous and some are asynchronous and then this week we will look at non-blocking functions and again summer synchronous summer asynchronous um so more on that later okay uh let's go on to the next question in this quiz so if you call it's related if you call NPI asynchronous send uh the B would send be send and there is no receive posted which of the following are possible outcomes so hopefully the answer to this will make what I've just said a little bit clearer as well let's see how we did okay so possible options with the buffet sand um so the message won't disappear send me fail with an error [Music] um it won't wait until the see is posted so it's not synchronous is's an asynchronous communication um so it won't deadlock which is nice um the message is stored and delivered later on yep that's it's main name the everything goes well that's what happen um and the sending process continues execution regardless whether message received OCC because it's asynchronous uh as it will not time out um so the reason however that we discourage the use of bends is this the send fails with an error that's possible if your butter space is full um and whether or not that will happen uh if you are relying on a system buffer it's obviously machine dependent if you are attaching your own buffer which is a better idea um and indeed in many cases Miss well if you're using Bend at all actually have to attach your own buffer and NPI send the point is it will do either a buff or synchronous send depending on whether or not the space in the buffer um but the B end if there's not enough space in your buffer will fail and that may happen when you scale your program up and there will be more Communications happening um it may also hide other errors in the code and it's difficult to debug as it is so it's best to use synchr sends always to begin with um but yeah this is the main reason we discourage use of B because of the possibility of it's faing with an error because what it does is it copies the um send buffer into its own or into the buffer space that you've attached and then it waits it keeps checking whether a receive has been posted so there is a certain amount of communication going on behind the scenes an NPI um and it won't try and send until the receive is posted it will just hold your message in the buffet space now obviously if you don't post to receive then you run into H you just sitting in the buffet space whever and your program might still complete okay depending on how it's written it may actually be fine if you don't run out of buffer space but if you run out of buffer space and you will if you scale that program up um it will just fail and kill the entire job so it's a synchronising it's like posting a letter so you just shove it in the post box it's gone and but if there's nowhere for it to Beed it's just going to sit in the system forever and you have to supply your own buffer space with NPI buffer attach the real Point here as well is that or as we'll discover later on this afternoon if you want to do asynchronous Communications and you want to to do useful things like overlap computation and communication um the synchronous non-blocking sends uh provide a much safer and a much better way of doing that um than bend might have given some spoilers for this question if you call a standard send and there is no matching receive which are the following of possible outcomes okay so the message w't disappear um so we'll come back to to B why B is random it um so the send it's possible the send will wait until the rece is posted potentially waiting forever that is what will happen if the buer space is full for sure um because in that situation NPI send acts as a synchronous send it is equivalent to npis send it will block and it will wait um the message is stored and delivered later on in the brackets if possible POS so if there is enough buffer space available um either on system buffer or I thinking buffer space you attach although I would have to check that um you could only use the system memor but if if there is a buffer space and there's no receive posted um it will just take a copy allowing you to reuse the send buffer uh straight away uh much like the bend does nothing ever times out um and the program continues execution doesn't care whether or not message has actually been received it just as soon as that buffer is available again the program carries on so NPI send won't fail with an error because it won't try and put the message in the buffer if there's not enough space so the motivation for it originally and I'll open this up as well so people can can see the full explanation but the motivation originally was you had a synchronous send that could deadlock NES send and buffer send that might run out of space CR asynchronous the point of NP send was it was meant to be uh The Best of Both Worlds so it would check if there was enough buffer space and if there was it would send asynchronously which is nice of performance um but if there was then it would just become a synchronous end and work anyway uh now in reality because you never know exactly what it's going to do and it's machine dependent it's actually not a safe choice for most types of program um and it's much better to use something that you know exactly what it's going to do um because also it means when you scale up your code uh you can end up with deadlocks that you weren't expecting NPI send so it's another one we we recommend against um the NPI received routine has a veram account what does this mean let's see how we did okay so this one is is a trickier one to get your head around um so in reality the NPI is just sending B streams from one place to another um how ever when it's asking for the count because in in languages like C in foran and you know eight byes whatever isn't particularly meaningful because you need to know how to interpret them yeah so you always have a data type but also the um so although a is true or it should be true in most cases you would expect that the size of buffer you've reserved is the same as the size of incoming message it doesn't have to be um what it does have to be is uh large enough for the incoming message if it's not he'll run with problems um but it therefore means that what what NPI is interested in is the size of buffer you have reserved in terms of the data type that you said it is or you said should be coming in so it doesn't NPI never talks about byes unless you use the NPI B data type um because although in the background bites of what's being sent backwards back and forth and it's not a well-formed uh way of doing things in see in forat we need to specify data types as well and we need some way to interpret um the B streams at each end um so it's always in in terms of number of items whenever it's talking about C andless you explicitly tell it just bits um but also importantly it's although it should be the size of the incoming message in most situations it can be larger okay so you can you can create a much larger buffer um and put stuff into that so one reason you might do that is you might want to um put something into the of an array for example you might well pointed to the middle of an array um and say okay the array at this point is this large and it will just write into the first whoever many bytes inside the messages um and indeed you can have complex data or you can create your own NPI data types as well um which may have many different sizes for the buffer okay so for receive count is the size of the local receive buffer although you expect those to most do the same uh so Josh is asked using NPI send okay so Josh is asking uh if you've put a load of asynchronous messages into the buffer and then try to send another and it goes synchronous instead because it's not enough room the buffer and then a receive call is made at the other end assuming that the send or which is received first so assuming that uh say it's a point to point between the same two so all of your sends are for the same receiver and that's the receiver that posted the receive um it would be the first one that went into the buffer because messages will not overt take if they're between the same two MPR ranks so if they're different MPI ranks it's uh it's a different matter all together but um I would expect those messages to all be in order provided there between the same two ranks again I may have given the game away a little earli here what happens if the incoming message is larger count so again I say the point of these questions is not to uh to curate or anything like that it's not not a test it's just to see because the answer to these questions is not always obvious and it's not always what you would expect so it's worth going back over these things um hopefully helps you understand what exactly npis do when you call these functions why is another much deeper question often the answer is because it seemed like a good idea at time and time was 20 years ago or more in some situations I think I mentioned before the NPI forum is often place to focus on making it possible for implementers to to make these functions more optimal and more performant um but that does come at the cost of having making it less flexible in terms of corre of the code so the two major assumptions that NPI makes are that your code is correct um and the system is completely infallible and resilient will never fail so it's it's also not really able to deal with failes in the actual Hardware or there is work ongoing in that area let's have a look how we did on this one then okay uh so if the incoming message is larger than counts if large and a receive buffer unfortunately it won't just accept the first count items it will actually fail entirely um so you can get away with sending less than a receive buffer you can get away with uh having a count less than the send buffer at the send side because it will just take the first count items um from that array but you cannot get away with sending a message that's too large um that that will result in an error Again part of reason for that is is the NPI actually checks so as I mentioned before there is a certain amount of of communication going on in in the background when you're using NPI and it will have a look and see if the the received buffer is actually large enough because if it didn't do that the the program would just Su fault anyway and the standard behavior on any era is the entire NPI program to abort killing the entire um program which is is nice in some ways but very annoying when you've made a small error some rer code and you only find out a day later but there it is cool I've done it again what happens if the incoming message of size n is smaller than count let just start checking what the next question is before I actually uh start talking okay so if the incoming message is smaller than count then that's okay that one works um because your receive is set aside your receive is saying I have a buffer that can take 100 integers and the send says okay I'm sending 10 integers the receive goes I do have space for at least 10 integers and that's fine and it just accepts the entire B stream from uh the sending process the sending Rank and deposits it at the start of or starting from address that you specified in the received buffer um then it goes and it checks the size by making sure they're ends as well and so this last one uh that might seem like a useful thing to do um unless as I sort of mentioned earlier if you're say trying to receive um a column say from the middle of your array or a row from the middle of your array or for the middle of your array so if you're just sending a vector from one and receiving it as as part of a matrix um you really wouldn't want the rest of that storage to get zeroed and and fundamentally so you know NPI assumes you know what you're doing so it won't it won't do things outside of what you tell it and so in the explanation here it points out as well often um all these sizes are known but uh so yes so one thing you can do of course is just create a buffer that is quote unquote large enough um and always receive into that buffer uh because you don't know exactly how many items are going to be sent in any particular send you just know there will be less than some certain size um so MPI is happy to let you do that uh and then you can use the status to find out what the actual count of the received message was and an important Point as I said is are not the answers are not necessarily obvious um but it's worth going over uh so how is the actual size of coming a dam it again how actual size the incoming message reported most the way let's see how we did H it's stored in the status parameter someone guess so let's have a look at that last one first so why the associated tag tag is uh something we us set as the application developer uh that is used to match messages so it's used to match sends and receives it doesn't tell you anything about the the message size um and no so the the count in the receive is just the size of the buffer that you have already allocated um whether it's on the Heap or stack is up to you but it's it's just the the buffer size for the receive buffer in terms of number of items or number of whatever data type um but the actual account uh that gets received is stored in the message status parameter uh which is a struct in C um an array of integers in foran before 2008 and a struct but whatever the Fortran name is mat in Fortran 2008 so you do have to pass it through the helper routine NPI get count uh for this thing I mentioned for for the reason I mentioned that NPI is in the background of course just sending by streams from one R to another um so how many ins I receive is not a well formed question unless you also tell it um you know of type in it needs to be able to work out what what the data type is again separately um an NPI get count does that so NPI get counts arguments are just the status parameter um from the from the receive and and the data type ah and that's us so well done hopefully that was useful um uh and what we're going to do now is actually go through uh not nothing not necessarily in full but have a look at the solutions to the um uh the exercise from last week which is about solving Pi all right we're going to be doing on Ser so I even have the NPP solutions. already there this is not uh like one of the things that we suggested you to use we suggested M xter for example um I don't in so this is just my work laptop I don't generally often need an X server although from the fact that there is a picture on here I can see I was testing out um because I don't often using xter I don't actually use xterm in general um this is actually just windows power sh underneath um as Windows does now include SSH tools by default however I'm using it through something called Cony mu which is console emulator and all that really does is let's set a different color scheme for Powershell because the default uh directory colors that you get so this MPP solutions for example um the default colors you get off a lunux machine are completely unreadable against either the PO shell or the command prompt backgrounds and it's a pain to trying to find something that works so this just helps we do that there's nothing magic about this program I'm using for sshing other on that um so let's go and have an actual look at the P solution okay so be prepared for things to go horribly wrong so far so good let's just check the me works it does not excellent sort that out first ah so um what I've forgotten there is module load [Music] MPT module load Intel compilers told you something wasc let's just try make again there we go okay we're good um so let's just MPI run ah so someone asked me in an email the other day as well what the difference is between NPI run and NPI exore MPT um NPI run is a wrapper because we've loaded the MPT module NPI run is just calling uh hpn or NP MPT essentially um underneath however it is a convenience raer because NPI run is easier to type so on the login nodes um there's an alias in a bash essentially it just says uh NPI run means this depending on which module you loaded uh on the back end you need to use NP exact NBT because that alas does not exist but it's less crucial because you will just be writing it into um get PBS tipped um so essentially NPI run and NPI Z npt in this case are the same thing uh it's just one of them only exists on the login node now that said uh different NPI libraries and NPI implementations have different uh NPI launches um NPI run is the name of the typical one for open NPI for example um whereas NPI undor MPT is the HP MPP libraries primary uh launcher uh and I think there's NPI exec is I want to say NPI ch's standard one so it may vary you may have to use a specific one depending on which Library you're actually using which implementation of the N Library you're actually using um on Sirus it's NP run is fine on the the login nodes and on the back end you need to use NP exore MPT assuming that you're using um MPT MPI I think there is also another MPI Library you can choose on here um yeah nothing differences okay so let us just try running this P example okay and as you can see 3.1415 93 not bad I'll take that so that's just on four processes um okay and what I should do as well is let's have a quick look at the exercise sheet again just to remind ourselves of what the point of this exercise was okay so we have some numerical approximate way of calculating pi and the a is simply to split this up across NPI processes now okay let's open up a C I'm going to use the C example uh primarily because um I prefer see there we go okay uh so there nothing particularly special up here so here's the MPI include it is just the standard include as you would find uh in C C++ um and here we defined n to be 840 now one thing that's worth noting is this n of course is not the same so this is the end that the summation goes to but it is not the same end as uh as a number of processes you're r on um that does not me to be the case it is slightly simpler if it uh if the number of processes um Can evenly divide that number but it's not not absolutely necessary you can definitely write it so it's not needed um and this just looks like standard C code uh so the the communicator has its own data type in C and C++ NPI Comm and NPI status these are actually just uh struts underneath um on the older versions of Fortran uh it is an integer array in Fortran 200 eight they have the concept of struct so it becomes a struct as well with it's own data type um okay and you can see what we've got here is an I start and I stop and these are quite important we'll see why in a minute um these this solution I'm looking at by the way is just in the solutions. tar um so if you do want to have a look at it if you haven't already and compare it's your own solution uh ah and is slightly sneakily hidden here under Recon where it says here is a file containing Simple Solutions and that's where I've got the list from so if you if you wish to Blue Peter it you can um okay and here we set com equals NPI com World um so this is our NPI Communicator uh you don't St to do this it's just it's easier in writing NPI com World in everywhere yourself um and it means you can change it later on obiously but uh you can always also just write NPI com if you prefer NPI init n n we can also provide RC and RV although it's quite rare for them to actually do anything uh so these calls I hope so this is from the week one exercise you should be familiar with these just says the size of the communicator which rank this particular uh or which rank am I essentially bear in mind that four copies of this code get um get launched so mean also to come back to an early point from the quiz know that the MP in is down here somewhere and the start our code is up here so all of these things get declared uh on every process uh of however many are launched so four in about the previous case um all these things still get declared on every process even though NPI in it doesn't happen until down here NPI in it does nothing but enable the start of communications um so you need to do that before you can call com size and com rank um and here we have our first for in the Cod based on that rank uh so if I let's just show this as well so if I comment these out oh to recompile First [Music] so what you can see there okay is that all four ranks have in fact yes a second so all four ranks have printed this name because it's no longer forked based on the rank ID um based on rank uh middle just printed exactly the same thing so there's a few things to to think about here uh first of [Music] which is that I can do something up here four MPI in it just to labor this point a bit more apologies for my total Le abilities of typ there we go okay so you can see that hello is printed four times uh even though it's 4 it because they're all running exactly the same code up until that point so another thing to note let's see if there's a good point to do this okay so if I have say so it says Computing approximation how let's do this one so let's do that so everyone's going to say rang on however many processes um I am process so if I do this and have each one reported Rank and another thing that's worth noting about that is that fact come out in a very random order so you can see that there's no particular order in which they will print um print statements won't come out in any particular order ever generally speaking unless you do something to enforce that um it can be in whatever Order each process gets to that point in uh but also yeah so here we have two and three but you can also see that there are completely different points in the code as well so uh process two has printed hello Pro and process 2 and then has already gone through its calculation before process 3 comes out uh or for process 3 has its print um no that doesn't actually even necessarily mean that the process two ran ahead of process 3 because the way that uh print buffering Works in various machines there's really no guarantee about order of these things when you're running multiple processes except that within a single process so Within uh say rank zero I would expect every statement printed by rank zero to appear in order however that order relative to any other process has absolutely no meaning whatsoever and indeed may change if I yeah there we go so if I run exactly the same thing again completely different order and here it is 0123 um so the orders of print statements are not meaningful and should not be relied on which is kind of annoying um and with that I see that we are just about up to three uh and we do have some time to to carry on looking through this P solution after the break so I propose that we We Begin The Break um and meet back here at half three uh we'll go a bit further into into this P solution and talk about a couple of things that are worth noting about it um and then we will be discussing asynchronous sorry both synchronous and asynchronous non-blocking Communications um which I'm sure everyone will be interested in has already had a lot of questions leaned towards that um hopefully uh it will clear a lot of things up um about how exactly you should use uh asynchronous communication so uh see you back here at half three have a good break I'm going go and get some coffee thanks for that okay hello everyone and welcome back to online MPI um so we're going to start this session off by uh just going back to looking through the solution to the P problem from last week and pointing out a few interesting features of the uh sample solution as I said the reason that I'm using this uh con to twist this H into cus is that it lets me actually configure a color scheme because using command prompt and Par shell you can't and then um things like comments which are always key to to Blue in V are completely unreadable but even with this it can be tricky um as I notice it's a little bit brighter on my screen and it actually comes through on the on the collaborate as well um yeah thank you for for pointing out to me that was a little bit difficult to read uh you can also of course find these Solutions and on the website too if you want to read through in your own your own computer um okay so I will carry on where had I got to ah yes okay so we're looking at the fact that I them again quickly um yeah print you can see my attempts to work out why uh let's do this why then insist on starting in replace mode and so the order will always be different for printes there's no real guarantee there except that uh within a single processor will come out in order but you shouldn't put any stock in a timing um there's no real way to coordinate that that that wouldn't involve a lot of overhead so if you do require specific output from specific ranks you need to include synchronization through the MPI uh routines um or you need to Fork as well so if you just want one rank uh to print then you use an if like this so another thing to note as well is that we we've used consistently rank zero and for this kind of for game where we just want one statement out there is absolutely nothing special about rank Z zero um and there's no ordering to the way in which that NPI assigns ranks generally speaking um the only thing that is special about rank zero is that it is the only rank that is guaranteed to exist uh for all numbers of NPI processes because if you only run one NPI process um it's rank zero because they're all zero indexed um so rank zero is always there um whereas any other rank uh might not be depending on the number of processes you actually launch so it's often used for being the one to print stuff out for that reason um so on the actual calculation of n so I go quickly back to the exercise sheet um it's just implementing this summation here but one thing that's key is that we don't um need to go through all this entire summation on every single process and because uh additions are associative we can we can calculate individual sections of that um some on different processes and then just add them all together uh however in order to to do that we don't um we don't say okay if I'm rank one or if I'm ranked zero do this part of the loop if I'm uh rank one do this part of the loop uh much sort of smarter way to do it is to actually Define the starting integer and the stopping integer for your Loop so we're going Define the loop conditions based on rank which is what we've done here and so you can see why this is dependent on the number of processes that's launched and the value of n because this n divided by size and so part of the the exercise towards the end was to uh if you got that far through it um was to look at how you could sets up so that wasn't necessary and one simple solution is actually just to round this number down and then whatever is left at the end or you simply add to the stop condition for uh for rank zero for example um is a general General way of of solving that issue is you just have one of your processes do a little bit more work um and usually that's that's fine in certain situations it could lead to a large load in Balance but generally speaking um if it's just to finish up the the problem then it's quite all right simply have rank zero do a little bit more than everybody else um so that's sing and you can see here then we have the same for Loop for every rank because they've all set based on their rank high start and I stop uh conditions and this is just a summation inside that uh for C programmers like me um something to watch out for is the fact that as it's imation is actually important what the initial number is and it's one not zero um and also it's 2 N inclusive these are two things which are worth which often catch me out um you may you may simply be smarter than me and avoid that trap but there's been many a time when I've been trying to work out why I calculation is wrong and it's because I'm doing a summation from um 0 to n minus one um and it should be zero to n and also know that you need to to cast these to doubles in C okay but that's not too much of an issue um okay and then we simply print out what what everyone's partial value is and then here uh we just use point-to-point communication okay to get all the answers back to rank zero so it does say uh that yeah this would be more efficiently done using NPI reduce um that's true and we'll be looking at the collective Communications such as NPI reduce in the last week which is actually not next week as unfortunately I'm away at Big Bang fair in Birmingham um so the week after that we'll be looking at Collective Communications but for now we're just going to use Point too okay uh and we say if if I'm rank zero okay first of all essentially add in to my value of pi uh partial P that I have calculated and then get ready to receive from every other source and and all this does is just set up a receive from everywhere and if I'm not rank zero so this health statement completes that if statement follows on for statement um send synchronously my value so one thing that's worth noting here uh I see it's actually the way around is we could replace this with NPI any Source okay and this will still work just fine um because the tags match uh so it will happily just match those in any order okay still still getting perfectly reasonable results um and there it'll just match the first one that came in each time and is quite happy with that however if if you do need to constrain the specific source and you could so if you need need things to arrive in a certain order then we can change this back to um back to Source like so and just a reminder that for the receive we need to give it an address um even though it's just an integer declared on the on the stack um and the address of the status goes in there and if we wanted to check the actual Source um if we're using NP Source we could find that out from the status as well um okay here we just could replace that with the plus equals but there we go H this all I'm using synchronous s so all quite see uh no yes and it's all sending to rank zero so I hope that all all makes a reasonable amount of sense um do feel free to ask questions either in the chat or by email later on if you do have any uh here we're just calculating Pi separately to to print out an error um and I will just quickly show you the other language Solutions I think there should be a so I mean the C++ one um actually let me check I have the most recent version of MPP Solutions yeah there we go K oh there we go um okay I thought there might be no apparently not okay so I don't have a a c Plus+ version available but essentially it's the same unfortunately if you are a C++ programmer um when it comes to writing NPI you are essentially uh back to doing um C with standard out instead of print F there was at some point a uh proposed C++ interface but it is no longer supported but okay what I will do quickly then is just show off the um for the the Fortran programmers amongst I'll show off the for Trans solation okay so it's it's more or less the same and A40 still setting a separate I star and I stop the main differences are really this um here actually so uh first of all there's I error um doesn't work so I eror here is is important because it's needed for all uh NPI routines in Fortran um and then second of all for fortron before f208 and this is a array of NPI status size an integer array in for 2008 using that interface it is a struct and then it all looks a lot more similar to to the C because you can also skip out the ey Arrow um okay and just little this all very much the same and again you can you can look through this and there we go uh so you can look through this it's just in the fmpi for directory under that Tower instead um but there's really nothing different other than the data types used for um internal uh MPI thing so Comm is just an integer as well rather than being a def find data type as it is in C um but okay so another that you've all been waiting for we are going to look at non-blocking communications there we go okay non-blocking Communications um so I know we've already so there'll probably be some repetition involved but that's not necessarily a bad thing so from previous weeks is if I have a sort of round the ring communication pattern um that I'm trying to implement one of the dangers is that if I just do that with sends and re uh or synchronous sends and receives that will deadlock Okay the reason it will deadlock is that everybody is trying to receive at the same time or is trying to send at the same time they need to be receiving um and there are a few solutions to this uh one of which is to pattern so of follow some pattern in the ordering of your um sends and receives um often called a red black uh communication pattern example where you will simply say every odd number sends and then receives and every even number receives and then sends but th those types of solutions are not necessarily scalable um so we're going to look at a much more General solution this week uh but first we're going to revisit because it is it is confusing it can be the two different ways in which an NPI communication routine um is defined so there's the mode and the form the mode determines when it completes uh and that means when it completes at both ends of the communication okay so it's either synchronous or asynchronous if it's synchronous then both parties or all parties to the communication um will complete at the same time okay so a synchronous send needs to match with a receive and they will both um complete okay um and as as because the whole communication has completed naturally for any synchronous communication um the buffer is reusable thereafter um for an aing is communication okay one end or both ends of the communication or all parties the communication will complete separately or may complete separately so there's more like a buffet send where the buffet send um will simply copy the data out of the send buffer and stash it away somewhere else um and then it will have finished so the buet then will complete and return um and the buffer will be reusable as soon as it does so but it hasn't actually completed the communication it hasn't sent anything and may not until a receive is ready at the other end and will not until a receive is ready at the other end so on the other hand okay the form that's blocking versus non-blocking and that determines when the procedure will return so a blocking communication which is all the ones that looked at so far [Music] um Buffet send and synchronous send so buffard send is also blocking because you can be sure that the buffer is reusable as soon as it is completed um but a non-blocking send which is what we'll be looking at today we'll return control to the user program but will not promise that the buffer is reusable uh so you will be allowed to continue with whatever you're doing but you don't know for sure that the send buer or receive buer can be used again for something else until you check and we're going look at exactly what that means okay so here again blocking operations um this relates to when the operation is completed okay that's your s send and your receive and also your Bend and and apologies because I've because I've not got the uh the PowerPoint um there would have been animations here but I'm sure I can still make the point um so a non-blocking operation were turns straight away and you can continue to form other work um and then you can either test or wait for completion of the non-blocking operation so a little diagram to illustrate this is again using a fact machine something that I'm sure we've all got a lot of experience with um so here you can put your put your message in okay and then you go away and go back to um to churning butter or whatever it is that this stick figure represents uh while the message sends okay so you can you can overlap your communication and your computation while the message you're actually sending and then your turn a little while later um check that that send is actually completed okay and it's been received at the other end uh so that you can reuse um the message uh or the send the buffer really is the point there so it lets you do stuff while send while communication is happening um so important all nonblocking operations should have matching weight operations um okay some systems can't re free resources until weight has been called and a non-blocking operation immediately followed by the weight so that if I go back to this side where it says that sometime later the program canest a weight will look precisely at the difference between those two in a bit but for now we're just going to think about about weight and what a weight does is uh weights for operation to complete so um it says I've started an asynchronous communication and now I need to wait for that to actually finish I need that Reserve to be uh receive to be confirmed before I can do anything more so if you have a nonblocking operation immediately followed by matching weight that is equivalent to the same blocking operation okay the nonblocking operation aren't the same as sequential sub routine calls as the operation continues after the call has returned what I really mean there is if you're writing s of standard serial code um where I have a single single process and a single thread when my code is running when it enters a function I expect it to not continue through my program until whatever was inside that function has been done okay it won't return from that function and carry on with the code until everything in decide that function is complete uh non-blocking operations don't work like this so they return but whatever it started is carrying on in the background okay uh in this case it will always be a communication that's been started and that's happening at the same time as you're aware doing other things um so I hope it's it's clear why that's useful um but you then need to explicitly complete ENT to that function and the way you do that is by waiting uh so you can sort of separate communication to three or rather nonblocking communication separated communication to three phases uh they initiate so you begin sending or receiving um then you can do some work may involve other Communications may not um you can do whatever and you wait for non-blocking communication to complete and you must always weight um so the nonblocking sense here we have an example of it sending from two to zero um ah yes so there are there are both non-blocking sends and non-blocking Rees um and to a large extent these follow all the same rules as the blocking versions in terms of matching and then not overlapping one another and so on um but the important point is that while you can do both uh a unblocking send and receive uh it may be simpler to Simply do one at a time at least to begin with and but essentially they they're you know they're doing what you expect to be doing so the send will begin the sending process uh it needs a match and receive of some sort and it will and you have to wait for it to finish to be sure that receive has happened um equally you can initiate an unblocking receive okay and it will stay in there until or it will continue okay sorry it will return immediately um but at some point you have to wait to make sure that receive actually has happened and you still need there to be a matching send somewhere in the system or you will run into problems and by problems I mean Deadlock uh so the handles used are broadly the same so you still need NPI data data type uh you still need a communicator Communications are only within a communicator um the additional one is an NPI request so it's an NPI request data type in uh C C++ or simply an integer in Fortran um and that request handle is allocated when the communication is initiated so when you begin the non-blocking communication um you get a handle back and that is what you need to wait on so later on uh you will say I want to wait for this communication to finish or I want to test has this communication finished or not um so here's the actual syntax uh for C and Fortran and these are you'll notice that it's NPI send that is NPI immediate synchronous send so the the kind of code for the non-blocking communications in NP is immediate um so an I something is a a non-blocking version of that function um and again it may seem C intuitive that it is non-blocking but synchronous uh it's synchronous because the buffer is not free until so free for use send off is not free for use until after the request has completed um however it is non-blocking because you can do other stuff while it's going on um okay and there's the MPI weight uh so you'll see at the very end of the NPI is send there's this NPI request and then pointer to or the address of a request and you need to supply that same request to NPI weight at some point in the future as well as an NPI status um and the request will tell it which one or which you know communication you're wanting to to wait for U the status is exactly the same as the status everywhere else in NPI code it's just about returning some useful information once it has completed um which you may choose to entirely ignore so it's perfectly valid to Simply declare a single NPI status somewhere at top of your code and then just always apply that address throughout if you know that you're now going to need to actually look at it um that's quite okay um you uh Chris is asking if the poter request is something youve allocated memory for or does it come from the MPI buffer uh it is something that you have allocated memory for uh it's most likely that that will simply be stack memory so you don't have to do a maloc although you could do um but you really generally only need well in a lot of cases you'll only maybe need one or like a couple of uh of requests if you do need for whatever reason if you need a lot of M requests you could Mal an array of them um there are there are certainly examples where I've seen okay never actually um but I have seen arrays of requests created various reasons um that's quite common but generally yeah it's just something on the it's exactly the same as um for example if you're sending a single integer you would just declare in whatever variable name and Su on your code so it gets allocated on stack but it is there and then you can supply the address of that variable uh to the send buffer um you do the same thing by saying NPI request request and then just give and request to the function uh in Fortran everything is uh pass by reference anyway so you don't need to worry about that explicitly saying it's an address um and everything is is an integer at least in the old style uh interface so you don't need to worry about in this you do need to remember theend era but the kind of the format the signature of these calls is is I hope relatively um uh it's clear to see the small differences between that and the blocking communication okay any CL for the receive it's an I receive an immediate receive um and it looks pretty much the same as a receive uh only with the request argument at the very end um so that later on you can match it with a weight and as with the send you must always always always match it with a weight and if you don't the code will either deadlock or eventually uh possibly fail with an error um because NPI you know it depends it may finalize over it but the resources that is set aside for doing those requests won't be free until uh until the NPI wait has been called um so you should always make sure you call NPI wait also because then you know especially in the case of receive it's not safe so that I should I should explicitly say as well it is so in a send case and in the receive case it is not safe to access the buffer that you have supplied for the send or receive until after NPI wait is called um you should not modify you should not read from it until NPI weight has happened because the communication is essentially ongoing until NPI weight um comes back so NPI weight is a blocking function although the immediate send or receive before it do not uh the weight is um and once that weight completes you can be certain that the buffer is safe for reuse or reading but not until after that you can so the stuff that you can do in between those two points whatever uh you know as long as it doesn't relate to the buffer that you're using for sending or receiving is completely okay um you can do whatever you like you just have to leave that part of the memory alone and until afterwards okay and you can use a blocking send with a non-blocking receive and vice versor and as I mentioned briefly earlier I would actually recommend that um unless so you may find you have a good reason for wanting to do uh non-blocking at both ends um but in many instances I would suggest that you know it's fine just have one side be non- blocking and make the other block um and then okay if you find if you find that there is a good reason to to do otherwise later on then try that I mean for a first I would for a first pass at a code I would always actually recommend just beginning with simple synchronous Sims and then maybe make it uh non-blocking later if you think that you know there's something you can overlap there um okay but you don't have to match blocking and non-blocking Communications uh you you can do a blocking send to nonblocking receive or an unblocking send to a blocking receive um and an unblocking sends can use any mode synchronous or standard um because synchronous affects completion not initiation so again that's this point that um the fact that it's synchronous or not determines when uh yes when when the communication has completed okay so means those two processes or however many processes are involved in the communication if it's Collective all those processes are synchronized at that point um because they're both completing that communication at the same time um whereas in the asynchronous case it copies that data into a buffer first um so another Point really is that yes you can do uh non-blocking asynchronous communication you can do an I send um but really all you're doing there is waiting or is uh allowing overlap of the time it takes you to copy your send buffer into the buffer space um that might be worthwhile but it's it would be a niche use case I imagine um where there's actually much point to that um so isms are much more commonly used um and it's much clearer why that's usable because communication can take some time um but yeah both synchronous and asynchronous uh nonblocking Communications exist um and indeed that you can do use there there's an iend um which is the same thing as NPI send but non- blocking which may do either uh asynchronous or synchronous communication and what the blocking versus non-blocking tells you is that or it separates out the initiation from the completion of that communication or of that function um with the caveat that the buffer is not safe to reuse in the time in between okay so here's a quick rund of them all uh the iend the is send the IV send again the IV send is a little bit harder to see what the point would be um because all you're really waiting for there and the time in between is is it the buffer to be copied it might be very large it might take a while um but then if it is very large why do you want to take a copy of it and keep it on the same machine um you'd be as well just doing an is send uh rather than doubling up the space required to hold it on that particular node um okay and then the I receive um SO waiting versus testing I sort I've mentioned testing earlier but didn't really tell you much about it uh not because it's mysterious but because it's a slightly more Niche use use case to be honest and so it is a bit like uh proving really um so MPI weight will block until the communic unication has completed okay so as mentioned an immediate communication followed by a weight is equivalent to the blocking version of that communication and NPI test will not block it will simply return a flag or rather overwrite a flag that you provide it and that says that is true if that communication uh has completed and false otherwise um okay and the syntax is is similar in the sense that you have an NPR request and NPI status um with the addition of this flag that you need to check so one thing that's important to note is NPI testing and NPI waiting are not the same uh in a sense that you cannot just when your NPI test returns true that doesn't mean you're done you still need to call NPI weight but you just know that NPI weight will return pretty much immediately um because communication will be completed um but NPI test is a way of knowing whether or not that's ready to happen without actually blocking so if you have a communication that you suspect will take a very long time um you might want to put in NPI tests um to check and see okay and if it returns false you might say okay well then actually I'll do this other bit first um that is about the only use case I can think of for it uh again generally it's better just to put the MPI weight immediately before you actually need that buffer again um and say okay this is the point in my code at which this needs to have happen so I would just wait for it to happen here uh and if if the communication has already completed um it would just return pretty much imediately it will be very quick um but NPI test is a way of of uh uh doing more advanced communication pattern so you can actually check and see whether your um non-blocking communication has completed before explicitly waiting for it to complete um and the four trans index is as ever basically the same as the C1 except as no pointers and as an IR argument as well because there is no return it's a sub routine okay A little example here um with some 10in buffer sent getting Isn it um ient uh and now in between you could in principle be doing something here we're not really well okay we're calling us do something else function uh which is's missing an underscore I note um okay and then followed up by an MPI weight um equally the other end we start a receive operation um and sort of do something else while that's ongoing Follow by an NPI weight so examples where this can be useful um sort of in more real world applications is if you have some code that does a domain decomposition so you know it's input is some grid that it divides up amongst the processes um and you have to do Halo swaps uh yeah if you have to do Halo swaps then you can basically work on the parts your your grid that are not in the Halo or near the edges so aren't affected by boundary conditions you can do whatever you need to um on the interior part while that communication is completing um and then come back to the communication later to actually include your new boundary conditions and that's a that's a common use case um but yeah I hope it's reasonably clear why these sort of things are useful um I guess in the questions of getting last week actually that you guys are uh well up for doing non-blocking Communications already um but most of the most of the like cool signature is the same as for the blocking ones it really is just this extra request um okay and again here and where is gued earlier request is declared somewhere above that's fine um so you do have to one thing you should be a little bit wary of is to make sure that in your code you're not overwriting requests um by accident so if you do have multiple iends um they do need to be looking at different requests um okay which is one reason that sometimes you see arrays of requests if your is no for example again doing a Halo swap you might need four uh requests all at the same time uh on each iteration so then you can create an array however as long as you don't overwrite in between um uh sort of non-blocking communications um or while non-blocking Communications ongoing you're okay so you can um overwrite them again at the next iteration that would be fine but while the non-blocking communication is ongoing you should you should leave that request alone because you will need it uh to supply to the weight um and you can you know you'll run into problems if you find that you're um stopping yourself from being able to access requests for communications that have been started um in an unblocking way so do be a little bit wary of that um you can uh do this in multiple more what they're saying is now you can just wait for completion of one message uh there is a test all and wait all um that you simp simply Supply an array of requests to um and there is a test or wait for simply as many as possible I think that's if some of those requests aren't active you can still Supply the array check that um yes so in terms of testing multiple non-blocking Communications you can think of this as your your process having multiple in trays and I can look at each of them um to see uh if anything's arrived um ah okay and so someone did ask me about this the other day as well uh there is also so this is another sort of slightly Advanced um uh communication uh it's a send receive so essentially swaps uh buff buffer on two different um two different ranks uh so you supply both a send buffer and receive buffer I mean essentially the call Signature is that of a send uh plus that of a receive um with only one NPI status between them and the idea is that both these operations will happen forly the same time uh you do need to make sure that your send buffer and your receive buffer uh are different don't think that you can uh just put the same buffer in for both that cause problems um but there's another way of voiding deadlock and something like this meage the ring example from the very beginning of this talk uh however so c z is fine it works well and but it's not as generally as applicable because it is pairwise um whereas sort of non-blocking sends and receives can happen uh through the same range of options any other sends and receives can so they're a lot more generally useful but if you do have a simple parse communication pattern there's nothing to stop you send receive okay so that uh brings us to the end of this lecture really uh just before I move on to just throwing off a bit about the next exercise um are introducing it are there any questions uh so Mar is asking if the request is what help helps identify the weight um so it identifies the if I go back to here yeah so the request identifies a particular asynchronous communication I've done it a particular nonblocking communication sorry uh is is unique so uh the request is unique to a particular non-blocking communication uh and you need to like keep it you must keep a hold of it until it's applied to the NPI weight yeah so it it it tells wait what um non-blocking communication it should be waiting for um or inde test what uh non-blocking communication you should be testing um and it is unique to that communication uh and you must not lose it however once you've called NPI weight uh the request is safe for reuse as well as the the send or receive buffer so NPI weight completes the communication and all resources are you know free for reuse again then all the sort of matching rules as well are the same for uh blocking Communications um including things like uh messages not overtaking one another [Music] and but you can use this type of communication to break Deadlock as well uh Sam's asking if that means the I I send or I receive ass signs the request yes so it it writes into the um the request uh struct or the NPI request data type in C or in foran into the request um integer and yes yeah it so it's not like the tag in a sense that so the tag is used for matching Communications across the network so MP uses a tag to actually match up sends and receives the uh the request just matches um on the side that has initiated the non-blocking communication it just matches weight with that non-blocking communication and there is the stuff in the background NPI does NPI will send some messages in order to determine when the communication is completed at both ends so that's how it knows uh overall but the the request is a thing that exists locally to a process um okay great and thank you for the the good questions as always okay so we'll have a little look at the exercise for for next week again don't worry if you're not on for this exercise yet U or you're way ahead it's completely fine you're welcome to uh try these exercises or not in your own time as you see fit and the full exercise sheet is available on the the course website that I'll just show you that on here um here we go under week one just there okay um and as said as I mentioned I think uh in the first week this is actually the exercise sheet for our entire message passing programming CS so I don't feel as well that you need to get to the end of it um so there is a little bit about timing routine so one thing that's quite useful I mentioned this earlier as well um MPI provides NPI W time as a routine which just Returns the double Pres I know some of you already looked at this uh last week but returns a double Precision number which represents a lapse time in seconds um for all clock it's quite useful uh because it is portable so wherever there is an NPI Library you can use npiw time um some of you may or may not have had the joy of dealing with uh trying to do precision timing MC before but it is quite a pain so npiw time is is useful um so that's a small note about that but the main exercise that we're interested in next is this rotating information around the ring and this is the thing that I mentioned at the start of the lecture uh would deadlock with blocking Communications used naively um so a possible solution and that I think I believe I have some slides for this so we should go after here um got got it yeah so possible solutions uh you can do the the pairwise matching red black communication uh where you simply say odd ranks send for an even rank to receive and then vice versa in the next um however it's more interesting to use non-blocking Communications and you could do it the either way around uh and indeed you can do both non blocking um as I mentioned a couple of times actually having both sides you nonblocking is is less common um because it's often not useful I mean essentially because you know a certain point you can look well you can look at your code and say at a certain point I need to reuse either the send or receive bucker um you can simply put on the the non-blocking side you can put the weight in there immediately before you need to reuse it leer and not worry about it in between that gives you the maximum possible overlap um and at the same time at the other end you know at a certain point you're going to need um that information to and then you might as well just have a blocking uh receiving for example now that said if you suspect or know that that receive is actually taking a very long time to complete because of a large size of messageer sending then it may be worth doing both um in a n blocking way but those sorts of use cases are quite unusual um although not completely unknown um a couple of notes to give you some hints for this next exercise your neighbors do not change um you always send to the left and to the right uh and you don't alter the data you receive so the point here is we're actually just adding up perhaps I should go back to the exercise sheet Point here is we're just calculating the sum of however many numbers or as many numbers as we have um processes um and we want the sum on every single uh rank okay so you're not actually modifying these Buffs every time come they come in you can just keep a local tally local count like local sum even um and simply add in the buffer that you receive each time and that will work perfectly well okay and you pass the data unchanged along and this is and again I mentioned this a couple times already but you must not access the buffer in between so while your nor blocking communication is started do not touch the buffer just leave it alone until you have administered a wait and can be sure that it's safe and that is true for both sends and receives okay um as usual if anyone does have any questions feel free to email me and as an extra thing before I go uh one thing and this was in the email we sent earlier but it might be worth looking at so IFC we run a Blog my colleague Mario is forever trying to get people um to write things for it this article was recently posted uh by my colleague Dan Holmes um Dan is actually on the NPI Forum if that he's there this week um and I when I leave today we'll need to go and finish some slides I need to sent him for a project we're working on exciting background for you there but he's written this very nice article on what is NPI nonblocking for which if you're interested I would encourage you to go and have a look through as well it probably explains some B better than I can even um so he's on the NPI forum and he's extremely knowledgeable WR these sorts of things uh but it's very on topic for this week's lecture so do go and check that out um if you're interested feel free also to have a look through the rest of my blog um and that's that's really it for me for this week uh but otherwise thank you very much uh and have a good rest of your day and rest of your week okay hello and welcome back to the final session of online MPI uh and so what we're going to begin with today uh is we're going have a little look at the exercise uh from last week it's okay if you haven't done it although we are about to run through the solutions let's begin by looking at the uh the ring orating information around the ring solution so the exercise here was simply to calculate a cumulative sum across all processes by passing um just the rank of each process to each other process so we're going to we're going to go to the usual for uming this accessing uh s and checking the solution on there um all manner of things could probably will go wrong because such is nature of these things if you do have any issues reading the terminal as well you let me know in the chat I'll ende do something about it um as usual I'm going to going to Bo you to this and pull one out that we made earlier rotate there we go okay so our aim here it notes at the top is to form a global Su of data stored on each Pro process by rotating each piece of data all the way around the ring um and at each iteration the process will receive some data from the left and then pass it on to the right uh it's fairly simple set up the here really is to look at non-blocking communications of course um we won't take terribly long on this example because it is reasonably short and also because I want to spend a decent amount of time going through Collective Communications um but I just want to highlight quickly use of start and stop here um to abstract away Loop indexes or Loop indices I should say uh and also com equals NPR com world not strictly necessary but good practice set the communicator at the top and just setting a default tag one in this case and all these things are of worth doing we also calculate and remember that every process is doing this independently calculate neighbors's left and right add us um somewhere near the top so they simply have that information stored calculating mavors uh is always a challenge for many NPI codes um that need do things like Hal swapping uh in 1D of course it's relatively simple we have two neighbors but once you start moving up to multiple Dimensions it can get more challenging um we won't cover it in this course but there are uh special types of communicators you can construct which will give NPI some information about topology and then it has functions which will help you calculate a rank's neighbors um for you which is very worthwhile but we're not worry too much about it here here we simply have a neighbor proceeding and neighbor afterwards and to the left and to the right and of course we need to make sure that the periodic boundary conditions are respected here uh this is a fairly Brute Force way of doing that but it works um I can't re initialize some zero as well uh so here okay we're going to be using rank plus one times rank plus one uh that is simply because of this part here so this is actually just looking at section two um so Computing the sum of rank plus one squ um so there's nothing special here so the the first version of excise is here where it simply uses the rank um doesn't matter too much twice and here is the solution it as you can see it's not it's not super complicated um but what is important here is that these are or this apologies that was anchy uh this is uh an immediate send an immediate synchronous send so non-blocking communication um hence it has a request at the end okay so the highlighting is not particularly useful in then in this col scheme so here we have a request at the end um which we can then later test or wait for uh to ensure that the communication has completed and in between we've performed the receive uh and this means that all of our Communications that around the ring happen more or less simultaneously um rather than having to stagger sends and receives to avoid deadlock as it would using blocking Communications um so the only uh overlap with the F here is doing put the receive in between the uh the is and and the weight uh can anyone suggest suggest in the chat what else we could how we could rearrange or refactor this this Loop here um in order to achieve a little bit more overlap not a huge amount I must say because this is a fairly um simple example but here it could be done me guesses it's all all quiet in the chat okay I'll I'll give you the answer then so there's only I mean there are only two other things in this Loop right that you could possibly move and the one that it would be safe to um move into the overlap region is is this summation line so that could go there um I just want to take a minute to to examine why that is the case so what's key here is that this NP receive this is still a blocking communication uh a blocking call so addon is safe to use as soon as that receive exits and that means we can do some equals some plus add-on safely because we know add-on is fine um on the other hand pass on uh which we've sent we don't know that that is safe to reuse that buffer until after this NPI weight is completed so when the NPI weight exits we know that that is well that non-blocking communication has completed successfully and the buffer is safe to reuse and pass on therefore is reusable uh so this pass on equals addon has to come after the MP weight but this line some equals Su plus addon some plus equals add on uh can come before that because it doesn't care about the pass on buer um it simply needs that receive to completed so you can actually you could move that up in between and that would be fine um because this is a blocking communication so I hope that's clear um and I think really that's that's all I want to say about this uh yeah all I want to say about this example um just to check that all that I'm saying is true let's try compiling and running it uh is the m file going to be [Music] oh no okay so I I might need to modify to make a little bit check however I do need to do [Music] this no that's not even bother let just try it okay and NPI run okay uh and let me check the formula I believe that has worked fine um 4 * 5 is 20 * 9 180 630 yeah yeah yeah okay so that's all work perfectly correctly even though I've moved the um that last part and just a little bit more of that now in this example it's you know that will make almost no difference to the runtime whatsoever but you can see that you know the more you can get in between here and here the better generally speaking um and what matters is whether or not you need the buffer that you're either sending or receiving and if you don't then you can put it in the overlap region and you you should because if this has already completed then this MPI weight will and take almost no time at all you will simply go yep you're good NPI will free the resources that it's held and the program will continue um so it's always a good idea to fit as much as you can in between in order to create efficient uh NPI programs and the other and I'm not sure if I mentioned this last week although I should have uh the other big reason to use nonblocking Communications is to prevent serialization um of your code what I mean by that is if you're not careful then you can write perfectly valid um synchronous blocking NPI code that if is effectively running although distributed effectively running in serial so one process will do a thing the next process will do a thing the next process will do a thing but they won't be doing anything in parallel and you're you're losing many most of the benefit of using MPI um you're just running a distri seral code which isn't really advantageous um using non-blocking Communications can help to to make sure that's not the case by overlapping regions of of code of the program okay um so with that said walk back out of and come to today's actual lecture and which is collective Communications great so what we've looked at so far is at Point too communication uh simply sending from one process to another process um in both its blocking and non-blocking form what we're going to look at today Collective Communications are communic a involve an entire communicator and I was careful to say an entire Communicator there because these operations always act on a communicator but it does not have to be NPI Comm world and now that said believe I mentioned last week that two weeks ago now that MPI does not expect you to operate in such a way that you create many many sub communicators it expects a few or a handful of communicators to exist not thousands so do be a bit wary of that but Collective Communications exist in order to do communication across an entire uh communicator and there are several and much in the same way that we have for previous lectures what we're going to do here is simply go through each of the different types uh it hopefully quite slowly and carefully and as I do stop me if you have any questions questions um so it's communication involving group of processes and it's called all processes in a communicator so examples are and the ones we're going to go for in fact are barrier synchronization is really the simplest um broadcasts scatters and gathers and then things like Global sums um so these are different types of reduction operation um see more on that more on what that means in a minute now one thing I will say about collectives is um as a rule it is good to avoid having to do Collective communication uh in in any program because it is a communication across an entire Communicator it's fairly heavyweight and you know especially if you're looking at a program that's maybe running thousands of uh of NPI processes even hundreds of NPI processes you if they all have to communicate it will take take a while um it may well be a button neck in your code that said okay if your algorithm requires that you have to do this okay whatever you're implementing requires that you have to do this then do use the NPI Collective Communications to do it don't be tempted to implement your own uh pointto Point solution instead because it will not to be better at one time when Collective Communications were very new um it was possible they were just using Point too Communications underneath now that is certainly not case you know they're quite well optimized for what they're doing um they'll often use specific topologies for the communication um you know so do use these functions if your your code demands it okay don't don't try and get around it by um doing things that seem clever with point-to-point Communications mostly because they probably won't scale um in a way you Rel like them to or in the same way so with that said uh characteristics of collective Communications actually know a Communicator all the processes must communicate so you can't call a collective community unication from a subset of processes on your Communicator okay that code will deadlock um because they will wait for the other processes to reach that Collective That Never Comes uh okay it is over the entire Communicator always and and indeed that that's an important thing to watch out for is you must make sure that all processes call the collective Communication in order for it to happen um or your code will deadlock uh now synchronization may or may not occur having I just said that this I appreciate this confusing what that really means is some collective Communications do synchronize in the case of the barrier that's that's its purpose therefore is to synchronize your processes but although they all have to call the collective communication that does not mean they are sychronized for example consider the case of a reduction onto a single process okay um now for that to occur so every process has to send some value to rank zero say is what I mean by that and then rank zero will add them all up so it's a reduction operation of some Global sum of some value onto a single process in that instance okay every process has to call that Collective communication but they don't necessarily all do it at the same time and the only one that has to wait for them all to have done it is the root process uh so the process of actually Gathering all the data and then summing it that one cares about what everyone else is doing but all the others individually once they've sent their part they can move on so in that case they are not synchronized okay although they have all still been involved ination and done their part um only one process has to wait for the entire Collective to finish and that is the route in that instance it varies from from Collective communication to Collective communication I'll try to remember to point out which ones do and don't synchronize and hopefully will be correct when I do so um standard Collective operations are blocking uh nonblocking was and it says recently introduced into NP 3.0 um I'm just going to briefly bring break out of the presentation for a second here to put that into context so here you can see this is the NPI fors website um kmpi forum.org and 3.0 was standardized and fully implemented as of 2014 uh so Josh is asking in the chat is the collection all done on one node um for one of the reduction operations yes that is the case so there are a few different kinds um but for a simple NPI reduce operate yes the aim there is is to collect a global sum onto one node uh there is also a different function called MPI all reduce it gives the global sum to every um every node or every process uh in the communicator um and we will talk about both in a bit uh so I guess the answer is well it depends um so yes uh as you can see although it says here uh that non-blocking Communications recently introduced recently here means um it was fully implemented across almost all libraries five years ago so actually the most recent standard uh that is complete and fully implemented is 3.1 uh and then work on goinging on NPI 4.0 and and just to to toot epc's horn a little bit I just thought i' show off this this is the NPI forums uh GitHub which they use to to monitor issues and recently um persistent Collective Communications uh passed its final vote uh persistent Communications are if you know that you're going to be doing the same communication again and again again and again throughout your code you can actually set it up at the start um and NPI instead of ditching all the information about that communication after every send and receive will retain it and you can say okay do this communication again with this buffer um if it's always the same two processors involved for example uh and persistent collectives were simply the extension of that idea to Collective Communications um as you can see work started on this in 2015 and uh it was started by Tony Shel who is from the University of Chattanooga Tennessee I believe hope that's correct um but you can also see that Daniel Holmes uh one of my colleagues was heavily involved in this and it was actually last week um that he was away at the NPI Forum um and this thing that has been working on with Tony SK or Shel has now passed its final vote and we will be an the m 4.0 standard um so MPI although it's stable and um and you know has been around a while now quite mature library is constantly growing um and it is just one of the new things that will be in it soon uh and EPCC is involved in that so you might recognize Dan if you read this blog which I highlighted the other week uh about what is in working for very timely okay so that was a brief uh Divergence and thank you for indulging me so non-blocking versions of collective operations also exist um that said okay they're not even yet so this was five years ago and but that it was recent um but I think this is possibly still true they're not yet commonly employed um the advantages of doing because many of them synchronize anyway their branches of doing uh nonblocking collector Communications is less obvious although certainly still exists I think actually the main thing is probably that um the performance reasons people typically avoid algorithms that require heavy usage of their Collective operations so because they're avoided anyway um there's less incentive to to try and optimize them further because they appear minimally in your code and there probably other things you can focus on um but yeah that doesn't mean don't use them do if you if you have a use case or if you some you can overlap with the collector operation and I don't see why not and the extension is is as you'd expect is simply an extra request parameter as there is for non-blocking point to point Communications um Collective Communications don't have tags uh I hope the reasons fairly obvious it's because every process is involved in every process in the communicator must be involved so there's no reason to match against tags because all processes are going to be getting in on it um so you don't need to worry about matching uh and receive buffers must be exactly the right size okay uh so you might have Eagle ey viewers may have noticed that I have quite transparently got a whole load of MPI um uh man page is essentially open in my browser right now um that is largely because this parameter count is important and is slightly different between the different Collective Communications So to avoid tripping myself or you up I simply open the Man pages so I can refer to them as as we discuss those uh but the received button must always be exactly the right size um and you must remember which process therefore it's going to be on or simply create on the right size everywhere there's nothing nothing to stop you doing that um but for some of the collective Communications not every process will need a receive buffer necessarily um show okay um I'll just keep an eye on the time we're going till till 3 I believe um so the first Collective communication I want to introduce is a barrier uh a barrier does basically exactly what you you might think it does um no process May proceed past the barrier so it will block uh until every process has reached it okay only takes one argument there it's the communicator um that you want to carry out the the barrier across so this is generally speaking frowned upon I wouldn't expect a real production code to be making use of barriers as a general rule unless um it was for profiling purposes so that's that's the place where you you really might need it is if you want to time a particular region of the code um but fundamentally stopping all the process is until some criteria has been reached so and making them all wait is not a useful thing to do and for performance off scalability this is obviously a very costly operation you know to synchronize with your processes so you need to have a good a good reason for doing so um there's often a temptation to to write especially when you're writing the first version of a parallel code to to insert lots of barriers um to effectively serialize your code because it makes it easier to understand but uh it also makes it run horribly and is not a good way to to do these things so um the NPI barrier certainly has its uses but you know it's it's not good for efficiency performance and I wouldn't expect to see in a in a proper um code that's not just using it for timing for example okay um owns re and we move on so the next one the one that is somewhat more useful um is NPI broadcast okay so the broadcast uh again hopefully it's reason to clear what what I'm about to say uh it simply sends Sun buffer to every single um every single process now what's key here is that and a common extra argument that you'll find in Collective Communications is this one here the root okay so that's the processor from which the data will actually be sent um the broadcast is sending some data across the entire communicator but obviously at least one process needs to already have that data and this will be the root process um from which all data is sent here's here is your Communicator MPI data type as you expect that works exactly like it does in point to point as does uh the buffer okay so the buffer is simply the send buffer here um and the count does exactly what it does in the point too Communications again uh so there should be in that buffer count of data type uh that's so 10 doubles three integers and so on or even just one character for example okay uh should be in that doer now what you might be wondering however is every process every rank needs to call NPI broadcast so what do they put in here um now the answer is essentially whatever you like it's only significant on the Ro process uh that said there is particular reason to make it any different from what the root process has in there and you know if you just declared your buff is somewhere near the top or Point as the buffers or whatever uh somewhere near the top of your code um before without forking it in any way uh then every every rank will have access to a buffer that looks like the right thing and they can simply put that in there um there's no reason to Fork that um and write it differently for the root process versus every other one uh because every process does need full MPI broadcast okay not just the route um and in fact in the broadcast case I just realized of course and some of you may haveed ahead of me um that that's absolutely what you should do because this will be the receive buffer for all other processes uh so I apolog I take back what I said about these not being significant that is for a different uh type of collective communication uh on the root process this is a send buffer on every other process it is the receive buffer okay um and it should have space for count of data type if it's receive and should be count of dat type if it is as send and root specifies from where it's being sent okay and the Fortran uh is much like Fortran always is has this IR eror uh argument at the end in order to have the equivalent of the integer return from the C function and let's just switch over to here yeah so here you can see count is in MPI cast number of entries in buffer okay and it'll be more of this why I'm highlighting that shortly okay so that's n broadcast apologies for mixing it up a little bit there uh Chris Stuart is asking in the text chat is there a requirement for every receiver to receive the full buffer sent by the route um the answer is yes in the case of NPI broadcast um what we're about to look at is a slightly different Collective communication where that is not the case so a good question uh the answer is yes for broadcast but there are other operations for doing what what you're thinking about um and so actually another thing I should say about broadcast as well is that uh it's another one of the operations that generally we would suggest avoiding um you know because it is it is sending some amount of data to every single process uh it will have an impact on performance um and often you don't need to do a full broadcast okay because you're replicating the data across every every process there um and that's actually not often useful uh in particular compared to the next thing which is an NPI scatter and this is this is what does what you're thinking of Chris um so that takes some some array some set of data from some root process and scatters it across the entire Communicator so in an example here we have some character array with a ecde um and that is split across our five processes so rank zero has a rank one has B rank C has rank three has C two has C rank three has d rank four has e got there in the end um and one thing that's important is that that scattering process includes the root okay that will it will essentially copy um from the send buffer to the receive buffer um one part of that array even on the root uh process is fine certainly what you want um in case there is a small am of duplication data but otherwise it is scattered to all other processes and this is a much more useful pattern than broadcasting generally speaking um half the point of often of using message passing uh parallelism is that you can split your problem up uh across many machines so it's not necessarily useful to Simply replicate the data exactly across all of them instead here you can simply send parts around the communicator okay and the cool signature here somewh more complicated we have a send buff um a send count okay and the MPI data type A send type and then a receive buffer a receive count an NPI data type for the receive type uh the root again is important um and the communicator across which the collective communication is happening um now you might wonder why do I need to specify the data type twice it should be the same uh I think basically just so that this mirrors um the way that point-to-point Communications operate right so it you know you specify it for both in principle allows you to be flexible about that as well although doing so would of course be risky um and may cause errors as we discussed previously um but yes they certainly should just be the same uh now here's where I'm going to check so the send count is the number of elements sent to each process okay and a receive count is the number of elements in the receed buffer what is important is that your send count uh or your send buffer sorry contains the same number of uh or is large enough to contain one of each data type or one data type for each process okay you need to make sure that um the size of your buffer is equal to the send count times the number of processes uh if it's not it will say fault because NPI is expecting you to have done the right thing so you tell it I'm sending um three integers to each process and there are 10 processes it assumes that your send buffer has 30 integers in it for it to do that and otherwise there will be problems or at least you know if assumes that it is at least 30 integers because of course you can if you have some very large array you can you're welcome to send an address in the middle and it will simply read from that point yeah so all you know the same conditions that applied to pointto point Communications app like here um but it's important that your send count reflects the size of this this send buffer uh and equally um ideally they should match okay so you should if you're sending three integers you should say I'm expecting to receive three integers but you can have more okay so you can have a received buffer that's got uh 10 integers worth of space and that would be fine but less would be bad okay much like with the point too Communications um I hope that makes sense I appreciate is you know I mean even for me remembering which counts refer to which thing can be tricky um but it's important to make sure that everything matches up um where you will run into problems but I hope that's all reasonably clear and do let me know if it's not um so next okay I want to talk a little bit about the inverse operation which is a gather okay and here uh again as you might expect from the name we're assuming that on uh some rank some rout uh process which I should say as well and I I definitely mentioned this last time because a question was asked about it um this would often be uh rank zero but it doesn't need to be at all so no rank is special in NPI um the only thing that is slightly special about rank zero is it is the only one that is guaranteed to exist no matter how many processes are launched so if you're looking at writing code that um you know scales across any number of processes if you wanted to run on uh one process you have to make sure that um the rank zero is the specified for doing certain tasks because rank one simply won't exist any greater number but these scatter and gather operations can in principle take any um any rank as their rout um and indeed it may be necessary for your for your code that is the case Okay um so The Gather operation will collect from every process some amount of data and store it in a single receive buffer on the route process uh the Syntax for the calls shouldn't be overly suring so Chris asked the question in the text chat it was definitely covered in previous sessions are the ranks in a ah Soh yes so he's asked if are the ranks in a communicator always from zero or do subset communicators have ranks not necessarily starting from zero the answer is that they always start from zero so if you create a sub Communicator it will uh relabel the ranks accordingly because the labeled so the rank number only applies um within a communicator outside of a particular Communicator the rank ID is meaningless um so MPI com World contains every process that is launched going from zero up to how n minus one N is a number of processes launched any sub Communicator that you create will also go from zero up to n minus one uh where n is the number of processes in the sub communicator but zero will still be there so you can create a sub Communicator with only one process and that process will be ranked zero within that sub communicator and so Mar has also asked a question she's asked uh GA collects a specific process not necess zero is a truth for scatter and broadcast to uh it's not necessarily from R zero process and the answer is yes absolutely um so in both cases they have uh root I'm pointing on my screen that's not very helpful they have root as an argument okay and that is the process from which the broadcast takes place um and Scatter has root and that is the place from which the the scatter takes place broadcast and Scatter SL say than somehow in their head route zero and that's not I mean that's perfectly reasonable uh often it would be rank Z4 the reason that rank zero always exists but it doesn't have to be okay you can specify route as being uh whichever rank suits um but very often it would be zero so the only one that doesn't have any kind of root argument is the barrier and that's because it simply doesn't matter because it is all processes in the communicator um will be stopping and waiting there and they're not sending anything or receiving anything and so it's irrelevant but everything else has a roote that specifies where it's coming from we going to in the case of gather um so here we have send buffer a send count data type for the send a data type for the receive as well again uh could be different probably should not be ever no I can't think of a good use case I can think of ways you could make use of that but not any that you should there's a receive buffer and a receive count and a receive count should match the size of the SE buffer um which may not necessarily be uh the size of what's being sent but should be at least as big as what's being sent um a root process and the communicator to which all this supplies um the fortron syntax is almost the same but as Ira um and again and I hope this isn't too annoy to people I'm going to jump out and just highlight you know the the manual here s number of elements in send buffer uh receive count number of elements for any single receive and the reason that's important um is that okay that's the wrong thing okay you are sending uh Sun buffer which has say 10 integers but the receip count is going to be uh one for example if you have 10 processes um let me check that's right however yeah does that work ah I see sorry I've G the wrong way um yeah so if you're Gathering 10 inges into an array of 10 uh then this would be one okay because you're sending one integer from each other process and then on routs you should have um receive count would also be one but your receive buffer your receive buffer must be 10 inges large or you will run into problems yeah okay so whereas before it mattered that your send off was the correct size I mean you should still the correct size but um NPI is only going to look at send count of it uh however the receive buffer it simply trusts that is large enough okay um so if you have 10 processes in your Communicator in NPC World um and you're sending one integer from each uh receive count will also be one because it's receiving one integer from every process but the receive buffler needs to have space for all 10 um and do keep in mind as well that it does always include the root so here you can see we're we're going the inverse of that previous scatter problem where is ABCDE um we still need space for B even though that's on the root process okay um and so I hope that's clear all right it does it does take some needs to to make sure that you know the size are all correct here I mean generally speaking the golden rule is that you know send and receive sizes so s count should match um but is a little bit counterintuitive that is no longer necessarily the size of your buffer and and the buffer needs to be significantly larger okay so next we're going to come on to Global reduction operations um so so far we've looked at barrier which simply pauses all processes in the communicator and says wait here until everyone has reached this point and then continue we've looked at broadcast which simply sends some data to every single process um scatter which is a slightly smarter approach of sending a bit of some data to every uh or P some data to every process um and gather which is its inverse now those are reasonably useful um however probably the most useful uh and the most widely used Collective Communications are reductions because it is a very common pattern that we need to for example calculate a global sum or a global maximum or Global product or indeed some of a global value of a set of data that are distributed across our entire machine um or across our entire Communicator uh and that's what the reduction operations are for um they used to compute result involving data distributed over group processes and one thing that's key about this um as well is that this allows you to so you could do this simply by um Gathering all that data and then on a single process you know Computing whatever property of it you needed to do uh this will actually let you combine um both operations into one uh communication and because you know you're saying to MPI this is what I plan to do it means the implementation has some freedom to try and actually optimize that process so for example if you are Computing a global sum uh across a distributed you know network of computers um the most efficient approach efficient approach is not actually simply to um send it all to one process and then have that process Su all up okay that that's essentially the worst possible approach that you can take um what you can instead do is calculate partial sums at intermediate points in the network um because of property estimation we'll we'll speak about in a minute um and you have a kind of treel like Network where two nodes for example two uh two ranks will communicate with one another add up there a bit and then that will be sent on uh to a different process which will receive in a couple of sets and add up and they'll fan out in that way in order to to complete partial sums as a hill um and the NPI library is able to to take advantage of approaches like that instead which are much more efficient um than simply doing a gather and then a su um so again if you if you do need to do an operation that looks a lot like this and it is a common one um do make use other Collective operations to do it because they are optimized there is more going on under the hood than simply doing the obvious thing um and actually there's even uh apologize I'm going off pie slightly here again uh there is even research uh ongoing as part of one the projects I'm involved in in fact hogr HS um on doing compute in the network so for the approach I just described where different sets or different pairs of nodes will calculate partial sums um that would still often require the involvement of the central uh processing units of a particular um node so assuming we have some distributed machine like Archer for example or deed Ser uh you still need to wake up the main the main processor to do that summation uh one of the things that epigram HS looking into is doing compute in the network and thereby actually um uh doing those summations using the networking Hardware instead uh and avoiding having to wake up the main CPU on a particular node and but okay the point is that these reduction operations are optimized and you should make use of them um how do they work well is it predefines that of reduction operations uh which are things that are generally useful and commonly use so C creting maximum C the minimum uation product uh simply multiplying all numbers together uh logical and bitwise and logical or or various logical operations and bitwise operations um and the maximum and minimum location oh sorry the maximum and minimum and where they're located I see okay so that'll be assuming you have some distributed array um it'll let you know both what the max value is and where it is in that in that this array yeah so there's a set of predefined ones that just exist for you to make use of um however if you require something more fancy H we'll come on to that in a minute first let's have a look little look at the NPI reduce um syntax so this is the the more simple case where I simply want the result on a single node or a single process I should say uh or sing rank even um so I'm performing some reduction across my entire Communicator I'm adding up every ranks local value but I want the result on just one single rank okay this is where the NPI reduce comes in it has a send buffer and a receive buffer um okay and every and every rank needs to specify both of those um but and let me check I'm getting this right this time yet however you know these aren't necessarily equally important across all ranks um because it's being received into a single uh because it's being received into a single route process and so there's also the count and here you'll notice that it's actually unified this time okay so this is assuming that you're reducing the size of the thing that you're uh reducing will be the same in both the receive and the sent um so you can't try and reduce uh 10 for example 10 integers and if you need to calculate a global sum of arrays of 10 integers you would have to reduce it using this operation into a 10in z array and then calculate the summation of that array separately using the MP ruse operation okay so the send buffer and receive buffer need to be the same size here um and if it is more than one so if you're not simply calculating the reduction of scaler um what you'll get is the reduction of each array element okay in your in your array so hope that makes sense uh as usual we have the NPI data type we also have then the operator so this would be NPI sum for uh a summation uh NPI prod it prod NPI prod for the product and so on and so forth the root so the process where you're expecting the result to actually end up and the communicator you're reducing across um and as a the fortron syntax is almost exactly the same uh yeah so what's key here uh is that this count now counts to both the sending and receiving buffers and that's because if you have an array there instead of a single single value um it will calculate the element Y is reduction okay and you'll end up with an array with the reduction of each element across your entire Communicator um in each position and may receive buffer okay and I'm just conscious of time again here and let me just check yes three elaborate time okay so just before we go to the break then I'll show off this this diagram illustrating that um okay so here we have sets of four uh of arrays of four um elements okay doesn't matter what they are but we're reducing a e i and N again I'm pointing on healthfully on the screen a e i and N okay they will be reduced um into the first element uh of the receive buffer on on the root process okay and only the root process so nothing will be written into the received buffers of the other ranks and the roote doesn't have to be zero it could be any process okay but equally b f j and N will be written into the second element uh or be reduced into the second element on the root process um okay and I think I'll I'll start the break there uh I'm just throwing quite a lot at you I think um we'll continue and finish off this lecture uh I'll as a US of view run through the Practical quickly after that you're also free uh to Simply email me at a later date if there is anything you would like to know um okay and with that hello and welcome back to this the final session of online NPI um we're going to be carrying on with uh the collective Communications lecture um but in the break Mara and Chris have have asked two very good questions uh so respond to those first okay so um MAA asks what uh what is the count in NPI reduce um is it the count for every single send is sending towards the reduce operation uh so and I apologize if I if I made this less clear rather more clear and so NBR reduce is is the one that has only one count um okay and it is the count the count is the size in number of data types so number of integers if is integers or doubles if it's doubles and so on um of both the send and receive bu um so what actually gets sent uh is in fact that number of items from each process as well because in reduces case okay if you have some Vector of say 10 integers just usual example been going with um then or some array I should say not VOR of of 10 integers then what you will end up with on your root process is also an array of 10 integers um and each element in that array will be the reduction operation of that element across all processes okay um so the count then is the number of elements in that array uh if it's a single value um then it's just one uh so if you're just calculating the sum of a single number across all processes then that count would be one if you want to calculate the sum of 10 different integer values or 10 different double values um across all processes then it would be 10 and it doesn't yeah that's came out so answer that one and it doesn't need to relate to the number of processes in anyway should say as well and now Chris then asked uh for NPI reduce since we haven't given uh the root process a buffer to hold all the data to receive at once um for storing things like into immediate values and this is a very good question as well um where does all that data end up the get to fire at it and now in the case of reduce so the point here is that we have a single root process that is gathering up information or data from across the entire Communicator um and all those individual Communications are hidden from you so I believe the will be that yes those values will go into the receive into some into NPI buffer space that it sets aside for doing this before it gets added in uh or reduced into the receive buffer um because that can only happen once at a time and obviously if multiple processes all send at the same time uh there could be some sort of collision however and one thing that's important to remember is that NPI is almost certainly not doing the naive thing which is to Simply have every single uh process okay every single process send its local calculation straight to the root process okay so it's not if you have a thousand processes on the go 999 of them are not all suddenly just going to try and message process zero um instead it will use some kind of network topology to to split up uh the reduction operation throughout the network so that actually to avoid contention um for exactly this reason so you don't just flood the MPI buffer on a single process that's one of the reasons why and those sorts of optimizations are actually very important and why it's better to use Collective operations than to Simply Implement implement this yourself um but it means that you won't have every process suddenly trying to send to the same one um so stuff will end up in an NPI buffer but not for long as it should quickly be able to be uh reduced into the receive buffer but under the hood NPI will just make use of its own buffer space so hopefully that all Mak sense as well okay great um yeah I mean so as I say that keeping track of because the counts are different for each of the collective operations and keeping track of them can be tricky and and my advice is not to try and simply H just remember and get it right all the time unless you're using these functions regularly just just check the the main tricky uh one is is the scatter and gather because here you have um unlike many others you have a receive buffer or a send buffer and that is a different size from the actual things that are getting sent and received and that's where the mismatch comes in and why it can be confusing um the golden rule is that MPI is interested in what is being sent and received uh more so than it is the buffer it assumes the buffers are large enough and if they're not that's your problem and it will fail uh so another reasonable question you might ask them is why does NPI never check these things for you um and the the short answer is scalability and to an extent we've kind of skirted around this um got fir we skirted around this issue in the previous lectures as well uh looking at exactly the same thing looking at the count size and data types and all this kind of stuff you need to specify for your NPI you know why can't I just check these things to make sure they're correct um the simple reason is that when you scale up to many many processes at that point you would spend so much time checking everything that you would actually struggle to get the useful bit done um it would kill performance at at high high numbers of processes if you always had to check that whatever you were sending into was larger enough um and indeed that there was a receive posted in all those things that it could in principle do for you that simply doesn't and realizing your equip being correct uh it's all to do with allowing uh performance and scalability which are the core of of npi's mission um as a standard okay um so we'll have a look at an example uh GL reduction here uh this is just using NPI reduce so it is uh performing a reduction operation onto rank zero of a single integer okay so the buffer is X uh for the let me get this right uh yes so the send buffer is X the receive buffer is result which makes sense um these do need to be so results in X need to exist they need to been declared on every rank that is to work okay um however your result you by the time this operation completes um result will still be uh undefined on every rank except the root okay so except zero in this case there's no point in expecting result and seeing what's in it because the answer will be nothing or should be nothing unless you've already declared uh you know given the value elsewhere um it's just an empty buffer essentially um on the other hand the send buffer should have something in it on every single rank including the root because the root is also taking part in the reduction operation okay it's not that its only role is to collect the data it is also there to contribute a bit of it um to the result so in this case the reduction operation that's being performed um on of a single integer is a summation uh so NPI sum is used and it's across the entire um set of processes so it's NPI com World um okay but result is only given a value uh on rank zero okay so I hope that will make sense now we're going to talk about user user defined reduction operators um so I sort of briefly mentioned these earlier um so you're not limited and I'll scroll back up a bit to here to the predefined reduction operations you're not limited to only these predefined reduction operations okay you can uh Define your own and indeed if you've used your own NPI derived data type then you will essentially have to um Implement your own uh reduction operation even if it is just a a summation because uh NPI won't well in fact C or Fortran won't know what to do with that data type otherwise um okay but it might also just be that you're doing a more complicated reduction and that's fine you can Define your own uh some arbitrary operator which we're going to refer to as o um hence forth uh so in C this EAS defined function is a function of of type MPI uncore user uncore function um and it has a particular uh I know if that's true or foral one as well it has a particular um uh formal parameters formal parameters particular signature some void pointer which is its input um Vector some void point which is or uh an in and out Vector so over that um one which an integer which is the length of those vectors um and they should be the same okay uh so that's that's the equivalent of the count when you're performing the reduction operation um because the reduction operation can always be on some array and it will just perform the reduction on every element of that array so so that would be the input that goes into length here and it should be the same or should match up um and uh the data type as ever for all NP things uh and the Fortran uh version is it is a external subprogram and it has exactly the same essentially um signature with an in Vector an input Vector an in out Vector uh which is one that will be overwritten but also contains some of data um and then length and data type except that the data type is an IND has ever and for not a typ in of itself um so there are some rules about how you can Define that reduction operator uh so this is first of all this thing I already mentioned of it keeping in mind your ruction uh can always take place over some array of of uh any given data type or of the data type um so it should be able to to act in that way so that and it overwrites this inout Vector okay um which is kind of is probably exactly as you expect if you're used to foram um is a little more unusual for c um normally there' be a kind of return and you just have inputs in the signature but here you know you're going to be overwriting it overwriting that buffer um and it should be the case that the new value for that input output buter uh is simply the binary operation of its current element and the same element from the other input Vector okay so I hope the syntax we kind of all the Pudo code we put up here make makes sense um feel free to ask questions if it doesn't uh and the other important rule um about this operator is that it needs to be associative um so I appreciate not everyone is a recovering physicist like myself so I went and find this earli so I could explain it correctly um the associative property and this is the mathematical associative property not associativity from um caches for example um so mathematical property is that the result is the same uh no matter what sequence the operand is happening or no matter what sequence The operation's Happening so if I'm adding numbers together addition is associative um because it doesn't matter what order you add those numbers up in uh you'll get the same result and the same is true for the product of this is kind of reporting off scalers so of real numbers um those are asso associative operations okay and the important point is it doesn't matter what order that happens in the reason that's important for the NPI reduction operator is the thing that I mentioned earlier um about the fact that under the hood MPI can you know use much smarter um networks to actually perform this reduction across the entire communicator and that only works because it is associative if it matters what order those things happen in um okay then it's not possible to do the reduction in that way because it would be different um depending on the different ways that the network decided to to split up the problem on that day so uh that's why it's important that it's associative um okay it also says that the operator doesn't need to commute um so kesing is is the property that um to operate or the operation is gives the same result in a different uh if the operands are in different orders uh no I'm have to bring it up sorry if a sequence of operations happens in a different order uh no operands which one's which ah here we go yes so the order in which the operations are performed doesn't matter as long as the sequence of operand is not changed okay that's associativity commutativity is that changing the order of the operands does not change result okay so this is a slightly stronger version uh so this says that you can do these calculations in any order at all it doesn't matter um associativity simply says as long as the sequence of operand is the same is preserved it's okay um and the reason that it's associ oh okay and the reason that associativity matters when commutativity commut doesn't in this case is most likely due due to um uh floating or to the way that computers have numbers up right so um you know floating pointer it means that uh these things are never commutative um but must be associative so hopefully that explanation has helped not make things worse uh the main thing to remember is that you whatever your reduction operator is it should be important that as long as the order of um operations is globally preserved uh it should matter if you chunk it up in different ways okay so I can add one and two over here and three and four over there and then just add the partial results and it's still okay I still get the same result um as long as 1 2 3 and four are still in the same order um it doesn't need to be true that if I add one and three and two and four I still get the same result that's not [Music] important so most common operations that you can think of and that you would want to do in a reduction that will be true but just be aware that formally this is the case uh and in my past life as a physicist I spent a lot of time dealing with um math that does not commute so when you multiply two things in different orders they did not give the same result and it is every bit as painful as it sounds I go through there um so in terms of registering those userdefined operators the process is very similar to that of registering your own userdefined data types um and C there's a specific type that's defined um NPI op uh in foran an integer okay or rather the handle to it is um so you have to call NPI op create in order to register the function with NPI that you're going to use as your reduction operator um uh simply with some handle to function itself um an integer or logical which indicates whether or not it is a communitive operation um again which will under which will affect the underlying uh assumptions that MPI is able to make okay and um yeah and then the handle that you should have already defined but will now uh assign in the NPI create um cool okay so you have your function pointer some logical to say whether or not it commutes and then the handle that you're going to use to specify uh to the actual reduction operation okay and of course I error in forr and okay so if I just quickly scroll back here so this so your uh MPI op um okay or op in the forr notation that would fit in to reduce for example where MPI sum is okay so that's what you would supply to the NP reduce call um is your custom uh registered function uh but you use it actual handle so it's NPI op handle that you created and assign using NPI op create so there are variance of NPR reduce and Connor J is asking uh the tough questions what sets the ordering if the operator is not commutative um so in that case have a little think about this uh so if it doesn't commute then MPI will have to make sure um apologize I'm going to have to go back to here let's check I get this right changing the order of your friends right okay yeah so if it's commutative then you can change the order of the operands and that's okay and in that case NPI is a lot more free to choose the order in which um those uh processors communicate with each other in which case the partial all the partial reductions are calculated um if it is not Comm if you're right oh Chris suggestion is probably right uh well so if it's not commutative I would expect that it probably is just based on the rank um on the assumption that because that's an order that you can fix and always be sure of so fundamentally right the NPI Library wants to optimize as much as it can so if you tell it uh this is not a mive operation it has to pick an order in which to do it and the one that it can always rely on is simply the order of the ranks um however if you tell it that it is a commutative operation it no longer has to worry about keeping the same order and it can look at things like where processes are actually physically located um so if you have say on Archer uh well let's do with Sirus actually because that's that's where been had access to so on suus you have 36 physical Calles U on each node so you might well elect to you know to launch um 72 NPI processes across two nodes if your operation is commutative then NPI is at Liberty to calculate the partial reduction on each node independently and then put them together because it doesn't matter how those ranks are actually distributed across the nodes and on the other hand if it's not commutative then you might then it will have to stick to for example rank ordering um and that might be that might lead to the situation where it still calculates the partial reduction um on each node independently and then communicates across nodes but it's not necessarily the case um so yes and yeah so you're right Chris um I hope that makes sense Connor I appreciate this is not the simplest thing to to to get into detail on but that's a good question um and I don't know for sure that it would simply be based on on the rank okay but that would be the most sensible thing I can think of for it to do um but as ever the aim is for NPI to let you give it the information needs to um make sort of dangerous optimizations if it can um yeah okay you're welcome Conor um so yeah the reason that I've tried to go more slowly through through this Collective Communications uh lecture and I have some of the previous ones and a lot of the reason for that is that um things about ordering are confusing um and the counts are confusing um or at least I think so you might be thinking know this is all easy but hopefully I managed to make it all make a certain amount of sense um there are variants of NPI reduce available so there is an NPI all reduce in which there is no root process um and that simply uh so an equivalent thing to an or reduce would be to do an NP reduse followed by an NP broadcast of the result he don't do that because that's much less efficient the NPI or ruse um does that for you okay uh so it's equivalent in the same way that a non-blocking communication for the immediately Power weight is equivalent to a a blocking communication um so an NP reduce every process gets the uh reduced result um NP redu scatter is an even more Niche one in which the result is it's scattered case over use isn't really that Niche uh and again so another thing I should say actually as well for for old use it's even more important however that the NPI library is able to do some kind of optimization of how that reduction happens um because you need to suddenly you know complete well complete partial reduction everywhere but also share those results back um and there are more efficient ways of doing that than than the naive case of Simply performing a reduction and then broadcasting and but again then it will make a real big difference as well whether or not your operator is commutative or not um so I think ignore what I was saying about uh floating pointer I think NPI doesn't care about that that level of it um it will probably just assume that NPI some for example is always commutative an NPI prod uh the product Global product is always commutative and won't worry about the implications of um uh floating Point arithmetic there because then the result will be different but only me very slightly um yeah so um as IED again NP scatter uh it'll calculate the reduction and then scatter the result back to all processes um you does exactly what it says in the tin really NPI scan uh so the the the explainer here is parallel prefix um so I'll I'll be honest with you I had to look up what that meant turns out the prefix and scan are both terms used in computer science to refer to what I would call a cumulative sum uh so there it is it is calculating a cumulative Sun across processes uh in rank order and of some particular uh variable or array um so we'll have a look at some diagrams and some some Syntax for these so they'll reduce um you know as I explained is going to calculate the reduction uh but now whereas in our in our single NPI reduce only one root process gets the result that's filled in now they all do um all the same rules apply all processors have to call it and they all have to have buffers set aside uh they will all actually be used now is the difference um okay and so Chris is been over to m form and handly search sentation for us and has found a bit it says users May Define operations that are assumed to be associative but not commutative the canonical evaluation order of reduction is determined by the ranks of the processes in the group however the implementation can take advantage of associativity or associativity and commutativity in order to change the order evaluation thank you very much for for taking the time to look up chis that's helpful um and I think means that I was saying the right thing which is nice for me so essentially if your operator is commutative okay it will just the NP implementation is free to take advantage of any optimization it likes in The Ordering of those BR L reduction operations um however if it's not commutative then it will just do it in rank order uh that may be important for reproducibility reasons as well um now than think about it more so for example if you are a sort of person that's ined interested in bit reproducibility the only way to ensure that uh with a reduction operation of any kind would be to make sure that that reduction always happened in the same order okay so then you would have to specify that it was a non-commuting operator in order to make that happen but thank you very much Chris for for looking that up great um so all R use uh it's more or less as you'd expect a send buffer and receive buffer and a count uh the count should be the size of both the send buffer and the receive buffer here um because again it will just do an element wise reduction of an array if it is an array um uh and every process will end up with something in its receive bu uh as NPI data type specified the oper raor full reduction um and the communicator across which the reduction is taking place um all more or less as you'd expect the only difference here between this and NPR juice is there's no root process specified um again it's MPI scan which is the cumulative sum now again okay and here it's a little bit trickier to think about as well the difference between simply doing this with a Scala so a single value or an array of values uh it will calculate the cumulative sum uh in rank order but of every element in that array okay not the sum across the array but the sum of individual elements in that array um so rank zero will simply have uh some initial set of values and then rank one will have the sum of rank zeros plus rank ones in each element and so on yeah so the the lettering here down the middle of this diagram indicates the um reduction at each stage for just the first element so R zero has a um let's assume the operator is a plus for Simplicity so rank one has a plus e uh rank two has a plus e plus I and rank three has a plus e plus I plus n okay and the scanner is just a c of um some operation uh okay and it has a send buffer a receive buffer account all of which are as they are with the reduction operation other reduction operations um has a data type and operational communication it has all the things that You' expected to it's just doing something slightly different um and again the fact that uh so this obviously has a certain order to it um but there are optimizations that you could make but the time you get up to many many processes and that expect the NPR libraries to make um okay there we go okay and that that does bring us to the end of of the lecture material anyway um so we'll quickly have a look at exercise five uh as I said before you're not under any particular obligation to complete these exercises with a possible exception if you do want um or you do need Certificate of attendance we will ask you um to submit uh a sort of a pi calculation based um solution just to just to show that you were here for least some of the course U but otherwise it's it's completely up to you whether or not you pay any attention to these but you are also free to to do them on seus as well as just your own laptop or computer um and you should have sarus accounts or I think a couple of weeks after the end of this course so after today and you'll be warned before they are uh removed but you should transfer all your data off before that happens if you wish to keep any of it um so exercise five uh on the sheet is simply the r the ring um problem that we looked at already but now we want to do it with an NP reduction operation rather than point to point um okay and you can look at how so one thing that's certainly interesting to look at actually is how the execution time varies with number of processes and how it compares to the point too implementation uh we won't have a chance to to go through this um uh separately because this is the last session but what I would expect to see is that your point-to-point implementation does increasingly badly compared with the uh reduction operation um as the number of processes goes up um I hope to see that otherwise we have an issue with our Collective Communications there's a few extra things you can look at here is a wish and of course there are there's even more so I briefly mentioned earlier there are special communicators you can create which will allow you to do things like calculate neighbors easily um one of those is a cartisian communicator so there's a little bit of extra stuff um in this exercise sheet because this is just the same one we use for our master's course um so you may wish to have a little look at that as well if uh if you're interested and okay other than that we're now into the general Q&A um thank you all very much for attending and I hope you've enjoyed the course and I hope it's been useful