Transcript for:
Understanding Camera Motion and Homography

So, what does actually means is now again right we have to be able to show that you know that those equations that you saw earlier x t is equal to some a times a x and y t is equal to a times what is that a times y s right that is what we had that was. the notation that we had used. So, we had we had assumed that there is some a right that is there. Now we want to we want to actually show what camera motion and what kind of a plane normal how the plane should be with respect to the camera source to be able to yield a motion like that right. So, for that what we will do is we will kind of look at the situation where it is a front or parallel plane ok. So, it is like 0 0 1. Ok, and you have a translation vector with the camera, no the camera has to move right. So, the camera does not translate along the along the optical axis. So, so your T x is 0, T y is 0 and then you have only a T z component. So, it is like saying that you know you have a camera like that and you have a fronto parallel plane because n is taken to be 0 0 1 and now you are right you are going to say translating about the optical axis. There is no translation like that right it is only going along the optical axis and then r will be identity right we are not going to rotate on anything. So, r is going to be identity right under these conditions we will show that we will end up with those equations that we saw earlier ok. So, what that actually means is that Now let us again go back and okay and then yeah until now right I have been assuming that F to be 1 by the way okay all along I do not know whether I explicitly told you that focal length I have assumed it to be 1, in reality right I will talk about that a little bit to it later on okay that matrix is actually a little more complex than what I what I can see what I. What I have said here, we will not kind of go into the full details and all, but I just wanted to kind of say touch upon the fact that the intrinsic camera matrix is a little more complicated than what I have assumed till now. Just for understanding sake, I kind of you know, I try to keep that away for the time being, right. So, our k is identity right now. So, all that we have we are worried about is r plus, right. So, we are sort of assuming that k is identity. So, we are just looking at r plus 1 by d Tn transpose. So, this yields you 1 0 0, 0 1 0, then 0 0 1 plus 1 by d. Now, T is 0 0 T z, there is only a z component and then there is an N transpose. is 0 0 1 because it is a fronto parallel plane. So, under this condition you will get 1 0 0 0 1 0 0 0 1 plus this will be 0 0 0 0 0 0 and then 0 0 Tz by D right. So, add the do-up. you will get 1 0 0, 0 1 0, 0 0, 1 plus T z by d ok. Now, act this on some X s Y s 1 right in order to know where it goes ok. So, your X t now this will be equal to what do you get X s Y s. and 1 plus Tz by d, okay. Which means that if I want my real image coordinate, okay because it is still in a homogenous right we are dealing with homogenous coordinates. So in order to get my image coordinates I should scale, okay my Xt, Yt that I really need the actual Xt, Yt that I need are. Xt will be then Xs upon 1 plus Tz by D right this I told you already you have to divide by this third component this Dz by D and Yt will then be that means on the image on the image. image space right on the image grid, it will lead to a coordinate where y t is equal to y s upon 1 plus t z by d and imagine that A is let A be equal to 1 by 1 plus t z by d ok. Then What this means is that you have got like your, so we are kind of back to that equation that we talked about A times Xs, Yt is equal to A times Ys and A is of course 1 by 1 plus Tz by D and the Tz being. If t z is actually, if t z is less than 0 that means if it is negative that means you are coming closer to camera, closer to camera then clearly. A will be greater than 1 which means you will be zooming in. Of course, the maximum that you can come as t z negative will be like till you can come up to d right because it is d away after that you will actually hit the plane itself right. So, t z negative can only go to d. Then on the positive side if t z is greater than 0 yeah you can you can have t z you can go as farther away as you can and right you are going away from the camera sorry closer to plane ok not camera. to plane and away from plane and you got a less than 1 which is zoom out. Okay and yeah so the idea was that at that time we just assumed that there is an A and you know we did not really bother about where was that A coming from, what will influence A and so on. But now we know that you know what camera motion and how the plane should be in order for you to be able to see something like a scaling okay happening okay. This is a small example now I am going to leave it as an example show right this is going to be a small little homework for homework for you show what. Camera motion, what camera motion and what plane orientation will yield shear. Suppose you assume it to be like 1 k 0 1, this is like the x shear, ok, will what plane orientation. So, what is the x-axis? We also saw no, shear I told you was a special case of affine, right. Affine is like a, b, c, d, t, x, t, y, 0, 0, 1, right. But a, b, c, d if they turn out to be special like 1, k, 0, 1, then it is an x shear, right. So, I am just asking you because this is exactly the example that I gave in the class last time. Imagine, I mean, right, this you should be able to show by simply imagining the case where you have a car, right. Imagine that, imagine the, right, this is your x axis, your Right, you are kind of looking out of the car, right, so the camera is pointed this way, so the camera optical axis, so Z is this way, Y is down, right. Imagine that you are moving this way, so the camera is sort of translating along X. and then it is actually looking at this kind of ground plane. So, the ground plane is actually below the camera and you are actually imaging the ground right. So, what camera motion and so I just want you to write up the correct normal and the correct correct t r well right. I do not want to give away the answer whatever it is a small little answer right you can just figure it out and so what camera motion what plane orientation will yield shear. Ok now let me let me put of the kind of the kind 1K 01001001 ok. So, again right so. This is again going from the homogenous to the things that you saw when you considered very simplistic cases. Now, what we have not talked about until now is the fact that how do you really arrive at this homography? Here, this is... assuming that somebody gives you all this, right. In reality, if I give you two images and then right taken from the same camera and I ask you to relate the two views, you will find out what homography will take you from one to the other, ok. Nobody will tell you what, no, somebody does not sit there and tell you right that this point will go is going there and that this point is going there and so on, right. Somebody told you of course it will be easy, but then in general you cannot sit and do that, right. So, solving for the homography. Let us do solving for homography. There is also a special homography which is rotation homography what we are looking at is really a planar homography. So, rotation homography I will do later ok for the time being ok I will I will stay away from that and one more thing ok let me just mention that ok yeah maybe at this point of time. I will also mention about the intrinsic camera parameter matrix. See this k that we had taken, we had said that it is F 0 0, 0 F 0, 0 0 1. This is actually a very simple situation. Normally the way this k looks like, the real intrinsic sort of a camera parameter, camera this one, this is called the intrinsic. Intrinsic camera matrix, intrinsic because it is intrinsic to that camera, right, it stays unchanged, intrinsic, intrinsic camera matrix. The other one, right, that you had which involved R and T which was the camera motion, right, that is called the extrinsic camera. matrix because that is like that is not inherent to the camera you can take a camera move it the way you want right so it is extrinsic this is intrinsic. So in reality right what you really have is you know something like fx, s, ox then 0, But there is only for information. So, what this actually means is you know so in a so in a in a real camera right. You are kind of assuming that when you put F 0 0 0 0 0 0 1 what you are assuming is that the optical axis right it coincides exactly with the image plane center. Okay, it need not. The optical axis of the lens right when it passes through the image plane right we are assuming when you put the 0 0 right here. This 0 0 means that right you do not have any kind of a deviation. It is exactly that meets the center of the image plane. In case it does not and if there is a drift right then this O X 1 1. Now these are again. things that nobody may be able to give you and these are supposed to be estimated. So, that is why it is called a calibration procedure. We are not going to go into that. But typically, if you want to use a camera, then you should know these parameters. If you are just doing a homography, you do not need any of this. That is also something that I want to emphasize that if you are just solving for a homography between two images, you do not care about all these intrinsic parameters and all. All that you need to see is that matrix that is going to take you from one to the other. But if you want to do some high level tasks and all like for example, if you want to take a camera, you want to estimate depth, 3D depth and all that, then in that case, this is intrinsic camera information you need. So, solving for this, so right now we are assuming that K is of this form and F is 1 and so on. So, OX, OY is like deviation from the center. This is like a deviation from the center. Or in other words center of the image plane, deviation of the optical axis from the center of the image plane. So We just typically assume that it coincides and so on, but it need not, okay. There could be some errors, right, when they actually manufacture. You could get a small shift, right. So, that you have to account for. There is F x and F y, right. This is supposed to be the aspect ratio. See, normally the focal length is not really, you know, so we will assume that it is F f. But F x and F y, they write, right, what they really mean is if there is an aspect ratio, right, there is a difference in the aspect ratio, then this F x and F y will take care of the aspect ratio. And this S is something right that actually takes care of deviation from what you call you know rectangularity. This is something like the shear you know this S, this parameter is something like a shear. Exactly like you know if you relate this to what we had right when we were doing the homography right. So S is exactly playing a similar role except that this is like a hardware thing right. So this is like inside the if the sensors if there is some kind of You know deviation from a rectangularity then this S is, this may not happen in the sense that you know you may simply if there is no deviation S will be 0. But if there is then it needs to be accounted for. So this is like deviation from rectangularity, deviation from rectangularity. So these, these are things that you know if you look at, if you look at K typically right it will be of this form. So it will have a kind of phi unknown sitting inside it. But as far as we are concerned and anyway right since it is an upper triangular matrix you see this is an upper triangular matrix and you know that an upper triangular matrix is always invertible as long as the diagonal elements are non-zero right and f x, f i and all are you know you can relate them to focal length or the aspect ratio so none of these entries is zero right. So this matrix is always invertible, so that k inverse that I kept writing, I kept writing it for F 0 0 situation, but the same thing is also valid even if you take a real intrinsic camera matrix where maybe you know there are some other entries that you need to account for, there is no issue, okay, this is always invertible, this is an upper triangular matrix with non-zero diagonal entries, upper triangular matrix. In fact, there is also a reason why, you know, when these people do a camera calibration, have you heard about a QR decomposition in matrix, right? There is a QR decomposition, what does it do? It kind of gives out a matrix as a product of an upper triangular and an orthogonal matrix, right? So, that upper triangular is this K guy, that is the way they do the camera calibration and then the orthogonal matrix is your R, R is an orthogonal matrix, right. So, just saying that you know so people do all this camera calibration and all there are there are there are kind of nice ways to do that, right. So, upper triangular matrix always invertible is invertible as invertible as all its diagonal elements are non-zero. elements are non-zero, okay. Then now let us kind of come back. So, what we are saying is we have this, so we said that you know we have x dash, right which is like, which is this homogenous coordinate of your target, right and we are saying that the h acts on x. where let us say X is the source. So X is like X is Y is 1. So if you think about it, this is like, I am just going to write X because I do not want to carry this S S always. So this is like applying the homography on the source grid and then it takes you to a target grid which is X dash, okay. And in order to get back your, so yeah, so in fact I will even multiply this by lambda just to indicate the fact that when you do H X, the even third entry. here, right, can, need not always be 1, you can end up with some lambda which you will have to, which you will have to use to divide the first 2 entries. So you have something like lambda x dash, right, is equal to h times x, okay, this is how, this is how you, this is what you have now. The point is if you look at, if you look at your homography, right, h, let us say that h is, h has the following entries, you know, h 1 1 and this could be the case that you have an intrinsic matrix also sitting there, Right whose values you do not know, you do not know any of these parameters let us say you have this you have this most general case, okay it is still valid because all that we are saying is this H is some k, right so H is k r plus 1 by d Tn transpose and then what was it k inverse, right. So, so, so we are saying that let that k be there whatever it is. Finally, right I would I just need to need to map this image onto that. As far as the homography is concerned, I am not worried about what the intrinsic parameters and all because all that will get absorbed in my H, ok. But if you want some other task and all, you cannot always ignore and move on, ok. You may have to go back and look at what is my camera, what are its intrinsic parameters, but yeah, right. But for a homography, it does not matter. So, now what we are saying is, so let this H 1 1 be H 1, no, let this H be H 1 1, H 1 2, H 1 3, H 2 1. H22, H23, H31, H32, H33 right and we said that H has 8 unknowns right you can only estimate it up to scale factor and which is anyway being shown here right by that lambda. So now since there are 8 unknowns right and so what this means is that you need at least 4 feature correspondences. Because one feature correspondence, okay now right I mean in order to see that right let us let us start writing. Let us say that I have an image okay here right and then I have a source and then I have a target and suppose somebody says that you know well there is one point here that matches that goes here. Matches in the sense that I know that right that this imagine that probably right this is a corner of some of a this one table or something and I can kind of identify that corner in the second image. Now this is not typically done manually. So what is done is typically you use some feature descriptors, okay. You use what are called feature descriptors. descriptors and there are several of them. In fact, people have spent their whole spent a whole lifetime trying to find out what kind of features are good to match. So I was saying that you have these two images and you know you I mean so like I was saying there are these there is a lot of work which has actually gone into trying to match features because that is one of the fundamental things that you need in several applications especially when you're trying to solve for a homography. Because right if you if you try to see see right what is what is happening here, you have 8 unknowns sitting in that homography and if you had 4 feature correspondences then it will yield you you know 8 equations, I will show you why, okay. So let us say that let us say that you have lambda Right, x dash y dash 1 is equal to h which is h 1 1 whatever right and then you have x y 1 right. So, if you just expand this, you will get lambda x dash is equal to h 1 1 x plus h 1 2 y plus h 1 3. is what is sitting in this H, then lambda y dash is equal to H21x plus H22y plus H23 and finally you have lambda is equal to H31x plus H32y plus H33, ok, which then means that which then means that I can do lambda x dash by lambda which is which will give me my x dash so that is h11 x plus h12 y plus h13 upon lambda which is h31 x plus h32 y plus h33 similarly we have lambda y dash by lambda giving y dash is equal to h 2 1 x plus h 2 2 y plus h 2 3 upon h 3 1 x plus h 3 2 y plus h 3 3 ok. So, if you if you kind of cross multiply right. So, what will we have we will have ok I am going to write this as h 3 1 x x dash. plus h 32 y x dash plus h 33 x dash is equal to h 11 x plus h 12 y plus h 13 right. and from the other one you will get h 3 1 x y dash plus h 3 2 y y dash plus h 3 3 y dash equal to h 2 1 x plus h 2 2 y plus h 2 3, correct. So, now the unknowns we know are these H1 1 2 H3 3 right sitting there. So, what we can do is we can in turn write this matrix right. So, we can write it as let me go with the standard notation ok. So, that is like so, what we will have is minus x minus y minus 1. So, I am trying to I am trying to I am trying to get all the all the terms on one side then see. So, here it assume that that you have this matrix. matrix multiplying h 1 1, h 1 2, h 1 3, h 2 1, h 2 2, h 2 3, h 3 1, h 3 2, h 3 3 right. So you can just put all your unknowns on to alright on to this particular you can see vector and we know that this vector needs to be found only up to a scale factor right. So minus x minus 1 so if you see right so what have you got h 1 1 x therefore if you pull it to the left side will become minus x then h 1 2. 2 is getting multiplied by minus y, h 1 3 is getting multiplied by minus 1, then h 2 1 nothing right, so 0 0 0 and then h 3 1 is x x dash right, h 3 2 is y x dash and h 3 3 is x dash. And similarly what you will get for the other one is h 1 1 there is nothing right, so you have like 0 0 0 and then h 2 1 minus x minus y minus 1 and h 3 1 is x y dash, so this is x into y dash right, these are all coordinates that you are multiplying, y y dash and y dash ok, this whole thing is equal to 0 correct. So now the point is right to see so all these entries we know okay these are all known quantities it is all known to us because when will it be known when somebody tells me that this point is going there right because I have assumed that somebody has given me a correspondence between x dash and x, right. Somebody has said that if you apply h on this x, y, 1, it will take you to x dash y dash 1, okay. So, which then means that with just one point correspondence, I have two equations, right. And with and since I have eight unknowns, I will need four point correspondences. So, somebody has to give me not just one, but people have to give me at least four point correspondences so that I can solve for Because H has actually 8 unknowns and it is known up to a scale factor, right. So then you can kind of think of this as you know you can add an X1 now. So you can say that for let us say one feature point you have X1, Y1, X1 dash, sorry X1, X1 dash, Y1, X1 dash, X1 dash, X1, Y1, X1, Y1 dash, Y1, Y1 dash, Y1 dash and similarly right followed up with minus X2. minus y2 blah blah blah right 0 0 0 and go on. So you will have x1, x2, x3, x4, so you will have 8 rows minimum, you can of course use more right, so this whole thing we can write this as A multiplying h equals 0, where 0 is again actually a vector. So A is 8 cross 9, h is 9 cross 1, 0 is 8 cross 1, so 0 is a vector, it is not a scalar right, A h equal to 0, so this is also called a homogeneously least squares. So, what you are asking is, now the point is that you might ask is who will give me those point correspondences, I was just talking about that. So, basically people have spent a lot of time trying to get these correspondences and there are various things for that. So, one is called, I mean there are various such features that let us say people have found out. One is called a Harris detector, it is called a corner detector, Harris corner detector, something called SIFT. which is a scale invariant feature transform stands for scale invariant feature transform then there is something called SURF which stands for speeded up robust features and there are kind of see many of them. Okay now among these SIFT is the one that is most common okay this was actually proposed by a guy called David Loeb. So this is this is the most common and the nice thing about this is this is completely translation rotation and scale invariant. and scale invariant and it's partially invariant to illumination and pose. What this means is that even if you have a post change, it will still be able to reasonably identify that it is exactly that point. So, it can tell that between these two images that you have, it can. So, if you run a SIFT code, this is available. In fact, when you do the assignment, we will give you that code. So, in this course, I do not go through. the entire details about how SIFT works and so on when we do a vision course computer vision course there we do but here just just right keep in mind that there are many such things that can allow you to identify feature identify feature correspondences across images okay and SIFT is one of them and it's pretty fast and it's it's used can I say you know it's used all over But these are all called handcrafted features by the way right these are not your deep learning kind of features these are all handcrafted somebody figured out that you know if you do this if you do a gaussian scale space right and you know so you do some scale space processing you will be able to able to identify features. So they're saying that if you had a scene captured under some illumination you capture the same scene after some time from a different viewpoint but then even with slightly different illumination, you still match those correspondences. And you will still be able to tell how the two are aligned, what geometric transformation can bring one to the other. So we will assume that there is something like a shift available with us, which will kind of, so you can just run it. I mean we will just give you that code, we can run it, you will see that it will not just pick three or four and all, you pick many. The point is right for us it looks like if I had just 4 feature point correspondence I am done because because A being A being 8 cross 9 and and you know as long as you make sure that not 3 no 3 points are collinear ok then then you know then you can be assured that A has a rank 8. Now which actually means that because of the fact that A H equal to 0 that means there is at least one kind of non-zero vector right which vector ok that is mapped to 0 by a right because a is 8 cross 9. So, so you know that you know that there is at least one non-zero vector that is mapped to 0 by a. So, one way to do it is you do a simple SVD ok. So, if you use MATLAB or something right you will ask for null of a. So, you will say null of A, I mean that is that one-dimensional vector, right, I mean you know which one, that is a one-dimensional space, right, you know in which this particular say vector lies. All scaled kind of say versions of that are ok. So, when you do null of A, so typically what MATLAB does is it will do an SVD on A and then pick the, pick that eigenvector corresponding to the smallest singular value, the even zeroth singular value in fact. Because you know for sure that assuming that all of these are, I mean right in this case we are assuming that you do not have 3 points lying on the line and so on. You just have to avoid those things but otherwise you can just pick any of these 4 points and you can run a h equal to 0, ask for null of a and that will basically return an h for you which is non-zero which one acted upon by a will actually yield you the 0 vector. Now you can also clearly see that if you have any scaled versions of H right that will also go to go to 0 right because A alpha H is nothing but alpha A H right and but then alpha but then A H we know is 0 so this is like alpha into 0 so this is 0 right so all scaled versions of H so you can only estimate H up to a scale factor which is fine because as we said originally we only eat unknowns in H anywhere right we are okay with estimating H up to a scale factor right so whatever H that is actually returned is what you can directly use now then you can actually align the images now. So that H you go back and put it back in your H matrix as H 1 1, H 1 2, H 1 3, H 2 1, H 2 2, H 2 3, H 3 1, H 3 2, H 3 3, H 2 3, H 3 3 and then if you multiply it with any x y 1 what will happen is you may you will get some you will get actually a 3D I mean that is a 3 dimensional vector and this need not be 1. So you so you will have to scale the first two by this and that will give you the image coordinates.