Neuro Radiance Fields (NeRFs)

[Music] okay welcome to the afternoon session uh my name is Torsten I'm going to give you brief introduction to neuro Radiance fields or nerves for short if you ever search for those things on the internet don't type Nerf will'll end up with a lot of toy guns but not the thing that you're interested in um if you have any questions at any point in time feel free to stop me um and we'll talk about it directly rather than waiting until the end good so my background is in computer vision and I'm interested in building 3D models or 3D reconstructions of scenes from images and that's precisely what what neres are doing to some degree uh the input is a set of images with known intrinsic calibrations and Camera poses uh and we would like to have a faithful reconstruction of the scene and traditionally um these things have been done using explicit scene representation so you have when you reconstructed a scene you try to build a point cloud or some 3D mesh this is what for example multiv stereo methods give you uh they work well the problem is that they are quite memory intensive and you make the assumption that you have luran surfaces me meaning the color of the uh of each part of the scene doesn't change depending on uh from which direction you look at uh an alternative to explicit scene representations are implicit scene representations um the most wellknown is probably the sign distance function so if you have a like a simple shape for example such as a circle the sign distance on the surface is zero and as you move away from the surface the sign distance uh to that surface increases if you move into the surface the sign distance decreases or you get Negative values and those things have been used for for 3D reconstruction for some time uh we did like a few years ago uh live reconstruction using implicit scien like implicit scene representation sign distance function um on a mobile device where you would actually use efficient camera to compute depths maps and these would then be integrated into sign distance function and once you have the sign distance function you can extract a mesh um which is rendered on the large part of the screen by essentially finding the zero crossing of the scien distance function um the way that traditionally this is implemented is you uh voxelize a scene so you uh compute a discrete subdivision of the scene and in each boxal you store the corresponding sign distance function um and as I said you get the geometry or 3D mesh by finding the zero crossing of the S distance function which by definition is is exactly the surface of the seene um the issue here is again because you need to store a voxal grid this becomes quite memory intensive and again you make the assumption that you have luran surfaces and this is where essentially nerves come in um here what you see on the left is actually not real image but a rendering from a neural Radiance field and on the right you see the geometry that's uh that the Nerf learns as part of the training process as you can see you're able to actually Faithfully reconstruct the scene in high detail both uh in terms of color and and in terms of geometry so you can get high quality geometry out of these things um at the same time you can uh model complex illumination complex uh like interactions between light and the scene such as like Reflections refractions and so on and so on again this these are not real images these are renderings so you're actually able to handle uh nonners of surfaces quite well um and at the same time these uh NE Radiance Fields can be rather compact so what you see here is a 3D reconstruction system that takes uh color and depth images as input and then trains on the fly in a neuro Radiance field uh to represent the scene and I think this the scene representation that they use is about a megabyte for a room so you you end up with something very very compact good so let's I hope that motivated you that nerves are something useful let's look into what they are and how we train them um and they build up on something that was invented in the80s um it's slightly younger than me the paper uh and it's volume rendering uh so the idea is that you have some representation of the volume uh if you have an image that you want to render uh you want to compute a color for this pixel you essentially shoot array into the scene you sample 3D points along the scene and at each uh sample point you evaluate your uh volume you get a color and some volume density and you essentially cumulate all these uh samples uh like the the colors combine them in like in linear fashion and that gives you the final color for that pixel um back then I think someone said in 10 years all rendering would be will be volumetric rendering we weren't there in the '90s we we might get there now with NE Radiance Fields um so how do we actually get the color so we have this aray that we shoot for each pixel you have samples for each sample we have a color volume uh color uh which is RGB um and volume density which essentially says uh is this part of the space occupied or close to some some surface or not if it's uh not occupied you get basically zero otherwise you get a large value the value doesn't have to be between zero and one uh but it could have like larger values so and then the final color as I said is a linear combination of these uh these individual colors um more precisely the way that you comp computed is this there is a visibility term and uh that's used to to weight the color and this visibility term consists of two things like this thing looks at just at the current sample it uses the distance between this sample and the next sample assuming that volume density and color remain the same here and essentially what happens here is that if you are in an occupied part of the space meaning Sigma I is is a large value then this part becomes small and you have something that's close to one so you give a high weight to this if you're in a non-occupied part of the space then the sigma value is close to zero meaning the exponential is close to one this cancels out and you get very very little weight um the ti term essentially looks at all previous samples by essentially uh looking at uh like the individual distances and weighting them by the corresponding volume densities um if you are here and if all of these previous samples were in empty space then this part uh will essentially be uh zero so you get a weight of one if one of these previous samples was occupied then you will get some larger term meaning the exponential becomes small and you give less weight to that color um let's try to illustrate this a bit uh we we'll call this this individ ual term beta I here um and let's look at uh like the distribution of volume densities let's assume this is our volume density the ground to then at the first sample we haven't encountered like anything before so this is zero meaning T1 is one since we we are far away from like high volume density this part is zero so this part becomes uh zero as well as we go to approach the volume density the ti term stays close to one or one because we haven't seen any reasonable volume density before um as we hit the peak the ti term is still one because none of this was actually uh none of the samples before was had any density but the better term now also becomes large because these this part becomes large l so this part becomes small so this is close to one we get the next sample which is uh again in a region that has no volume density so the better part is uh zero but the ti part is also zero because you are close to zero because you encountered one sample with a high volume density so meaning this is a large term this is a small term um and so on and so on so hopefully that that gives you an idea what what this uh volume rendering does is essentially takes into account uh individual color samples and weit them based on some sort of visibility is this part visible or should it be visible or not good so what we assume is that we have some volume uh function that gives us these uh density values and the color values um what's what you should not is that this whole function is is very simple it's fully differentiable and this is where nerves come in nerves are essentially nothing else than saying look I'm going rather than assuming that I have captured these volume densities and colors with some uh special equipment or so or they're given I'm actually trying to train them from images so I want to learn a volumetric representation and uh what you essentially do is you train in your n Network that takes five parameters as input these five parameters are uh the position like three parameters of this these five are the position of the 3D point and then you get two angles that describe from which direction are you looking at this 3D point so that you can essentially model view dependent effects where you continue looking at the same 3D point from different viewpoints and then the the color and the volume density or the color that's outputting by the network can change according to uh the change in Viewpoint so you end up with a seene representation that's continuous compared to the discrete representations that you had for for voxel volumes um so for a Nerf when you would render an image you essentially do volume rendering you shoot array through the pixel you sample 3D points along the aray that gives you uh together with the direction of this this Ray in 3D space gives you the input to uh the Neal Network and the Neal Network gives you color and density for each of these samples so you end up with a color for each pixel computed using um volume rendering so how do you train the Nerf you assume that you have a set of images with know poses during training you can like generate by shooting pixels through through the images uh you compute the color uh you have the color for that pixel that is actually in the image you have the color that is predicted by uh the ne Radiance field and what you want to do during training is you want to make sure that the color that's predicted is as similar as possible to the color in the actual image so very simple training objective um in order to make things work you need to do like a few tricks so first one is like be a bit careful on how you generate the samples yes what about density the you we don't have any like ground TRS for the density toally from images you don't have an idea of where exactly along the ray the volume is but that's something uh where along the aray the surfaces so that's something that the neuron need to figure out um it's a free parameter so to speak that you need to estimate if you had say uh depths images then you could add like a supervision term where i' say I would like to my uh my volume density to be high close to the uh to the depth values that I observe and zero otherwise if you sample along the ray then you typically assume that you know how roughly how far away the camera is from the scene so you can define a near plane and a far plane which says I I'm only interested in sampling values in between and you would subdivide this this the distance between the near and far plane into equal intervals and then during training rather than sample at fixed positions you would for each um interval you would randomly select some point in this interval um the reason why you do this is if you would sample at fixed positions you only uh train the network at fixed uh positions if you're uniformly sample during training you essentially cover the space more uh more continuously so here you hope that you would uh hit something that's close to the surface but you're not really guaranteed to to draw a lot of samples actually at the surface but if you evaluate the network at these sample points you actually get get an idea of how the volume density is distributed because for each sample point you get an density estimate uh so you can fit some sort of Curves corve and then can do a second sampling step where you would draw your samples according to the volume density that you have observed before so the idea of this find network is that uh rather than trying to uniformly guess where the density is you get a reasonable guess hopefully from this Core State page and then you do educated sampling and that typically improves performance quite a lot if you do rendering you would typically take the samples that you get from the course Network and the F network uh uh for uh to compute the final color so at least in the original Nerf implementation you actually would train two nerves one based on course sampling and one based on uh fine sampling and you would simply add a second uh cost term where you would compare the ground color with one the the color predicted by the fine Network and the color predicted by the course Network um that's not all that you need to do in order to train those things uh properly one important detail is that you don't actually have one single neural network but you actually have two the first one takes only the position of the 3D Point as input and out put some feature vector and the volume density and the idea is there is that the density of the space like where surfaces are should not depend on from which direction you're looking at but that's something that's uh that's fixed in the scene and you would then take uh essentially the volume density this feature vector and the viewing direction of the ray as input to Second Network that then predicts the color um if you do all these tricks and you try to train a Nerf you'll get something that looks something like this which is not very like appealing right so what has gone wrong here um it turns out that neural networks are not really good at uh predicting fine details if you give them just coordinates but there's a simple trick where you rather than like giving the original XYZ coordinate you lift them you encode them into some higher dimensional space and then feed them into the neural network same with the direction of the the uh of the array and one simple way of doing this input encoding is to essentially look at like each coordinate and compute various s and cosine terms some sort of frequency encoding similar to what uh people are doing for Transformers for example and has been shown that this helps networks uh to recover fine details that you would otherwise lose if you just give XYZ um so if you do this position encoding then this is this is a difference between just taking XYZ and to uh uh the the re uh viewing direction of the array vers doing an input encoding for each of the five values without changing anything else so it makes a significant difference uh let's look at some results uh this is on synthetic scenes on the left you see some previous method on the right you see the output of the nerve this is essentially the nearest Training view uh a train view that's closest to the to the view from which is currently rendered and as you can see you're you're able to recover fine details while also taking uh into account you depend and an effect such as Reflections on this watery surface so again you see that you're able to to reproduce fine details such as the individual links between the the parts of the um instrument so this is synthetic data this is on real scenes it's not a complex scenes complex motion but you can see that you're like you're able to Faithfully reproduce fine details any any questions so far [Music] yeah so if you if you would have solid objects and ideally you have zero everywhere where there's free space and then some high value on the surface and inside the object object um but there the thing doesn't really have like an intuitive meaning at least in in my experience is essentially is there something or is are we close to something or not it actually causes problems and we'll see this later on if you want to extract the uh the surface because it's it's not uni uniquely defined we'll get to that later any other questions so we get wonderfully nice images everything is good right researched done going to sell this to some company or whatever um there's one disadvantage though which is uh Nerfs are at least in the original uh formulation horribly slow because for each point you need to query and evaluate this newal network and depending on how complex your neural network is that can give you like few days for to for training such a simple scene and rendering an image can take seconds to minutes so how do we make this faster uh one question to ask is do we actually really need a neural network or can we do some uh do things in a much more simpler way uh and it turns out that you actually don't necessarily need a neuron Network um so there is this paper called penoxal which showed I can actually get competitive results without actually any deep learning just using uh differentiable volume rendering and a slightly different representation of of uh colors so the idea is rather than training a single neural network that represents a scene you go back to like a more explicit representation where you subdivide the scene into voxal and for at the corners of each boxal you store essentially a volume density value and uh like spherical harmonics coefficient that you can combine into a color if you sample uh like if you want to evaluate some three point along the ray that falls into voxel you compute the corresponding spherical harmonics coefficients by Tre linear interpolation of the uh coefficients that you store at the corners take like where the interpolation is done based on the spatial position of that point in the boxo um so that gives you some color that gives you volume density you can do volume magic rendering uh and you can optimize essentially the same reconstruction loss that we had before uh the the essentially the idea is that rather than optimizing the the weights of AAL Network you optimize the weights of these spheric H harmonics and the promise there is that because you have like a much shallower representation rather than having like a deep neural network you have like a few like I think 32 parameters or so that you optimize per per vertex this is something that's much faster to evaluate and thus much faster to render and train um one issue is if you take this original like this formulation you will end up with renderings like this where you have a lot of noise and the what Happ happens is that you actually have so many degrees of freedom that you can overfit to the individual colors without enforcing any smoothness so what what uh this paper then actually proposes then rather than just looking at uh the Reconstruction loss you also have some regularization term that says I would like to if possible uh have a smooth transition between moils and if you add regularization get much more meaningful results uh this is not perfect so here there should be a lot of ferns that are not reconstructed but the advantage of this thing is that it's significantly faster than training in neural network so here on the left you see the current state of a neural network and while this is still trying to figure out what's happening uh the penoxal has already been trained and then everything else refines quality but even after a few minutes you actually already have a result uh here are some results on on simple Real World scenes where again you can see that you don't necessarily need to learn like uh a neural network that represents a Radiance field but you can directly parameterize this in a much easier way which I think is a very nice result showing that you don't necessarily always need deep learning um here are results for a bit more complex scene uh you can see that things are much more blurry that comes from the fact that you have maybe a couple of hundred images in the scene So You observe each scene part in a few images from potentially a few meters away so you don't have that much detail to recover but still you can see that uh this non-e representations able to model view dependent uh effects quite well um so what's happening here um why does this work it essentially works because you have a dense enough grid of the boxal grid of the scene and you actually need this because you're going to if you compute volume densities uh you're doing TR linear interpolation meaning you make the assumption that within the Box law you have like essentially uh a linear relationship in how your um uh volume density changes that means you actually have to have quite uh detailed subdivision of the scene for this linear approximation to hold which in turns means what we have actually done is we have uh uh we have made training and rendering much faster at the cost of actually higher memory because we have this detailed voxel grid for each voxel we need to store uh multiple uh uh features so like a 30d dimensional vector and that quickly the memory of this quickly gross um how can we try to to reduce this so that we both get fast rendering and training tries at a reasonable increase in memory uh and the idea is that you can can take a step back and this plox essentially had trainable features associated with each vertex of your um of your neural network and uh rather than just Computing a color from this directly you could uh take the like comp uh computer trilinear interpolation of your features uh to get a feature for a point on array add some additional information such as the viewing direction of the Ray and then feed this into a small neural network where the idea is that rather than having to include everything explicitly in the um in the features and having a fine enough subdivision that like linear interpolation uh makes uh sense you you have this neural network that can model nonlinearities that hopefully then means that you can have like a cors of subdivision combined with a small uh neural network and you would jointly train these features and the neural network so here's some example on the left you see the output of a standard nerve which where the neural network has about 440,000 parameters and training to this quality takes around 14 minutes and then the next one you see a multi multi res solution grit approach which is based on voiz a scene um you have like a very small neural network which only has 10,000 parameters but you also have 16 million parameters that come from um from the feature stored in the uh at the corners of the boxel um so you have increased uh uh memory but you have decreased training time to like one and a half minutes because in each ation you only need to update like a few parameters of the neural network and a few uh features you never need to foret all 16 millions of them um still you see a decrease in quality which naturally leads to the question is can we actually improve quality while also reducing memory requirements and turns out yes you can do this um by doing uh a multi-resolution hashen coding of your of your features so the idea is pretty simple you have uh multiple grids that uh subdivide the scene for each corner you hash the point uh position into like uh a value in a hash map and you store like a short feature Vector in that uh position and you do out multiple uh resolutions you have like multiple grids in uh to get the color value for for the point that you want to evaluate you take the hashed feature like the features that you get from the hashmap and you do a linear interpolation within the voxel you concatenate all of this you potentially add some extra information such as viewing direction and you run this through a neural network um so the difference here to the previous multi-resolution approach is that uh you reduce the number of parameters by essentially hashing 3D points into fixed size hash Maps um and you essentially would use use the same Ash table size for each resolution which add at course resolution actually gives you an injective mapping uh where each point is mapped to uh unique um cell in the entry in the hashmap so you don't have any collusions um but you will get collusions in when you start having uh finer subdivisions these are actually not handled but everything that uh you can have multiple points mapping to the same uh entry and you never really uh try to resolve this but rather during training what happens is that uh multiple like individual uh points get mapped to uh the same entry you get multiple gradients uh for updating the feature of the entry coming from different 3D points and the idea is that in many cases there's only a few gradients that are large that will actually help you improve the network the rest is small because you're sampling regions of the space that are empty so the idea is our intuition is that the network will actually handle uh collisions by itself by essentially you would first uh take care of of large gradients because they give you a large uh they they much more contribute to the to the update of the feature if you do this so here's an example of this approach using a hash table you go from 60 million parameters to 500,000 you keep training time roughly the same same but significantly increased quality and if you combine this with um a highly efficient implementation of uh neural networks then you get something where you can train neural networks like nerves in a few milliseconds uh few seconds uh that's you might have heard about it so-called instance engine P instant neural Graphics Primitives uh it's a very nice software package from Nvidia that they released um so let's have a quick look at this so what you see here is the neur radiance field getting trained and you directly see the output of this you see that here you can see the training loss which is decreasing and as the training loss is decreasing the quality of the renderings increases and after a couple of seconds you already have something that's meaningful after a bit more time you start seeing more and more details you can also visualize uh the cameras from which uh which are used to uh from which the images were taken so this is a video sequence it's just essentially me running around uh the [Music] statue so here you can see that the positions of the cameras that are extracted from or images that are extracted from the video um I think after a few seconds you can try to you can see that it kind of is modeling some Shadow that's a bit view dependent here uh there's also the a bit of a subtle effect you won see this after a few seconds of training that here you actually get some details whereas if you looking from the other direction everything gets uh like saturated because the sun we had direct sunlight so you actually see that it's able to uh model VI dependent effects um as part of the neural network and this is a bit hard to do if you don't see it you can actually not only up estimate like the the entries of your neur Radiance field but you can also update the intrinsic and extrinsic parameters of your cameras so you see that you it slightly start Shifting the camera poses and it should add you a bit of um bit of detail to the neural network um uh to the representation because it's essentially able to optimize the camera pose so that you get crisper renderings and typically this this works within a few minutes you get something reasonable at least for small yes so because it has this degree of Freedom which is um you can model like colors depending on from which direction you're looking at um that does help not help you with moving people right because that that you can't just model as a as an effect uh we're actually going to look into like occluders and so on on on the next slides is there any question about yeah the hash table maps from 3D Point positions to some feature like a two-dimensional feature typically uh that encodes some information about uh color density about for that point and you do this at multiple resolutions uh for for query point you get the corresponding features by two linear interpolation based on the position of the point in the voxel and then you concatenate the the different interpolated features feed them in the network and ask the network to give you color and and density yes it's essentially a learn encoding if you want to do this yes yeah you what and are because be so the the the input to the whole training process is still images with known poses and uh intrinsics and essentially what in the sperical harmonics example you essentially said I know that I need to represent colors I represent colors in like as coefficients of a spheric harmonic function [Music] and yes yes so you would do uh you would evaluate the the color based on uh position and viewing Direction and the same happens here only that you tell I I don't know how to model I don't want to Define how to model it but I would like the the training process to figure this out so you just say it's it's a couple of numbers they are not necessarily like humanly interpretable that the first number is is color and the last numberers is whatever but you just get like a vector representation that you can throw into your network and something meaningful comes out the the Y would be the input which is this concatenated feature vector and Theta would be the parameters of the network that you need want to optimize so you would optimize uh you would get a color as output you would compare the color that you get from the network to the color of the pixel and then based on the difference in color you update the parameters of the network and you update uh update the input to the network whereas the input to the network depends on the features so you actually up update the features that are stored at the corners of the wxos just think of this as some Vector in a high dimensional space and as part of the training process you learn the you know Network and you learn to structure the space in a manner that makes sense to the network any other questions yeah toore information um so what you can do is rather than accumulating color you could essentially try to figure out you look at the uh the occupancy values that you get and you try to figure out based on the occupancies that I see where do I think the depths like the geometry should be and that gives you depths map so this is the same Neal Network train but rather than showing colors I'm trying to show uh depths so darker colors except black mean something close white colors mean something far uh admittedly the the way that they uh show uh uh depths here is not really easily to see I will run a second framework later where I can see this but yes you you get Geometry out of this can get Geometry yes so you can use uh if you have depths you could have additional loss functions that essentially says I would like to have high volume density around the predicted uh like the depths that I get from my depth camera yes I mean if the object is transparent there's you don't get and I'll have an example for this you don't necessarily get depths where you would get like a meaningful surface I think uh you can probably for reformulate this uh in a bit but the problem is that if you have a translucent surface there's nothing that forces a network to say I there's one jump in volume density around the say you have this this bottle there's there's nothing that Force the network to say if I enter here there's one jump in volume density then I get to zero inside and there's one jump when I got go out but you could also just have some constant volume density within the whole thing that makes it not like not fully opaque but translucent enough that can look through this uh so there's nothing that that forces the network to say uh there's there's really just one Peak here one one Peak there nothing in between so that that makes it hard to actually get that uh like proper depths for for things at a transluc you can model the effect because it's something viewer dependent and the network has enough capacity to to model this but the geometry behind this is a bit murky you can also hear was it what is a bit typically a bit better is ambient occlusion which has a bit of a like better uh like more humanly visible uh rendering and you can see that it manages to learn the geometry of the scene quite well okay um one thing that you typically have if you if you look at images that you download from the Internet or if you capture things over time is you see illumination changes from day to night cloudy days sunny days and there are some sort of transient objects humans walking in front of this and it turns out that you can you can model these effects and still train a nerve for the static part of the scene the idea is that rather than just training uh a neural network that predicts density and one that uh that uh predicts uh color you actually have uh you're training something that's called appearance embedding which is again think of it as a feature in a relatively low dimensional space that encodes where like how what's the appearance of the image and you have a transient embedding which is again Vector in a relatively low dimensional space uh that says Where in the image that I used for training are objects that are not part of the static scene um so you you you have at training time you have a you predict the static part of the scene and you predict color and density for for transient objects so it looks something like this your inputs are like the Viewpoint you have some appearance embedding that you learn and based on this you can render the static part of the scene then you have this transient embedding together with Viewpoint that allows you to predict colors for for transient objects per image not not in in the full scene then you can combine the static rendering and the transin rendering you get a reconstruction for the image and you can compare it against the training image uh obviously you get something blur for transient objects because you most likely observe them only in one image but but you're also predicting some uncertainty that helps the training process to say okay I'm I don't believe that these regions are really worse optimizing so I focus on the rest um what what is cool if if you learn these um appearance vectors so that you so here you see the the depth that the network predicts and the color but what you can do with these appearance vectors you can actually take an over Viewpoint uh you can render uh the scene from different viewpoints with different appearance vectors and that allows you to essentially switch elimination conditions so this is seen render from the same Viewpoint uh under different or from small trajectory by while at the same time varying the appearance uh uh Vector so you get different appearances for the scene meaning can essentially based on having uh training data at captured at different conditions you can relight the scene in an implicit Way by by interpolating between vectors um so so far we have looked at mostly small scenes you can also scale the whole thing up by essentially representing the scene not through a single nerve but having multiple nerves that you train are based on images in a certain radius and then if you want to predict a new image you get contributions for multiple nerves potentially uh this is a paper by voo released last year which showed that they could actually take uh Google stre viw imagery and essentially train uh newal networks that represent downtown San Francisco not just small scenes and since they have data captured at different conditions they can again relight the scene and simulate appearance under different conditions any questions so far um now we come actually come to your question about the geometry of this so what happens if we want to get Geometry from neural Radiance Fields one idea would be to say I believe that certain density value corresponds to my Surface so I extract uh geometry as a level set for this density function and if you do this you get something that looks like that gives you a rough idea of what you're looking at some skulls some uh cans in front of the background some statue some apples here but they are noisy and the reason is that there's this defining geometry as a level set of some density value is not uh like a well defined uh way of getting geometry so like think of it is this way you have your aray you have volume density and you that you have estimated and that's essentially your surface uh you would like to have like a peak here well when you enter the surf when the ray enters the surface volume density inside remains high and then it drops once you leave um so you could model like the network might uh train might learn this volume density representation but I might also might learn this and learn to compensate for the lower density by adjusting the color so think of it this this way here you have zero volume density it doesn't matter which color you have here you have nonzero but you might find like a right color value and you might not see a difference at all so there's upior no definition of uh like it's not clear what's the right volume density and as part of this actually your density might be drawn out rather than peaked like you would like to have when you extracting geometry um so in general the the issue is that this volume density is not really a good measure of of 3D model geometry or Surface um luckily there's a relatively Simple Solution which is to say rather than doing volume density I actually estimate the sign distance function when doing volume uh volume rendering I have a way to translate from s distance to volume density so that it can can do volumetric rendering and this helps because uh from scien distance function the geometry is well defined essentially as uh the zero level set of the S distance function so once I've trained this model I can run something like a margin cubes algorithm after I uh um sorry after I subdivide the SC into boxels um if you do this you get significantly better geometry if you try to extract it from this so here you see that there's two cans stacked off each other in front of two this is a mil back and there's probably some chips back you get a highly detailed three model of the of the skull here you get details in Statue here you get the different fruits here um so the the the corresponding method is called W stf or volume rendering of uh uh neural implicit surfaces this works quite well in in small toy data sets where you have many views uh seeing the scene it starts getting a bit murkier in more complex scenes where more complex is defined as I observe each part of the scene and only a few cameras so here this is the same scale as here only we Tred to reconstruct it from three views rather than 64 um here you is this is an indoor scene uh reconstructed from like quite a lot of images but these images have to cover a larger part of the scene so each part of like small part is only covered by a few cameras and then it's hard to to properly train the network and here you see another room trained from like roughly 300 images so the issue here is that the the measurements that you have for each part of the space are relatively sparse and the problem of neuron uh of uh neur Radiance fields or the volf is that you're training a network that actually has a lot of capacity meaning there are many ways to to actually overfit to the Sea here's an example uh where this uh neur network is trained from three uh views and you can see that it's able to sorry uh it's able to perfectly explain the individual views what the interpolation doesn't make any diff uh the the interpolated results don't make any sense uh if you add regularization you can get something much more meaningful but still there are a lot of artifacts in these scenes so the issue as I said is that uh the the general problem of uh reconstructing uh volumetric representations from a few images is rather under constrained there are many configurations that explain the the input images but that do not lead to valid scene charge um one way to get more constraints to actually not just look at images yeah so they uh so the best thing is to to look at this the idea I think if I remember the paper correctly is to enforce some form of consistency from novel viewpoints even though you don't have um actual images there you can still enforce some consistency on on the feature presentation learned by the Nerf but I would need to look into the paper I haven't looked at it in in a year or so yeah there there's a lot of work on on trying to regularize this training process in one way of the other I decided not to go into this into detail here um if if you're interested in this I can try to to figure out pointers to it the one thing that you can actually try to do to get more observations is to not just look at individual images but also get some idea about depth of the scene and surface normal you might not always observe these but nowadays there's actually good neural networks that can predict depths and normals from a single image and they are meaningful enough that for example these are monocular predictions rather than using a deps map these measurements are noisy and depth is defined up to some arbitrary scale but you get an idea of where potentially scene geometry could be and what what the surfaces are um and that's what we used in uh in a paper we published last year where the idea is we we do the standard training of a neural Radiance field or a neural volumetric SDF representation where we sample points along the Rays we evaluate them we get like a distribution of density and color along the ray and combine them into single color basic on volume rendering but we're not only comparing against the image but we're also comparing against like uh the predicted depths against the depths that we get from neural network evaluated on on a single image as well as normals that we get from from the single image prediction and that to some degree helps to regularize the scene because you get more more observation you get more data you get a better idea of where the surface should be and here you see the difference between essentially taking the original W SDF formulation uh and only training on images versus adding these additional monocular cues uh to a degree where you actually get something meaningful from three views only you get much cleaner geometry here and here as well so here are some results showing what we reconstruct uh in the scene and this is the same number of images uh versus what like a standard classical multiv stereo power plan would be that is using not no learning versus what this W SDF method that I explained before gets versus what another approach um that trains in new volumetric representation but also has uh tries to reason about uh orthogonal surfaces or ples and this this simple idea of getting monocular Prius Works quite well here's another example where we are able to reconstruct the scene quite well uh from relatively few images even compared to like learn mul stereo approach which is this one uh we're doing quite well um and it turns out that uh depths and normals gives you best results but even just having access to depths or normals itself is actually quite useful so this is without any cues this is what you get if you just have depths this is what you get with steps and normals and this is what the ground to looks like we are not using depths cameras here we're using normal cameras but we have a neural network that for each image independently predicts depths and predicts normal per pixel oh so it's what's a bit hidden here is um so we are this STS is defined up to some scaling factor and if you take two images The Depths might not be consistent so as part of the training process we're actually learning like consistent scaling factors uh for two views and then based on this uh we have a we are able to compare the depths predicted by the network in a way that is uh with with the predicted depths values and at the beginning so we are not taking this as pH while we saying this is the precise position of where the point should be but we're starting during the first part of the training we're saying all the depths we we have a higher weight to depths loss and during training we decrease the importance of depth consistency more and more because we we trust that the network is getting better and better predicting depth and at some point you don't want to like you don't want to force the network to to fit too noisy predictions they are learned yes so this is we didn't train those networks ourselves it's just that I think the paper was released in 20 21 they they started training those things on like at a massive scale and all of a sudden you actually started to have meaningful things before this the single view depths predictors that I've seen were they kind of looked good on few examples but they didn't work well in general this this thing seems to work really really nicely and so you Trainor no these These are trained on on massive training datas and we assume then uh it will generalize meaningful in a reasonably way to the data that we are using this and we're going to do a look a bit into uh what's called Nerf Studio which is it's a very nice project that uh combines different Nerf implementations with with nice tools and a very nice viewer so this includes for example the instant NGP uh formulation I think also nox we're going to look at at a nesty example which is uh like a very very uh reflective surface for from which we we want to to uh compute scene structure uh like which we want to model and the workflow is this you start with images or a video um if you capture a video you extract images from it you run the images through structure for motion uh for example call map or reality capture to get camera intrinsics and extrinsics you convert them into some format that's that's useful for the uh the software package that you're using to estimate the nerves and then you start training so let's let's look at this I'm using reality capture here um which is a commercial software just because it's training it's much faster than uh colmap which would be the open source alternative I'm starting with reality capture I am going [Music] to find the right folder it's going to import images so here I essentially told reality capture that all images should have the same intrinsics I start the alignment process which will extract features match features and then estimate camera positions yes I think that makes sense there is a paper uh I don't have the precise reference that yet that is uh doing a point based representation of the scene so they train a network that takes uh Point positions with features and then predicts the color and density for this so if you you start with a point Cloud you have some features attached to these points you shoot Rays for each point that you sample along the way you find the nearest points you interpolate the features and use that for for rendering um and then during as part of the training process they try to densify the point Cloud uh in order to improve quality so that would be one example in exactly that that direction you can see this is what reality capture estimated uh this is the center of the pool there's basically nothing there it has some images reconstru something reconstructed along these parts which have relative mat uh non glossy surfaces we're going to export the alignment which is relative efficient then we're going to process this uh in a way that Nerf Studio can handle by essentially calling NS process data telling this reality capture data this is the directory where all my images are this is the file that I just generated which contains intrinsic Sy in extrinsics and I want all my output in in this folder this will take a bit because it's essentially reading in the images and uh and the poses and it's also creating downscaled versions of those images which is actually the most timec consuming part of the process here it copied the original images it's creating downscaled versions of these images at multiple scales and that's the most timec consuming part what it will also generate is a file called transforms. Json which essentially gives stores the data in a way that that ner Studio likes to read it there's a camera model S says I would like to use open CV's camera model uh the image this is the height uh and width of the image the corresponding image name focal length values principal Point uh radial Distortion parameters and then this is the transformation matrix that either goes from world to camera or camera to world I don't remember and you essentially have this for each individual image your are allowed to actually have different intrinsic calibrations for egy and once you have uh process the data you can start the training process I want to train a model that's called Nerf facto with the data in this folder that I generated and I want to write all my output to some output directory and it will start it will load the data and it will start training it actually has connects to some local uh port and creates a website which you can open where you can actually visualize the the training process so the here you see a visualization of the of the train NE trained Radiance field that gets updated over time training progresses um no I think this is I think this is just a redirect I don't think the the uh training process is running on my machine it yes I think they send the I haven't looked at the code my understanding is that they are not sending data to some other uh like to some external servers but that this is just a redirect like it the site is there but it redirects the data loading to your local thing I think it used to have a function to show the original images uh it worked yesterday so here you can see the the neur radiance field you can see that it's able to capture few dependent effects such as uh this house becomes reflected in the surface but also we are currently at we have like done 5% of the training uh I have a version of this where I trained it actually for half an hour so you can see that after training for half an hour [Music] you get something that's much more detailed uh it's still blocky because what's happening is that as you move the camera it renders low resolution images uh in order to to do the rendering fast enough and then as you stop it starts rendering higher and higher resolution Imes up to 1,24 pixels on the side and I'm not saying this is perfect but it's actually given that it's a quite nasty scene with something that doesn't have a properly defined uh geometry it's doing quite well you can also look at predicted depth values and you can see that it's able to like model the these sin structures that actually are not reflective but it doesn't really have like a clearly peaked uh surface on this reflective and refractive part uh rather it will distribute the density somewhere there which is enough to give you like realistic renderings but not enough to to get Geometry so in that sense uh you can compare it to the dense reconstruction that you get from reality capture which looks like this so here you can see that reality capture is able to I think there is a textured model is able to [Music] do reasonably well on the parts of scene that are not reflective but the uh reflective reflective Parts actually missing you with the uh your Radiance field you get at least something that renders realistically looking images from this but yeah as I mentioned uh you don't get Geometry there anywhere as well so it's similar there there's no clear geometry there can see that it's actually able to run you're able to train different models depending on which model you use you might have less parameters that need to be evaluated or shallow Network that will make rendering faster the penoxal approach um um the penoxal approach is uh making rendering faster because you don't need to evaluate a Neal Network at all um in terms of shallow neural networks you could also just subdivide the scenes into small parts and then for each part have a very shallow like a few layer neural network that will be efficient to evaluate um there's a recent paper that combines traditional mesh rendering with a uh Radiance field so the idea is you first render a mesh uh you render like features distributed on a mesh and then feed the features into a uh into a new network that gives you color view dependent color where the idea is that uh the network is actually able to uh to account for the fact that your mesh is not perfect uh so it it's offering some sort of Correction term close to the surface so this is the type of things that you could do to make rendering faster you can cach NE Radiance values typically to some degree uh because if especially if you're rendering like trajectory you're getting close three uh three points close to each other that get evaluated can catch some outputs of the radiance Fields there so that you don't need to uh run it completely from scratch scale on the the compation re would you expect this to be say an iPhone or I would expect that at some point you can run this on an iPhone given that all these companies are investing into chips that can do like tensor operations very efficiently so there would be my hope that with with all these these specialized Hardware you might be able to run this on on mobile devices at some point yes so for for those who might not have heard about it there's this thing called dream Fusion which is essentially takes uh a text prompt as input and then tries to create a 3D model um and I think this is an Europe's 2022 paper or something like this I clear I think it was accepted at some point um so this is like part of the text input and the corresponding three model is seen here and the way that this works this I've read the paper a while ago um what they don't have is a is training data that would link uh text and images uh text and and 3D models this is typically something that's that's hard to to get so what is there is uh there's a lot of uh data that links text and images for example uh stable fusion and all these other generative models are trained on on text uh on on this type of data so they start with essentially something like stable Fusion uh which gives them for a given prompt gives them an image and then they try to translate this into uh into three mod and the the way that they do this essentially is that rather than learning a 3D model they like an explicit representation they learn a translation into a Radiance field that they can then render and they have some regularization terms such as if you look on look at all these models that they're showing they are actually quite symmetric so they actually have I think some prior that says I would like to create models that are more like reasonably symmetric as a way to stabilize the training process um what else they do to to make this train I don't remember I have haven't read the paper in like half a year or so um but that's the gist of it you uh try to train uh like an embedding from images to Radiance field like neural radian field parameters and then you can render these neuro Radiance Fields so I hope I convince you that like implicit neural scen representations such as nerves are interesting in the sense of that they can handle a lot of complex uh effects produce accurate three moles at least in under certain conditions uh and can potentially generate compact SC representations they learned from images and they're currently a very hot topic uh but there's I mean the the original Nerf paper came out what was published less than two years ago um or about two years ago uh so they they are still very much in the infancy there's a lot of things that still needed but down scalability uh reducing training times like we are far away from uh 3D reconstruction algorithms that can run on on like classical algorithms that can run on Real Time on mobile devices um but we're getting the nice thing is we're getting more and more software I showed you nvidia's instant NGP I showed you Nerf Studio we actually have something called SDF Studio which does the same thing as Nerf studio just with s distance uh function based representations uh one issue here here's the whole thing that there's actually very large uh body of Literature Like A couple of years ago Gans used to be a very very popular topic with a couple of Gan papers published on archive each day we are about at the same thing with Nerfs so it's very hard to keep up this literature but there are some good starting points there's blog post by Frank dard I think three of them are there are a couple of survey papers that are reasonably good if you want to get into the topic but yeah the the downside of looking into a very hot topic is that keeping up with literature is is close to Impossible okay yeah that brings me to the end of it thanks for listening if you have any more questions either ask them now or like bump into me later and ask thank [Applause] you [Music]

Transcript for:Neuro Radiance Fields (NeRFs) - Lecture Notes

Transcript for:
Neuro Radiance Fields (NeRFs) - Lecture Notes