Transcript for:
Lecture Notes on RDKit

In this video I introduce RDKit, which is a Python library that's very very useful when you deal with chemical data. RDKit is not installed by default in Google Colab, so you have to install it using these lines of code. And so the way for you to get started on this is to look at the link in the video description that'll take you to this notebook here.

which gives you the installation instructions and it tells you here what you need to import for a lot of things. So the way to use it, the idea is now you click on the link in the video description, that should take you to this page, and now you need to make a copy of this. So now you have it in your own notebook. I'm going to retitle this. rdkit intro and then I'm gonna run it.

So run all and as it says here in the comments installing rdkit is going to take two or three minutes so you have to be a little bit patient. I'm going to speed it up so that this will take considerably shorter here in the video and we're done. So we can scroll down here and get started.

So, Rdkit can do a lot a lot of things. I'll just get, I'll just scratch the surface in this video and I can, but I can really recommend this page calling getting started with Rdkit in Python. So just google that and and have a look.

Okay, so one thing it can do, Rdkit, is it can convert smile strings which we'll talk, which I'll talk about in a later video. into a molecule or molecular representation. So one thing you'll do very often is convert smile strings to a molecule. And so smile strings is just a text-based way of representing molecules.

So for example, this is methane, this is ethane, this is propane, and this is normal butane. And you can see this simply by... in a notebook here simply by, oh I've misspelled that, Molson smiles, and so you can see that you can get a picture of this quite easily. In general you get smiles that's either in the data set or if you want to input your own smiles, google this cheminfo.org and here you can draw molecules and you'll get the smile string down here right away. Right, so as I update this, the smile string gets updated.

Okay, I can also, if I have a molecule, I can also convert it back. So if I type smiles, cam, mole, two smiles, I can give it my mole. object, my molecule object here, and you'll see the smile string here again. So if you're ever sort of wondering what molecule you're dealing with, that's another way to check it. And so if you have the molecule, now you can compute the various interesting things.

For example, the molecular weight. Right, so you can do that like so. as the molecular weight of normal butane. Another thing you can do is find substructures. And so to demonstrate that, I'm going to generate a SMILES list.

So a list of SMILES here, and I'm going to use some amino acids. So for example, if I google glycine and go to the wiki page, I can come down here. where is it smiles here show it right and i can copy this and put it in here so that's my glycine i have to put it in as a string and i'll now do the same for for phenylalanine histidine and cysteine so through the magic of editing this will be done just like this okay So here is my smiles list.

So if I want to display these, I can now make a list for my molecule and for example type for smiles in smiles list Then I can convert it from smiles to mole like so, and then append it. And then if I now want to show all of them at once, then I can use this handy function. like so.

Oh, I have to show the image. Okay, and if I want everything on one row, I can do it like this. So here you have your glycine, your phenylalanine, histidine, and cysteine. So for example, one thing I can do is do a substructure search.

So I can search for a pattern that's often very useful. So for example, I can ask which one of these has a sulfur in it. So here I'm just using the smile string for sulfur.

And then I can go through my molecules in the list and then print molecule has substructure match like so with this pattern. All right, so we can see in this case only the last molecule, molecule number four here has a sulfur in it. So that's true. I can also see, for example, which ones have a carboxyl group in them. And I'll leave a link below to a blog post that tells you sort of how to get more familiar with it.

these smile strings. But let's see which ones have a carboxyl group. Of course all of them do.

Let's see who has this substructure. Okay. So all of them except this one, right?

So basically what you're searching for is a nitrogen here that's attached to a carbon and another carbon, right? So if you're in doubt what this looks like, right, you can just go up here, right? So it's searching for a substructure like this and glycine. doesn't have it here right because this carbon here is only has one other bond.

You can also search for more general things so you don't necessarily have to define your pattern using smiles. So you can also use something called smarts. Oops so let me copy that and so smarts is a generalization of smiles. So for example, one thing that's a little difficult to do with SMILES is to ask, well, how many of these molecules have a ring?

Right? So that's quite difficult here because these two rings are actually different. So you can't just search for a benzene ring or an imidazole ring. But with SMARTS, you can simply use this notation here. which tells you whether or not the molecule includes a ring, right?

And so only the two middle molecules include a ring. And you can get even fancier and say, for example, which of these molecules contains a five-membered ring, okay? And so that's only the histidine here.