Transcript for:
FastQC Report Interpretation for RNA Data

[Music] asalam alikum welcome to my YouTube channel science for everyone today the video is related to RNA sequencing data analysis and the title of my today's video is understanding fast UC report it will be an end to end explanation as you all know that I have started a video series related to RNA sequencing data analysis and this is the part 11th of that video series in today's video it's a bit theoretical and we we will cover these Concepts uh basically these are the uh these are the reports of fast QC I will discuss the basic statistics per Bas sequence quality pertile sequence quality per sequence quality and other else so stay tuned till end uh because it's very much important and because when you understand fast fast QC report so after that then you can easily Pro pre-process your data using different pre-processing tools okay so if you if you understand it well so you can pre-process your data well okay so in the previous video in the part 10 of the this video series uh I perform fast Q on our files and I practically show you that how to perform it so if you haven't V watch this video so I recommend to watch this video because you should know that how to find how to perform fast Q using Linux operating system and after that when you are familiar with it first perform your fast QC on your biological data on your sequencing files after that then interpret it so today is basically the interpretation of it okay before proceeding uh to to the video to the explanation I want to tell you that in the part four and part five of this video series I briefly discussed the structure and quality of sequencing files and I also told you that what are the basic reasons that make our files compromised okay with respect to Quality so I recommend to watch part four and part five of this video series so it will give you a brief background about this the quality of sequencing files after that then uh watch this part 10 and then proceed proceed this with part 11 okay so the only reason is that you will be in a floow and you will not miss even a single point of it okay so I recommend to watch this that videos then skip then start this video okay so basically uh now we will start with the fast QC reports fast QC reports have basically three type of colors one is green color that means passed the second is warning and the third is failed okay so uh green yellow red so basically when it is green so it is um it is okay and after that when it is warning so it is also okay although you can perform pre-processing to just clear that warning and make it past but if it is not uh if you are not worrying about it so warning is also okay you can U proceed with your data but if it there is failed so then you have to pre-process your data according to the understanding okay so the main reason of this video is to make you understand that what basically this these fast QC reports tell us about so when you interpret it well then you can easily preprocess it and remove the the the failed or the warning regions okay and make it green okay because this is very much important okay so the first state is the basic statistics okay and basically it is always green so basically it show you the uh the detail of your file for example my file file name was srr 721 7471 476 and it is of first read pent read and it is a fast. gz so this is my file name and my file type is conventional Bas call so basically it is a sequencing file okay fast new file after that uh it is it was being generated by Sanger luminia 1.9 and the total sequences are 48 l 34426 so the basic the total reads are the total sequencing now uh the total bases are 57.9 million BS okay and after the Bas P are basically the Adine guanine cytosine and thyine after that sequence flag is poor quality so basically it does not have any poor quality okay so when you are dealing with the row files or the row data so basically it should contain some poor quality or if it is pre-processed already from the company so then it will there will be no poor quality okay so basically it shows the poor quality now the the most important thing is the sequence length so basically my sequence length here is it start from 60 the lowest sequence read is 60 and the largest is 150 okay so basically it was I pre-processed this data and this is the uh the the fast QC report of that pre-processing was so basically this this is the report of it and the number of the GC the percentage of the GC cont is 49% so basically it should be 50% 49% it's okay okay but uh so this is all the statistics of my uh my fasty report and it is always green so now we will proceed with the per Bas sequence quality so before proceeding with it a very small concept that I already mentioned in the previous videos but here I'm also giving you a very short introduction of it uh basically a fastq file consist of of sequencing reads and inside that fq files there is a header there is a in the second line this is our sequence the sequence read this is the read and after that there is a spacer and in the fourth line each it is the quality of each base okay so for example the quality of this first C is basically this underrate okay so it shows something about the quality and basically it is pro it just follow these asky rules okay so basically when there is uh this underr so this means that it has the prob of error is one one so I already have discussed it in the previous video but it just to remind you because we will be dealing with the this um this result okay so basically the first is the per B sequence quality and how to interpret this okay because it is very much important it consists of three region the green region the uh the warning region and the error region okay so if our bases are in the error region so it will give this red region so it will give an error okay so I will explain uh the this is these are the box plot of it basically this is position in read Bas SP the in the on the y axis is the quality and basically this is the quality these are the quality okay the Fret score is the quality and and after that this it start from the zero and ended with 30 okay so 30 is the highest quality okay so in the on the y- axis is the quality and on the x axis is the number of Bas pairs okay so basically I will explain a single of it for example uh we are looking into the quality of the first base okay uh these are my sequence reads okay these are my sequence reads and this is the quality of that sequence it's a type of arbitary but I won't to explain it I want to show you that how the fast you just read it okay so I will be dealing with the first box BL so for the for the first box FL it just um revolve around the first Bas pair okay so basically um uh we are having we are having 32 base pairs okay if we count it so it is 32 so it basically gives us the the 32 the quality of the 32 base pairs okay so after that what we'll do these are my first best pair and I want to check the quality of it okay so it will be shown in a type of box plot okay and this is the box plot and this is the first Bas pair so collectively it represent its quality in a type of box plot I think are you all are familiar with the box plot it just show you a range of quality so for example in this box plot it have a range of quality of this first uh this yellow color box also shows a specific percentage or specific amount of quality and amount of read that lies in this quality and this is the whisker okay so basically we can say that about more than 50 % of the our data is present in this region okay so for example the first four bases the the first four bases of the first four sequence reads are uh their quality lies in between the in this range okay so basically it show you that the quality of the first reads are in between this range and for example this is 32 and 33 Okay and the last one we can say that its range lies in between uh 30 uh 29 and 32 okay so basically basically we can say that overall the quality of my first base is almost good it lies in between the the 29 and 33 of this so basically it's it is present in the green region okay so you can also interpret just like this the other the other part also okay so this blue blue color is the median and this red color is red color is the median and this uh blue color also something I will now show you okay so that is how fastqc just check our reads and it give report of each and every base okay collectively so the first base quality of the first base in every read basically it can be represented from of box do okay so I think you now get the point so for each read it is collected just look here to the 32 this 29th read okay for example so for example 293 has the very much lowest quality each base has a lower quality okay so basically most of the bases have lower quality so it rise in the error region almost in the error region okay so that's how it is being calculated okay so basically the central red line is the median value and these are the the yellow box represent the interal T range 25 to 75 percentile so basically what I was telling that here is about 50% of your data is here okay so about 50% of our sequences have good quality okay the first Bas 50% of the bases of the first base have good quality okay so that what I mean the upper and the lower whisker represent the 10th and the 90th percentile okay so basically it here this is the whisker these are the whiskers okay but here the the it consist of an upper whisker and lower whisker but here almost all the quality is revolving inside here okay so the almost all most of the quality is here on this red line okay so no whisker is being present okay if there was a sequence read if there was a sequence read and its quality was 34 so then there would be a whisker okay but here almost all the quality revolve around here this 33 Okay so that's why it does not have a whisker and after that this whisker is a little bit larger so we can say that um a couple of reads have quality in between 32 and 29 okay so that's why so there should be a whisker and there should not be a whisker if here is not the whisker so we can say that almost all the reads the 25% of the reads revolve around this medium okay so that's why so this is the reason that here is not the whisker and the blue line represent the mean quality okay so this is basically the mean quality of it so hope you get my point and if you know that how to interpret it then I'll hear that our the our 20 from 20 add to 32 reads have compromise quality so then now you can you can deal with it and I will show you that how to deal with it in the later video okay in the part 12 of this video series here I just want you to understand the results of it okay now uh we will proceed with the per sequence quality score so it's very much easy uh basically it's it just check our sequences okay basically your sequences were I I want to show you that in the statistics our sequences were U 48 lak 34426 so basically almost this this is the quality the Fret score and this is the number of sequences so the it start from the five L and it just go through the 25 L I think so yeah to 25 L so almost 25 lak or more than 25 FL of our sequencing reads have quality of 35 in between 31 and 35 okay the the the entire quality of the sequence read okay so it means that we are having a good uh sequences that are present in our fastq file have good quality okay overall good quality so it should be like this okay so this is the per sequence quality score and it should the distribution of average rate quality should be tight in the upper range of the plot okay what about this this is also a per sequence quality score now we are having some sequences that have a overall quality in between 10 and 20 okay and 10 and uh in between 10 and 20 this is not a good quality okay so about this is another report about 15,000 of our sequencing reads have a quality in between 11 and 20 so this is an issue okay you have to deal with it and after that about about 60,000 of a read have uh quality of 20 uh in between 24 and 33 okay so it's also good but the perfect was this one okay so you have to deal with this such type of things now uh the third one is the per Bas sequence content okay so before proceeding with it I want to show you that when we are having a read so the the nucleotide the bases the Adine guanine cytosine and thyine all should be in equal number okay if thyine is four times so Adine should be four times if cytosine should be if it is four times um produced so Adine Goan should be also four times okay so basically all the the bases should be in equal number of in equal number so what if if there is an issue and our sequences the bases are not in equal content okay so basically then there is an issue and it this report is just show you like this that 1 to n u in between the 1 to n bases of the old SE sequencing read it just check all the sequences just like this and after it gives the report so it check Goan thine Adine cine and after that it check that either its number is equal or not okay so just look here on on the first position the Adine Goan CTO thine is not equal okay just look here thine is red and cytosine andine guanine is yellow so the range is not equal up to n9th or we can say up to 12 of the bases the number are not equal and I will show you that why what what is basically the reason behind it and I also discuss it in the part four and part five of this video series but I will also tell you here so here the number of bases are not equal although after the 15 Bas of the sequencing read now all the bases are equal okay so basically uh this shows the per base sequence content okay so here why 129 is basically not equal so almost u in the part second and part third of this video series I explain the sequencing technology and adapters okay so basically when we are sequencing our read so there are adapters in the beginning and in the end of our sequencing reads so basically the adopters are human M sequences and it is just meant to um just meant to recognize our uh our read okay so basically and to start the sequencing so it is human read it does not follow the the the the number of bases and the equal content of the bases it just meant because to just understand or just to identify the sequence so it should consist of it may consist of plenty of adenine no no cytosine or Goan so basically when they are sequenced in your fast you file when they are shown in your fast file so it will give you such type of reset and after that what will you do you have to remove the adapters and the other such type of content the contamination and other type of content of it after that you will then see that this would not appear okay so I will show you that how to pre-process it but just keep the concept in your mind now per sequence GC content okay so I have told you that GC content should also be about 50% because if add in is 50% so GC content should be also 50% okay so uh it also show you that what is the percentage of your GC content so it consist of a theoretical distribution and the the GC content per read it show just like here okay so uh you can interpret it uh if there is an issue so this uh Pink line this dark P line pink line will be just Elevate this dark blue color line okay so 50% is important okay or 40% so I told you in the in the start of this video series when I was showing you the statistics of my shows that it was 49% and it is good okay now reports the percentage of n at each point okay so this is per base end content before explaining this n content uh in the previous videos I show you that if we are having a sequence and there is an ambiguous call this n is basically ambiguous SC and I have told you the reason in the previous videos that why this these are being generated so these are the igas calls and our sequencer does not know that what should I tell that either it is Adine or it is thine or cytosine or Goan so basically um just to uh just to overcome or just to mention this ambiguity it just give the sequence as end that I do not know that I have I I have sequenced it but I was not sure about it that what was basically this content okay so these are the ambiguous call so ambiguous calls are also very much important to be removed from our sequencing reads okay so per Bas sequence now here the if you are having no ambiguous reads so basically uh sequences reads that are having no ambigous call so basically this line will be just like this okay so 1 12150 there is no such an content okay so when it is more than 5% so basically it is a warning and if it is more than 20% Then it is an error okay so keep this thing in mind okay now is the sequence length distribution okay sequence length distribution is also very much important in the start of this video I show you that my sequence was in between 60 to 150 okay so uh the length of the sequencing read also matter so the maximum the length the better is the it is better that it should be uh near 150 or so most of the our sequences these are the number of sequencing reads and most of the sequencing read have a length in between 150 and 151 okay although there are reads that have that have uh that have lens distribution in between 58 59 64 65 but it is lesser number so it is a perfect not a perfect but it is a warning but it is better okay so your sequence it depends upon you when you are performing uh preprocessing so you can tell your um your fast speed tool that which how much length do you require so you can just make a threshold of it that I want a sequence those read that have a sequence length in between 100 and 150 or 140 and 150 it depends upon you okay so you have to decide that how much length is required but the basic length is b is almost from in between 70 and 150 okay so I recommend to you to choose 70 and 150 of it okay now the sequence duplication level it is very much important because um there should be no such duplication there should be there is a specific threshold of duplication but uh if your sequence entire sequence is repeated again and again so then it is a it is an issue okay so duplication means when your specific read your entire specific read is produced sequenced again and again and the reason is the polymer Chain Reaction before proceeding with the with the sequencing or with the NX generation sequencing you just increase your number of reads with respect to PCR okay so PCR just to um just um amplify the sequences and it does not follow that the the hard and fast rule that it should magnify or just amplify every read in a single time or two times okay so it it it can amplify a read 100 time and it can amplify a second read 10 times okay so it depends upon the PCR of it okay so when you perform PCR and you you come upon with specific reads that are being implified about thousand time so then it will be sequenced in uh in the sequencing and after in the sequencer and after then it will show you an issue okay so then you can you will you should have to remove the duplication because when you are doing alignment and Gene count so then it will show you that this Gene count is highly expressed although it was duplicated okay so you have to remove the duplication of it okay so just look here about 10,000 more than 10,000 of a specific reads are being duplicated okay so you have to remove the duplication so percent of remaining if duplicate so there is also a range of duplication now here it is just Pro it does not follow the threshold so basically the here is an error and in the pre-processing there is an command that when we use it so it removes the entire duplication of it okay okay so but you have to understand what is duplication duplication is basically the the the repetition of entire read okay for example if you are having read one and it is being duplicated 10,000 times okay so it is an issue okay so you have to resolve it now over represented sequences okay most of the people are uh get confused in between duplication and over representative sequences here I will clear it out okay over representative sequences are basically short sequences that are present in the entire read and they are being produced over represent they are produced again and again for example if you are having 10 reads and in that 10 reads this specific region is repeated again and again okay so then it is an over representation okay because um the software will detect that um it is not a type of thing that a specific sequence should be repeated again and again it should it should have an issue for example if we are using primers okay uh sorry the the adapters and when the adapters are specific reads that are that are present with the sequence reads okay with our sequence reads and after that when it is sequenced okay so adapters are present with every sequence and it is manmade so it is sequenced again and again again and again in the in the read so when the first you just scan it so it it come upon a specific portion of a read that is being over represented it is present in each and every read so it is not we can say that um it should not be there okay just look here this the first portion this is a sequence that is present in 66 lak 35,000 counts okay it is how this is so this portion is over represent it so we have to remove it okay because there is no such hard and fast rule that every Gene should have a specific portion that is common in each and every part of a sequence okay in every Gene okay for example if you are having 10 genes so there is no such specific portion that is over that is repeated in every Gene okay because genes are the gene sequences are different from each other okay so this is basically over representation and it should be removed okay so it is just due to adapters or different contamination okay so duplication versus over representation the sequence duplication is basically the repeated occurrence of the entire reads and over represented is basically repeated occurrence of a specific sequence or motive okay of a smaller portion of the read okay so this here is the entire read when it is duplicated and here is a portion of a sequence inside that read okay and if it it is repeated in each and every read so then it is over representation so I think that now you can just differentiate between duplication and over representation okay you can read this content after that adapter content adapter content is the last part of this fast Q reports and basically I very much in detail I discuss the adapters that what are being what are basically adapters so watch part third of this video series I will also share the link in the description you can just explore the adapter the concept of adapter okay adapter are basically manmade just to recognize our sequences okay so it should be removed okay because it is not an actual read an an actual uh basically read so it does not code anything so when it is being present in our sequence so it should tell it should donate some issue okay so it should be removed okay so in my f report here there are no adapters okay so basically it is green so if adapter is present more than 5% so it should be an warning and if it is more than 10% so basically it should have uh it should give an error okay so this basically this was about adapters okay so I think I have tried my best to explain each and everything now when you just uh understand these results I think I recommend to watch the previous videos after that watch this video okay so then I'm pretty sure that you will be almost completely okay with the quality of the file so after that in the letter in the next part of this video series I will show you that how to pre-process if for example you are having uh you are having this such type of U condition okay that most of your reads are being uh in the range of 28 and 32 uh sorry uh it have a lower quality so how to deal with it and how to clean your data okay if you are having if you are dealing with this such type of thing okay so how to deal with this okay so I will briefly show you the commands Okay that how to pre-process it and then just play around uh your data download your data I have shown you that how to download your data and pre-process it okay and after that you can just clean up your reads okay so this was all about the today's video if you have any question just do let me know I will answer you in the comment box you can also email me about your queries and if you have any uh issue with your final final year projects related to bionformatics so do contact me I will help you out okay so thank you very much for today see you in the next part of this video series thank you very much God bless you