Transcript for:
Understanding RPKM, FPKM, and TPM in RNA-seq

StatQuest! StatQuest! StatQuest! StatQuest! Hello, and welcome to StatQuest.

StatQuest is brought to you by the friendly folks in the Genetics Department at the University of North Carolina at Chapel Hill. Today we're going to be talking about RPKM vs. FPKM vs. FPM. In previous StatQuests, we've talked about topics that are broadly applicable to a variety of fields. These subjects, however, really only relate to high-throughput sequencing of RNA.

So if that's what you're interested in, quest on! There's a new RNA-seq metric on the block. We used to report RPKM, or Reads Per Kilobase Million, or FPKM, Fragments Per Kilobase Million.

These normalized read counts for the sequencing depth, that's the million part. It's because sequencing runs with more depth will have more reads mapping to each gene than sequencing runs with less depth, and we don't want that to bias our analysis. The other thing that these metrics normalize for are the lengths of genes.

That's the kilobase part. Longer genes will naturally have more reads mapping to them than shorter reads, and it's important to... remove that bias from the analysis as well.

However, nowadays they want us to use TPM, which stands for transcripts per million. So today we're going to talk about these three things and how they're related and how they're different and how I think that using TPM is actually a really good idea. Let's start with an example. To understand the differences between TPM and RPKM and FPKM, we'll work through the math Using an imaginary RNA-seq dataset with three replicates, REP1, 2, and 3, for a genome with only four genes, A, B, C, and D. On the left, we see the names of our genes and the lengths of each gene.

To the right, we see the read counts for each replicate. We see that replicate 3 has way more reads than the other replicates, regardless of the gene. This means it had higher sequencing depth than the other replicates. We're going to normalize for that. We also see that gene B is twice as long as gene A, and that might explain why it always gets twice as many reads, regardless of the replicate.

We're going to normalize for this, too. First, we're going to normalize the data using the familiar RPKM metric. With RPKM, the first step is to normalize for read depth. Here we've calculated the total number of reads in each replicate. For the purpose of this four gene example, we're going to scale the total read counts by 10 instead of 1 million.

This will make the numbers easier to read, but in future slides consider tens of reads or millions of reads to be interchangeable just for this example. Originally, 1 million was picked just to make the numbers look nice so they wouldn't require too many decimal places. Thus, these are our per million scaling factors for each replicate. And by scaling, I mean we're just going to divide the read counts for each gene by the appropriate scaling factor for that replicate.

And using those per million scaling factors, we can calculate the reads per million for each replicate. The second step for normalizing by RPKM is to normalize for gene length. Here we've got the gene lengths on the left side.

All we have to do now is scale per kilobase. After dividing by the length of the genes, the reads are scaled for depth, M, and for gene length, K. So now we have RPKM.

To summarize RPKM, we have our before data, which isn't normalized for depth or gene length, but then we normalized for differences in sequencing depth and differences in gene size. And so here's our after data, our RPKM values for each replicate and each gene. Now, RPKM and FPKM are two very closely related terms.

However, there's a lot of confusion related to them, so let's clear that up right now. First, RPKM, as you know, stands for Reads Per Kilobase Million. FPKM stands for Fragments Per Kilobase Million. The only difference is that RPKM is for single-end RNA-seq and FPKM is for paired-end RNA-seq. To illustrate why it's necessary to make this distinction, consider a fragment that is to be sequenced.

With single-end sequencing, there is only one read sequenced per fragment. That read is either on one end of the fragment or the other end of the fragment. In contrast, With paired-end sequencing, both ends can map, giving you two reads per fragment, or sometimes only one end of the paired end has a quality read and maps. In this case, you only get one read mapping to a fragment.

All FPKM does is keep track of the fragments so that one fragment with two reads mapping to it is not counted twice. Now that we know what RPKM and FPKM are, let's talk about TPM, or transcripts per million. TPM is like RPKM and FPKM, except the order of the operations is switched. Let's look at an example.

For TPM, the first step is to normalize for gene length. So after dividing each read count by the gene length, we have RPK. or reads per kilobase. The second step for TPM is to normalize for sequencing depth.

We do this by adding up the read counts that have already been normalized for gene length and get the total for each replicate. We then divide this total by some number, usually that's a million, but for this four gene example we're just going to divide by 10. This gives us our scaling factors. Now we divide the read counts that have already been normalized for gene length by our new scaling factors. And that gives us TPM.

That's all there is to it. We did the same things we did for RPKM and FPKM, except in this case we just did them in a different order. However, this will have profound effects on the results.

Here's a comparison of the same original data sets scaled for RPKM and TPM. Above we have RPKM and below we have TPM. Both TPM and RPKM correct for biases in gene length and sequencing depth, but the sums of total normalized reads in each column are very different.

In RPKM we get a different value for each sample. With TPM We get the same value for each column. Why is this important?

I'll show you. Now consider three pies, each the same size. In this case, each pie is size 10. The TPM values in each replicate represent slices in these pies, and we can tell from each slice what proportion of the total reads in that replicate went to the each gene. For example, In gene A, in replicate 1, we see that the size of that slice is 3.33.

This slice is larger than the slice in replicate 3, which is size 3.326. So it's just a little... the slice in replicate 1 is just a little larger than the slice in replicate 3. However, what this tells us is that of all the reads that mapped...

to replicate 1, a larger proportion of them mapped to gene A, whereas of all the reads that mapped in replicate 3, a slightly smaller proportion of them mapped to gene A. This is because some of those reads also mapped to gene D in replicate 3, and none of the reads mapped to gene D in replicate 1. With RPKM, it's harder to compare the proportion of the reads. proportion of total reads that map to each gene because each replicate has a different total overall.

That is to say each pie has a different size. That makes it difficult to compare the slices between each one. The main point? With TPM everyone gets the same sized pie.

Alright, in all seriousness, folks are using TPM because the numbers can clearly tell you what proportion of reads map to what gene in each sample. And since RNA-seq is all about comparing relative proportions of reads, this metric seems more appropriate. The end.

I hope this helps you understand the difference between these different metrics. Tune in next time for another exciting stat quest.