This is level one of the CFA program, the topic on quantitative methods and the reading on organizing, visualizing, and describing data. What we've done for you is divided this reading into two parts. This will be part one.
You should also note that this is a new reading written in 2020 by two researchers, both of whom have PhDs. and one of whom has written a fantastic article on Dutch auction repurchases in which she uses several of the organizing and visualizing and describing data techniques that we'll talk about here in this reading. Let's take a look at some of these learning objectives.
Notice that there are interprets and describes and identifies. And as you look through these, you might think that these are relatively straightforward. And in fact, they are.
But what I want to do is give you... of a warning that what the Institute is trying to do with this new reading is to prepare you to be able to collect the data, to be able to manage the data, and then be able to use the data in some type of a useful framework after which you can make a reasonable decision. So let's go ahead and look at that first one, identify and compare data types.
I'm willing to bet that most of you guys know these kinds of things, data, collection of facts, can be a whole bunch of different things there, including audio and video. Of course, me being a old dude, you know, audio and video and images. I didn't grow up with that stuff, but my children, they don't even think twice about it. I'm guessing you're in that category as well. We have a couple of slides on the classification of data.
There are three different types of ways to differentiate. between and among data. So let's go ahead and take some time to work through some examples here. Numerical data of course can be measured or counted and this can be divided into two categories.
Continuous data measured on an infinite scale. The example that we have there is measures of temperature. How about if we take a look at some other type of continuous data examples as you watch through my videos going through level one you'll note that I like to make lots and lots of sports analogies and so we can have a continuous data on the professional golf tour we can say something like all right these professional golfers you know sometimes they shoot a 59 sometimes they shoot oh I don't know how high do you want to go 85 89 surely and hardly ever see 90 but they can be measured on an infinite scale with within some range continuous data discrete data on the other hand these are these are data that can be counted counted and have a finite number of possible values notice that first example we have there of course there are just seven days in a week unless you are a Beatles fan How about another sports analogy?
Suppose we're watching NFL football on Sunday, and we're watching Kansas City Chiefs and Patrick Mahomes, and we want to know how many touchdown passes is he going to throw this weekend. And so, well, I guess the minimum is zero, right? So you can count to zero. And then I think Peyton Manning has the record for most touchdown passes on a Sunday. And I think that's six.
So you can count one, two, three, four, five, or six. Although I wouldn't put it past Holmes to throw seven touchdown passes one of these weekends. How about categorical data?
So this is data representing qualitative outcomes. Ah, so there's a quality to it or a characteristic, like investment grade bonds, which are BBB or higher, or special. Speculative, some people call them junk bonds, which are double B or lower.
You can have nominal data, which has really no inherent value from a numerical standpoint, but it has value from a categorical standpoint. And so what you do is you just assign numbers to represent certain outcomes like one for bankruptcy and two for new bankruptcy. Ordinal data, you can rank this data according to some characteristic. And of course, my children like to rank me as the quality of my parenting skills. And so some days I get a zero, some days I get a zero.
I get a 0 to 1, sometimes I get a 5, and sometimes I get anything in between there. This is a super important slide here because what we're preparing you, what the institute is preparing you, is to be able to handle some regression analysis. So when you try to collect data on a whole bunch of variables and we'll talk about variables here in just a second.
It's important to see if there's any statistical relationship between and among those variables. So what we'll do both in level one and in level two we'll examine cross-sectional data and time series data. And as I was saying earlier about that Dutch auction repurchase article written by one of the authors, there's an example of cross-sectional data in which the author collected a sample of firms that used a Dutch auction repurchase to change the capital structure of its firm quickly.
And so this is a set of observations taken at a time. at a single point in time. Stock returns earned by Microsoft, IBM, and Samsung for the year ended on December 31st, 2021. There's an example of cross-sectional data. Now it also includes this Dutch auction repurchase, which was a single point in time in which there was an announcement of a Dutch auction repurchase, and that was different for every firm, but then the stock returns were collected around that single point in time.
So notice we have that in bold. And that's super important to remember about cross-sectional data. But we can also ask ourselves the question, well, how do particular variables change over time?
And this is time series of data. And so we can say something like, hey, we have this valuable asset. Maybe it's the...
Pink Panther diamond. And what we want to do is see how the value of that Pink Panther diamond changes on a monthly basis. Now, of course, those of you who watch the old Pink Panther movies know that that's really kind of a silly question because it's priceless.
So if it's priceless here and priceless there and priceless there, time series data really is not too relevant. However, if the asset is something that can be measured and has volatility in prices, like a barrel of oil. So the barrel of oil, one week it goes up, one week it goes down.
So we can see that time series data and we want to extract from both time series data and cross-sectional data. We want to extract the important relationships. So once again, that we can make some.
some kind of a reasonable decision. And then, of course, as you might suspect, you can combine those two. That's called panel data. Let's get back to my use of the word variable.
Characteristic or quantity that can be measured, counted, or categorized and is subject to change. Let's look at the examples that we have. Stock price, market capitalization, dividends. of course, sports analogy.
Here's a great one for you. Number of putts during the course of a round of golf for a professional golfer. What would you suspect?
You would suspect that over Over time, when the golfer has fewer putts, that golfer is going to have a lower score. An observation, on the other hand, would be an observation of, OK, today Tiger Woods had 22 putts. And of course, if you knew he had 22 putts, you would be thinking, oh, I bet he shot a 62. And that's probably accurate. On the other hand, if you were evaluating me, Jim, Jim had 41 putts and you would say, well, I don't even care what Jim shot that day because he stinks at putting.
How about structured versus unstructured data? envisioning a question in which you're given some data and say and the question says okay is this structured or unstructured? So let's make sure we can identify and compare here, but make sure we can differentiate. So structured data. This is exactly what you might suspect.
organized in any predefined manner. And what you'll find watching my recordings is that I regularly refer you to a visualization of an Excel spreadsheet. And that's why we put rows and columns in there. Easy to store, easy to process, and easy to access. And the examples really are limitless.
You could put the number of putts in a column in an Excel spreadsheet, or Or you could put dividends, you could put cash flows, you could even put some kind of data that is qualitative like skill set of the executives. Unstructured data, on the other hand, not organized, and notice what we have in bold over there, difficult to handle and understand. So the Institute is not saying it's difficult and handle to understand. Therefore, we're going to throw it over there and not worry about it. No, the question then becomes, OK, if it's difficult to handle and understand, how can we wrap our arms around it and bring it over here inside of our arms so that we can?
put together some kind of a useful data set so that we can make decisions. How are they generated? Boy, social media posts. I'm a complete, complete ignorant individual about those kinds of things. But credit card transactions, that makes perfect sense.
Regulatory filings, sensory images from satellites, foot traffic, mobile devices, etc, etc. How about if we take a quick quiz? I'll give you five of these examples of data sets.
And you go ahead and give me the answer of what kind of a data type would these things describe. Let me start off with an easy one there because we've already given that one to you. So that's cross-sectional data.
How about the price change of a stock? Do I need to pause to let you process that through your brain? How fast is your cognitive function? That's continuous data. Number of students in a class.
That's going to be discrete data. Ooh, color of a smart. Smartphone. Where's my phone around here?
I don't think I have a color on mine. That would be nominal data. Grades of a student in a quiz. Boy, if they all make zero, I'm not quite sure what that means, but that's ordinal data. How can we organize this data for quantitative analysis?
All right, so this is what I have been saying here and hinting at that what the Institute wants you to be able to do is to make a decision. So that's what they call quantitative analysis. analysis, putting this stuff together.
So it jumps out of the spreadsheet or jumps out of the page at you and says, Hey, Jim, there you go. There you go. There's the answer. And the answer is, oh, let's invest in these firms that pay these kinds of dividends.
that have this amount of cash flow and, you know, whatever else it is that you want to put at the end of that sentence. So there are two pictures of a one-dimensional array. Think of it this way. When I give examples on my exams, what I'll do in that, boy, is that a light red? In that top column, I'll just give years, like year 1, 2, 3, 4, 5, 6, 7. And then underneath it, I'll have, you know, like some operating cash flows.
and they'll have to compute net present value. So that would be one, that would be one dimensional, right? And then two dimensional would be the Excel spreadsheet.
We go over this way and then we go down this way and that could be time series or cross-sectional or any kind of way that you figure it out. Let's go ahead and talk about distributions. And I want you to think about distributions really as the range of a data set. And so if we go from, let's go back to my Tiger Woods example, you know, what's the least number of putts that Tiger has had during the course of a round. I wouldn't know.
I bet that's probably relatively easy to find. I mean, it's not 18, right? Because there are 18 holes, but it's something in the low 20s, right?
And what's the most putts that he's ever had? Oh, I don't know. Maybe He's had a really bad putting day. I think he had an 85 in the British Open one year when there was a tremendous wind and rain and everyone was shooting high scores. Maybe he had 35 putts that day.
I don't know. You know, so you have this range. And so let's think about the frequency.
You know, so how many times did Tiger have 22 putts? How many times do you have 25 putts? How many times did he have 32 putts?
Right. And so this is a frequency distribution. And so think about the more you get into the average, the more likely.
it is that that event will occur. So notice what we've written in that frequency distribution embedded circle point. Simplify the analysis.
So we're going to divide these into groups or intervals. So let's go ahead and make up an example here. Let's suppose that we are trying to evaluate hedge fund managers.
And for some reason, we have complete data on 30 hedge fund managers. And what we've decided is that the The minimum number of assets invested in a hedge fund is 67. The maximum number of assets invested in a hedge fund is 125. And remember, the hedge fund universe includes things like, you know, stocks and bonds, but also things include things like investments in gold and oil wells and private equity and vintage cars and anything you can throw into that hedge fund universe. Now, let's also suppose that within that that range between 67 and 125 that we've decided that there are five classes. So we're going to say this is a five class width.
And there's probably some artistry involved in identifying and selecting the appropriate number of classes, but it could be based on some pretty reasonable and simple things like, you know, maybe size. So let's go ahead and try to establish and craft a frequency distribution and see what this means so I want you to think about this you know this is all the information in the question stem and where do we go from there how do we craft this distribution and how can we interpret it all right a couple of steps that we're going to take so notice I that we've written up there five would be appropriate number of classes so what we do is we start with the minimum level we start with that 67 so let's put 67 down there as the minimum level And then we need to figure out how wide are these classes going to be. So notice the simple math that we've done up above the table. 125 over 67 divided by 5 is equal to 11.6. and we're going to always round up the reading tells us to always round up and that's to make sure that the bottom number is included in the table all right so what all we're going to do then is we're going to go from 67 add 12 to it that gets us down to 79 subtract 1 that gets up to 78 and so notice that each of those differences is just 12 all the way down to 126 so let's just verify this of course all the way back.
There we go. The maximum of 125 is included in this class limit, which goes all the way up to 126. And of course, it's not going to be perfect because we have some rounding. So then what we do is we go and after we've established those class limits, we go ahead and tally them.
All right. So we're going to go and count how many of these hedge funds have between have 67 and 78 securities or assets. in their portfolio. So we got three, we got five, eight, nine, and five.
So there's our total of 30. Do you see how this is kind of taking this huge data set, right? We have 30 hedge funds, and we have all of these assets, and now we're trying to organize them, right? Trying to squish them down so that we can make some sense of them.
All right, so let's keep going here. We're not done with that table, but let me define relative frequency, which is just absolute over total frequency. And let's go ahead and make sure you understand how to craft this table. All right. All right, so here in the red, we've already done that.
So our relative frequency is going to be the absolute 3. So 3 divided by 30, there's 10%, right? And then you do all of those relative frequencies going down. Of course, they have to sum to 100%.
And then you do the same way going across there. So you get the cumulative frequency. So 10 plus 16 gives us 26. Then plus 26 gives us the 53. Plus 30 gives us the 83. So now we have relative frequency.
frequencies and now we have cumulative relative frequencies. So what we can do is we have taken this huge data set on hedge funds and we've narrowed it down, right? We've kind of chopped it.
We've compartmentalized it so that we can identify strengths and weaknesses and maybe make some recommendations about the performance of that hedge fund. Let's move on to what the Institute calls a contingency table, a tabular representation of categorical data. All right. So particular.
So we're going to show some frequencies here for particular combinations. How about if we use this example? Let's suppose that we're trying as a good financial analyst to estimate growth rates in the economies of three African countries. And let's suppose that we identify these three Africa. and countries because they are all geographically surround Lake Victoria and maybe Lake Victoria has some interesting part about it that we think that that's going to be able to increase the economic growth in these three countries maybe compared to other countries in Africa or maybe other countries throughout the world.
Interesting to note that one of these countries I think it's Uganda has a couple of hydroelectric plants coming from coming out of Lake Victoria. So what we're doing is we're saying all right we want to estimate economic growth, one of the important variables that we determine going into this study is the education of those individuals that live in each of those countries. And so we have middle school, high school, basketball. bachelor's or master's degrees.
Notice that we have 40 in each country, so that gives us a total of 120 of these individuals, and maybe these are groups of individuals, and maybe we've left off, you know, a bunch of zeros in each one of these numbers so that we don't have an unwieldy type of a matrix. All right, so let's talk about joint and marginal frequencies. A joint frequency is simply defined as a combination of two conditions happening and occurring at the same time. So Kenya has five middle school or lower individuals, Uganda has five middle school or lower individuals, and Tanzania has 30. And so each one of those cells, if you will, in an Excel spreadsheet is a joint frequency because you go down and you go across.
But then what we can do is we can establish a marginal frequency. So notice that we have boxed in master's degrees 15 and 5 and 0. So that sums to 20. That is the marginal frequency of master's degrees in these three African countries. Now, of course, these are absolute values, so we want to know what relative values are.
So let's go ahead and do relative frequencies and frequency distributions. And so this is really just a matter of dividing. So that 4% and the 13%, let's just go back here.
What's that 4%? So... 5 divided by the 120, I'm guessing that's 4%.
Is that right there? Yep. And then 5 divided by the 40, I'm guessing that's going to be the 13%. Actually, they come out to be a little bit higher or a little bit lower, so there's a lot of rounding going on in this table.
but note what we've done we've taken we've taken some data that we've collected right we put it in this two-dimensional table and we're now determining frequencies joint frequency marginal frequency and then percentage frequencies. So this is going to give us an idea that is going to help us decide which of these three countries or maybe all three of them are going to lead economic growth in the area. Of course, this is just a part of a larger global macroeconomic study, but you can see the importance of taking one variable education and relating it to economic output.
And then you might have a handful of other variables that you want to do similar analyses with and then in the end you combine all these things and you say something like oh yeah Uganda is clearly going to lead in economic output over the next one year or six months or five years or whatever it is that you're trying to study Now one part of a contingency table can be referred to as a confusion matrix, which I always thought was an interesting term. This really is a summary table of the difference between what you think is going to happen versus what actually happens. So notice what we have in bold there, actual versus predicted values. So read across the top, stock market negative return of 20% or more.
These are actual values. Go down the left hand column, stock market negative return of 20% or more. These are predicted values. So where the two meet, right?
If the stock market was predicted to have a loss of 20%, a negative return, right? And it actually did. That was 460. If we did just the opposite, if the answer is no to both of those questions, that's 190. So these are called true outcomes, right?
The predicted outcome was at least equal to, if not greater than, the actual outcome. But then let's suppose we're wrong. And notice what we've done in bolded red. Type 1 error and a type 2 error.
So the predicted incorrectly by the model, but falsely predicted by the model. So type 1 and type 2. And we're going to spend a lot, a lot of time on type 1 and type 2 errors. But this is a confusion matrix, which will tell us more about the relationship between what we thought was going to happen.
and what actually happened. Now the final part of these learning outcome statements is a series of graphs, and those of you who are familiar with Excel, which I'm guessing is all of you, you won't be surprised to see these. So there's a histogram which is just going to show a distribution of numerical data in the form of a graph. Notice there's frequency going up the vertical axis and then there's some measure of rates of return going across the horizontal axis and those have been divided into class intervals like like we did before and so notice that first blue block right that kind of looks like a square so what happens we go down then we go up then we go up go up we go down then we go way down and so what we can do is we can put that into some kind of a polygon this is called a frequency polygon And so that's why I went through that up, down, up, up, up.
And so you can see these are midpoints. And there's a good formula up there to determine midpoints. And by the way, there's probably going to be an LOS in level one or level two that asks you to compute the midpoint of a class interval. That's just a little bit of a warning there. But for this for this reading, we don't need to worry about computing it.
But but you can see how it's done. Bar charts, of course, can go up like this or they can go over like that. And that gives you a sense of verticality and horizontality.
Is that a word? And so you can just visualize. I mean, you don't have to be any kind of a genius to say and conclude that, oh, yeah, that tall green line or that longest green line tells us something more important about the ratings for these hockey teams.
How about a Pareto chart? Categories are ordered by frequency and in descending order. So you can see those categories. So of course, of course, the big one is going to be the very first one. So sales by client A, that's a huge one.
And then they go all the way. And of course, you can do a relative frequency line that starts somewhere over there and then it goes all the way up to 100 percent. Grouped bar charts.
by different categories. Here we have different cities, right? So these are revenues in different quarters. So you can get a sense of what's going on in in each quarter that's blue, right?
For the first quarter, and you can see the different colors for the different cities, stacked bar chart, the colors go up and down. And as I go from You guys don't care about this one. I can see this one pretty clearly, but when I go to this, oh, all the pistons in my eyes and all the rods in my eyes, they explode. I see that. I almost get vertical.
But clearly, you can see that these are stacked on top of each other. And so all you have to do is go over and see, does the orange, does it increase? Does it decrease? Tree maps are really interesting things, displaying hierarchical data, rectangular shapes.
And so you start with You know, let's do a tree map for the number of majors, major championships won by professional golfers. So, you know, the big blue. So if this were a tree map for professional golfers, you'd have a big blue with Jack Nicklaus.
And you'd have a little bit smaller one. You know, Jack has 18 majors. Tiger has 15 majors. So you have a little bit of smaller one.
And then you have to go all the way down to. You know, the next group of people, which includes, you know, Phil Mickelson, all those other great golfers. So the size of the rectangle determines the contribution as a proportion to the overall variable that is being measured.
And then you can do this in a word cloud and you see the word clouds all the time. So if this were a word cloud, you would see. you would see Jack Nicklaus you know it would be the big bold letters in Jack and then Tiger would be just a little smaller and then you'd see all these other names in there heat maps these are system of color coding to represent different values so Notice you have low to high on the horizontal, low to high on the vertical. And so, of course, you know, way, way up there to the right, that's super hot.
Way, way down to the left, that's super cold. And that could be, you know, also... sorts of things. My first thought to reflect my son's intense interest in fantasy football, ah, you know, these hot players over there, let's try to trade for them and get them on our team, the cold players down the bottom left, we want to...
dump them they don't want to be a part of our team heat map line chart and bubble chart these are super common in Excel scatter plots as well let me call your attention to some of these scatter plots because I'm going to give you a little bit of a warning about what's super important when we get the quantitative analysis notice that those first couple first two they have upward slopes the line are upward sloping so there's a positive relationship between the two variables. But in that first one, notice what we have written under strong positive correlation. Oh my gosh, we're going to have to compute correlation coefficient between and among lots and lots of variables in level one, level two, and level three.
So correlation is really important, which means that a scatter plot between two variables has to be super important. And then this last LOS, describe how to select among visualization types. Regularly, I tell students to get out their phones, take a picture of this one so that you have it, you can store it, and that you can consult it so that you can.
address these issues on an exam but we've covered all of these visualization types and I can envision a question if I were writing questions I would give you some data in fact in the question stem I would give you some data here, here, here, and here, like four of them, and I would say, okay, which one is which? And then I could ask a series of questions. Of course, in level one, you don't have that kind of a question set, but you get the sense that you'll have to decide between and among these.
And that takes us through the learning objectives. I think these are relatively simple. Hopefully this was a good video. I like to try to emphasize one or a handful of LOSs at the end of.
each recording but I don't know that I can do this here although identify and compare data types that first one there when you have a master of that LOS then the other ones you know kind of fall