Transcript for:
Understanding Structural Equation Models

in the first two videos on structural equation models I've covered some of the sort of conceptual background the history some of the key ideas in this video we move to understanding some of the applications some of the actual model fitting that goes on in structural equation modeling and this focuses particularly on confirmatory factor analysis so in this video I'm going to talk about the general idea of how we measure concepts using latent variables and I'm going to contrast two approaches to using latent variables to measure concepts the first is the more conventional historically the the main way of doing this using exploratory factor analysis and I'm going to contrast this then with the more modern approach of confirmatory factor analysis I'll then move on to talking about some of the ways that we go about actually fitting and estimating confirmatory factor models and some of the the important procedures that we have to do and I'm going to finish off by talking about some of the kind of extensions that we can take CFA into notably when we are modeling the means of latent variables as well as their their relationships their associations I'll talk about the difference between formative and reflective indicators in CFA a procedure called item parceling and and also the situation which we may sometimes be interested in of fitting a factor model to variables which are themselves latent variables rather than to observed variables which is the usual case and that would be called a higher-order factor model so in the first video I gave a sort of pithy definition of structural equation modeling as being path analysis with latent variables we can also think of this as really being a distinction between two stages or two parts of the modeling process the first is where we want to get good measures of our concepts or our constructs and and then the second part is looking at the relationships between those measured constructs so there's a if you like the emphasis firstly on measurement and measurement accuracy and adequacy and then secondly moving on to look at the the structural relationships between the constructs that we've measured so again we saw in the first video and that anytime we want to measure something in in science and particularly in social science is that the the measurements contain various kinds of error that that error can be random and or systematic and so what we want to do in our statistical approach to the data is to isolate the true score in a variable and remove the error and this is really what we're trying to do using latent variables for measurement so we want to decompose our X variables X is what we've actually measured and we can decompose that into the t and the e components the T is that the true school and the e is the error and and we need some kind of model to enable us to to split the X into these T and E components now one quite straightforward and useful way of doing this is simply to to add the scores across a number of different X variables if we have say four variables which are all measuring the same underlying concept then we could just add those up and take a school now this has has some benefits because the the random error in each of those measurements will will cancel out as we add items together and but it's it's a rather unsophisticated approach and in particular it gives equal weight to each item in the construction of the the true school and that's often something that we we don't want to do so an another approach is to actually estimate some kind of a latent variable model now in understanding the ways that we do this in Sim it's useful to sort of go back in history if you like and think about a earlier approach to estimating latent variables now this isn't to say that exploratory factor analysis is no longer used of course it is and but the the more modern procedure of confirmatory factor analysis has some attractive property shall we say compared to EFA so the exploratory factor model is also referred to as the unrestricted factor model or an unrestricted factor analysis because as we'll see when we get to looking at CFA CFA does place restrictions on the variance covariance matrix whereas EFA doesn't do this EFA or principal components analysis is a similar technique and finds the the factor loadings which best reproduce the correlations that are observed between the observed variables in our model so let's say that we have six questionnaire items that all measure more or less the same thing they're intended to measure some some concept that we're interested in and an EF a will simply kind of reorder the data in a way which has best accounted for the observed correlations between those variables now it does this in a way of producing a number of factors which are in EF a equal to the number of observed variables that we have so this is really just a reordering of the observed data and we end up with the same number of factors as we have observed variables so at this point in just this reordering and EF a hasn't done very much in in way of summarizing of simplifying which is often what we're trying to do with a latent variable model so we have the same number of factors as we have observed variables and all the variables in our model the observed variables are allowed to be correlated with all of the factors now we need to get from this point of having the same number of factors as observed variables to retaining a smaller number so that we're doing some job of summarizing rather than just transforming the observed relationships and there are different rules for doing this and one kind of heuristic judgment would be to keep or retain a number of factors which is less than the number of observed variables that explain some satisfactory amounts of the observed variance so we might say we'll retain as many factors as are needed to explain 70% of the variability or the correlations between the observed variables and something else that we have to do in addition to summarizing is to understand what the factors that are produced by the factor analysis what they mean what are they measuring now we do this and by looking at the pattern of factor loadings between the factor and the observed variables so we do this in a sort of inductive way we work out what the factors are by looking at how they are related to the observed variables and another thing about exploratory factor analysis is that there is no unique solution where we have more than one factor and and so we can rotate the axes of our solution and in ways that can help us to see what the the underlying structure is and so rotation of axes in exploratory factor analysis is quite common there's a given example of what I mean by some of those previous points here's some some made up data and we have nine observed items and these are we like knowledge quiz items that have been administered to a sample of children and what we're measuring is some construct like intelligence or cognitive ability now if we were to apply an EF a or a principal components analysis to this data then we would initially have nine components or factors which is the same number as the observed items so the first thing that we would need to do would be to implement some judgment about how many factors to retain now in this case you can see that for three factors have been retained in this model and that may have been based on one of these heuristic guides around amount of variants explained or some kind of plot like a scree plot so once we've done that we want to know what each of these three factors is actually measuring and we do that by looking at the pattern of correlations and that's what are in the the rows and columns of this table between each factor and each and the set of items so if we look first at factor one we can see that these the the factor loadings or the correlations are high between factor one and the observed items which are measuring mathematical ability so this is saying that if you have a high-school the higher your school on factor one the more likely you are to get the item Math one correct there's a high correlation between your school and the factor and your school on the item the factor two there are high loadings on the visuospatial items and low loadings on the other items and for factor three we see this other pattern where it's the verbal items that have a high school and low scores on the other so we we we do this inductive process of figuring out what the factors are measuring by looking at the correlations between the factors and the observed variables once we've retained a smaller number that we think is in some ways satisfactory so this is a very useful procedure and has been widely used in social science for many decades but it does have some limitations and firstly the EFA is an inductive it's rather a theoretical procedure and that is something which in general we are less happy with in terms of the way that we build theory in quantitative social science so and we've got a situation where the data is telling us what our theory should be when generally we would prefer to do that the other way round we would have a theory in testing against the data another unattractive property of EFA and similar techniques is that it it relies on a subjective judgment and heuristic rules about what's a large amount of variability to explain and so on so there there is a lot of room for subjectivity in determining what our model should be and of course when we are analyzing data of this nature where we have indicators of underlying concepts it's rarely the case that we have no theory at all about which concepts the different indicators are actually measuring we've usually written the questionnaire indeed with a specific intention of measuring particular concepts so actually the more realistic and accurate assessment of what's going on here is that we're starting with a theory and then we're assessing it against the data that we've collected so the idea that we are going from the data to the theory is not generally an accurate representation of how this procedure actually works so given that that is the case given that we do have a theory about how the the indicators are related to the concepts it's better to be explicit about that from the outset and then use statistical tests of those theories of measurement and against the the sample data that we've collected so we can compare this approach of exploratory factor analysis with a confirmatory approach so confirmatory factor analysis is also referred to as the restricted factor model was unlike EFA it places restrictions on the parameters of the model it can't be therefore rotated you can't rotate the solution there is only one unique solution for the CFA and and the key difference now with CFA to EFA is that we specify our measurement model before we've looked at our data and this is sometimes referred to as the the no peeking rule if we have a theory about how the indicators are related to our concepts then we should set that down a pre alright as our theory and then test it against the data rather than tweaking our theory as a function of the particular sample data that we happen to have so when we do things in this way in a confirmatory way the key kinds of questions that we have to answer are which indicators measure or cause which are caused by which factors which indicators measure or are caused by which factors and importantly and this is the real distinction with EF a is which indicators are unrelated to which factors remember in an EF a we say that every variable is related in some way is allowed to correlate with every factor in CFA that isn't the case we will say that the correlations or the covariance is between some of the indicators and some of the factors is zero we'll make that as a parameter restriction and we will also need to answer questions about the correlations between the factors rather than leaving that as a default assumption in the model here we have six observed variables X 1 2 X 6 now the first part of the model will have produced 6 factors or components so at this stage we've already retained just the 2 factors that we think explain enough of the variability between our observed variables but what you also see here still is that and there is a single headed arrow running from each of the like the two latent variables Etta won and Etta - - all six of the observed variables so there is a we are estimating a correlation between each factor and each of the observe variables now what we would be looking for in this kind of situation is that and some of those loadings would be large and some of them would be close to zero so if we look at at a 1 for example we might in an EF a context hope that the or expect that the the loadings between 801 and X 1 to X 3 would be high of say 0.7 or above in standardized form or and that the loadings that run from etta 1 to X 4 2 X 6 would be close to 0 and the opposite would apply for Etta to so what we're doing there is I say estimating all of those relationships and expecting some pattern of of high and low loadings between them by way of contrast the the same variables and the same two factors now in the form of a confirmatory factor model now rather than having estimates for all of those relationships between X 1 and X 1 to X 6 and a 2 and X 1 to X 6 we say that there is no relationship between Etta 1 and X 4 2 X 6 there's no arrow pointing from a 2 1 to any of those observed variables and the same for it or - there's no arrows pointing at X 1 to X 3 so the fact that there isn't an arrow there means that in our model we are constraining those to 0 we're not just estimating them and saying are they nearly 0 we are specifying our model a priori to say that those paths are indeed 0 so those are the kinds of parameter constraints and parameter restrictions that I was referring to and talking about in video 2 that it's quite unusual in other branches of statistics that we use in social science to make these constraints and fix parameters to particular values but that's why we call the confirmatory model the restricted factor model because we place restrictions on the loadings so and sometimes as I just gave an example of we would fix particular parameters to zero for indicators that do not measure do not influence a measured variable and the important thing to understand is that our theory of the measurement of our concepts how we think the concepts are related to the indicators that we've selected and written if they're a questionnaire item and that that theory is expressed in the constraints that we place on the model so we're not just estimating everything but we are placing restrictions on what the parameters the values that parameters can take and those restrictions those fixing a parameters they over identify the model so we are placing restrictions which give us more degrees of freedom in our model which enable us in turn to test the fit of our model compared to the matrix that we've actually observed s the the sample variance covariance matrix another way that we apply restrictions to the to the parameters in a confirmatory factor model is to give the latent variables a metric now what I mean by that is that if we have a measured variable we will have specified some kind of scale for the respondents to answer on so maybe it would be strongly agree is the value one and strongly disagree is the value five so the scale is 1 to 5 for that measured variable for a latent variable and we don't have any metric and is an unobserved variable it's a hypothetical variable so it doesn't have a metric on its own we have to give it one and there are two ways that this can be done and the first is to essentially produce a standardized solution so that all variables are measured in standard deviation units this can be done by constraining the variance of the latent variable to one and this has some benefits but the downside of course is that we no longer have a none standardized solution if we if we require all latent variables to be measured in standard deviation units then they don't have any retention of the the unstandardized metric that they could be given so the second approach is to constrain one of the factor loadings to take the value one and and by doing this and we take the the scale from that particular item which we'll call the reference item so if we fix the factor loading of a particular item to one then that will be the reference item and the latent variable will have the same scale as that item so if it's measured again on a one to five scale of strongly agree to strongly disagree then the latent variable be on a scale of 1 to 5 if it's a 1 to 10 scale the latent variable will be on that same scale now this is generally preferred to the first approach of having a fully standardized solution because we can also get a standardized solution using that the second approach of fixing 1 loading to the value 1 and we also get the standardized solution in that approach as well so I'm in confirmatory factor analysis and we are interested in making good measures of our key constructs concepts in our theories and we are then in this the next stage usually going to move on and look at the relationships between the measured concepts and so conventional SEM is focused on that the structural model the relationships between concepts so we are not so interested in the means of the observed or the latent variables and in as I say the conventional way of doing them that isn't a focus the focus is on covariances and correlations relationships between the variables but there are occasions within a same context where we would be interested in the means of latent variables there are two main areas where we would want to estimate latent means the first is where we want to see whether there are differences between groups on a latent variable and secondly if we're interested in change over time perhaps if we've got a longitudinal data set we would want to estimate the mean of the latent variable and see whether that is changing over time so when we introduce means into our CFA and then we do this by adding a constant to the model actually when you fit models in modern same software and this isn't a choice that the analyst has to make it is if you like done underneath the hood but this is the process that is actually implemented is to add a constant which has the value the same value one for all cases in the model now the regression of a variable on a predictor and a constant will give us the mean of that variable in the unstandardized beta of that regression and the mean of an observed variable is the total effect of a constant on that variable so that the total effect as we saw in video one is the sum of the indirect and the direct effects so if we now introduce a constant which in path diagrammatic notation is represented as a triangle and here we have the number one inside the triangle to indicate that the constant is one then we in this path diagram have again a Y variable and an X variable we have a direct effect from the constant to Y which has the coefficient a we have a direct effects from the constant to X which is B and a direct effect from X to Y which is C so the indirect effect of the constant on Y is the product of B and C so by adding in this constant we can estimate the mean of X which is simply the the coefficient B and we can estimate the mean of Y by taking the sum of a and the product of B and C that's the total effect the sum of the direct and the indirect effects so that's how we introduce means into our model and now if we've added a mean structure in then we will require some additional identification restrictions because we're now trying to estimate more unknown parameters that's the the latent means so there is a question then about how we estimate and compare one mean to another and the way we do this is by having multiple groups so where we have more than one group in our sample then we can fix the the mean of a latent variable in in one of those groups to be 0 and then the means of the the remaining groups on that latent variable are estimated as differences from the reference group so with mean models in CFA one of the groups always has to have a restriction to that their mean value is zero then the other groups are interpreted in terms of differences from that reference group when we've looked at path diagrams and thought about the relationship between concepts and indicators between latent variables and observed variables the arrow will be pointing from the latent variable to the observed indicator and so what this is saying in theoretical terms is that the the latent variable causes the indicators that's why the arrow points in that direction so we can think of of that as meaning if we're trying to measure let's say someone's social capital and we've asked lots of questions in in a questionnaire that what's actually causing their answers to those questions in the questionnaire is their underlying level of social capital so the causal arrow points from the the latent variable to the observed indicators now for many concepts that direction of causality makes sense in other contexts the the idea that the causality flows from the latent variable to the indicator and doesn't really make sense so let's think of an example where we want to measure socioeconomic status and we're going to use indicators of someone's level of education what kind of occupation they have their earnings and so on and we want to combine these somehow into a latent variable that measures their socioeconomic status now what's problematic about this in the the reflective indicators context is that it doesn't really make sense to say that I have some underlying socioeconomic status and that if that were to change then my educational level would change or my earnings would change or my occupation would change because actually causality is flowing in the other direction if there is any causality going on here at all so someone's level of education influences their socioeconomic status as do their earnings so now we're in a situation where the causality makes more sense to flow from the indicator to the latent variable so the key point here is whether manipulating if we could somehow change someone's score on the on the latent variable would it make sense to change the score on the observed indicator now for some concepts that make sense for others it doesn't and in the case where it doesn't make sense we would essentially turn the arrows round and make the arrows point from the indicators to the latent variable and and in this context we've now got what we call formative indicators rather than reflective indicators now as I said it's a different sort of latent variable now that we're dealing with and it's essentially a weighted Indian the observed indicators and it doesn't have a disturbance term there's no error in it so it's not the same kind of a variable as we would have with a reflective indicator the key thing is that the in the path diagram the arrows point from the indicator to the latent variable rather than the other way round there are of course some quite different procedures for estimating this kind of a model and but for now the concern is to understand the conceptual difference and the fact that we have the indicators related differently to the latent variables another common procedure in confirmatory factor analysis is when a researcher may have a very large number of indicators for latent constructs or for a number of latent constructs this is quite often the case in in psychology where there are quite complex latent variables and each one maybe has a 10 12 or more indicators one of the problems that researchers run into with this kind of data and is that the model can become extremely complex very quickly and there are lots of difficulties that people can run into with estimation and interpretation and so on simply because there are so many relationships in the observed data because there are such a large number of indicators and latent variables and this is often combined with sometimes quite small sample sizes which can add to this problem so when in this situation researchers will sometimes use an approach called item parceling and which is a first stage of taking some scores adding up the scores for those large numbers of items author subsets of the subgroups of those items and and then those those subgroups of parceled items of some scales then act as the the observed indicators for the latent variables so this is a sort of a parsimonious way of treating rather complex data it does rely on some assumptions about the unit dimensionality of the items in that parcel but it is an approach that researchers who are in that context of having lots of indicators for their latent variables and large numbers of latent variables can thus you last day I'm going to talk about a kind of confirmatory factor model where the latent variables are not measured by observed indicators better themselves measured by latent variables so we have a sort of a hierarchical structure where a first set of latent variables are measured using observed indicators we have to have observed indicators at some point in the model but once that first set of latent variables that are measured then a higher-order factor can be added which is a function of the the first-stage latent variables now this is an approach which is often useful when our theories are not so much about the the relationship between variables but are in the dimensional structure of the data for example in psychology there are debates about the the number of personality dimensions and often you know belief systems and so on it's important to understand how many different dimensions that there are in addition to how those dimensions might be related to other variables so intelligence personality and so on higher-order factor models can be useful they can also be applied in a longitudinal context so here's what a path diagram for a confirmatory factor model with a higher order structure would look like and we have at the bottom of the diagram now the observed variables in rectangles there are nine of those and each set of three is measuring a latent variable and then the the highest level variable etter one is then measured as a function of those three latent variables so in this third video I've looked at some of the important issues in confirmatory factor analysis started off by looking at the general idea of using latent variables to measure concepts in our theories I've contrasted the historical approach the conventional approach of exploratory factor analysis or the unrestricted factor model to more modern confirmatory factor model the unrestricted factor model we've looked at how we kind of give a metric a scale to latent variables by fixing one of the indicators to take the value 1 and therefore take the the scale from that reference item we've thought about how we can analyze means within a confirmatory factor model and usually we're mainly focused on associations correlations but we can also estimate means and we've looked at some special cases where we have formative indicators rather than reflexive indicators where we have a first stage of item parceling when there are many many indicators and a large number of latent variables and we finish by the special case of a higher-order factor where a latent variable is measured not by observed items but by lower-level latent variables you