Information

Using Canonical Correspondence Analysis on matrices with missing data

Using Canonical Correspondence Analysis on matrices with missing data


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have a matrix of sites where not all the environmental variables I want to assess were sampled. In other words, there are sites with the whole set of variables sampled, and there are other sites where just some variables were sampled. Does Canonical Correspondence Analysis work with missing data for the environmental variables? If it does, what would be the effect of not including the missing values?


First, you cannot fully analyze pair-wise correlations between your environmental variables with NA values, and therefore cannot fully discount including covarying variables. If this is the case, you will not be able to know which of the covarying variables is responsible for any trends in your data.

Second, I don't believe CCA will work with NA values -- you will either have to eliminate the observations containing those missing values or fill them in with column averages. However, both of these methods will have an impact on your results, so move forward cautiously.

Third, I wonder if CCA is even the way you want to go. nMDS (non-metric multidimensional scaling) is much less constrained than CCA. Additionally, it doesn't suffer from as many assumptions/limitations as CCA.

From McCune & Grace (2002):

The following two questions can be used to decide whether CCA is appropriate: (1) Are you interested only in community structure that is related to your measured environmental variables? (2) Is a unimodal model of species responses to environment reasonable? If, for a specific problem you answer yes to both of these, then CCA might be appropriate

However, missing environmental data is still a problem in nMDS.


Biplot scores from canonical correspondence analysis

I'm using the R package vegan to perform canonical correspondence analysis (CCA). As input we have two matrices, one being (sites)x(species) and the other being (sites)x(conditions).

Sample data (and source of plot) are here.

Species loadings are easily accessed with summary(cca_model)$species . What I'm trying to find is the loadings for the explanatory variables, the conditions. The only summary I can find is biplot scores. Looking through the documentation for vegan I can't find any description of how they are computed. Can I sum them across CCA components to get an idea of how much they influence the data?

This is a biplot of two CCA components. The scores are used as coordinates for the arrows.

What are biplot scores in the context of CCA?

Can biplot scores be used to determine how much of an effect conditions have on the response variables?


Phylogenetic Tools for Comparative Biology

I just posted a new function to do phylogenetic canonical correlation analysis. Canonical correlation is a procedure whereby, given two sets of variables (say, a set of Xs and a set of Y), we identify the orthogonal linear combinations of each that maximize the correlations between the sets. This type of analysis is most naturally used in an evolutionary study to analyze, say, a set of morphological variables and a set of environmental or ecological variables.

The phylogenetic version of this analysis takes the phylogeny and a (explicit or implicit) model of evolution into account to find the linear combination of Xs and Ys that maximizes the evolutionary correlations (that is, the inferred correlation of evolutionary changes) between the two sets (Revell & Harrison 2008).

The program is very simple. Direct link to the code is here. To use the function, first load the source:

Here, tree is a phylogenetic tree and X & Y are two data matrices containing values for one or multiple characters in columns and species in rows. Rows should be named by species.

The results are returned as a list with the following elements:

> result
$cor
[1] 0.3764753 0.1852836 0.1054606
$xcoef
CA1 CA2 CA3
[1,] 0.04497549 -0.09956576 -0.45926364
[2,] -0.18997199 0.46065246 -0.07810429
[3,] -0.42425815 -0.16063677 -0.18791902
[4,] 0.25374826 0.29822455 -0.06176255
$ycoef
CA1 CA2 CA3
[1,] -0.2704762 -0.3841450 0.1029158
[2,] -0.1048448 0.2502089 0.5860655
[3,] 0.3736474 -0.2580132 0.2743137
$xscores
CA1 CA2 CA3
1 0.27821077 -0.33344726 0.94985154
2 -0.23088044 0.78905936 0.26050453
3 -1.44525534 -0.22803129 -0.64071476
.
$yscores
CA1 CA2 CA3
1 0.55710619 -0.850905958 0.300282830
2 1.41482268 1.237829442 -0.446763906
3 -1.40453596 0.227361557 -0.964307876
.
$chisq
[1] 8.9531203 2.0752824 0.5032912
$p
[1] 0.7069293 0.9126462 0.7775202

Here, $cor is the set of canonical correlation $xcoef & $ycoef are the canonical coefficients $xscores & $yscores are the canonical scores, in terms of the original species and $chisq & $p are &Chi 2 with corresponding p-values. The p-values are properly interpreted as the probability that the ith and all subsequent correlations are zero.

A few years ago I released a C program that does more or less the same thing however there are a few differences.

1) My C program globally optimizes the &lambda parameter. I will add this to the present function promptly.

2) My C program first transforms the data into a phylogeny free space, and then computes the canonical correlations. This means that, although the correlations are the same in both methods, the scores are no longer in terms of species and will be different than in this function.


Description of the data

For our analysis example, we are going to expand example 1 about investigating the associations between psychological measures and academic achievement measures.

We have a data file, mmreg.dta, with 600 observations on eight variables. The psychological variables are locus_of_control , self_concept and motivation . The academic variables are standardized tests in reading ( read ), writing ( write ), math ( math ) and science ( science ). Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.


See Also

This help page describes two constrained ordination functions, cca and rda . A related method, distance-based redundancy analysis (dbRDA) is described separately ( capscale ). All these functions return similar objects (described in cca.object ). There are numerous support functions that can be used to access the result object. In the list below, functions of type cca will handle all three constrained ordination objects, and functions of rda only handle rda and capscale results.

The main plotting functions are plot.cca for all methods, and biplot.rda for RDA and dbRDA. However, generic vegan plotting functions can also handle the results. The scores can be accessed and scaled with scores.cca , and summarized with summary.cca . The eigenvalues can be accessed with eigenvals.cca and the regression coefficients for constraints with coef.cca . The eigenvalues can be plotted with screeplot.cca , and the (adjusted) R-squared can be found with RsquareAdj.rda . The scores can be also calculated for new data sets with predict.cca which allows adding points to ordinations. The values of constraints can be inferred from ordination and community composition with calibrate.cca .

Diagnostic statistics can be found with goodness.cca , inertcomp , spenvcor , intersetcor , tolerance.cca , and vif.cca . Function as.mlm.cca refits the result object as a multiple lm object, and this allows finding influence statistics ( lm.influence , cooks.distance etc.).

Permutation based significance for the overall model, single constraining variables or axes can be found with anova.cca . Automatic model building with R step function is possible with deviance.cca , add1.cca and drop1.cca . Functions ordistep and ordiR2step (for RDA) are special functions for constrained ordination. Randomized data sets can be generated with simulate.cca .

Separate methods based on constrained ordination model are principal response curves ( prc ) and variance partitioning between several components ( varpart ).

Design decisions are explained in vignette on “Design decisions” which can be accessed with browseVignettes("vegan") .

Package ade4 provides alternative constrained ordination function pcaiv .


Scaling: Correspondence Analysis

5 Extensions

Although the primary application of CA is to a two-way contingency table, the method is regularly applied to analyze multiway tables, tables of preferences, ratings, as well as measurement data on ratio- or interval-level scales. For multiway tables there are two approaches. The first approach is to convert the table to a flat two-way table which is appropriate to the problem at hand. Thus, if a third variable is introduced into the example above, say ‘sex of respondent,’ then an appropriate way to flatten the three-way table would be to interactively code ‘country’ and ‘sex’ as a new row variable, with 23×2=46 categories, crosstabulated against the question responses. For each country there would now be a male and a female point and one could compare sexes and countries in this richer map. This process of interactive coding of the variables can continue as long as the data do not become too fragmented into interactive categories of very low frequency.

Another approach to multiway data, called multiple correspondence analysis (MCA), applies when there are several categorical variables skirting the same issue, often called ‘items.’ MCA is usually defined as the CA algorithm applied to an indicator matrix Z with the rows being the respondents or other sampling units, and the columns being dummy variables for each of the categories of all the variables. The data are zeros and ones, with the ones indicating the chosen categories for each respondent. The resultant map shows each category as a point and, in principle, the position of each respondent as well. Alternatively, one can set up what is called the Burt matrix), B=ZZ, the square symmetric table of all two-way crosstabulations of the variables, including the crosstabulations of each variable with itself (named after the psychologist Sir Cyril Burt). The Burt matrix is reminiscent of a covariance matrix and the CA of the Burt matrix can be likened to a PCA of a covariance matrix. The analysis of the indicator matrix Z and the Burt matrix B give equivalent standard coordinates of the category points, but slightly different scalings in the principal coordinates since the principal inertias of B are the squares of those of Z.

A variant of MCA called joint correspondence analysis (JCA) avoids the fitting of the tables on the diagonal of the Burt matrix, which is analogous to least-squares factor analysis.

As far as other types of data are concerned, namely rankings, ratings, paired comparisons, ratio-scale, and interval-scale measurements, the key idea is to recode the data in a form which justifies the basic constructs of CA, namely profile, mass, and chi-squared distance. For example, in the analysis of rankings, or preferences, applying the CA algorithm to the original rankings of a set of objects by a sample of subjects is difficult to justify, because there is no reason why weight should be accorded to an object in proportion to its average ranking. A practice called doubling resolves the issue by adding either an ‘anti-object’ for each ranked object or an ‘anti-subject’ for each responding subject, in both cases with rankings in the reverse order. This addition of apparently redundant data leads to CA effectively performing different variants of principal components analysis on the original rankings.

A recent finding by Carroll et al. ( 1997 ) is that CA can be applied to a square symmetric matrix of squared distances, transformed by subtracting each squared distance from a constant which is substantially larger than the largest squared distance in the table. This yields a solution which approximates the classical scaling solution of the distance matrix.

All these extensions of CA conform closely to Benzécri's original conception of CA as a universal technique for exploring many different types of data through operations such as doubling or other judicious transformations of the data.

The latest developments on the subject, including discussions of sampling properties of CA solutions and a comprehensive reference list, may be found in the volumes edited by Greenacre and Blasius ( 1994 ) and Blasius and Greenacre ( 1998 ).


Quality statistics in canonical correspondence analysis

Canonical correspondence analysis is an important multivariate tool in ecology. A key aspect of the analysis is the representation of species optima, where these optima are estimated by the weighted averages of the species with respect to environmental variables. This article shows that, strictly speaking, canonical correspondence analysis does not optimize the representation of the species optima but the inertia of the abundance matrix under linear constraints. It is argued that the eigenvalues obtained in the analysis, usually reported in applied studies, are a measure of the quality of the display of the abundance matrix, and only indicate the quality of representation of the species optima when environmental variables are uncorrelated. In practice, environmental variables are often correlated. Thus, additional quality statistics are needed to express how well the species optima are represented. In this article we derive quality statistics for the representation of the species optima and the environmental variables, and use artificial and empirical data to illustrate their use. Copyright © 2001 John Wiley & Sons, Ltd.


Using Canonical Correspondence Analysis on matrices with missing data - Biology

One alternative would be to use a similar approach but to replace the calculation of the correlation matrix with something more suitable, and then to project the matrix to lower dimensions. This idea has lead to one of the most productive and widely-used methods in the history of multivariate analysis in ecology --- canonical correspondence analysis or CCA. Just as RDA relates to PCA, CCA relates to CA. That is, (1) start with a Chi-square vegetation matrix [(actual - predicted)/sqrt(predicted)], (2) regress the differences from expectation on environmental variables to get fitted values, using a weighted regression where total abundance by plots is used as the weights, and (3) calculate the Euclidean distance of the fitted vegetation matrix and project by eigen-analysis. The importance of specific environmental variables is then assessed by their correlation to the projected scatter diagram.

Just as CA, there are several algorithms available to calculate CCA. The approach outlined above follows the Legendre and Legendre (1988) approach. Ter Braak (19xx) describes an algorithm based on reciprocal averaging that is employed by the popular program CANOCO. The result is the same either way.

In addition, there is also more than one S-Plus/R algorithm to compute CCA. Stephane Dray contributed CAIV, while Jari Oksanen contributed a cca() function as part of his vegan package (version 1-3.9 or later). The two differ slightly in the conventions for scaling the results. Because the vegan cca() function returns results identical to CANOCO, and because we already load the vegan library, we will use the vegan cca() function. However, to keep the plots produced by cca() more comparable to those we have produced from other programs, we will replace the plotting routines supplied with the vegan cca() function with others.

Running cca()

To calculate a CCA, select those environmental variables you have reason to believe are important, and enter them into the cca() function in formula notation, just like we did for GLMs and GAMS. The full taxon matrix goes on the left-hand side of the equation, with selected environmental variables on the right.

In this particular example, the CCA was not very successful. Only 0.6975/10.8656 or 0.064 of the total variability was captured in the CCA. Clearly, the weighted regression step was not very successful at capturing the variability in vegetation composition, but after glm() and gam() we should not be too surprised.

The next set of lines gives the eigenvalues associated with the projection. The top line gives the "constrained" eigenvalues. Because we only had three variables in our environmental dataframe we can only have three constrained eigenvalues. The three values sum to 0.69755. so

Plotting the CCA

As for CA, the species are shown as red crosses and samples as black circles. In this analysis, the first axis is associated with increasing elevation, while the second axis is associated with decreasing slope and increasing aspect value (av).

As you can see, the species are pretty well condensed in the center of the ordination. To get a better look, we can specify "scaling=1" to mean "samples as weighted averages of the species."

Package vegan supplies a number of graphing functions for ordiplots, including points() and identify(). We can use the identify() function to identify specific samples or species. Depending on whether you want a clearer picture of samples of species, you can plot using the appropriate scaling, and then use the identification functions with the same scaling.

Adding Categorical Variables to the Analysis

Notice how different this plot is from the first. While the total variability explained did not increase very much (and it can't go down with an increase of degrees of freedom), regressing the vegetation against topographic position in addition to the other variables results in a quite different perspective on the variability. Each possible topographic position is plotted at the centroid of the samples of that type, shown as an "X". To find out which one is which, look at last element of the summary of the cca object.

Discussion

Ancillary Functions

"The functions find statistics that resemble ‘deviance’ and ‘AIC’ in constrained ordination. Actually, constrained ordination methods do not have a log-Likelihood, which means that they cannot have AIC and deviance. Therefore you should not use these functions, and if you use them, you should not trust them. If you use these functions, it remains as your responsibility to check the adequacy of the result."

The function below does not make use of log-likelihood directly, but rather employs a rather brutish permutation approach and tests whether adding a variable explains more inertia than expected at random. Nonetheless, I'm sure Jari disapproves and I include it here for whatever good it might serve.


Canonical correspondence analysis is a technique developed, I believe, by the community ecology people. A founding paper is Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis by Cajo J.F. Ter Braak (1986). The method involves a canonical correlation analysis and a direct gradient analysis. The idea is to relate the prevalences of a set of species to a collection of environmental variables.

Traditionally CCA (correlation) seeks to find that linear combination of the X variables and that linear combination of the Y variables that have the greatest correlation with each other. It relies on the eigen decomposition of $Sigma_<12>Sigma_<22>^<-1>Sigma_<21>$ , where the Sigma matrices are correlation matrices of the variables. See Mardia, Kent and Bibby (Multivariate Analysis).

CCA thus assumes a linear relationship between the two sets of variables. The correspondence analysis assumes a different relationship: The species have a gaussian distribution along a direction determined by the environmental factors.

Note that CCA is symmetric in the X variables and the Y variables. Correspondence analysis presumes no symmetry, since we want to explain the species in terms of their environment - not the other way around.


Integrative analysis of two data sets

One-table dimension reduction methods have been extended to the EDA of two matrices and can simultaneously decompose and integrate a pair of matrices that measure different variables on the same observations ( Table 3). Methods include generalized SVD [ 42], Co-Inertia Analysis (CIA) [ 43, 44], sparse or penalized extensions of Partial Least Squares (PLS), Canonical Correspondence analysis (CCA) and Canonical Correlation Analysis (CCA) [ 36, 45–47]. Note both canonical correspondence analysis and canonical correlation analysis are referred to by the acronym CCA. Canonical correspondence analysis is a constrained form of CA that is widely used in ecological statistics [ 46] however, it is yet to be adopted by the genomics community in analysis of pairs of omics data. By contrast, several groups have applied extensions of canonical correlation analysis to omics data integration. Therefore, in this review, we use CCA to describe canonical correlation analysis.



Comments:

  1. Bowden

    And I have faced it. Let's discuss this question. Here or in PM.

  2. Feliciano

    Of course. All of the above is true. Let's discuss this issue.

  3. Rowell

    Sorry, but I need another one. What else could that suggest?

  4. Pomeroy

    I confirm. All of the above is true. Let's discuss this issue. Here or at PM.

  5. Wardley

    Remarkably! Thank you!

  6. Mera

    It seems it's going to come close.



Write a message