Sparse Selection Indices (SSI) using the SFSI R-package

In this section, we use data from the Wheat-small data set used in the manuscript to illustrate how to fit Sparse Selection Indices using the SFSI R-package (Lopez-Cruz et al. 2020).

Installing the package from GitHub

The following snippet shows how to install the package from GitHub.

rm(list = ls())     

# Install devtools package first
install.packages('devtools', repos='https://cran.r-project.org/')
  
# Install SFSI package from GitHub
devtools::install_git('https://github.com/MarcooLopez/SFSI')       

# Install BGLR package (needed to download the data)
install.packages('BGLR', repos='https://cran.r-project.org/')

Data preparation

To illustrate the use of the software we will use data from the Wheat-small data set which is available with the BGLR R-package (Perez and Campos 2014). The following code shows how to prepare data for environment 1; all the analyses hereinafter are based on this data.

library(SFSI)
library(BGLR)    # Load the packages
data(wheat, package="BGLR")     # Load data from the BGLR package

# Select the environment 1 to work with     
y <- as.vector(scale(wheat.Y[,1]))     

# Calculate G matrix       
G <- tcrossprod(scale(wheat.X))/ncol(wheat.X)      

# Save data
save(y, G, file="geno_pheno.RData")      

Heritability and variance components

Implementing the SSI requires an estimate of the heritability. We obtain this using a G-BLUP model \(y_i=\mu+u_i+\varepsilon_i\) with \(\varepsilon_i \overset{iid}{\sim} N(0,\sigma_\varepsilon^2)\) and \(\boldsymbol{u}\sim N(\textbf{0},\sigma_u^2 \boldsymbol{G})\). This model can be fitted with the fitBLUP function included in the SFSI R-package. The BGLR R-package can be also used to fit a Bayesian version of the model. The code below illustrates how to estimate heritability using the fitBLUP function.

load("geno_pheno.RData") # Load data 

# Fit model
fm0 <- fitBLUP(y, K=G) 
fm0$h2 <- fm0$varU/(fm0$varU+fm0$varE)  # Estimate heritability
c(fm0$varU,fm0$varE,fm0$h2)             # Variance components (varU,varE,h2)

save(fm0, file="varComps.RData")

Training-testing partitions

The code below produces training (trn, \(70\%\)) and testing (tst, \(30\%\)) partitions. The parameter nPart defines the number of partitions. The output is a matrix with nPart columns containing 1’s and 2’s indexing the observations that are assigned to the training and testing sets, respectively. The object is saved in the file partitions.RData and will be used in later analyses.

nPart <- 10                       # Number of partitions
load("geno_pheno.RData")          # Load data 

nTST <- ceiling(0.3*length(y))    # Number of elements in TST set
partitions <- matrix(1,nrow=length(y),ncol=nPart)   # Matrix to store partitions
seeds <- round(seq(1E3, .Machine$integer.max/10, length=nPart))

for(k in 1:nPart)
{   set.seed(seeds[k])
    partitions[sample(1:length(y), nTST),k] <- 2
}
save(partitions, file="partitions.RData")    # Save partitions

Accuracy of the G-BLUP and of the Sparse SI

The following script shows how to derive SSIs using the partitions above created. The weights of the SSI are computed using the SSI function for nLambda=100 values of \(\lambda\). The G-BLUP model is fitted for comparison using the fitBLUP function. Estimates of \(\mu\) and \(h^2\) are computed internally in the SSI function when these are not provided. These estimates obtained from the G-BLUP model will be passed to the SSI function to save time. Indices denoting training and testing sets are passed through the trn and tst parameters, respectively. The accuracy of the G-BLUP and SSI models are stored in the object accSSI, and saved in the file results_accuracy.RData.

# Load data
load("geno_pheno.RData"); load("varComps.RData"); load("partitions.RData")

accSSI <- mu <- h2 <- c()       # Objects to store results

for(k in 1:ncol(partitions))
{ cat("  partition = ",k,"\n")
  trn <- which(partitions[,k] == 1)
  tst <- which(partitions[,k] == 2)
  yNA <- y;   yNA[tst] <- NA
    
  # G-BLUP model
  fm1 <- fitBLUP(yNA, K=G)
  mu[k] <- fm1$b        # Retrieve mu estimate
  h2[k] <- fm1$h2       # Retrieve h2 estimate

  # Sparse SI
  fm2 <- SSI(y,K=G,b=mu[k],h2=h2[k],trn=trn,tst=tst,mc.cores=1,nLambda=100)
  fm3 <- summary(fm2)   # Useful function to get results
        
  accuracy <- c(GBLUP=cor(fm1$u[tst],y[tst]), fm3$accuracy)/sqrt(fm0$h2)
  lambda <- c(min(fm3$lambda),fm3$lambda)
  df <- c(max(fm3$df),fm3$df)
  accSSI <- rbind(accSSI,data.frame(rep=k,SSI=names(accuracy),accuracy,lambda,df))
}
save(mu,h2,accSSI,file="results_accuracy.RData")

Displaying Results

The following code creates a plot (as in Figure 1 in the manuscript) showing the estimated genetic prediction accuracy by values of the penalty parameter (in logarithmic scale). The rightmost point in the plot corresponds to the G-BLUP model (obtained when \(\lambda=0\)). The point at the peak denotes the maximum accuracy that was obtained by the SSI.

load("results_accuracy.RData")

dat <- data.frame(do.call(rbind,lapply(split(accSSI,accSSI$SSI),
         function(x) apply(x[,-c(1:2)],2,mean))))
dat$Model <- unlist(lapply(strsplit(rownames(dat),"\\."),function(x)x[1]))
 
dat2 <- do.call(rbind,lapply(split(dat,dat$Mod),function(x)x[which.max(x$acc),]))
ggplot(dat[dat$df>1,],aes(-log(lambda),accuracy)) + 
   geom_hline(yintercept=dat["GBLUP",]$accuracy, linetype="dashed") + 
   geom_line(aes(color=Model),size=1.1) + theme_bw() +
   geom_point(data=dat2,aes(color=Model),size=2.5)
alt text here

Cross-validating to obtain an optimal penalization

The snippet below can be used to perform, within each trn-tst partition, \(k\)-folds CV to get an ‘optimal’ value of \(\lambda\) within the training data, and then used to fit an SSI for the testing set. The CV is implemented using the SSI_CV function from the SFSI R-package for one 5-folds CV, this can be set by changing the nCV and nFolds parameters.

load("geno_pheno.RData");   load("varComps.RData")
load("partitions.RData");   load("results_accuracy.RData")

lambdaCV <- accSSI_CV <- dfCV <- c()   # Objects to store results

for(k in 1:ncol(partitions))
{   cat("  partition = ",k,"\n")
    trn <- which(partitions[,k] == 1)
    tst <- which(partitions[,k] == 2)
    
    # Cross-validating the training set  
    fm1 <- SSI_CV(y,K=G,trn.CV=trn,nLambda=100,mc.cores=1,nFolds=5,nCV=1)
    lambdaCV[k] <- summary(fm1)$optCOR["mean","lambda"]
      
    # Fit a SSI with the estimated lambda
    fm2 <- SSI(y,K=G,b=mu[k],h2=h2[k],trn=trn,tst=tst,lambda=lambdaCV[k])     
    
    accSSI_CV[k] <- summary(fm2)$accuracy/sqrt(fm0$h2)
    dfCV <- cbind(dfCV, fm2$df) 
}
save(accSSI_CV,lambdaCV,dfCV,file="results_accuracyCV.RData")

After running the above analysis, the following snippet can be run to create a plot (as in Figure 2 in the manuscript) comparing partition-wise the accuracy of the optimal SSI with that of the G-BLUP. The average accuracies are also shown in the plot.

load("results_accuracy.RData"); load("results_accuracyCV.RData")

dat <- data.frame(GBLUP=accSSI[accSSI$SSI=="GBLUP",]$acc,SSI=accSSI_CV)
rg <- range(dat)
tmp <- c(mean(rg),diff(rg)*0.4)

ggplot(dat,aes(GBLUP,SSI)) + geom_abline(slope=1,linetype="dotted") +
  geom_point(shape=21,color="orange") + xlim(rg) + ylim(rg) +
  annotate("text",tmp[1],tmp[1]-tmp[2],label=round(mean(dat$GBLUP),3)) +
  annotate("text",tmp[1]-tmp[2],tmp[1],label=round(mean(dat$SSI),3))
alt text here

The code below creates a plot (as in Figure 3 in the manuscript) showing the distribution of the number of points in the support set for the SSI, across all partitions.

load("results_accuracyCV.RData")

dat <- data.frame(df=as.vector(dfCV))

bw <- round(diff(range(dat$df))/40)
ggplot(data=dat,aes(df,stat(count)/length(dfCV))) +  theme_bw() +
    geom_histogram(color="gray20",fill="lightblue",binwidth=bw) +
    labs(x=bquote("Support set size(" *n[sup]*")"),y="Frequency")
alt text here

Subject-specific training sets

The next script can be used to create a plot (as in Figure 4 in the manuscript) showing (for a single trn-tst partition) the subset of points in the support set, for each individual being predicted. This plot can be made through the plotNet function from the SFSI package.

# Load data
load("geno_pheno.RData"); load("partitions.RData"); load("results_accuracyCV.RData")

part <- 1      # Choose any partition from 1,…,nPart
trn <- partitions[,part] == 1 
tst <- partitions[,part] == 2          

# Fit SSI with lambda previously estimated using CV
fm <- SSI(y,K=G,trn=trn,tst=tst,lambda=lambdaCV[part])

plotNet(fm,K=G,tst=fm$tst[1:25],single=FALSE,title=NULL,bg.col="white")  
alt text here

References

Lopez-Cruz, Marco, Eric Olson, Gabriel Rovere, Jose Crossa, Susanne Dreisigacker, Mondal Suchismita, Ravi Singh, and Gustavo de los Campos. 2020. Regularized selection indices for breeding value prediction using hyper-spectral image data.” Scientific Reports 10 (8195).
Perez, Paulino, and Gustavo de los Campos. 2014. Genome-wide regression and prediction with the BGLR statistical package.” Genetics 198 (2): 483–95.