Explaining Missing Heritability Using Gaussian Process Regression (Reader’s Digest)


The paper “Explaining Missing Heritability Using Gaussian Process Regression” by Sharp et al. tries to tackle the problem of missing heritability and the detection of higher-order interaction effects through Gaussian process regression, a technique widely used in the machine learning community. The authors obtained estimates of broad-sense heritability for a number of mice and yeast phenotypes using an RBF kernel that models higher-order interactions and found these estimates significantly larger than the narrow-sense heritability of these phenotypes. The authors also detected several loci displaying interaction effects.



In genetics, phenotypes are modeled by the following equation

Screen Shot 2018-02-16 at 2.01.02 PM

whereScreen Shot 2018-02-16 at 2.03.09 PM is the phenotype measurement of the i-th individual, Screen Shot 2018-02-22 at 1.53.54 PM the genotype vector,Screen Shot 2018-02-22 at 1.54.02 PM a random effect term that captures relatedness among individuals, and Screen Shot 2018-02-22 at 1.54.10 PM the environmental noise. Here, Screen Shot 2018-02-22 at 1.54.17 PM is a function that maps the genotype vector into a real number. Under this model, heritability is defined as the proportion of variance in Screen Shot 2018-02-16 at 2.03.09 PM  that is due to variation of Screen Shot 2018-02-22 at 1.54.35 PM,

Screen Shot 2018-02-22 at 1.54.47 PM

Different flavors of heritability exist based on the complexity of the Screen Shot 2018-02-22 at 1.54.17 PM function and the input that goes intoScreen Shot 2018-02-22 at 1.54.17 PM. In general geneticists work with four types of heritability, as listed below.

  • Broad-sense Screen Shot 2018-02-22 at 2.11.03 PMheritability: Broad-sense heritability is the amount of variance in phenotypes that is due to all genetic variations including both additive and epistatic effects. For Screen Shot 2018-02-22 at 2.11.22 PM, the function Screen Shot 2018-02-22 at 1.54.17 PM can be any function that incorporates any order of interactions between genetic variations. This is the most general definition of heritability.
  • Narrow-sense heritabilityScreen Shot 2018-02-22 at 2.11.36 PM: Narrow-sense heritability is the amount of variance in phenotypes that is due to all additive genetic effects. ForScreen Shot 2018-02-22 at 2.11.47 PM, the functionScreen Shot 2018-02-22 at 1.54.17 PM is a linear function that takes in first-order terms.
  • SNP heritability Screen Shot 2018-02-22 at 2.11.55 PM: SNP heritability is the amount of variance in phenotypes that is due to additive genetic effects of a given set of SNPs. ForScreen Shot 2018-02-22 at 2.12.04 PM, the function Screen Shot 2018-02-22 at 1.54.17 PM is a linear function that takes in a fixed set of SNPs.
  • GWAS heritabilityScreen Shot 2018-02-22 at 2.12.12 PM: GWAS heritability is the amount of variance in phenotypes that is due to additive genetic effects of GWAS hits. ForScreen Shot 2018-02-22 at 2.12.21 PM, the function Screen Shot 2018-02-22 at 1.54.17 PM  is a linear function that takes in GWAS hits only.

Based on the definition of the four flavors of heritability, it follows thatScreen Shot 2018-02-22 at 2.12.31 PM. The missing heritability problem often refers to the gap between Screen Shot 2018-02-22 at 2.12.44 PMand the narrow- / broad- sense heritability.

Gaussian Process Regression

Parametric regression problems often involve a Screen Shot 2018-02-22 at 2.27.24 PMfunction , governed by a set of parametersScreen Shot 2018-02-22 at 2.27.34 PM, that maps each input Screen Shot 2018-02-22 at 2.27.43 PM with a response. For example, in Poisson regressionScreen Shot 2018-02-22 at 2.27.49 PM, the distribution of the response variable is characterized by the mean parameter Screen Shot 2018-02-22 at 2.28.00 PM and the density function of Poisson.

Screen Shot 2018-02-22 at 2.28.07 PM

Gaussian Process Regression is different from parametric regression in that one does not assume any parametric form for the function Screen Shot 2018-02-22 at 1.54.17 PM . Instead, a Gaussian Process prior assumes that the function values of Screen Shot 2018-02-22 at 1.54.17 PMScreen Shot 2018-02-22 at 2.28.16 PM, for a number of inputs, Screen Shot 2018-02-22 at 2.28.28 PM, follow a multivariate normal distributionScreen Shot 2018-02-22 at 2.28.35 PMwhere Screen Shot 2018-02-22 at 2.28.43 PM is the kernel matrix, measuring the similarity between samples, that contraints the possible space of Screen Shot 2018-02-22 at 1.54.17 PM . Because the only constraint on the kernel function Screen Shot 2018-02-22 at 2.29.00 PMScreen Shot 2018-02-22 at 2.28.53 PM is that the covariance matrix Screen Shot 2018-02-22 at 2.29.00 PMis positive definite, this enables Gaussian Process Regression to model a broad range of functions.

The following is a list of kernel functions that are widely used (credit to Wikipedia),

  • Linear kernel: Screen Shot 2018-02-22 at 4.11.27 PM
  • Polynomial kernel: Screen Shot 2018-02-22 at 4.11.34 PM
  • RBF kernel:    Screen Shot 2018-02-22 at 4.11.59 PM

Applying Gaussian Process Regression

The Kernel Function

Specifying the kernel function is a fundamental step of Gaussian Process Regression. An appropriate kernel allows one to model interaction of any order among genetic variations. In the Sharp et al. paper, the authors proposed a generalized version of the RBF kernel to measure similarity between two individualsScreen Shot 2018-02-22 at 1.53.54 PMand Screen Shot 2018-02-22 at 4.25.47 PM, acorss the genotypes of Screen Shot 2018-02-22 at 4.27.39 PM SNPs,

Screen Shot 2018-02-22 at 4.28.34 PM

where Screen Shot 2018-02-22 at 4.29.37 PM is a parameter that governs the overall similarity between Screen Shot 2018-02-22 at 1.53.54 PM and Screen Shot 2018-02-22 at 4.25.47 PM , Screen Shot 2018-02-22 at 4.29.45 PM the contribution of SNP Screen Shot 2018-02-22 at 4.29.53 PM to the variations of the phenotype – a large Screen Shot 2018-02-22 at 4.29.45 PM suggests that SNP Screen Shot 2018-02-22 at 4.29.53 PM contributes little to the variation of the phenotype, and a small Screen Shot 2018-02-22 at 4.29.45 PM  implies significant contribution. By examining the magnitude of the hyperparametersScreen Shot 2018-02-22 at 4.29.45 PM, one can infer whether a genetic loci contribute significantly to the trait.

Sparsity-Inducing Priors

Overfit may occur when the number of parameters to estimate is larger than the amount of data one has. To avoid overfitting and improve parsimony of the model, the authors imposed a Gamma prior over the inverse of Screen Shot 2018-02-22 at 4.29.45 PM , Screen Shot 2018-02-22 at 4.30.13 PM. The Gamma prior has density function

Screen Shot 2018-02-22 at 4.30.21 PM

Setting Screen Shot 2018-02-22 at 4.30.31 PM  removes any mode in the density function, resulting in a monotonically decreasing function with a heavy tail concentrated around 0 (see figure below), enforcing most of Screen Shot 2018-02-22 at 4.30.41 PM to be close to zero.

Gamma Distribution

Posterior Distribution of the Parameters

Gaussian Process prior allows one to analytically perform integration over the space of Screen Shot 2018-02-22 at 1.54.17 PM , resulting in a posterior for the parametersScreen Shot 2018-02-22 at 4.30.58 PM

Screen Shot 2018-02-22 at 4.31.03 PM

where Screen Shot 2018-02-22 at 4.31.09 PM incoporates the sparsity-inducing priors. The integration step effectively averages over all possible f()f(⋅), discarding the need to estimate each instance of Screen Shot 2018-02-22 at 1.54.17 PM separately. This step also increases power to detect loci that contribute to phenotypes.

There is no analytical solution to the posterior mode or mean of θ. However, sampling based approach (e.g. MCMC) can be used to start from a starting point and lead to the posteior mode. In the Sharp et al. paper, a Hybrid Monte Carlo that models a particle’s trajectory was used to make inference over θ.

Estimating Broad-Sense Heritability

Once the parametersScreen Shot 2018-02-22 at 4.30.58 PM estimated, one can use these estimates to quantify broad-sense heritability from the Gaussian Regression model. The basic idea is as follows:

  • For each sample in the training data, one first predicts its phenotype using the estimated parameters.
  • The variance of the predicted phenotype can be found analytically using the conditional distribution of multivariate normal.
  • The ratio between the sum of each individual’s variance and the phenotype variance gives the broad-sense heritability.