{"title": "Beyond normality: Learning sparse probabilistic graphical models in the non-Gaussian setting", "book": "Advances in Neural Information Processing Systems", "page_first": 2359, "page_last": 2369, "abstract": "We present an algorithm to identify sparse dependence structure in continuous and non-Gaussian probability distributions, given a corresponding set of data. The conditional independence structure of an arbitrary distribution can be represented as an undirected graph (or Markov random field), but most algorithms for learning this structure are restricted to the discrete or Gaussian cases. Our new approach allows for more realistic and accurate descriptions of the distribution in question, and in turn better estimates of its sparse Markov structure. Sparsity in the graph is of interest as it can accelerate inference, improve sampling methods, and reveal important dependencies between variables. The algorithm relies on exploiting the connection between the sparsity of the graph and the sparsity of transport maps, which deterministically couple one probability measure to another.", "full_text": "Beyond normality: Learning sparse probabilistic\n\ngraphical models in the non-Gaussian setting\n\nRebecca E. Morrison\n\nMIT\n\nRicardo Baptista\n\nMIT\n\nYoussef Marzouk\n\nMIT\n\nrmorriso@mit.edu\n\nrsb@mit.edu\n\nymarz@mit.edu\n\nAbstract\n\nWe present an algorithm to identify sparse dependence structure in continuous\nand non-Gaussian probability distributions, given a corresponding set of data. The\nconditional independence structure of an arbitrary distribution can be represented\nas an undirected graph (or Markov random \ufb01eld), but most algorithms for learning\nthis structure are restricted to the discrete or Gaussian cases. Our new approach\nallows for more realistic and accurate descriptions of the distribution in question,\nand in turn better estimates of its sparse Markov structure. Sparsity in the graph is\nof interest as it can accelerate inference, improve sampling methods, and reveal\nimportant dependencies between variables. The algorithm relies on exploiting the\nconnection between the sparsity of the graph and the sparsity of transport maps,\nwhich deterministically couple one probability measure to another.\n\n1 Undirected probabilistic graphical models\n\nGiven n samples from the joint probability distribution of some random variables X1, . . . , Xp, a\ncommon goal is to discover the underlying Markov random \ufb01eld. This \ufb01eld is speci\ufb01ed by an\nundirected graph G, comprising the set of vertices V = {1, . . . , p} and the set of edges E. The edge\nset E encodes the conditional independence structure of the distribution, i.e., ejk /\u2208 E \u21d0\u21d2 Xj \u22a5\u22a5\nXk | XV \\{jk}. Finding the edges of the graph is useful for several reasons: knowledge of conditional\nindependence relations can accelerate inference and improve sampling methods, as well as illuminate\nstructure underlying the process that generated the data samples. This problem\u2014identifying an\nundirected graph given samples\u2014is quite well studied for Gaussian or discrete distributions. In the\nGaussian setting, the inverse covariance, or precision, matrix precisely encodes the sparsity of the\ngraph. That is, a zero appears in the jk-th entry of the precision if and only if variables Xj and Xk\nare conditionally independent given the rest. Estimation of the support of the precision matrix is often\nachieved using a maximum likelihood estimate with (cid:96)1 penalties. Coordinate descent (glasso) [4] and\nneighborhood selection [14] algorithms can be consistent for sparse recovery with few samples, i.e.,\np > n. In the discrete setting, [12] showed that for some particular graph structure, the support of a\ngeneralized covariance matrix encodes the conditional independence structure of the graph, while\n[21] employed sparse (cid:96)1-penalized logistic regression to identify Ising Markov random \ufb01elds.\nMany physical processes, however, generate data that are continuous but non-Gaussian. One example\nis satellite images of cloud cover formation, which may greatly impact weather conditions and\nclimate [25, 20]. Other examples include biological processes such as bacteria growth [5], heartbeat\nbehavior [19], and transport in biological tissues [9]. Normality assumptions about the data may\nprevent the detection of important conditional dependencies. Some approaches have allowed for\nnon-Gaussianity, such as the nonparanormal approach of [11, 10], which uses copula functions to\nestimate a joint non-Gaussian density while preserving conditional independence. However, this\napproach is still restricted by the choice of copula function, and as far as we know, no fully general\napproach is yet available. Our goal in this work is to consistently estimate graph structure when\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe underlying data-generating process is non-Gaussian. To do so, we propose the algorithm SING\n(Sparsity Identi\ufb01cation in Non-Gaussian distributions). SING uses the framework of transport maps\nto characterize arbitrary continuous distributions, as described in \u00a73. Our representations of transport\nmaps employ polynomial expansions of degree \u03b2. Setting \u03b2 = 1 (i.e., linear maps) is equivalent to\nassuming that the data are well approximated by a multivariate Gaussian. With \u03b2 > 1, SING acts as\na generalization of Gaussian-based algorithms by allowing for arbitrarily rich parameterizations of\nthe underlying data-generating distribution, without additional assumptions on its structure or class.\n\n2 Learning conditional independence\nLet X1, . . . , Xp have a smooth and strictly positive density \u03c0 on Rp. A recent result in [26] shows\nthat conditional independence of the random variables Xj and Xk can be determined as follows:\n\nXj \u22a5\u22a5 Xk | XV \\{jk} \u21d0\u21d2 \u2202jk log \u03c0(x) = 0, \u2200 x \u2208 Rp,\n\n(1)\nwhere \u2202jk(\u00b7) denotes the jk-th mixed partial derivative. Here, we de\ufb01ne the generalized precision\nas the matrix \u2126, where \u2126jk = E\u03c0 [|\u2202jk log \u03c0(x)|]. Note that \u2126jk = 0 is a suf\ufb01cient condition\nthat variables Xj and Xk be conditionally independent. Thus, \ufb01nding the zeros in the matrix \u2126 is\nequivalent to \ufb01nding the graphical structure underlying the density \u03c0. Note that the zeros of the\nprecision matrix in a Gaussian setting encode the same information\u2014the lack of an edge\u2014as the\nzeros in the generalized precision introduced here.\nOur approach identi\ufb01es the zeros of \u2126 and thus the edge set E in an iterative fashion via the\nalgorithm SING, outlined in \u00a74. Note that computation of an entry of the generalized precision\nrelies on an expression for the density \u03c0. We represent \u03c0 and also estimate \u2126 using the notion of a\ntransport map between probability distributions. This map is estimated from independent samples\nx(i) \u223c \u03c0, i = 1, . . . , n, as described in \u00a73. Using a map, of the particular form described below,\noffers several advantages: (1) computing the generalized precision given the map is ef\ufb01cient (e.g.,\nthe result of a convex optimization problem); (2) the transport map itself enjoys a notion of sparsity\nthat directly relates to the Markov structure of the data-generating distribution; (3) a coarse map may\ncapture these Markov properties without perfectly estimating the high-dimensional density \u03c0.\nLet us \ufb01rst summarize our objective and proposed approach. We aim to solve the following graph\nrecovery problem:\n\nGiven samples {x(i)}n\ngraph of this distribution and the associated Markov properties.\n\ni=1 from some unknown distribution, \ufb01nd the dependence\n\nOur proposed approach loosely follows these steps:\n\n\u2022 Estimate a transport map from samples\n\u2022 Given an estimate of the map, compute the generalized precision \u2126\n\u2022 Threshold \u2126 to identify a (sparse) graph\n\u2022 Given a candidate graphical structure, re-estimate the map. Iterate. . .\n\nThe \ufb01nal step\u2014re-estimating the map, given a candidate graphical structure\u2014makes use of a strong\nconnection between the sparsity of the graph and the sparsity of the transport map (as shown by [26]\nand described in \u00a73.2). Sparsity in the graph allows for sparsity in the map, and a sparser map allows\nfor improved estimates of \u2126. This is the motivation behind an iterative algorithm.\n\n3 Transport maps\n\nThe \ufb01rst step of SING is to estimate a transport map from samples [13]. A transport map induces\na deterministic coupling of two probability distributions [22, 15, 18, 26]. Here, we build a map\nbetween the distribution generating the samples (i.e., X \u223c \u03c0) and a standard normal distribution\n\u03b7 = N (0, Ip). As described in [28, 2], given two distributions with smooth and strictly positive\ndensities (\u03c0, \u03b7),1 there exists a monotone map S : Rp \u2192 Rp such that S(cid:93)\u03c0 = \u03b7 and S(cid:93)\u03b7 = \u03c0, where\n(2)\n(3)\n\n\u22121(y) det(cid:0)\n\n\u22121(y)(cid:1)\n\n1Regularity assumptions on \u03c0, \u03b7 can be substantially relaxed, though (2) and (3) may need modi\ufb01cation [2].\n\nS(cid:93)\u03c0(y) = \u03c0 \u25e6 S\nS(cid:93)\u03b7(x) = \u03b7 \u25e6 S(x) det (\u2207S(x)) .\n\n\u2207S\n\n2\n\n\fWe say \u03b7 is the pushforward density of \u03c0 by the map S, and similarly, \u03c0 is the pullback of \u03b7 by S.\nMany possible transport maps satisfy the measure transformation conditions above. In this work, we\nrestrict our attention to lower triangular monotone increasing maps. [22, 7, 2] show that, under the\nconditions above, there exists a unique lower triangular map S of the form\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nS1(x1)\nS2(x1, x2)\nS3(x1, x2, x3)\n...\nSp(x1, . . . . . . , xp)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb ,\n\nS(x) =\n\nwith \u2202kSk > 0. The quali\ufb01er \u201clower triangular\u201d refers to the property that each component of the\nmap Sk only depends on variables x1, . . . , xk. The space of such maps is denoted S\u2206.\nAs an example, consider a normal random variable: X \u223c N (0, \u03a3). Taking the Cholesky decomposi-\ntion of the covariance KK T = \u03a3, then K\u22121 is a linear operator that maps (in distribution) X to a\nrandom variable Y \u223c N (0, Ip), and similarly, K maps Y to X. In this example, we associate the\nmap K\u22121 with S, since it maps the more exotic distribution to the standard normal:\n\nS(X) = K\n\n\u22121X d= Y ,\n\n\u22121(Y ) = KY d= X.\n\nS\n\nIn general, however, the map S may be nonlinear. This is exactly what allows us to represent and\ncapture arbitrary non-Gaussianity in the samples. The monotonicity of each component of the map\n(that is, \u2202kSk > 0) can be enforced by using the following parameterization:\n\nSk(x1, . . . , xk) = ck(x1, . . . , xk\u22121) +\n\nexp{hk (x1, . . . , xk\u22121, t)}dt,\n\nwith functions ck : Rk\u22121 \u2192 R and hk : Rk \u2192 R. Next, a particular form for ck and hk is\nspeci\ufb01ed; in this work, we use a linear expansion with Hermite polynomials for ck and Hermite\nfunctions for hk. An important choice is the maximum degree \u03b2 of the polynomials. With higher\ndegree, the computational dif\ufb01culty of the algorithm increases by requiring the estimation of more\ncoef\ufb01cients in the expansion. This trade-off between higher degree (which captures more possible\nnonlinearity) and computational expense is a topic of current research [1]. The space of lower\ntriangular maps, parameterized in this way, is denoted S \u03b2\n\u2206. Computations in the transport map\nframework are performed using the software TransportMaps [27].\n\n(cid:90) xk\n\n0\n\n3.1 Optimization of map coef\ufb01cients is an MLE problem\nLet \u03b1 \u2208 Rn\u03b1 be the vector of coef\ufb01cients that parameterize the functions ck and hk, which in turn\nde\ufb01ne a particular instantiation of the transport map S\u03b1 \u2208 S \u03b2\n\u2206. (We include the subscript in this\nsubsection to emphasize that the map depends on its particular parameterization, but later drop it for\nnotational ef\ufb01ciency.) To complete the estimation of S\u03b1, it remains to optimize for the coef\ufb01cients \u03b1.\nThis optimization is achieved by minimizing the Kullback-Leibler divergence between the density in\nquestion, \u03c0, and the pullback of the standard normal \u03b7 by the map S\u03b1 [27]:\n\n(cid:0)\u03c0||S(cid:93)\n\u03b1\u03b7(cid:1)\n\u03b1\u03b7(cid:1)\n(cid:0)log \u03c0 \u2212 log S(cid:93)\n(cid:16)\n(cid:16)\nx(i)(cid:17)(cid:17)\nn(cid:88)\n\nlog\n\nS(cid:93)\n\n\u03b1\u03b7\n\n\u03b1\u2217\n\n= argmin\n\n\u03b1 DKL\n\n= argmin\n\n\u03b1\n\nE\u03c0\n\n\u2248 argmax\n\n\u03b1\n\n1\nn\n\ni=1\n\n(4)\n\n(5)\n\n(6)\n\n= \u02c6\u03b1.\n\nAs shown in [13, 17], for standard Gaussian \u03b7 and lower triangular S, this optimization problem\nis convex and separable across dimensions 1, . . . , p. Moreover, by line (6), the solution to the\noptimization problem is a maximum likelihood estimate \u02c6\u03b1. Given that the n samples are random, \u02c6\u03b1\nconverges in distribution as n \u2192 \u221e to a normal random variable whose mean is the exact minimizer\n\u03b1\u2217, and whose variance is I\u22121(\u03b1\u2217)/n, where I(\u03b1) is the Fisher information matrix. That is:\n\n, as n \u2192 \u221e.\n\n(7)\n\n(cid:18)\n\n\u02c6\u03b1 \u223c N\n\n(cid:19)\n\n\u03b1\u2217\n\n,\n\n1\nn\n\n\u22121(\u03b1\u2217\n\n)\n\nI\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: (a) A sparse graph with an optimal ordering; (b) Suboptimal ordering induces extra edges.\n\nOptimizing for the map coef\ufb01cients yields a representation of the density \u03c0 as S(cid:93)\npossible to compute the conditional independence scores with the generalized precision:\n\n\u03b1\u03b7. Thus, it is now\n\n(cid:2)(cid:12)(cid:12)\u2202jk log S(cid:93)\n(cid:12)(cid:12)(cid:12)\u2202jk log S(cid:93)\nn(cid:88)\n\u2126jk = E\u03c0 [|\u2202jk log \u03c0(x)|] = E\u03c0\n\n1\nn\n\n\u2248\n\ni=1\n\n(cid:12)(cid:12)(cid:3)\nx(i)(cid:17)(cid:12)(cid:12)(cid:12) = \u02c6\u2126jk.\n(cid:16)\n\n\u03b1\u03b7(x)\n\n\u03b1\u03b7\n\n(8)\n\n(9)\n\nThe next step is to threshold \u02c6\u2126. First, however, we explain the connection between the two notions of\nsparsity\u2014one of the graph and the other of the map.\n\n3.2 Sparsity and ordering of the transport map\n\nBecause the transport maps are lower triangular, they are in some sense already sparse. However,\nit may be possible to prescribe more sparsity in the form of the map. [26] showed that the Markov\nstructure associated with the density \u03c0 yields tight lower bounds on the sparsity pattern IS, where\nthe latter is de\ufb01ned as the set of all pairs (j, k), j < k, such that the kth component of the map does\nnot depend on the jth variable: IS := {(j, k) : j < k, \u2202jSk = 0}. The variables associated with\nthe complement of this set are called active. Moreover, these sparsity bounds can be identi\ufb01ed by\nsimple graph operations; see \u00a75 in [26] for details. Essentially these operations amount to identifying\nthe intermediate graphs produced by the variable elimination algorithm, but they do not involve\nactually performing variable elimination or marginalization. The process starts with node p, creates a\nclique between all its neighbors, and then \u201cremoves\u201d it. The process continues in the same way with\nnodes p \u2212 1, p \u2212 2, and so on until node 1. The edges in the resulting (induced) graph determine the\nsparsity pattern of the map IS. In general, the induced graph will be more highly connected unless\nthe original graph is chordal. Since the set of added edges, or \ufb01ll-in, depends on the ordering of the\nnodes, it is bene\ufb01cial to identify an ordering that minimizes it. For example, consider the graph in\nFigure 1a. The corresponding map has a nontrivial sparsity pattern, and is thus more sparse than a\ndense lower triangular map:\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\nS(x) =\n\nS1(x1)\nS2(x1, x2)\nS3(x1, x2, x3)\nS4(\nS5(\n\nx3, x4)\n\nx4, x5)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb ,\n\nIS = {(1, 4), (2, 4), (1, 5), (2, 5), (3, 5)}.\n\n(10)\n\nNow consider Figure 1b. Because of the suboptimal ordering, edges must be added to the induced\ngraph, shown in dashed lines. The resulting map is then less sparse than in 1a: IS = {(1, 5), (2, 5)}.\nAn ordering of the variables is equivalent to a permutation \u03d5, but the problem of \ufb01nding an optimal\npermutation is NP-hard, and so we turn to heuristics. Possible schemes include so-called min-degree\nand min-\ufb01ll [8]. Another that we have found to be successful in practice is reverse Cholesky, i.e., the\nreverse of a good ordering for sparse Cholesky factorization [24]. We use this in the examples below.\nThe critical point here is that sparsity in the graph implies sparsity in the map. The space of maps\nthat respect this sparsity pattern is denoted S \u03b2I . A sparser map can be described by fewer coef\ufb01cients\n\u03b1, which in turn decreases their total variance when found via MLE. This improves the subsequent\nestimate of \u2126. Numerical results supporting this claim are shown in Figure 2 for a Gaussian grid\ngraph, p = 16. The plots show three levels of sparsity: \u201cunder,\u201d corresponding to a dense lower\n\n4\n\n2134521453\ftriangular map; \u201cexact,\u201d in which the map includes only the necessary active variables; and \u201cover,\u201d\ncorresponding to a diagonal map. In each case, the variance decreases with increasing sample size,\nand the sparser the map, the lower the variance. However, non-negligible bias is incurred when the\nmap is over-sparsi\ufb01ed; see Figure 2b. Ideally, the algorithm would move from the under-sparsi\ufb01ed\nlevel to the exact level.\n\n(a)\n\n(b)\n\nFigure 2: (a) Variance of \u02c6\u2126jk decreases with fewer coef\ufb01cients and/or more samples; (b) Bias in \u02c6\u2126jk\noccurs with oversparsi\ufb01cation. The bias and variance of \u02c6\u2126 are computed using the Frobenius norm.\n\n4 Algorithm: SING\n\nWe now present the full algorithm. Note that the ending condition is controlled by a variable\nDECREASING, which is set to true until the size of the recovered edge set is no longer decreasing. The\n\ufb01nal ingredient is the thresholding step, explained in \u00a74.1. Subscripts l in the algorithm refer to the\ngiven quantity at that iteration.\n\n:n i.i.d. samples {x(i)}n\n\nAlgorithm 1: Sparsity Identi\ufb01cation in Non-Gaussian distributions (SING)\ninput\ni=1 \u223c \u03c0, maximum polynomial degree \u03b2\noutput : sparse edge set \u02c6E\nde\ufb01ne : IS1 = {\u2205}, l = 1, | \u02c6E0| = p(p \u2212 1)/2, DECREASING = TRUE\n1 while DECREASING = TRUE do\n2\n\n, where Sl(cid:93) \u03c0 = \u03b7\n\nEstimate transport map Sl \u2208 S \u03b2Il\nCompute ( \u02c6\u2126l)jk = 1\nn\nThreshold \u02c6\u2126l\nCompute | \u02c6El| (the number of edges in the thresholded graph)\nif | \u02c6El| < | \u02c6El\u22121| then\n\ni=1\n\n(cid:80)n\n\n(cid:12)(cid:12)\u2202jk log S(cid:93)\n\n\u03b1\u03b7(cid:0)x(i)(cid:1)(cid:12)(cid:12)\n\n3\n\n4\n\n5\n\n6\n7\n8\n9\n\n10\n11\n\nFind appropriate permutation of variables \u03d5l (for example, reverse Cholesky ordering)\nIdentify sparsity pattern of subsequent map ISl+1\nl \u2190 l + 1\n\nelse\n\nDECREASING = FALSE\n\nSING is not a strictly greedy algorithm\u2014neither for the sparsity pattern of the map nor for the edge\nremoval of the graph. First, the process of identifying the induced graph may involve \ufb01ll-in, and the\nextent of this \ufb01ll-in might be larger than optimal due to the ordering heuristics. Second, the estimate\nof the generalized precision is noisy due to \ufb01nite sample size, and this noise can add randomness to a\nthresholding decision. As a result, a variable that is set as inactive may be reactivated in subsequent\niterations. However, we have found that oscillation in the set of active variables is a rare occurence.\nThus, checking that the total number of edges is nondecreasing (as a global measure of sparsity)\nworks well as a practical stopping criterion.\n\n5\n\n102103Numberofsamples10\u22121100Averagevariancein\u02c6\u2126Gridgraph,p=16SparsitylevelUnderExactOver102103Numberofsamples10\u2212310\u2212210\u22121100Biassquaredin\u02c6\u2126Gridgraph,p=16SparsitylevelUnderExactOver\f4.1 Thresholding the generalized precision\n\nAn important component of this algorithm is a thresholding of the generalized precision. Based\non literature [3] and numerical results, we model the threshold as \u03c4jk = \u03b4\u03c1jk, where \u03b4 is a tuning\nparameter and \u03c1jk = [V( \u02c6\u2126jk)]1/2 (where V denotes variance). Note that a threshold \u03c4jk is computed\nat each iteration and for every off-diagonal entry of \u2126. More motivation for this choice is given in\nthe scaling analysis of the following section. The expression (7) yields an estimate of the variances\nof the map coef\ufb01cients \u02c6\u03b1, but this uncertainty must still be propagated to the entries of \u2126 in order\nto compute \u03c1jk. This is possible using the delta method [16], which states that if a sequence of\none-dimensional random variables satis\ufb01es\n\nd\u2212\u2192 N\n\nd\u2212\u2192 N\n\n(cid:12)(cid:12)(cid:12)\n\n\u221an\n\n(cid:12)(cid:12)(cid:12)X (n) \u2212 \u03b8\n(cid:12)(cid:12)(cid:12)\nX (n)(cid:17)\n(cid:18) 1\n\n\u2212 g (\u03b8)\n\n(cid:0)\u00b5, \u03c32(cid:1) ,\n(cid:0)g(\u00b5), \u03c32|g\n(cid:19)\n\n(\u2207\u03b1\u2126jk)\n\n(cid:48)\n\n(\u03b8)|2(cid:1) .\n(cid:12)(cid:12)(cid:12)\u03b1\u2217 .\n\nThe MLE result also states that the coef\ufb01cients are normally distributed as n \u2192 \u221e. Thus, generalizing\nthis method to vector-valued random variables gives an estimate for the variance in the entries of \u2126,\nas a function of \u03b1, evaluated at the true minimizer \u03b1\u2217:\n\u22121(\u03b1)\n\n(11)\n\nI\n\njk \u2248 (\u2207\u03b1\u2126jk)T\n\u03c12\n\nn\n\n5 Scaling analysis\n\nIn this section, we derive an estimate for the number of samples needed to recover the exact graph\nwith some given probability. We consider a one-step version of the algorithm, or in other words: what\nis the probability that the correct graph will be returned after a single step of SING? We also assume\na particular instantiation of the transport map, and that \u03ba, the minimum non-zero edge weight in the\ntrue generalized precision, is given. That is, \u03ba = minj(cid:54)=k,\u2126jk(cid:54)=0 (\u2126jk).\nThere are two possibilities for each pair (j, k), j < k: the edge ejk does exist in the true edge set E\n(case 1), or it does not (case 2). In case 1, the estimated value should be greater than its variance, up\nto some level of con\ufb01dence, re\ufb02ected in the choice of \u03b4: \u2126jk > \u03b4\u03c1jk. In the worst case, \u2126jk = \u03ba,\nso it must be that \u03ba > \u03b4\u03c1jk. On the other hand, in case 2, in which the edge does not exist, then\nsimilarly \u03ba \u2212 \u03b4\u03c1jk > 0.\nIf \u03c1jk < \u03ba/\u03b4, then by equation (11), we have\n\nthen for a function g(\u03b8),\n\n\u221an\n\n(cid:16)\n\n(cid:12)(cid:12)(cid:12)g\n\n(\u2207\u03b1\u2126jk)T I\nand so it must be that the number of samples\n\n1\nn\n\nn > (\u2207\u03b1\u2126jk)T I\n\n\u03b4\n\n(cid:16) \u03ba\n(cid:17)2\n\u22121(\u03b1) (\u2207\u03b1\u2126jk) <\n(cid:18) \u03b4\n(cid:19)2\n(cid:17)\n\n\u22121(\u03b1) (\u2207\u03b1\u2126jk)\n(cid:16)\nn\u2217\n\n\u03ba\n\n(12)\n\n(13)\n\n.\n\n.\n\njk\n\njk and set n\u2217 = maxj(cid:54)=k\n\nLet us de\ufb01ne the RHS above as n\u2217\nRecall that the estimate in line (9) contains the absolute value of a normally distributed quantity,\nknown as a folded normal distribution. In case 1, the mean is bounded away from zero, and with\nsmall enough variance, the folded part of this distribution is negligible. In case 2, the mean (before\ntaking the absolute value) is zero, and so this estimate takes the form of a half-normal distribution.\nLet us now relate the level of con\ufb01dence as re\ufb02ected in \u03b4 to the probability z that an edge is correctly\nestimated. We de\ufb01ne a function for the standard normal (in case 1) \u03c61 : R+ \u2192 (0, 1) such that\n\u22121\n1 (z1), and similarly for the half-normal with \u03c62, \u03b42, and z2.\n\u03c61(\u03b41) = z1 and its inverse \u03b41 = \u03c6\nConsider the event Bjk as the event that edge ejk is estimated incorrectly:\n\n(cid:110)(cid:16)\n\n(cid:17)\n(ejk \u2208 E) \u2229 (\u02c6ejk /\u2208 \u02c6E)\n\n(cid:16)\n\n(cid:17)(cid:111)\n(ejk /\u2208 E) \u2229 (\u02c6ejk \u2208 \u02c6E)\n\n.\n\n\u222a\n\nBjk =\n\n6\n\n\fIn case 1,\n\n\u03b41\u03c1jk < \u03ba =\u21d2 P (Bjk) <\n\n(1 \u2212 z1)\n\n1\n2\n\nwhere the factor of 1/2 appears because this event only occurs when the estimate is below \u03ba (and not\nwhen the estimate is high). In case 2, we have\n\n\u03b42\u03c1jk < \u03ba =\u21d2 P (Bjk) < (1 \u2212 z2).\n\nTo unify these two cases, let us de\ufb01ne z where 1 \u2212 z = (1 \u2212 z1)/2, and set z = z2. Finally, we have\n(Bjk) < (1 \u2212 z), j < k.\nNow we bound the probability that at least one edge is incorrect with a union bound:\n\nP\n\nBjk\n\nP (Bjk)\n\n(14)\n\n\uf8eb\uf8ed(cid:91)\n\nj 1 \u2212 2m/ (p(p \u2212 1)). Let\n\np(p \u2212 1)(1 \u2212 z).\n\n=\n\n(cid:20)\n\n(cid:18)\n\n(cid:19)(cid:21)\n\n(cid:19)\n\n\u2217\n\n\u03b4\n\n= max [\u03b41, \u03b42] = max\n\n\u22121\n1\n\n\u03c6\n\n1 \u2212\n\n2m\n\np(p \u2212 1)\n\n\u22121\n, \u03c6\n2\n\nTherefore, to recover the correct graph with probability m we need at least n\u2217 samples, where\n\n.\n\n(16)\n\n(cid:40)\n(\u2207\u03b1\u2126jk)T I\n\n\u22121(\u03b1) (\u2207\u03b1\u2126jk)\n\n(cid:18)\n(cid:18) \u03b4\u2217\n\n\u03ba\n\n2m\n\np(p \u2212 1)\n\n1 \u2212\n\n(cid:19)2(cid:41)\n\n.\n\n\u2217\n\nn\n\n= max\nj(cid:54)=k\n\n6 Examples\n\n6.1 Modi\ufb01ed Rademacher\n\nConsider r pairs of random variables (X, Y ), where:\n\nX \u223c N (0, 1)\nY = W X, with W \u223c N (0, 1).\n\n(17)\n(18)\n(A common example illustrating that two random variables can be uncorrelated but not independent\nuses draws for W from a Rademacher distribution, which are \u22121 and 1 with equal probability.) When\nr = 5, the corresponding graphical model and support of the generalized precision are shown in\nFigure 3. The same \ufb01gure also shows the one- and two-dimensional marginal distributions for one\npair (X, Y ). Each 1-dimensional marginal is symmetric and unimodal, but the two-dimensional\nmarginal is quite non-Gaussian.\nFigures 4a\u20134c show the progression of the identi\ufb01ed graph over the iterations of the algorithm, with\nn = 2000, \u03b4 = 2, and maximum degree \u03b2 = 2. The variables are initially permuted to demonstrate\nthat the algorithm is able to \ufb01nd a good ordering. After the \ufb01rst iteration, one extra edge remains.\nAfter the second, the erroneous edge is removed and the graph is correct. After the third, the sparsity\nof the graph has not changed and the recovered graph is returned as is. Importantly, an assumption of\nnormality on the data returns the incorrect graph, displayed in Figure 4d. (This assumption can be\nenforced by using a linear transport map, or \u03b2 = 1.) In fact, not only is the graph incorrect, the use of\na linear map fails to detect any edges at all and deems the ten variables to be independent.\n\n6.2 Stochastic volatility\n\nAs a second example, we consider data generated from a stochastic volatility model of a \ufb01nancial asset\n[23, 6]. The log-volatility of the asset is modeled as an autoregressive process at times t = 1, . . . , T .\nIn particular, the state at time t + 1 is given as\n\nZt+1 = \u00b5 + \u03c6(Zt \u2212 \u00b5) + \u0001t,\n\n\u0001t \u223c N (0, 1)\n\n(19)\n\n7\n\n\f(a)\n\n(c)\n\n(b)\n\nFigure 3: (a) The undirected graphical model; (b) One- and two-dimensional marginal distributions\nfor one pair (X, Y ); (c) Adjacency matrix of true graph (dark blue corresponds to an edge, off-white\nto no edge).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: (a) Adjacency matrix of original graph under random variable permutation; (b) Iteration 1;\n(c) Iterations 2 and 3 are identical: correct graph recovered via SING with \u03b2 = 2; (d) Recovered\ngraph, using SING with \u03b2 = 1.\n\nwhere\n\n(cid:18)\n\n(cid:19)\n\n1\n\nZ0|\u00b5, \u03c6 \u223c N\n\n\u00b5,\n\n1 \u2212 \u03c62\n\n, \u00b5 \u223c N (0, 1)\n\u2217\ne\u03c6\n\u2217\n1 + e\u03c6\u2217 \u2212 1, \u03c6\n\n\u223c N (3, 1).\n\n\u03c6 = 2\n\n(20)\n\n(21)\n\nThe corresponding graph is depicted in Figure 6. With T = 6, samples were generated from the\nposterior distribution of the state Z1:6 and hyperparameters \u00b5 and \u03c6, given noisy measurements of the\nstate. Using a relatively large number of samples n = 15000, \u03b4 = 1.5, and \u03b2 = 2, the correct graph\nis recovered, shown in Figure 6a. With the same amount of data, a linear map returns the incorrect\ngraph\u2014having both missing and spurious additional edges. The large number of samples is required\n\n(a)\n\n(b)\n\nFigure 5: (a) The graph of the stochastic volatility model; (b) Adjacency matrix of true graph.\n\n8\n\n1357924680\u22123\u22122\u221210123x\u22123\u22122\u221210123y510246810510246810510246810510246810510246810\u00b5\u03c6Z1Z2Z3Z4ZT\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\f(a)\n\n(b)\n\n(c)\n\nFigure 6: Recovered graphs using: (a) SING, \u03b2 = 2, n = 15000; (b) SING, \u03b2 = 1; (c) GLASSO.\n\nbecause the edges between hyperparameters and state variables are quite weak. Magnitudes of the\nentries of the generalized precision (scaled to have maximum value 1) are displayed in Figure 7a.\nThe stronger edges may be recovered with a much smaller number of samples (n = 2000), however;\nsee Figure 7b. This example illustrates the interplay between the minimum edge weight \u03ba and the\nnumber of samples needed, as seen in the previous section. In some cases, it may be more reasonable\nto expect that, given a \ufb01xed number of samples, SING could recover edges with edge weight above\nsome \u03bamin, but would not reliably discover edges below that cutoff. Strong edges could also be\ndiscovered using fewer samples and a modi\ufb01ed SING algorithm with (cid:96)1 penalties (a modi\ufb01cation to\nthe algorithm currently under development).\nFor comparison, Figure 6c shows the graph produced by assuming that the data are Gaussian and\nusing the GLASSO algorithm [4]. Results were generated for 40 different values of the tuning\nparameter \u03bb \u2208 (10\u22126, 1). The result shown here was chosen such that the sparsity level is locally\nconstant with respect to \u03bb, speci\ufb01cally at \u03bb = .15. Here we see that using a Gaussian assumption\nwith non-Gaussian data overestimates edges among state variables and underestimates edges between\nstate variables and the hyperparameter \u03c6.\n\n(a)\n\n(b)\n\nFigure 7: (a) The scaled generalized precision matrix \u02c6\u2126; (b) Strong edges recovered via SING,\nn = 2000.\n\n7 Discussion\n\nThe scaling analysis presented here depends on a particular representation of the transport map. An\ninteresting open question is: What is the information-theoretic (representation-independent) lower\nbound on the number of samples needed to identify edges in the non-Gaussian setting? This question\nrelates to the notion of an information gap: any undirected graph satis\ufb01es the Markov properties of an\nin\ufb01nite number of distributions, and thus identi\ufb01cation of the graph should require less information\nthan that of the distribution. Formalizing these notions is an important topic of future work.\n\nAcknowledgments\n\nThis work has been supported in part by the AFOSR MURI on \u201cManaging multiple information\nsources of multi-physics systems,\u201d program of\ufb01cer Jean-Luc Cambier, award FA9550-15-1-0038. We\nwould also like to thank Daniele Bigoni for generous help with code implementation and execution.\n\n9\n\n\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z60.00.20.40.60.81.0\u00b5\u03c6Z1Z2Z3Z4Z5Z6\u00b5\u03c6Z1Z2Z3Z4Z5Z6\fReferences\n[1] D. Bigoni, A. Spantini, and Y. Marzouk. On the computation of monotone transports. In preparation.\n\n[2] V. I. Bogachev, A. V. Kolesnikov, and K. V. Medvedev. Triangular transformations of measures. Sbornik:\n\nMathematics, 196(3):309, 2005.\n\n[3] T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American\n\nStatistical Association, 106(494):672\u2013684, 2011.\n\n[4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\nBiostatistics, 9(3):432\u2013441, 2008.\n\n[5] S. K. Ghosh, A. G. Cherstvy, D. S. Grebenkov, and R. Metzler. Anomalous, non-Gaussian tracer diffusion\n\nin crowded two-dimensional environments. New Journal of Physics, 18(1):013027, 2016.\n\n[6] S. Kim, N. Shephard, and S. Chib. Stochastic volatility: likelihood inference and comparison with ARCH\n\nmodels. The Review of Economic Studies, 65(3):361\u2013393, 1998.\n\n[7] H. Knothe. Contributions to the theory of convex bodies. The Michigan Mathematical Journal,\n\n1957(1028990175), 1957.\n\n[8] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.\n\n[9] C. Liu, R. Bammer, B. Acar, and M. E. Moseley. Characterizing non-Gaussian diffusion by using\n\ngeneralized diffusion tensors. Magnetic Resonance in Medicine, 51(5):924\u2013937, 2004.\n\n[10] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High-dimensional semiparametric Gaussian\n\ncopula graphical models. The Annals of Statistics, 40(4):2293\u20132326, 2012.\n\n[11] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimensional\n\nundirected graphs. Journal of Machine Learning Research, 10:2295\u20132328, 2009.\n\n[12] P.-L. Loh and M. J. Wainwright. Structure estimation for discrete graphical models: Generalized covariance\n\nmatrices and their inverses. In NIPS, pages 2096\u20132104, 2012.\n\n[13] Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini. Sampling via measure transport: An introduction. In\nR. Ghanem, D. Higdon, and H. Owhadi, editors, Handbook of Uncertainty Quanti\ufb01cation. Springer, 2016.\n\n[14] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the lasso. The\n\nAnnals of Statistics, pages 1436\u20131462, 2006.\n\n[15] T. A. Moselhy and Y. M. Marzouk. Bayesian inference with optimal maps. Journal of Computational\n\nPhysics, 231(23):7815\u20137850, 2012.\n\n[16] G. W. Oehlert. A note on the delta method. The American Statistician, 46(1):27\u201329, 1992.\n\n[17] M. Parno and Y. Marzouk. Transport map accelerated Markov chain Monte Carlo. arXiv preprint\n\narXiv:1412.5492, 2014.\n\n[18] M. Parno, T. Moselhy, and Y. M. Marzouk. A multiscale strategy for Bayesian inference using transport\n\nmaps. SIAM/ASA Journal on Uncertainty Quanti\ufb01cation, 4(1):1160\u20131190, 2016.\n\n[19] C.-K. Peng, J. Mietus, J. Hausdorff, S. Havlin, H. E. Stanley, and A. L. Goldberger. Long-range anticorre-\n\nlations and non-Gaussian behavior of the heartbeat. Physical Review Letters, 70(9):1343, 1993.\n\n[20] M. Perron and P. Sura. Climatology of non-Gaussian atmospheric statistics. Journal of Climate, 26(3):1063\u2013\n\n1083, 2013.\n\n[21] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional Ising model selection using\n\nl1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[22] M. Rosenblatt. Remarks on a multivariate transformation. The Annals of Mathematical Statistics, 23(3):470\u2013\n\n472, 1952.\n\n[23] H. Rue and L. Held. Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC\n\nMonographs on Statistics & Applied Probability. CRC Press, 2005.\n\n[24] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2003.\n\n10\n\n\f[25] A. Sengupta, N. Cressie, B. H. Kahn, and R. Frey. Predictive inference for big, spatial, non-Gaussian\ndata: MODIS cloud data and its change-of-support. Australian & New Zealand Journal of Statistics,\n58(1):15\u201345, 2016.\n\n[26] A. Spantini, D. Bigoni, and Y. Marzouk.\n\narXiv:1703.06131, 2017.\n\nInference via low-dimensional couplings. arXiv preprint\n\n[27] T. M. Team. TransportMaps v1.0. http://transportmaps.mit.edu.\n\n[28] C. Villani. Optimal Transport: Old and New, volume 338. Springer, 2008.\n\n11\n\n\f", "award": [], "sourceid": 1380, "authors": [{"given_name": "Rebecca", "family_name": "Morrison", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Ricardo", "family_name": "Baptista", "institution": "MIT"}, {"given_name": "Youssef", "family_name": "Marzouk", "institution": "Massachusetts Institute of Technology"}]}