> RTQ#` <bjbj 5D(nnn8DB% jj$$$$$$$$&h)$in$;E%Rn$$RbL"n^.'eVz$<[%0%tB))<)n$$^%DSupplementary Note
Concordance between an inferred axis of variation and a true axis of variation
Let N be the number of samples and M be the number of markers. According to Theorem 4 of the report of D. Paul, the squared correlation between an inferred axis of variation and a true axis of variation is asymptotically equal to [1-/(l1-1)2]/[1+/(l1-1)], where = N/M and l1 is the leading eigenvalue of the true covariance matrix. From Equation (11) of Patterson et al., we have (l1-1) = FSTN, thus /(l1-1) = 1/(FSTM). (We caution that Patterson et al. use a different definition of than the definition of D. Paul that we follow here.) Assuming that N >> 1/FST, the squared correlation is approximately equal to 1/[1+/(l1-1)] = x/(1+x), where x = FSTM.
The above result is subject to several assumptions. First, it assumes that < (l1-1)2, i.e. that NM > 1/FST2. This corresponds to the BBP phase change of Patterson et al. (the threshold at which population structure is detectable). Because in the current context the aim is not only to detect population structure but also to infer it accurately, this condition will easily be satisfied in applications involving panels of informative markers such as those described in this paper. Second, it assumes that N < M. More generally, if N > M, the above formula provides a lower bound on the squared correlation between inferred and true axes of variation. Third, Equation (11) of Patterson et al. assumes two subpopulations of equal size, however the theory can be extended to two subpopulations of size N and (1-)N. In this case, the above results hold if we replace FST with 4(1-)FST.
Stratification score method: using disease status to estimate odds of disease
The stratification score method of Epstein et al. computes regression coefficients which describe how genotypes of non-candidate marker predicts the odds of disease, and uses those regression coefficients to estimate the odds of disease of each sample conditional on genotypes of non-candidate markers. The odds of disease is intended to reflect differences in disease risk due to ancestry, as inferred from non-candidate markers.
The genotype data and disease status of sample j can either be included in the regressions whose coefficients are used to estimate the odds of disease of sample j (in-sample approach), or excluded from these regressions (out-of-sample approach). Let N be the number of samples and M be the number of markers. If N >> M, then the impact of one sample on the regressions will be negligible, so that the distinction between the in-sample and out-of-sample approaches is of little consequence. However, if N ~ M (or N < M) and the in-sample approach is used, the combined impact of one sample on the M regression coefficients used to estimate the odds of disease of that sample will be substantial, so that the odds of disease will simply overfit the actual disease status an effect that has nothing to do with ancestry. This will lead to a large loss in power when stratifying the association analysis by the odds of disease. For this reason, we consider the in-sample approach used by Epstein et al. to be inappropriate.
We demonstrate this via a simple reimplementation of the Epstein et al. method which we show yields extremely similar results. In detail, in the height data set using the original 178 markers, for each marker we computed an odds ratio for that marker (the relative risk of being tall per copy of the reference allele) as (ptall/pshort) / ((1-ptall)/(1-pshort)) where ptall and pshort are frequencies of the reference allele in tall and short samples. Then, for each sample, we computed the overall odds of disease as the product of relative risks using the number of copies of the reference allele at each marker. Following Epstein et al., we call this the stratification score and binned the stratification score into five strata.
We repeated this computation either including (in-sample approach) or excluding (out-of-sample approach) the genotype and height status of sample j in the computation of relative risks used to compute the stratification score of sample j. Results are displayed in Supp. Table 2. We see that under the in-sample approach, the strata are heavily stratified by height, closely matching the results of Epstein et al., but under the out-of-sample approach, the strata are not significantly stratified by height. Thus, the out-of-sample approach does not effectively stratify to avoid a spurious association in the height study, presumably because the original 178 markers are not sufficiently informative for ancestry.
We performed the same analysis using gender instead of height as a phenotype. We note that gender is essentially a random phenotype, as there is no correlation between gender and ancestry in this sample set. Results are displayed in Supp. Table 3. We again see that under the in-sample approach (but not under the out-of-sample approach), the strata are heavily stratified by phenotype. Stratifying an association analysis via a CMH analysis using these five strata, as recommended by Epstein et al., implies an effective sample size of 242 samples (sum of 4/(1/Nmale + 1/Nfemale) of each column), implying a loss in power equivalent to a reduction in sample size of 34% even though there is no underlying ancestry effect with respect to this phenotype. Experiments using randomly generated phenotypes generated very similar results.
Thus, although its correction for stratification in the height study is interesting and intriguing, the in-sample approach used by Epstein et al. will evidently lead to a large loss in power in any data set with 368 samples and 178 markers. Epstein et al. reported that their approach appears to have negligible effect on power when there is no confounding, however the number of samples was 10 times larger than the number of markers in all of their simulations, a scenario in which the in-sample approach would be expected to be little different from the theoretically appropriate out-of-sample approach. In summary, informative markers panels are still needed to provide a fully powered correction for stratification in targeted studies such as the height study. (However, the method of Epstein et al. may prove useful in future data sets involving informative marker panels, particularly if N >> M so that the distinction bfgZ
\
^
d
f
v
x
z
$,t"̨ӡӕߡߡӕӕӡvvߡh&h&5H*aJh&h&56aJ
hyQG5aJh2ch2c5H*aJ
h&5aJh+7vh+7v5H*aJh+7vh+7v5H*aJh+7vh+7v56aJ
h+7v5aJh2ch2c56aJ
h2c5aJ
haJhb9dh:&5aJh:&5CJ aJ .b
V[\^_BCZ"["O8P88;;;;;;dgd ;;<<"p
r
t
NXZ\bd|u4ĽhWhW56aJ
h[5aJ
hW5aJhHJhHJ5H*aJhHJhHJ5H*aJhHJhHJ56aJ
hHJ5aJ
h,5aJ
ho%R5aJ
h5aJh&h&5H*aJ
h&5aJh&h&56aJ
h2c5aJ144TV2b/2Z SWXvwúuukh$h$56h2ch$56aJ
h$5aJhzhz56 hz5 hYJ5 h25 h$5 hCL5hih:&aJ
h2aJhihiaJ
hHJaJ
haJhWhW5aJ
h[5aJhyQGhyQG5H*aJhyQGhyQG56aJ
hyQG5aJ*UVYZ_`cdef)267>EFJ_t)H]^_CZþܹ봯ܴܛhihi56aJ
h"Y5aJ
hi5aJ hCL5hzhCL5 hYJ5 h[5 h)5 hB5 h5 h25h[h$56 hnN5 h[5 h~X5 hW5 h|t5h$h$56 h$52Gf?ABfq/013PYm)?̽̽ڥڽڞڐڐӗڐڗ
h.5aJ
h#w5aJ
hKe5aJ
h}}5aJ
h>5aJh"Yh"Y56aJhYJhYJ56aJ
hYJ5aJh[h:&5
h[5aJ
ha5aJ
h"Y5aJhihi56aJ
hi5aJhihi5H*aJ3 # : C O e !B!G!H!L!Q!R!X!h!!"Y"Z"["j"m"""""""6#M###$$$$#%1%[%%ɽɽկըՌՌՌ~~w~
h^5aJ
h\5aJ
h(c@5aJ
hc]5aJ
hYJ5aJ
hm5aJ
h|t5aJ
hW5aJ
h25aJ
h.5aJh.ha5H*aJh.ha56aJ
ha5aJ
h)n5aJ
h}}5aJ
hc5aJ
h"&5aJ
h"Y5aJ,%%%%%%%%%%8M8N8O8P8V8b8888899
:;;;;;;;;;;;;;;< <Ľ|x|x|x|xtjjh ;0JUh ;h<jh<Uh}Y5CJ aJ hJ 5CJ aJ hKeh}Y5aJ
ho(5aJ
hKe5aJ
hKeaJhKehKeaJ
h"Y5aJh.h.5aJ
h\5aJUh.h.56aJ
h.5aJ
h(c@5aJ
haJ
h}Y5aJ'between the in-sample approach and the out-of-sample approach is unimportant.)
Using the panel of markers to match cases and controls prior to a genome-wide association study
We caution that if one of the ancestry-informative markers is linked to a causal variant, use of these markers to match cases and controls prior to a genome-wide association study could lead to a loss in power. A solution to this problem is as follows: at the time of initial genotyping of all cases and controls at the ancestry-informative markers, each of these markers is tested for disease association (correcting for possible stratification using all other ancestry-informative markers). Any marker showing a suggestive association (significance beyond some nominal threshold) is then removed from the set of ancestry-informative markers used for matching. A subset of cases and controls is then matched using the remaining ancestry-informative markers, and genotyped in the genome-wide association study.
Page PAGE 3
;;;;;;;;;;;;;;;;;;;;;;;;;<<<<dgd ; <<<<<<<<<h}Y5CJ aJ h<h ;hc0JmHnHujh ;0JU
h ;0J<<<<<<<<<<<<dgd ;$a$3
0&PBP/ =!"F#$%8@8Normal_HmH sH tH 8@8 Heading 1$@&CJN@N Heading 2$@&B*CJOJQJhtH uP@P Heading 3$@&5B*CJOJQJhtH uP@P Heading 4$$@&a$CJOJQJhtH uDA@DDefault Paragraph FontVi@VTable Normal :V44
la(k(No List6B@6 Body Text>*CJ(:P@:Body Text 25CJ4+@4Endnote Text>*@!>Endnote ReferenceH*JOJDefinition List
h^hCJh0U@A0 Hyperlink>*B*4@R4Header
!4 @b4Footer
!.)@q.Page Number66
Footnote Text@&@Footnote ReferenceH*HHiiJBalloon TextCJOJQJ^JaJj@jzQ
Table Grid7:V0==Db|}~}~de23|}op
!"#3456789:;>000000000000000000000000000000000000000000@000@000@000@0000000@00 00 00 0@000@00 0@000000000 #%'*"4% << ;<<!< *!/X2$0 rhlUf2"$&Fd'@0(
B
S ?=$&`g
89>/1pseg89>3333333U~$.#(89>89>|"d}f~6l@mv80>u
RV>NU63UE6^`.^`.88^8`.^`.^`OJQJo(^`OJQJo(88^8`OJQJo(^`OJQJo(hh^h`.hh^h`OJQJo(_^`OJQJo(^`o(()
^`hH.
pLp^p`LhH.
@@^@`hH.
^`hH.
L^`LhH.
^`hH.
^`hH.
PLP^P`LhH.~}|U6UE8 K
=ZdVN+]lzi"v'%77Q[_63Lv{hfvJ
>
6[
_
y
2:Bp@,9@Sj[mk%z$Lg3K
'66Ag3[mg# .2Kb \R\ 5!R"9#E#1k#,$5$5$V$1W%u%"&:&Q'^'v<(>(
'){()N)++'+++;,k,~,9-&f-x(.o5.\.i.y.H0&01>1.1B1n1V>2:x2y23P>3na3Aj31E4:5x5fz5z6N788&8gV9:n:bS;<[<*=0=eM=>(>H\?@B.@b7@,T@(c@(eABrDvXEFyQGIL IJHJ
KK3KCLjcLJgN>ORxOuP7sQsQRo%RC4RrRrROS _TN}TU&U[ULYVuVWbkW|WXkmX~X"Y5Y^Z#Z_Z[4[G[7{\B]]c]^}&^FY^*_7`z=`jb2c]ckcdb9dzdeKeggm1g@g=hBhShXxh5*cZhxY}&`\&#r=)=]F':MsQuU)nNWHnZ]LP[.pRBJ3(Y2uwT)k (jl.Qm<>oB$fVmwJ)Rs"c|kGBfyQ0-n]fsV}9H YJlWr*2k,}[ybUbi
R4>@<(gd"""""" =P@P
PPP @PP(@Pp@UnknownGz Times New Roman5Symbol3&z Arial3z Times5&zaTahoma"1hWk[zFUXV4V4!4d2qHX ?iiJ2SWhat fraction of human genetic variation has been captured by SNP discovery effortsReichHMS<
Oh+'0$0@ LX
x
TWhat fraction of human genetic variation has been captured by SNP discovery effortsReichNormalHMS85Microsoft Office Word@w@)@Zz@b'V՜.+,0lhp
L4Whitehead Institute/MIT Center for Genome Research4TWhat fraction of human genetic variation has been captured by SNP discovery effortsTitle
!"$%&'()*,-./0123456789:;<=>?@BCDEFGHJKLMNOPSRoot Entry FpȖ'UData
#1Table+/*WordDocument5DSummaryInformation(ADocumentSummaryInformation8ICompObjq
FMicrosoft Office Word Document
MSWordDocWord.Document.89q*