## Supplemental Material for Paris, Servin, and Boitard, 2019

dataset

posted on 09.10.2019 by Cyriel Paris, Bertrand Servin, Simon Boitard#### dataset

Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.

File S1: Approximating moments of Wright-Fisher process with a Taylor expansion.

Table S1: All Significant SNPs found with HMM approach in the chicken data. CHR column indicates the chromosome where the significant SNP is found, position indicates at which position in thischromosome, log 10 pval - is the pvalue found at this locus in the pHu- line, log 10 pval + is the pvalue found at this locus in the pHu+ line and the column Significant line indicates the line for which this locus is significant.

Figure S1: Some examples of WF approximations, for Ne = 100, Ne s = 10, starting from a frequencyof 0.1 or 0.5, during 1, 5 or 20 generations. Continuous lines correspond to continous densities and crosses correspond to discrete probabilities. Wright-Fisher distributions are represented in orange, Gaussian distributions in blue, Beta distributions in red and Beta with spikes distributions in green.

Figure S2: Evolution of mean allele frequency over time (in T /Ne units) for an initial frequency of 0.1 (left) or 0.5 (right). Each color stands for a different scaled selection intensity (range from N s = 0 to N s = 100). Vertical dotted lines indicate total sampling range corresponding to simulation scenarios analyzed with the HMM.

Figure S3: Wasserstein distances to Wright-Fisher transition for four continuous approximations (columns) using true moments of the Wright-Fisher process and for varying starting allele frequencies (x-axis) and number of generations (y-axis). Top panel: Neutral evolution (Ne s = 0). Middle panel: Mild selection (Ne s = 10). Bottom panel: Strong selection (Ne s = 100).

Figure S4: Likelihood in s = 0 for each approximation versus the likelihood in s = 0 for the Wright-Fisher model. Each quadruplet of panel columns corresponds to a value of T /Ne , while panel lines correspond to different values of the scaled selection coefficient Ne s and of the initial allele frequency x1.

Figure S5: Log likelihood ratio for each approximation versus log likelihood ratio for the Wright- Fisher model. Each quadruplet of panel columns corresponds to a value of T /Ne , while panel lines correspond to different values of the scaled selection coefficient Ne s and of the initial allele frequency x1.

Figure S6: Maximum likelihood estimator ŝ for each approximation versus the maximum likelihood estimator ŝ for the Wright-Fisher model. Each quadruplet of panel columns corresponds to a value of T /Ne , while panel lines correspond to different values of the scaled selection coefficient Ne s and of the initial allele frequency x1.

Figure S7: Likelihood in s = ŝ for each approximation versus the likelihood in s = ŝ for the Wright-Fisher model. Each quadruplet of panel columns corresponds to a value of T /Ne , while panel lines correspond to different values of the scaled selection coefficient N e s and of the initial allele frequency x1.

Figure S8: Calibration of log likelihood ratio λ(y) under the null hypothesis (s = 0) for all models (see colors), different values of T /Ne (columns) and different values of x1 (lines).

Figure S9: Calibration of λ(y) under the null: comparison of empirical vs theoretical χ 2 (1) quantiles for different initial allele frequencies (lines) and inter-sample times (columns and colors), for the NG model.

Figure S10: Estimation error distribution in different scenarios for x1 = 0.1, using the NG model. Each column stands for a scaled time parameter T /Ne ∈ {0.09, 0.9, 1.8}. The first line indicates the absolute error distribution in each scenario and the second line represents in each case, the proportion of rejected trajectories (red lines for fixations in 1 and blue lines for fixations in 0).

Figure S11: Power of the likelihood ratio test under different simulated scenarios, using the NG model. (a) Power as a function of Ne s, (b) Power as a function of T /Ne.

Figure S12: Empirical distribution of the estimation error ŝ − s in different scenarios. Each panel corresponds to a fixed value of the scaled time range T /Ne (lines), the scaled selection parameter Ne s and the initial allele frequency x1 (columns). It considers different values of the population size Ne (x axis).

Figure S13: Comparison of the results obtained at each SNP in the chicken experiment using Beta with spikes vs Wright-Fisher transitions. Each panel corresponds to a different likelihood based statistic (see subplot titles) calculated in each selected line (pHu+ and pHu-). (a) Likelihood at s = 0, (b) Maximum likelihood estimates (ŝ), (c) Likelihood at s = ŝ, (d) Likelihood-ratio statistic.

Figure S14: Distribution of p-values along the genome for the chicken data analysis. On the y-axis, the − log 10 p-value in each selected line (colored in purple and yellow) and on the x-axis the position in the chromosome. Each panel corresponds to one chromosome. Significant SNPs found using the threshold described in the article are highlighted in blue for the pHu+ line and in orange for the pHu- line.

Figure S15: Allele frequency evolution of significant SNPs, grouped by regions, for the chicken data analysis. At a given position, the blue line is the observed allele frequency evolution in the pHu+ line while the orange line is the observed allele frequency evolution in the pHu- line. Dashed lines correspond to trajectories not found significant in the corresponding line while solid lines correspond to significant trajectories in the corresponding line.

Figure S16: Example of significant SNPs found in the chicken data analyzis both by hapFLK and time series (CHR 1), only by time series (CHR 3) and only by hapFLK (CHR 26). See Figure S15 for panel interpretation.