## Supplemental Material for Garcia, Zoller and Anisimova, 2018

dataset

posted on 15.08.2018 by Victor Garcia, Stefan Zoller, Maria Anisimova#### dataset

Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.

**Figure S1**Average relative frameshifted sequence length, l, across within relative gene length. Averages of the length of the frameshifted sequence after a slippery site are computed for 20 bins of equal width. Bins span the entire gene.

**Figure S2**Frequency distribution of relative position of slippery sites l/L for all genes with -1 PRF signals in the PRFdb dataset. The mean of the distribution, as well as its 95% confidence intervals, are smaller than 0.5 -the expectation under randomly distributed slippery site locations.

**Figure S3**Testing the cost of -1 PRF mechanism maintenance. A) Number of slippery sites per gene and per nucleotide across protein expression levels (in molecules per cell) for genes from the integrated PaxDB data set. The red dotted line is the average number of slippery sites per gene. The blue line is a regression line through the data set. The text in the panels gives i) the Spearman correlation coefficient r and the p-value for the null-hypothesis that the correlation is zero and ii) the slope of the line and the p-value of the t-test for a non-zero slope value. B) The frequency distribution of the within-gene positions of the slippery sites, relative to the length of the gene, l/L. To ensure comparability, only genes from the von der Haar data set are considered. The mean of the distribution, as well as its 95% confidence intervals, are smaller than 0.5 -the expectation in the absence of selective pressure. C) Slippery site positions relative to gene length across protein expression levels in the von der Haar data set. The red line is the average slippery site position (computed across 20 bins of equal width in logarithmic scale) and the widgets are the uncertainty (1.96 standard error of the mean) around the average estimate. Averages with large uncertainties are in violet. Analogously to A), a regression line with the corresponding correlation coefficient r (with p-value) and slope (with p-value) are added.

**Figure S4**Codon Adaptation Index (CAI) frequency distributions of genes without -1 PRF signals (CAI of ORF) and with such signals (CAI of -1 PRF mRNA). The p-value given in the figure is for the two-sample Kolmogorov-Smirnov test for the two distributions.

**Figure S5**Spearman rank correlation values between corrections of CAI to -1 PRF presence ( ˆIc, Ia) to protein expression levels P von der Haar (2008) for different -1 PRF efficiencies.