. Relate Documentation

Estimate coalescence rates and effective population sizes


Estimating effective population sizes

You can find a bash script under

PATH_TO_RELATE/scripts/EstimatePopulationSize/EstimatePopulationSize.sh
This script simultaneously
  1. estimates population sizes
  2. re-estimates branch lengths using the estimated population sizes
  3. estimates an average mutation rate.
To run the script, all you need are the inferred .anc/.mut files and a .poplabels file (see here for the file format).
You can test the code on the toy data set in example/data. Note however that the data set is too small (in terms of sequence length, not sample size) to obtain reliable estimates.
Do not move the script - we are extracting the path to the binaries from the location of the script!

PATH_TO_RELATE/scripts/EstimatePopulationSize/EstimatePopulationSize.sh \
              -i example \
              -m 1.25e-8 \
              --poplabels data/example.poplabels \
              --seed 1 \
              -o example_popsize
Input arguments
-i,--input
Filename of the .anc and .mut files, excluding the file extensions.
-o,--output
Filename of output files without file extensions.
-m,--mutation_rate
Mutation rate per base per generation.
--poplabels
Filename of a .poplabels file. The file format is described here. This file needs to always contain all samples.
In addition, the following optional arguments can be specified:
--pop_of_interest
Optional: Specifies populations for which population size is estimated. The second column of the .poplabels file is used for population labels. You can specify multiple populations separated by commas, e.g. pop1,pop2. When using this option, it is recommended (but NOT necessary), to run Relate with --annot. This can also be done after running Relate without --annot, see here.
--first_chr & --last_chr
Optional: Allow for multiple input files, where filenames are assumed to be numbered by chromosome index.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
--threads
Optional: Maximum number of threads. Default is 1.
--num_bins
Optional: Specifies the number of bins between 1,000ybp and 10,000,000 ybp. Default is 30.
--years_per_gen
Optional: Years per generation. Default: 28.0
--num_iter
Optional: Number of iterations of the algorithm. Default is 5.
--coal
Optional: Initial estimate of coalescence rates treating all haplotypes as one population. Ignored if num_bins is specified. Default: Estimated from input files.
--threshold
Optional: Integer which is the number of SNPs a tree needs to not be dropped. This option serves two purposes: First, it removes bad/uninformative trees. Second it speeds up the inference. As we increase this value, we remove more trees. Default: number of haplotypes.
--seed
Optional: Seed for the random number generator used in the MCMC algorithm for estimating branch lengths.

Output

example_popsize.pdf A plot showing the estimated population size.
example_popsize.anc.gz, example_popsize.mut.gz Genealogy with updated branch lengths. By default, --threshold is set to a value larger than 0, such that these files will include only a part of the genealogy.
You can reestimate branch lengths for the entire genealogy using
PATH_TO_RELATE/scripts/EstimatePopulationSize/EstimatePopulationSize.sh \
                -i example \
                -m 1.25e-8 \
                --poplabels data/example.poplabels \
                --threshold 0 \
                --coal example_popsize.coal \
                --num_iter 1 \
                -o example_whole_genome
example_popsize.dist Contains distances between SNPs (see here for file format).
example_popsize.coal Contains coalescence rates and cross-coalescence rates for populations specified in the .poplabels file.(see below for file format).
example_popsize.bin Binary file containing coalescence rates. Can be used by RelateCoalescenceRates --mode FinalizePopulationSize to extract coalescence rates (e.g., for a different .poplabels file).
example_popsize_avg.rate Contains average mutation rate through time.

File formats

.coal
  • The first line lists populations
  • The second line lists the epochs (in generations).
  • Each subsequent line contains the coalescence rates between two populations. Each line starts with two integers indicating the populations considered.
  • These are "raw" coalescence rates - to obtain diploid effective population size estimates, use 0.5/(coalescence rate).
group1 group2
0 35.7143 49.6248 ...
0 0 8.62951e-05 0.000104302 7.33554e-05 ...
0 1 8.62327e-05 0.000115472 9.85662e-05 ...
1 0 8.62327e-05 0.000115472 9.85662e-05 ...
1 1 0.000114249 8.37037e-05 7.23175e-05 ...
.rate
  • The first column lists the epochs (in generations).
  • The second column lists the mutation rate in this epoch.
0 1.23894e-08
35.7143 1.26274e-08
49.6248 1.2925e-08
68.9535 1.23841e-08

More details

Below, we describe details about the functions used in the script. The general idea is to do the following until convergence:
  1. Estimate average mutation rate
  2. Estimate coalescence rate
  3. Re-estimate branch lengths
If you have multiple chromosomes, you can parallelize step 3.

Calculate coalescence rates through time

Calculates the average pairwise coalescence rate through time.
-i,--input
Filename of the .anc and .mut files, excluding the file extensions.
-o,--output
Filename of output files without file extensions.
-m,--mutation_rate
Mutation rate per base per generation.
--poplabels
Optional: Filename of a .poplabels file. The file format is described here.
--first_chr & --last_chr
Optional: Allow for multiple input files, where filenames are assumed to be numbered by chromosome index.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
--num_bins
Optional: Specifies the number of bins between 1,000ybp and 10,000,000 ybp. Default is 30.
--years_per_gen
Optional: Years per generation. Default: 28.0
 PATH_TO_RELATE/bin/RelateCoalescenceRate \
                 --mode EstimatePopulationSize\
                 -i example \
                 -o example 
Output: example.coal, example.bin.

Sometimes we want to obtain output with and without specifying the option --poplabels, or we want to specify different poplabels. For this, we can extract coalescence rates from example.bin using

 PATH_TO_RELATE/bin/RelateCoalescenceRate \
                 --mode FinalizePopulationSize\
                 --poplabels example.poplabels \
                 -i example \
                 -o example 
Here, --poplabels is optional. The .poplabels file needs to contain all samples.

Reestimate branch lengths

Once the average mutation rate and coalescence rates have been calculated, we can re-estimate the branch lengths.

--anc
Filename of the .anc file
--mut
Filename of the .mut file
-o,--output
Filename of output files without file extensions.
-m,--mutation_rate
Mutation rate per base per generation.
--mrate
Specifies the file containing the average mutation rate. Generated using this function.
--coal
Specifies the file containing coalescent rates. Generated using this function (do not specify option --poplabels for this).
--dist
Optional: Specifies distances between SNPs. If unspecified, we use the distances given in the .mut file. This option is only needed if some SNPs (and trees) have been removed using, e.g., this function. The .dist file can be generated using this function.
--seed
Optional: Specifies a seed for the random number generator used in the MCMC algorithm for estimating branch lengths.

 PATH_TO_RELATE/bin/RelateCoalescenceRate \
                 --mode ReEstimateBranchLengths\
                 --anc example.anc \
                 --mut example.mut \
                 --mrate example_avg.rate \
                 --coal example.coal \
                 -m 1.25e-8 \
                 -o example_updated 
Output: example_updated.anc, example_updated.mut.
With the updated branch lengths, we can then re-estimate the average mutation and coalescence rates. We can iterate these steps until convergence to obtain an algorithm that simulatneously estimates mutation and coalescence rates and branch lengths.

Estimate mutation rates


Calculate the average mutation rate through time

Calculates the average mutation rate.
-i,--input
Filename of the .anc and .mut files, excluding the file extensions.
-o,--output
Filename of output files without file extensions.
--first_chr & --last_chr
Optional: Allow for multiple input files, where filenames are assumed to be numbered by chromosome index.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
--num_bins
Optional: Specifies the number of bins between 1,000ybp and 10,000,000 ybp. Default is 30.
--years_per_gen
Optional: Years per generation. Default: 28.0
--dist
Optional: Specifies .dist file. The .dist file stores distances in bp between SNPs. If unspecified, we use the distances given in the .mut file. This option is only needed if some SNPs (and trees) have been removed using, e.g., this function. The .dist file can be generated using this function. If specified together with --first_chr, --last_chr, please specify the filename without file extension.
 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode Avg\
                 -i example \
                 -o example 
Output: example_avg.rate.

Calculate the mutation rates grouped by nucleotide in sequence

Calculates the mutation rate for 96 groups of mutations, where each group corresponds to a triplet mutation (e.g. TCC to TTC mutation).
It is necessary to have appended SNP annotation with --annot to use this function (This can also be done after inferring .anc,.mut files, see ).
-i,--input
Filename of the .anc and .mut files, excluding the file extensions.
-o,--output
Filename of output files without file extensions.
--ancestor
Fasta file containing ancestral genome. This is case insensitive.
--mask
Fasta file of same length as the ancestral genome contining a genomic mask. Loci passing the mask are denoted by P, loci not passing the mask are denoted by N. This is case insensitive.
--num_bins
Optional: Specifies the number of bins between 1,000ybp and 10,000,000 ybp. Default is 30.
--years_per_gen
Optional: Years per generation. Default: 28.0
--dist
Optional: Specifies .dist file. The .dist file stores distances in bp between SNPs. If unspecified, we use the distances given in the .mut file. This option is only needed if some SNPs (and trees) have been removed using, e.g., this function. The .dist file can be generated using this function. If specified together with --first_chr, --last_chr, please specify the filename without file extension.
--first_chr & --last_chr
Currently not supported for mode WithContextForChromosome (because we need to specify separate mask, ancestor files for each chromosome.)
You can however apply mode WithContextForChromosome to each chromosome individually. You can then use mode Finalize (see below) with options --first_chr and --last_chr to obtain a genome-wide estimate. Filenames are assumed to be indexed by chromosome as before.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode WithContextForChromosome \
                 --mask genomic_mask.fa \
                 --ancestor ancestor.fa \
                 -i example_chr1 \
                 -o example_chr1 
Output: example_chr1.bin.

Next, to extract the rates from example.bin, use
 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode Finalize \
                 -i example_chr1 \
                 -o example_chr1 
Output: example_chr1.rate.

If you applied mode WithContextForChromosome to multiple chromosome, use the following to obtain a .bin file summarizing estimates genome-wide. Input filenames are assumed to be example_chr1.bin, example_chr2.bin, etc.
 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode Finalize \
                 --first_chr 1 \
                 --last_chr 22 \
                 -i example \
                 -o example 
Output: example.bin, example.rate.

Detect positive selection


Detect selection

Calculates p-values for selection evidence from genealogy.

We recommend that you first estimate the population size history using this module, with setting --threshold 0. This is so that the branch lengths in all trees are updated for the estimated population size history. Alternatively, if the region you are analysing is short, you can specify a .coal file in this module and the script reestimates branch lengths for the trees of interest before calculating selection p-values.

P-values can only be calculated for mutations with derived allele frequency of at least 3. Mutations with a smaller DAF are filtered out if the DAFs are appended to the .mut file using RelateFileFormats --mode GenerateSNPAnnotations, otherwise we set the -log10 p-value column to 1.

-i,--input
Filename of .anc and .mut files without file extension. Needs to contain all SNPs in haps/sample input file.
-o,--output
File of the output files without file extensions.
-m,--mu
Mutation rate, e.g., 1.25e-8.
--first_bp
Optional: First bp position.
--last_bp
Optional: Last bp position.
--years_per_gen
Optional: Years per generation. Default: 28.0
--coal
Optional: Estimate of coalescent rate, treating all haplotypes as one population. Not necessary if .anc/.mut files were obtained using EstimatePopulationSize.sh, otherwise strongly recommended.
--seed
Optional: Random seed for branch lengths estimation. Only used if --coal is specified.

 PATH_TO_RELATE/scripts/DetectSelection/DetectSelection.sh \
                 -i example \
                 -o example_selection \
                 -m 1.25e-8 \
                 --first_bp 1000000 \
                 --last_bp 1000000 \
                 --years_per_gen 28 \
                 --coal popsize.coal 
Output: example_selection.lin, example_selection.freq, example_selection.sele, (example_selection.anc, example_selection.mut, example_selection_dist if coal is specified).

File formats

.freq Records the frequency of the derived allele at generations genN .. gen1 (specified in the header)

Columns: pos rs_id genN genN-1 ... gen1 0
.lin Records the number of lineages in the tree at generations genN .. gen1 (specified in the header), as well as the number of lineages when the mutation had frequency 2

Columns: pos rs_id genN genN-1 ... gen1 0 when_mutation_has_freq2
.sele Records the log10 p-value for selection evidence at generations genN .. gen1 (specified in the header), as well as the log10 p-value when the mutation had frequency 2. Log10 p-value is set to 1 if mutation had frequency <= 1 at a generation.
The "when_mutation_has_freq2" column contains the p-value for selection evidence over the lifetime of this mutation. All other p-values condition on the frequency at the specified time.

Columns: pos rs_id genN genN-1 ... gen1 0 when_mutation_has_freq2

More details

Below, we describe details about the functions used to calculate the selection p-value.

Get selection pvalues

We first calculate, for each SNP, its frequency through time. This information is stored in two files: the .freq file and the .lin file. The former stores the frequency of a SNP through time, the latter stores the number of lineages in the corresponding tree through time.

 PATH_TO_RELATE/bin/RelateSelection \
                 --mode Frequency \
                 -i example \
                 -o example 
Output: example.freq, example.lin.

We then use these two files to calculate a p-value for selection evidence for each SNP.

 PATH_TO_RELATE/bin/RelateSelection \
                 --mode Selection \
                 -i example \
                 -o example 
Output: example.sele.

Sample branch lengths (e.g., for CLUES)


Sample branch lengths

Samples branch lengths using MCMC.
The output file format resembles that of ARGweaver, which is the input file format required by CLUES for inference of selection and inferring allele frequency trajectories.

Documentation for using Relate with CLUES is under construction.

In Relate, we map mutations to branches with approximately the same descendants as carriers of the derived allele to be robust to errors. Therefore, we output an additional .sites file in which only the descendants of the branch to which the mutation is mapped are marked as derived, the rest are ancestral. Please also note that some SNPs don't map to a unique branch in the tree and such SNPs are left out of this .sites file.

-i,--input
Filename of .anc and .mut files without file extension. Needs to contain all SNPs in input haps/sample files. If not, see --dist option.
-o,--output
File of the output files without file extensions.
-m,--mu
Mutation rate, e.g., 1.25e-8.
--coal
Estimated coalescent rate, treating all haplotypes as one population.
--num_samples
Number of times branch lengths are sampled.
--num_proposals
Optional: Number of MCMC proposals between samples. Default is max(10*N, 1000), where N is the number of haplotypes. If num_proposals=0, all samples will be identical to the input.
--first_bp
Optional: First bp position.
--last_bp
Optional: Last bp position.
--dist
Optional: Filename containing bp positions of SNPs and distances between adjacent SNPs (file format). E.g., obtained using RelateExtract --mode AncMutForSubregion. Required if anc/mut file only covers a subregion of input haps/sample files.
--seed
Optional: Random seed for branch lengths estimation.

 PATH_TO_RELATE/scripts/SampleBranchLengths/SampleBranchLengths.sh \
                 -i example \
                 -o example_sub \
                 -m 1.25e-8 \
                 --coal popsize.coal \
                 --num_samples 5 \
                 --first_bp 100000 \
                 --last_bp 200000 \
                 --seed 1 
Output: example_sub.newick, example_sub.sites.

File formats

.newick A tab-delimited file with columns
chrom chromStart chromEnd MCMC_sample tree
  • chrom: Chromosome id.
  • chromStart: bp position at which tree starts.
  • chromEnd: bp position at which tree ends.
  • MCMC_sample: integer specifying the MCMC sample.
  • tree: Newick string of the tree.
.sites A tab-delimited file with
  • First line: Lists names of haplotypes.
  • Second line: Shows genomic region covered by this file.
  • Remaining lines: bp position followed by haplotype pattern at this site.

Plot trees of interest


TreeView

Plots a tree of interest using R.
Requires libraries ggplot2, grid, gridExtra.

--anc
Filename of the .anc file.
--mut
Filename of the .mut file.
--haps
Filename of haps file.
--sample
Filename of sample file.
--poplabels
Filename of a .poplabels file. The file format is described here. This file needs to always contain all samples.
--bp_of_interest
Integer specfifying the BP position of the SNP of interest
--years_per_gen
Years per generation. Default: 28.0
-o,--output
File of the output files without file extensions.

 PATH_TO_RELATE/scripts/TreeView/TreeView.sh \
                 --haps data/example.haps \
                 --sample data/example.sample \
                 --anc example.anc \
                 --mut example.mut \
                 --poplabels data/example.poplabels \
                 --bp_of_interest 3000000 \
                 --years_per_gen 28 \
                 -o example 
Output: example.pdf.

You can change plotting parameters by editing PATH_TO_RELATE/scripts/TreeView/treeview.R. Specifically, lines 9 - 22 directly influence the plotting parameters.

TreeViewMutation

Plots a tree of interest using R and highlights lineages carrying derived allele.
Requires libraries ggplot2, grid, gridExtra.
Takes same arguments as the TreeView.sh script, however bp_of_interest has to specify a SNP in the input.

 PATH_TO_RELATE/scripts/TreeView/TreeViewMutation.sh \
									 --haps data/example.haps \
									 --sample data/example.sample \
									 --anc example.anc \
									 --mut example.mut \
									 --poplabels data/example.poplabels \
									 --bp_of_interest 3000000 \
									 --years_per_gen 28 \
									 -o example 
Output: example.pdf.

Extract subtrees, trees in region of interest, and trees in newick format.


Extract subtrees corresponding to a subpopulation

Extracts subtrees such that all tips belong to the populations of interest.

--anc
Filename of the .anc file.
--mut
Filename of the .mut file.
--poplabels
Filename of a .poplabels file. The file format is described here. This file needs to always contain all samples.
--pop_of_interest
Specifies populations for which population size is estimated. The second column of the .poplabels file is used for population labels. You can specify multiple populations separated by commas, e.g. pop1,pop2.
-o,--output
File of the output files without file extensions.
 PATH_TO_RELATE/bin/RelateExtract\
                 --mode SubTreesForSubpopulation\
                 --anc example.anc \
                 --mut example.mut \
                 --poplabels example.poplabels \
                 --pop_of_interest group1 \
                 -o example_subpop 
Output: example_subpop.anc, example_subpop.mut, example_subpop.poplabels.

Extract trees corresponding to a subregion

Extracts trees in region of interest.

Please always apply this on the anc/mut files inferred using Relate (from haps/sample files), i.e., do not apply repeatedly to get "subregions of subregions". If you do, ignore the output .dist file and use the .dist file obtained from the original anc/mut files instead.

--anc
Filename of the .anc file.
--mut
Filename of the .mut file.
--first_bp
BP of start of region.
--last_bp
BP of end of region.
-o,--output
File of the output files without file extensions.
 PATH_TO_RELATE/bin/RelateExtract\
                 --mode AncMutForSubregion\
                 --anc example.anc \
                 --mut example.mut \
                 --first_bp 0 \
                 --last_bp 10000 \
                 -o example_subregion 
Output: example_subregion.anc, example_subregion.mut, example_subregion.dist.

Extract tree at a SNP of interest in newick format

Extracts a tree at a SNP of interest in newick format.

--anc
Filename of the .anc file.
--mut
Filename of the .mut file.
--bp_of_interest
Integer specfifying the BP position of interest
-o,--output
File of the output files without file extensions.
 PATH_TO_RELATE/bin/RelateExtract\
                 --mode TreeAtSNPAsNewick \
                 --anc example.anc \
                 --mut example.mut \
                 --bp_of_interest 3000000 \
                 -o example 
Output: example_at_.newick.

Remove trees with few mutations

Removes trees and associated SNPs from the .anc/.mut files onto which less than a given number of SNPs are mapped.

--anc
Filename of the .anc file.
--mut
Filename of the .mut file.
--threshold
Integer which is the number of SNPs a tree needs to not be dropped. This option serves two purposes: First, it removes bad/uninformative trees. Second it speeds up the inference. As we increase this value, we remove more trees. Default: number of haplotypes.
-o,--output
File of the output files without file extensions.
 PATH_TO_RELATE/bin/RelateExtract\
                 --mode RemoveTreesWithFewMutations \
                 --anc example.anc \
                 --mut example.mut \
                 --threshold 100 \
                 -o example_subset 
Output: example_subset.anc, example_subset.mut, example.dist.

Extract .dist file from .mut file.

Extracts distances between SNPs from the .mut file (3rd and 4th column). The .dist file is needed in some applications, when the .anc/.mut files have been subsetted using, e.g., this function In these cases, the .dist file needs to be generated using the not subsetted .mut file.

--mut
Filename of the .mut file.
-o,--output
File of the output files without file extensions.
 PATH_TO_RELATE/bin/RelateExtract\
                 --mode ExtractDistFromMut \
                 --mut example.mut \
                 -o example 
Output: example.dist.

Divide the .anc/.mut files into smaller files to allow multi-threading.

Dividing the .anc/.mut files into smaller files to allow for multi-threading.

threads
Number of threads that will be used. This slightly varies the number of chunks into which the .anc/.mut files will get divided into.
--anc
Filename of the .anc file.
--mut
Filename of the .mut file.
-o,--output
File of the output files without file extensions.
 PATH_TO_RELATE/bin/RelateExtract\
                 --mode DivideAncMut \
                 --threads 5 \
                 --anc example.anc \
                 --mut example.mut \
                 -o example 
Output: example.param, example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut, etc.

Combine the .anc/.mut files into one file.

Combining the .anc/.mut files into one file. This should be used after dividing .anc/.mut files using this function.

-o,--output
File of the output files without file extensions.

In addition, we assume that the corresponding .param file is in the same directory. You can also manually create this file, which has the header "NUM_HAPLOTYPES NUM_SNPS NUM_TREES NUM_CHUNKS", followed by the corresponding entries on the next line.

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode CombineAncMut \
                 -o example 
Output: example.anc.gz, example.mut.gz.

File formats

.dist
  • The first line is a header
  • Each subsequent line contains the base pair position of the SNP and the distance (in bp) to the next SNP.
#pos dist
51479 22
51675 20
51872 42
51914 21