Relate Documentation

Overview

This page lists add-on modules that can be applied to the Relate output for inference of coalescnece rates, mutation rates, and selection.

Estimate coalescence rates and effective population sizes

Estimating effective population sizes

You can find a bash script under

PATH_TO_RELATE/scripts/EstimatePopulationSize/EstimatePopulationSize.sh

This script simultaneously

estimates population sizes
re-estimates branch lengths using the estimated population sizes
estimates an average mutation rate.

To run the script, all you need are the inferred .anc/.mut files and a .poplabels file (see here for the file format).

Please note that two versions of coalescence rates/population sizes are outputted:

output.coal: Treats all samples as one group.
*.pairwise.coal/.bin, reflects grouping in poplabels file. You can rerun RelateCoalescentRates --mode FinalizePopulationSize to extract coalescence rates for any .poplabels file.

You can test the code on the toy data set in example/data.
Do not move the script - we are extracting the path to the binaries from the location of the script!

PATH_TO_RELATE/scripts/EstimatePopulationSize/EstimatePopulationSize.sh \
              -i example \
              -m 1.25e-8 \
              --poplabels data/example.poplabels \
              --seed 1 \
              -o example_popsize

Input arguments

-i,--input: Filename of the .anc and .mut files, excluding the file extensions.
-o,--output: Filename of output files without file extensions.
-m,--mutation_rate: Mutation rate per base per generation.
--poplabels: Filename of a .poplabels file. The file format is described here. This file needs to always contain all samples.

In addition, the following optional arguments can be specified:

--noanc: Optional: Flag specifying whether to reinfer branch lengths at the end. 0: reinfer, 1: don't reinfer. Default: 0.
--pop_of_interest: Optional: Specifies populations for which population size is estimated. The second column of the .poplabels file is used for population labels. You can specify multiple populations separated by commas, e.g. pop1,pop2. When using this option, it is recommended (but NOT necessary), to run Relate with --annot. This can also be done after running Relate without --annot, see here.
--first_chr & --last_chr: Optional: Allow for multiple input files, where filenames are assumed to be numbered by chromosome index.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
--chr: Optional: File containing chromosome names (one per line). Overrides --first_chr and --last_chr.
Example: -i example --chr chr.txt assumes example_chr*.anc, example_chr*.mut as input.
--threads: Optional: Maximum number of threads. Default is 1.
--bins: Optional: Specify epoch bins. Format: lower, upper, stepsize for function c(0,10^seq(lower, upper, stepsize)).
--years_per_gen: Optional: Years per generation. Default: 28.0
--num_iter: Optional: Number of iterations of the algorithm. Default is 5.
--threshold: Optional: Float in [0,1] which is the fraction of trees to be dropped, ranked by number of mutations mapping. This option serves two purposes: First, it removes bad/uninformative trees. Second it speeds up the inference. As we increase this value, we remove more trees. Default: 0.5.
--seed: Optional: Seed for the random number generator used in the MCMC algorithm for estimating branch lengths.
--noplot: Optional: Takes no value. If specified, do not plot the results using R.

Output

example_popsize.pdf	A plot showing the estimated population size.
example_popsize.anc.gz, example_popsize.mut.gz, example_popsize.dist	Not outputted if --noanc 1. Genealogy with updated branch lengths. Will contain all trees regardless of threshold value. If --noanc 1 is specified, you can separately reestimate branch lengths using the ReEstimateBranchLengths.sh script.
example_popsize.coal	Contains coalescence rates and cross-coalescence rates, treating all samples as one population (see below for file format).
example_popsize.pairwise.coal and *.bin	Coalescence rate file and corresponding binary file containing coalescence rates between pairs of samples. Can be used by RelateCoalescentRates --mode FinalizePopulationSize to extract coalescence rates for any .poplabels file.
example_popsize_avg.rate	Contains average mutation rate through time. Please note that any trends in the average mutation rate are confounded by coalescence rate changes - Relate can only infer mutation rate changes in mutation subcategories (see estimate mutation rate section).

File formats

.coal

The first line lists populations
The second line lists the epochs (in generations).
Each subsequent line contains the coalescence rates between two populations. Each line starts with two integers indicating the populations considered.
These are "raw" coalescence rates - to obtain diploid effective population size estimates, use 0.5/(coalescence rate).

group1 group2
0 35.7143 49.6248 ...
0 0 8.62951e-05 0.000104302 7.33554e-05 ...
0 1 8.62327e-05 0.000115472 9.85662e-05 ...
1 0 8.62327e-05 0.000115472 9.85662e-05 ...
1 1 0.000114249 8.37037e-05 7.23175e-05 ...

.rate

The first column lists the epochs (in generations).
The second column lists the mutation rate in this epoch.

0 1.23894e-08
35.7143 1.26274e-08
49.6248 1.2925e-08
68.9535 1.23841e-08

Reestimate branch lengths

Reestimate branch lengths using MCMC given coal file.

For a given coal file, this script will output anc/mut files with posterior mean branch lengths given coalescence rates.

-i,--input: Filename of .anc and .mut files without file extension. Needs to contain all SNPs in input haps/sample files. If not, see --dist option.
-o,--output: File of the output files without file extensions.
-m,--mu: Mutation rate, e.g., 1.25e-8.
--coal: Estimated coalescent rate, treating all haplotypes as one population.
--threads: Optional: Maximum number of threads. Default is 1.
--first_bp: Optional: First bp position.
--last_bp: Optional: Last bp position.
--dist: Optional: Filename containing bp positions of SNPs and distances between adjacent SNPs (file format). E.g., obtained using RelateExtract --mode AncMutForSubregion. Required if anc/mut file only covers a subregion of input haps/sample files.
--seed: Optional: Random seed for branch lengths estimation.

 PATH_TO_RELATE/scripts/SampleBranchLengths/ReEstimateBranchLengths.sh \
                 -i example \
                 -o example_sub \
                 -m 1.25e-8 \
                 --coal popsize.coal \
                 --first_bp 100000 \
                 --last_bp 200000 \
                 --seed 1

Output: example_sub.anc/mut.

More details

Below, we describe details about the functions used in the script. The general idea is to do the following until convergence:

Estimate average mutation rate
Estimate coalescence rate
Re-estimate branch lengths

If you have multiple chromosomes, you can parallelize step 3.

Calculate coalescence rates through time

Calculates the average pairwise coalescence rate through time.

-i,--input: Filename of the .anc and .mut files, excluding the file extensions.
-o,--output: Filename of output files without file extensions.
-m,--mutation_rate: Mutation rate per base per generation.
--poplabels: Optional: Filename of a .poplabels file. The file format is described here.
--first_chr & --last_chr: Optional: Allow for multiple input files, where filenames are assumed to be numbered by chromosome index.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
--chr: Optional: File containing chromosome names (one per line). Overrides --first_chr and --last_chr.
Example: -i example --chr chr.txt assumes example_chr*.anc, example_chr*.mut as input.
--bins: Optional: Specify epoch bins. Format: lower, upper, stepsize for function c(0,10^seq(lower, upper, stepsize)).
--years_per_gen: Optional: Years per generation. Default: 28.0

 PATH_TO_RELATE/bin/RelateCoalescenceRate \
                 --mode EstimatePopulationSize\
                 -i example \
                 -o example

Output: example.coal, example.bin.

Sometimes we want to obtain output with and without specifying the option --poplabels, or we want to specify different poplabels. For this, we can extract coalescence rates from example.bin using

 PATH_TO_RELATE/bin/RelateCoalescenceRate \
                 --mode FinalizePopulationSize\
                 --poplabels example.poplabels \
                 -i example \
                 -o example

Here, --poplabels is optional. The .poplabels file needs to contain all samples.

Reestimate branch lengths

Once the average mutation rate and coalescence rates have been calculated, we can re-estimate the branch lengths.

--anc: Filename of the .anc file
--mut: Filename of the .mut file
-o,--output: Filename of output files without file extensions.
-m,--mutation_rate: Mutation rate per base per generation.
--mrate: Specifies the file containing the average mutation rate. Generated using this function.
--coal: Specifies the file containing coalescent rates. Generated using this function (do not specify option --poplabels for this).
--dist: Optional: Specifies distances between SNPs. If unspecified, we use the distances given in the .mut file. This option is only needed if some SNPs (and trees) have been removed using, e.g., this function. The .dist file can be generated using this function.
--seed: Optional: Specifies a seed for the random number generator used in the MCMC algorithm for estimating branch lengths.

 PATH_TO_RELATE/bin/RelateCoalescenceRate \
                 --mode ReEstimateBranchLengths\
                 --anc example.anc \
                 --mut example.mut \
                 --mrate example_avg.rate \
                 --coal example.coal \
                 -m 1.25e-8 \
                 -o example_updated

Output: example_updated.anc, example_updated.mut.
With the updated branch lengths, we can then re-estimate the average mutation and coalescence rates. We can iterate these steps until convergence to obtain an algorithm that simulatneously estimates mutation and coalescence rates and branch lengths.

Estimate mutation rates

Calculate the average mutation rate through time

Calculates the average mutation rate.

-i,--input: Filename of the .anc and .mut files, excluding the file extensions.
-o,--output: Filename of output files without file extensions.
--first_chr & --last_chr: Optional: Allow for multiple input files, where filenames are assumed to be numbered by chromosome index.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
--chr: Optional: File containing chromosome names (one per line). Overrides --first_chr and --last_chr.
Example: -i example --chr chr.txt assumes example_chr*.anc, example_chr*.mut as input.
--bins: Optional: Specify epoch bins. Format: lower, upper, stepsize for function c(0,10^seq(lower, upper, stepsize)).
--years_per_gen: Optional: Years per generation. Default: 28.0
--dist: Optional: Specifies .dist file. The .dist file stores distances in bp between SNPs. If unspecified, we use the distances given in the .mut file. This option is only needed if some SNPs (and trees) have been removed using, e.g., this function. The .dist file can be generated using this function. If specified together with --first_chr, --last_chr, please specify the filename without file extension.

 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode Avg\
                 -i example \
                 -o example

Output: example_avg.rate.

Calculate the mutation rates grouped by nucleotide in sequence

Calculates the mutation rate for 96 groups of mutations, where each group corresponds to a triplet mutation (e.g. TCC to TTC mutation).
It is necessary to have appended SNP annotation with --annot to use this function (This can also be done after inferring .anc,.mut files, see ).

-i,--input: Filename of the .anc and .mut files, excluding the file extensions.
-o,--output: Filename of output files without file extensions.
--ancestor: Fasta file containing ancestral genome. This is case insensitive.
--mask: Fasta file of same length as the ancestral genome contining a genomic mask. Loci passing the mask are denoted by P, loci not passing the mask are denoted by N. This is case insensitive.
--bins: Optional: Specify epoch bins. Format: lower, upper, stepsize for function c(0,10^seq(lower, upper, stepsize)).
--years_per_gen: Optional: Years per generation. Default: 28.0
--dist: Optional: Specifies .dist file. The .dist file stores distances in bp between SNPs. If unspecified, we use the distances given in the .mut file. This option is only needed if some SNPs (and trees) have been removed using, e.g., this function. The .dist file can be generated using this function. If specified together with --first_chr, --last_chr, please specify the filename without file extension.
--first_chr & --last_chr: Currently not supported for mode WithContextForChromosome (because we need to specify separate mask, ancestor files for each chromosome.)
You can however apply mode WithContextForChromosome to each chromosome individually. You can then use mode Finalize (see below) with options --first_chr and --last_chr to obtain a genome-wide estimate. Filenames are assumed to be indexed by chromosome as before.
Example: -i example --first_chr 1 --last_chr 2 assumes example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut as input.
--chr: Optional: File containing chromosome names (one per line). Overrides --first_chr and --last_chr.
Example: -i example --chr chr.txt assumes example_chr*.anc, example_chr*.mut as input.

 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode WithContextForChromosome \
                 --mask genomic_mask.fa \
                 --ancestor ancestor.fa \
                 -i example_chr1 \
                 -o example_chr1

Output: example_chr1.bin.

Next, to extract the rates from example.bin, use

 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode Finalize \
                 -i example_chr1 \
                 -o example_chr1

Output: example_chr1.rate.

If you applied mode WithContextForChromosome to multiple chromosome, use the following to obtain a .bin file summarizing estimates genome-wide. Input filenames are assumed to be example_chr1.bin, example_chr2.bin, etc.

 PATH_TO_RELATE/bin/RelateMutationRate \
                 --mode Finalize \
                 --first_chr 1 \
                 --last_chr 22 \
                 -i example \
                 -o example

Output: example.bin, example.rate.

Detect positive selection

Detect selection

Calculates p-values for selection evidence from genealogy.

We recommend that you first estimate the population size history using this module, with setting --threshold 0. This is so that the branch lengths in all trees are updated for the estimated population size history. Alternatively, if the region you are analysing is short, you can specify a .coal file in this module and the script reestimates branch lengths for the trees of interest before calculating selection p-values.

P-values can only be calculated for mutations with derived allele frequency of at least 3. Mutations with a smaller DAF are filtered out if the DAFs are appended to the .mut file using RelateFileFormats --mode GenerateSNPAnnotations, otherwise we set the -log10 p-value column to 1.

-i,--input: Filename of .anc and .mut files without file extension. Needs to contain all SNPs in haps/sample input file.
-o,--output: File of the output files without file extensions.
-m,--mu: Mutation rate, e.g., 1.25e-8.
--first_bp: Optional: First bp position.
--last_bp: Optional: Last bp position.
--years_per_gen: Optional: Years per generation. Default: 28.0
--coal: Optional: Estimate of coalescent rate, treating all haplotypes as one population. Not necessary if .anc/.mut files were obtained using EstimatePopulationSize.sh, otherwise strongly recommended.
--seed: Optional: Random seed for branch lengths estimation. Only used if --coal is specified.

 PATH_TO_RELATE/scripts/DetectSelection/DetectSelection.sh \
                 -i example \
                 -o example_selection \
                 -m 1.25e-8 \
                 --first_bp 1000000 \
                 --last_bp 1000000 \
                 --years_per_gen 28 \
                 --coal popsize.coal

Output: example_selection.lin, example_selection.freq, example_selection.sele, (example_selection.anc, example_selection.mut, example_selection_dist if coal is specified).

File formats

.freq	Records the frequency of the derived allele at generations genN .. gen1 (specified in the header) Columns: pos rs_id genN genN-1 ... gen1 0
.lin	Records the number of lineages in the tree at generations genN .. gen1 (specified in the header), as well as the number of lineages when the mutation had frequency 2 Columns: pos rs_id genN genN-1 ... gen1 0 when_mutation_has_freq2
.sele	Records the log10 p-value for selection evidence at generations genN .. gen1 (specified in the header), as well as the log10 p-value when the mutation had frequency 2. Log10 p-value is set to 1 if mutation had frequency <= 1 at a generation. The "when_mutation_has_freq2" column contains the p-value for selection evidence over the lifetime of this mutation. All other p-values condition on the frequency at the specified time. Columns: pos rs_id genN genN-1 ... gen1 0 when_mutation_has_freq2

More details

Below, we describe details about the functions used to calculate the selection p-value.

Get selection pvalues

We first calculate, for each SNP, its frequency through time. This information is stored in two files: the .freq file and the .lin file. The former stores the frequency of a SNP through time, the latter stores the number of lineages in the corresponding tree through time.

 PATH_TO_RELATE/bin/RelateSelection \
                 --mode Frequency \
                 -i example \
                 -o example

Output: example.freq, example.lin.

We then use these two files to calculate a p-value for selection evidence for each SNP.

 PATH_TO_RELATE/bin/RelateSelection \
                 --mode Selection \
                 -i example \
                 -o example

Output: example.sele.

Sample branch lengths (e.g., for CLUES)

This module is used in CLUES and PALM, two methods for inference of selection strength on single loci and polygenic traits. Please refer to the CLUES documentation or the PALM documentation for details.

Sample branch lengths

Samples branch lengths using MCMC given a coal file.
The output file format is either anc/mut (with multiple branch lengths stored in the anc file), or input file formats of CLUES or PALM.
The output file format for CLUES resembles that of ARGweaver, which is the input file format required by CLUES for inference of selection and inferring allele frequency trajectories.

In Relate, we map mutations to branches with approximately the same descendants as carriers of the derived allele to be robust to errors. Therefore, we output an additional .sites file in which only the descendants of the branch to which the mutation is mapped are marked as derived, the rest are ancestral. Please also note that some SNPs don't map to a unique branch in the tree and such SNPs are left out of this .sites file.

-i,--input: Filename of .anc and .mut files without file extension. Needs to contain all SNPs in input haps/sample files. If not, see --dist option.
-o,--output: File of the output files without file extensions.
-m,--mu: Mutation rate, e.g., 1.25e-8.
--coal: Estimated coalescent rate, treating all haplotypes as one population.
--format: Output file format. 'a': anc/mut, 'n': newick/sites (used by CLUES), 'b': binary (used by CLUES and PALM). Default: 'a'. Note that anc/mut is modified to store multiple branch lengths for any one branch.
--num_samples: Number of times branch lengths are sampled. Integer >= 1.
--num_proposals: Optional: Number of MCMC proposals between samples. Default is max(10*N, 1000), where N is the number of haplotypes. If num_proposals=0, all samples will be identical to the input.
--first_bp: Optional: First bp position.
--last_bp: Optional: Last bp position.
--dist: Optional: Filename containing bp positions of SNPs and distances between adjacent SNPs (file format). E.g., obtained using RelateExtract --mode AncMutForSubregion. Required if anc/mut file only covers a subregion of input haps/sample files.
--seed: Optional: Random seed for branch lengths estimation.

 PATH_TO_RELATE/scripts/SampleBranchLengths/SampleBranchLengths.sh \
                 -i example \
                 -o example_sub \
                 -m 1.25e-8 \
                 --coal popsize.coal \
                 --format a \
                 --num_samples 5 \
                 --first_bp 100000 \
                 --last_bp 200000 \
                 --seed 1

Output: example_sub.anc/mut or example_sub.newick/.sites or example_sub.timeb.

File formats

.newick	A tab-delimited file with columns chrom chromStart chromEnd MCMC_sample tree chrom: Chromosome id. chromStart: bp position at which tree starts. chromEnd: bp position at which tree ends. MCMC_sample: integer specifying the MCMC sample. tree: Newick string of the tree.
.sites	A tab-delimited file with First line: Lists names of haplotypes. Second line: Shows genomic region covered by this file. Remaining lines: bp position followed by haplotype pattern at this site.

Plot trees of interest

TreeView

Plots a tree of interest using R.
Requires libraries ggplot2, cowplot.

--anc: Filename of the .anc file.
--mut: Filename of the .mut file.
--haps: Filename of haps file.
--sample: Filename of sample file.
--poplabels: Filename of a .poplabels file. The file format is described here. This file needs to always contain all samples.
--bp_of_interest: Integer specfifying the BP position of the SNP of interest
--years_per_gen: Years per generation. Default: 28.0
-o,--output: File of the output files without file extensions.

 PATH_TO_RELATE/scripts/TreeView/TreeView.sh \
                 --haps data/example.haps \
                 --sample data/example.sample \
                 --anc example.anc \
                 --mut example.mut \
                 --poplabels data/example.poplabels \
                 --bp_of_interest 3000000 \
                 --years_per_gen 28 \
                 -o example

Output: example.pdf.

You can change plotting parameters by editing PATH_TO_RELATE/scripts/TreeView/treeview.R. Specifically, lines 9 - 22 directly influence the plotting parameters.

TreeViewMutation

Plots a tree of interest using R and highlights lineages carrying derived allele.
Requires libraries ggplot2, cowplot.
Takes same arguments as the TreeView.sh script, however bp_of_interest has to specify a SNP in the input.

 PATH_TO_RELATE/scripts/TreeView/TreeViewMutation.sh \                   
                   --haps data/example.haps
                   --sample data/example.sample \
                   --anc example.anc \
                   --mut example.mut \
                   --poplabels data/example.poplabels \
                   --bp_of_interest 3000000 \
                   --years_per_gen 28 \
                   -o example

Output: example.pdf.

TreeViewSamples

Plots a tree of interest using R for an anc/mut file containing multiple sampled branch length (obtained using the SampleBranchLengths module).
Requires libraries ggplot2, cowplot.
Takes same arguments as the TreeView.sh script.

 PATH_TOcowplot/TreeView/TreeViewSamples.sh \                   
                --haps data/example.haps
                --sample data/example.sample \
                --anc example.anc \
                --mut example.mut \
                --poplabels data/example.poplabels \
                --bp_of_interest 3000000 \
                --years_per_gen 28 \
                -o example

Output: example.pdf.

Extract subtrees, trees in region of interest, and trees in newick format.

Extract subtrees corresponding to a subpopulation

Extracts subtrees such that all tips belong to the populations of interest.

--anc: Filename of the .anc file.
--mut: Filename of the .mut file.
--poplabels: Filename of a .poplabels file. The file format is described here. This file needs to always contain all samples.
--pop_of_interest: Specifies populations for which population size is estimated. The second column of the .poplabels file is used for population labels. You can specify multiple populations separated by commas, e.g. pop1,pop2.
-o,--output: File of the output files without file extensions.

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode SubTreesForSubpopulation\
                 --anc example.anc \
                 --mut example.mut \
                 --poplabels example.poplabels \
                 --pop_of_interest group1 \
                 -o example_subpop

Output: example_subpop.anc, example_subpop.mut, example_subpop.poplabels.

Extract trees corresponding to a subregion

Extracts trees in region of interest.

Please always apply this on the anc/mut files inferred using Relate (from haps/sample files), i.e., do not apply repeatedly to get "subregions of subregions". If you do, ignore the output .dist file and use the .dist file obtained from the original anc/mut files instead.

--anc: Filename of the .anc file.
--mut: Filename of the .mut file.
--first_bp: BP of start of region.
--last_bp: BP of end of region.
-o,--output: File of the output files without file extensions.

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode AncMutForSubregion\
                 --anc example.anc \
                 --mut example.mut \
                 --first_bp 0 \
                 --last_bp 10000 \
                 -o example_subregion

Output: example_subregion.anc, example_subregion.mut, example_subregion.dist.

Extract tree in region of interest in newick format

Extracts trees in region of interest in newick format.

--anc: Filename of the .anc file.
--mut: Filename of the .mut file.
-o,--output: File of the output files without file extensions.
--bp_of_interest: Integer specfifying the BP position of interest
--first_bp: Integer specfifying the first BP position of interest
--last_bp: Integer specfifying the last BP position of interest

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode AncToNewick \
                 --anc example.anc \
                 --mut example.mut \
                 --first_bp 3000000 \
                 --last_bp 4000000 \
                 -o example

Output: example.newick, example.pos. The latter file stores bp start position of each tree.

Remove trees with few mutations

Removes trees and associated SNPs from the .anc/.mut files onto which less than a given number of SNPs are mapped.

--anc: Filename of the .anc file.
--mut: Filename of the .mut file.
--threshold: Float in [0,1] which is the fraction of trees to be dropped, ranked by number of mutations mapping. This option serves two purposes: First, it removes bad/uninformative trees. Second it speeds up the inference. As we increase this value, we remove more trees. Default: 0.5.
-o,--output: File of the output files without file extensions.

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode RemoveTreesWithFewMutations \
                 --anc example.anc \
                 --mut example.mut \
                 --threshold 0.5 \
                 -o example_subset

Output: example_subset.anc, example_subset.mut, example.dist.

Extract .dist file from .mut file.

Extracts distances between SNPs from the .mut file (3rd and 4th column). The .dist file is needed in some applications, when the .anc/.mut files have been subsetted using, e.g., this function In these cases, the .dist file needs to be generated using the not subsetted .mut file.

--mut: Filename of the .mut file.
-o,--output: File of the output files without file extensions.

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode ExtractDistFromMut \
                 --mut example.mut \
                 -o example

Output: example.dist.

Divide the .anc/.mut files into smaller files to allow multi-threading.

Dividing the .anc/.mut files into smaller files to allow for multi-threading.

threads: Number of threads that will be used. This slightly varies the number of chunks into which the .anc/.mut files will get divided into.
--anc: Filename of the .anc file.
--mut: Filename of the .mut file.
-o,--output: File of the output files without file extensions.

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode DivideAncMut \
                 --threads 5 \
                 --anc example.anc \
                 --mut example.mut \
                 -o example

Output: example.param, example_chr1.anc, example_chr1.mut, example_chr2.anc, example_chr2.mut, etc.

Combine the .anc/.mut files into one file.

Combining the .anc/.mut files into one file. This should be used after dividing .anc/.mut files using this function.

-o,--output: File of the output files without file extensions.

In addition, we assume that the corresponding .param file is in the same directory. You can also manually create this file, which has the header "NUM_HAPLOTYPES NUM_SNPS NUM_TREES NUM_CHUNKS", followed by the corresponding entries on the next line.

 PATH_TO_RELATE/bin/RelateExtract\
                 --mode CombineAncMut \
                 -o example

Output: example.anc.gz, example.mut.gz.

File formats

.dist

The first line is a header
Each subsequent line contains the base pair position of the SNP and the distance (in bp) to the next SNP.