Relate Documentation

Software to estimate genome-wide genealogies for thousands of samples

Relate estimates genome-wide genealogies in the form of trees that adapt to changes in local ancestry caused by recombination. The method, which is scalable to thousands of samples, is described in the following paper. Please cite this paper if you use our software in your study.

Citations:

(original Relate paper) Leo Speidel, Marie Forest, Sinan Shi, Simon Myers.
A method for estimating genome-wide genealogies for thousands of samples. Nature Genetics 51: 1321-1329, 2019.
(update, v1.1.*) Leo Speidel, Lara Cassidy, Robert W. Davies, Garrett Hellenthal, Pontus Skoglund, Simon R. Myers.
Inferring population histories for ancient genomes using genome-wide genealogies. Molecular Biology and Evolution 38: 3497–3511, 2021.

Contact: leo.speidel@outlook.com
Website: https://leospeidel.com

Download

Relate is available for academic use. To see rules for non-academic use, please read the LICENCE file, which is included with each software download.

Pre-compiled binaries (last updated: 23/07/2024)

Github repository

Alternatively, you can compile your own version by downloading the source code from this github repository.

In the downloaded directory, we have included a toy data set. You can try out Relate using this toy data set by following the instructions on our getting started page.

If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please send a message to leo.speidel [at] outlook [dot] com.

We document changes to previous versions in a change-log.

What's new?

Versions >1.1.0 include a few new features such as:

Allowing for non-contemporary samples, e.g., ancient DNA: Simply add the --sample_ages option.
Allowing for haploid organisms: set the ID_2 column in the sample file to NA and Relate will work for haploid organisms.
Improved coalescence rate estimation: The algorithm has been improved and behaviour has changed slightly. The output now always contains all trees, and the threshold parameter only affects iterations leading up to the last one. The threshold parameter now is a float in [0,1], specifying the proportion of trees to throw away (Default = 0.5).
Arbitrary strings as chromosome IDs: in scripts such as EstimatePopulationSize.sh, use the --chr option to specify a file containing chromosome name.
More flexible epochs for coalescence rate estimation: use the option --bins to specify epoch boundaries in the format 10^seq(x,y,stepsize).

1000 Genomes coalescence rates, allele ages, and selection p-values

We deposited Relate-estimated coalescence rates, allele ages, and p-values for evidence of positive selection for the 1000 Genomes Project here.
These were obtained by estimating the joint genealogy of all 1000 GP populations and then extracted the embedded genealogy for each population. For the genealogy of each population, we jointly estimated the population size history and branch lengths. Variants segregating in more than one population therefore have correlated but different allele ages in each population.

Inferred trees

You can download inferred trees from the links below:

Relater: R package for handling Relate output files

I made an R package for handling Relate output files (MIT license). It's still under development but the main functionality is working.

You can install this package, e.g., using

library(devtools)
devtools::install_github("leospeidel/relater")

relate_lib: Conversion to tskit format

This github repo contains functions to convert between Relate and tskit format (MIT license).

It doubles as a C++ library you can link to if you want to use some of our Relate functions in C++.

relate_lib/bin/Convert --mode ConvertFromTreeSequence \
		--anc example.anc.gz \
		--mut example.mut.gz \
		-i example

relate_lib/bin/Convert --mode ConvertToTreeSequence \
		--anc example.anc.gz \
		--mut example.mut.gz \
		-o example

Thanks to Nathaniel S. Pope, you can now also specify an argument that compresses these Relate-converted tree sequences by assigning the same age to nodes with identical descendant sets across adjacent trees.

relate_lib/bin/Convert --mode ConvertToTreeSequence \
		--compress \
		--anc example.anc.gz \
		--mut example.mut.gz \
		-o example

Add-on modules

Preparing input files

The input file format of Relate is the haps/sample file format used as an output format of ShapeIt. We provide code to convert files from the hap/legend/sample and vcf file formats. Relate also comes with tools for determining the ancestral allele using an ancestral genome and tools for filtering SNPs using a genomic mask.

Extracting subtrees and trees of interest

We provide code for converting individual trees of interest into .newick format. This file format is convenient, for instance, when visualising a tree. We also provide code for extracting genealogies corresponding to a subpopulation.

Plot marginal trees

Relate can be used to plot a marginal tree of interest. This tree corresponds to the LCT region, where a mutation at SNP rs4988235 is believed to be responsible for Lactose tolerance in Europeans (here GBR). We can see that the derived allele at this SNP has spread rapidly in GBR, which is indicative of strong positive selection.

Estimating population sizes and separation histories

Relate can be used to estimate population sizes and separation histories between populations. The figure shows the separation history between FIN and GBR in the 1000GP data set. The inset shows the matrix of coalescent rates between pairs of haplotypes 9,000 years ago. Rows and columns are sorted by population labels of haplotypes, as indicated by the colour on the left of the matrix.

Estimating mutation rates

Relate can be used to estimate mutation rates through time. The figure shows TCC to TTC mutation rates for all 26 populations in the 1000GP data set. Trends shared between mutation categories are eliminated by dividing by the overall average mutation rate. For each population, the mutation rate is normalized such that its mean over time equals one. Consistent with previous estimates, we see an increase in the mutation rate of TCC to TTC mutations about 10,000 to 30,000 years ago in Europeans and Southern Asians.

Detecting positive selection

Relate can be used to detect evidence for positive selection. We calculate a p-value for selection evidence that quantifies how quickly a mutation has spread in the population. The figure shows a manhattan plot for GBR which indicates clear peaks in the LCT and MHC regions, which are known targets of positive selection.

Infer genealogies for ancient samples

Relate can infer genealogies for non-contemporary samples, such as high coverage ancient genomes or time-stamped samples of bacteria or viruses. This plot was generated using the TreeViewSamples.sh script, using 100 sampled branch lengths. The tree represents the posterior mean times, and "error bars" indicate the 0.025 and 0.975 quantiles of the posterior density of coalescence ages.

Sample branch lengths

Using the SampleBranchLengths.sh script, Relate can sample branch lengths from the posterior, which can then be used e.g., by CLUES to infer allele frequency trajectories and selection coefficients using an importance sampling scheme. See the CLUES repo for more details.

Change-log

v1.2.2

Date: 23 July 2024

Bug fix introduced in v1.2.1 which led to execution errors when running RelateSelection

v1.2.1

Date: 18 September 2023

Bug fix introduced in v1.2.0 which led to bad topologies for larger sample sizes
Bug fix introduced in v1.2.0 in EstimatePopulationSize.sh which caused an error when using with multiple chromosomes

v1.2.0

Date: 12 August 2023

Increased consistency between adjacent trees and improved memory footprint.
Added option to force build trees every x bases.
RelateSelection --mode Frequency works now with ancient genomes.

v1.1.*

Pre-compiled binaries (last updated: 14/06/2022)

v1.1.9

Date: 14 June 2022

Fixed bug in TreeView where singletons were not plotted.

v1.1.8

Date: 7 November 2021

Made ConvertToTreeSequence consistent with relate_lib version.
Bug fix in SummarizeMutationRates (assumed float instead of double, leading to Segfault).

v1.1.7

Date: 16 June 2021

Updated initialisation of bl with ancients.
Small bug fixes.

v1.1.6

Date: 17 Feb 2021

Added option to exclude transitions in branch length estimation (useful for ancient genomes).

v1.1.5

Date: 14 Dec 2020

Modified tree builder when ancients samples are specified (see Colate paper for details).

v1.1.4

Date: 30 Oct 2020

Fixed a in ConvertFromVcf that caused empty haps/sample files to be outputted
Added a function in RelateExtract to get branch frequencies for every SNP.

v1.1.3

Date: 7 Oct 2020

Fixed a major bug (v1.1.* onwards) in EstimatePopulationSize.sh when setting seed (not affected when seed is not set).
Added RelateSlurm.sh and RelateLSF.sh
Added a function to retrospectively map a mutation to trees in RelateExtract

v1.1.2

Date: 12 June 2020

Fixed bug in PrepareInputFiles, when downsampling individuals (introduced in v1.1.0).
Changed parameters for branch lengths estimation for aDNA, to improve mixing properties of MCMC.
Changed function for anc to newick conversion, to allow multiple trees to be extracted.

v1.1.1

Date: 25 May 2020

Changed filename used for CLUES and PALM from *.palm to *.timeb.
Fixed bug in EstimatePopulationSize.sh

v1.1.0

Date: 20 May 2020

Allowing for non-contemporary samples, e.g., ancient DNA: Simply add the --sample_ages option.
Allowing for haploid organisms: set the ID_2 column in the sample file to NA and Relate will work for haploid organisms.
Improved coalescence rate estimation: The algorithm has been improved and behaviour has changed slightly. The output now always contains all trees, and the threshold parameter only affects iterations leading up to the last one. The threshold parameter now is a float in [0,1], specifying the proportion of trees to throw away (Default = 0.5).
Arbitrary strings as chromosome IDs: in scripts such as EstimatePopulationSize.sh, use the --chr option to specify a file containing chromosome name.
More flexible epochs for coalescence rate estimation: use the option --bins to specify epoch boundaries in the format 10^seq(x,y,stepsize).

v1.0.*

Pre-compiled binaries (last updated: 10/01/2020)

v1.0.17

Date: 10th January 2020

Input files can now have CHR, RSID, ANC, DER columns of type strings (up to 1024 chars).
Added a scripts named TreeViewMutation.sh, taking same input as TreeView.sh, but different visualisation (bp_of_interest has to be SNP).

v1.0.16

Date: 2nd September 2019

Fixed a bug in ConvertToTreeSequence that meant internal nodes were flagged as sample nodes.

v1.0.15

Date: 13th August 2019

Fixed a small bug that meant memory was misallocated when the --memory value was set too large.

v1.0.14

Date: 24th July 2019

Fixed a bug in PrepareInputFiles.sh that meant the script terminated before filtering SNPs using a genomic mask when input was not gzipped.
Changed the default for y axis scale in TreeView to not be a log-scale.

v1.0.13

Date: 27th May 2019

Modified default criteria to terminate population size estimation. Now terminating when iteration at least 2 and "mean absolute error"/mu less than 0.1.
Modified memory allocation in Relate, where we were allocating too much memory for very small sample sizes (N less than 10).
Modified error message when poplabels file is misspecified in FinalizePopulationSize - you can just rerun this step again at the end.
Included the script RelateSGE.sh for running Relate on a SGE cluster.

v1.0.12

Date: 16th May 2019

Fixed a bug in FinalizePopulationSize --mode EstimatePopulationSize, where I had forgotten to close a file leading to an error on some platforms.
Allowing .poplabels file to be tab separated.
Fixed a bug in TreeView which threw an error when completely fixed SNPs were included in the haps/sample file.

v1.0.11

Date: 14th May 2019

Fixed a bug in FinalizePopulationSize introduced in v1.0.9, which threw an error message for correct input files.
Fixed a bug in SampleBranchLengths which was introducing small rounding errors.

v1.0.10

Date: 13th May 2019

Fixed a bug in SampleBranchLengths.sh, where a filename was hard-coded by mistake.

v1.0.9

Date: 3rd May 2019

Implemented a module which samples branch lengths from the posterior given tree topologies and effective population size histories.
Added some more detail to documentation for RelateSelection.
Added a more meaningful error message when .poplabels file is misspecified in EstimatePopulationSize.sh.

v1.0.8

Date: 15th March 2019

Now requiring different input and output names for DetectSelection.sh when it overwrites files otherwise.
DetectSelection.sh now works also if additional columns are not appended to the .mut file using ./RelateFileFormats --mode GenerateSNPAnnotations. However, code is more efficient if these columns are appended.
Switched to using R package cowplot in TreeView.sh.

v1.0.7

Date: 25th Feburary 2019

Fixed bug in the painting which affects tree topologies sometimes, but has only a small effect (not visible in distance measures when comparing to truth in simulations).

v1.0.6

Date: 6th Feburary 2019

I added a function for converting anc/mut output files to tree sequence file format (tskit). Currenty, some information is lost by this conversion.

v1.0.5

Date: 18th October 2018

I had introduced a bug in update v1.0.4 in the function RelateExtract --mode TreeAtSNPAsNewick which has been fixed.

v1.0.4

Date: 13th October 2018

Substituted zcat with gunzip -c to fix a bug in Macs.
Added requirement of R version >= 3.3.1 for TreeView because of a known bug with grid.draw() in older versions.
Fixed bug in RelateExtract --mode TreeAtSNPAsNewick, which outputted the first tree when snp_of_interest was not a SNP in the data set. Changed option snp_of_interest to pos_of_interest and this function now prints the tree at the position of interest. In addition, output filename was not using the option -o, which has been corrected.

v1.0.3

Date: 30th August 2018

Implemented pipeline for calculating p-values for selection evidence. Updated corresponing entry in the documentation.
Bug fix in RelateSelection --mode Frequency: Previous version had a bug whenever two internal nodes had same age.
Fixed function RelateFileFormats --mode ConvertFromVcf which previously parsed the vcf incorrectly in some cases.

v1.0.2

Date: 16th July 2018

Implemented parallelization of module EstimatePopulationSize.

v1.0.1

Date: 30th June 2018

Bug fix in parsing function of haps/sample.

v1.0.0

Date: 4th June 2018

Initial release.