Preparing the data
You can find a toy data set in the subdirectory example/.
Relate uses the haps/sample file format (output file format of SHAPEIT2) as input.
- Please separate your data by chromosome.
- The data needs to be phased. This can be done, for instance, using SHAPEIT2.
- You can convert from vcf to haps/sample using this function and from hap/legend/sample to haps/sample using this function.
- We provide a script to prepare your data under
(see here for how to use the script)
This script will
- make sure that, for every SNP, the ancestral allele is denoted by 0 and the derived allele is denoted by 1.
- make sure that every SNP is biallelic.
- Optional: filter SNPs (and adjust distances between SNPs) using a genomic mask.
- Optional: subset the samples in the data set.
- Optional: generate additional annotation of SNPs that can be appended to the output of Relate.
Relate creates temporary files and directories under a directory with name of the output prefix.
You can now run Relate in the same directory provided the output filenames are different.
To unpack the downloaded files on a Linux or Mac computer, you can use
where you have to replace filename by the name of the downloaded file.
tar -zxvf filename
In the subdirectory example/data/, you should find the following four files
|example.haps||A file containing haplotype information (more info)|
|example.sample||A file containing information about individuals in the data set (more info)|
|genetic_map.txt||A file containing recombination rates and distances (more info)|
|example.annot||A file containing additional annotation of SNPs (Optional. This is recommended but not necessary for e.g., population size estimation. See here for how to generate this file).|
Please change into the directory example/ and execute
Output files are example.anc and example.mut.
PATH_TO_RELATE/bin/Relate \ --mode All \ -m 1.25e-8 \ -N 30000 \ --haps data/example.haps \ --sample data/example.sample \ --map data/genetic_map.txt \ --annot data/example.annot \ --seed 1 \ -o example
We are specifying the following arguments:
- Mode in which to run Relate. Mode "All" will execute all stages of the algorithm and other modes can be used to execute individual stages (see here). This is useful, for instance, when parallelizing Relate.
- Mutation rate per base per generation.
- Effective population size of haplotypes. (NOT of individuals! To get the population size of haplotypes, multiply the effective population size of individuals by 2)
- Filename of haps file.
- Filename of sample file.
- Filename of genetic map.
- Filename of output files without file extension.
- File containing sample ages (one per line). Please specify an age for every haploid sequence, i.e. 2N for diploid organisms.
- Add columns to the .mut file. This can also be done after running Relate (a tool for generating a .annot file and/or appends columns to the .mut file can be found here). The additional columns are needed, for instance, when estimating population size.
- Specifies distances (in BP) between SNPs. This is to correct for regions with low mappability in the genomic mask (A tool that takes a genomic mask and generates a .dist file can be found here).
- Specifies variable coalescence rates through time in form of a .coal file. See here for file format. Specifying this option will overwrite the --effectiveN option.
- Specifies the approximate memory allowance (in GB) used for storing the distance matrices in memory. Relate will exceed this amount, as it also stores trees and other information in memory; however the option allows to approximately control memory usage. Default is 5GB. Relate can become more efficient (in runtime and hard disk usage) with more memory, particularly for large sample sizes.
- Optional: Seed for random number generator used for branch lengths estimation.
- other arguments
- Arguments --chunk_index, --first_section, --last_section are only relevant when executing Relate in a mode other than All. This is needed for parallelising Relate (more info).
- Programme options can also be viewed by executing
- Required arguments may differ for each mode. A list of required and optional arguments for mode All can be viewed by executing
PATH_TO_RELATE/bin/Relate --mode All
- All input files can be gzipped.
- If something goes wrong and Relate terminates with an error, it may leaves behind some temporary files and directories. The directory can be clean-up using
PATH_TO_RELATE/bin/Relate \ --mode Clean \ -o example
Output and analysis
Relate outputs two files.
|.anc||This file contains the trees. The first line shows the number of haplotypes. If sample ages are provided, this line will list sample ages. The second line shows the number of trees.
Each subsequent line represents one tree.
|.mut||This file contains detailed information about each SNP. The columns are described in the header.
Without .annot file, the columns are
Using the output
We provide a number of add-on modules to analyse the genealogy (see here for details).
You can find a bash script under
The script calculates a p-value for selection evidence using this module. The p-value quantifies whether the derived allele has spread surprisingly quickly under the standard coalescent model.