Preparing the data

You can find a toy data set in the subdirectory example/.

Relate uses the haps/sample file format (output file format of SHAPEIT2) as input.

  • Please separate your data by chromosome.
  • The data needs to be phased. This can be done, for instance, using SHAPEIT2.
  • You can convert from vcf to haps/sample using this function and from hap/legend/sample to haps/sample using this function.
  • We provide a script to prepare your data under
     PATH_TO_RELATE/scripts/PrepareInputFiles/PrepareInputFiles.sh 
    (see here for how to use the script)
    This script will
    1. make sure that, for every SNP, the ancestral allele is denoted by 0 and the derived allele is denoted by 1.
    2. make sure that every SNP is biallelic.
    3. Optional: filter SNPs (and adjust distances between SNPs) using a genomic mask.
    4. Optional: subset the samples in the data set.
    5. Optional: generate additional annotation of SNPs that can be appended to the output of Relate.

Getting Started

Warning!
Relate creates temporary files and directories.
DO NOT RUN MORE THAN ONE INSTANCE OF RELATE IN THE SAME DIRECTORY.
(You can run Relate in different subdirectories.)

To unpack the downloaded files on a Linux or Mac computer, you can use

 tar -zxvf filename 
where you have to replace filename by the name of the downloaded file.

In the subdirectory example/data/, you should find the following four files

example.haps A file containing haplotype information (more info)
example.sample A file containing information about individuals in the data set (more info)
genetic_map.txt A file containing recombination rates and distances (more info)
example.annot A file containing additional annotation of SNPs (Optional. This is recommended but not necessary for e.g., population size estimation. See here for how to generate this file).

Please change into the directory example/ and execute

 PATH_TO_RELATE/bin/Relate \
      --mode All \
      -m 1.25e-8 \
      -N 30000 \
      --haps data/example.haps \
      --sample data/example.sample \
      --map data/genetic_map.txt \
      --annot data/example.annot \
      --seed 1 \
      -o example
Output files are example.anc and example.mut.

We are specifying the following arguments:

--mode
Mode in which to run Relate. Mode "All" will execute all stages of the algorithm and other modes can be used to execute individual stages (see here). This is useful, for instance, when parallelizing Relate.
-m,--mutation_rate
Mutation rate per base per generation.
-N,--effective_size
Effective population size of haplotypes. (NOT of individuals! To get the population size of haplotypes, multiply the effective population size of individuals by 2)
--haps
Filename of haps file.
--sample
Filename of sample file.
--map
Filename of genetic map.
-o,--output
Filename of output files without file extension.
Relate has a few more optional arguments.
--annot
Add columns to the .mut file. This can also be done after running Relate (a tool for generating a .annot file and/or appends columns to the .mut file can be found here). The additional columns are needed, for instance, when estimating population size.
--dist
Specifies distances (in BP) between SNPs. This is to correct for regions with low mappability in the genomic mask (A tool that takes a genomic mask and generates a .dist file can be found here).
--coal
Specifies variable coalescence rates through time in form of a .coal file. See here for file format. Specifying this option will overwrite the --effectiveN option.
--memory
Specifies the approximate memory allowance (in GB) used for storing the distance matrices in memory. Relate will exceed this amount, as it also stores trees and other information in memory; however the option allows to approximately control memory usage. Default is 5GB. Relate can become more efficient (in runtime and hard disk usage) with more memory, particularly for large sample sizes.
--seed
Optional: Seed for random number generator used for branch lengths estimation.
other arguments
Arguments --chunk_index, --first_section, --last_section are only relevant when executing Relate in a mode other than All. This is needed for parallelising Relate (more info).
  • Programme options can also be viewed by executing
     PATH_TO_RELATE/bin/Relate
  • Required arguments may differ for each mode. A list of required and optional arguments for mode All can be viewed by executing
     PATH_TO_RELATE/bin/Relate --mode All
  • All input files can be gzipped.
  • If something goes wrong and Relate terminates with an error, it may leaves behind some temporary files and directories. The directory can be clean-up using
     PATH_TO_RELATE/bin/Relate \
            --mode Clean \
            -o example

Output and analysis

Relate outputs two files.

.anc This file contains the trees. The first line shows the number of haplotypes. The second line shows the number of trees.
Each subsequent line represents one tree.
  • Every line starts with an integer specifying at which SNP the tree begins.
  • Then, the i'th space-separated entry shows the parent of the i'th internal node. In brackets we specify the branch length, the number of mutations on this branch, the SNP at which the branch appears, and the SNP at which the branch disappears.
.mut This file contains detailed information about each SNP. The columns are described in the header.
Without .annot file, the columns are
snp
Index of SNP in .haps file (int)
pos_of_snp
Position of SNP in bp (int)
dist
Distance to next SNP in bp (int). If --dist option was specified, this may be different to the actual distance to the next SNP.
rs-id
rd-id of SNP (string)
tree_index
Index of tree, starting from 0, in the .anc file (int)
branch_indices
Index of branch on tree. If a mutation cannot be mapped to a unique branch, multiple branch indices are given (space-separated ints)
is_not_mapping
0 means SNP could be mapped to a unique branch, 1 means SNP could not be mapped to a unique branch (bool)
is_flipped
0 means SNP is not flipped, 1 means SNP is flipped (bool)
age_begin
Age of lower end of branch in generations (dbl)
age_end
Age of upper end of branch in generations (dbl)
ancestral_allele/alternative_allele
Ancestral and alternative alleles at this locus (string)
With .annot file, the following columns are appended
upstream_allele
Allele adjacent to the SNP in the 5' direction
downstream_allele
Allele adjacent to the SNP in the 3' direction
...
Number of carriers of the derived allele in populations

Using the output

We provide a number of add-on modules to analyse the genealogy (see here for details).

You can find a bash script under

PATH_TO_RELATE/scripts/TreeView/TreeView.sh
The script will produce a plot of the marginal tree at a bp position of interest with tips labelled by their assigned population. Details about how to run the script can be found here).

You can find a bash script under

PATH_TO_RELATE/scripts/EstimatePopulationSize/EstimatePopulationSize.sh
To run the script, all you need are the inferred .anc/.mut files and a .poplabels file (see here for file format). Details about how to run the script can be found here).

You can estimate mutation rates through time using this module. This can be used to calculate average mutation rates, or mutation rates of triplets.

You can find a bash script under

PATH_TO_RELATE/scripts/DetectSelection/DetectSelection.sh

The script calculates a p-value for selection evidence using this module. The p-value quantifies whether the derived allele has spread surprisingly quickly under the standard coalescent model.

You can extract trees in .newick format, or genealogies for subpopulations using this module.

You can convert the output anc/mut files inferred by Relate to other file formats using this module.