Interface to the Human Genome Reference (reference)

exception gepyto.reference.InvalidMapping(value)[source]

Exception representing an invalid mapping that we can’t fix automatically.

This can happen if the provided allele is incorrect for non-SNP variants. In this case we don’t know if the locus is bad or if the sequence is bad. Since this is ambiguous, we raise this exception for the user to fix.

class gepyto.reference.Reference(remote=False)[source]

Interface to the human genome reference file.

This class uses pyfaidx to parse the genome reference file referenced by settings.REFERENCE_PATH.

This can only be a single plain fasta file.

Also note that if the path is not in the ~/.gtconfig/gtrc.ini file, gepyto will look for an environment variable named REFERENCE_PATH.

If the genome file can’t be found, this class fallbacks to the Ensembl remote API to get the sequences.

This behaviour can also be forced by using the remote=True argument.

check_variant_reference(variant, flip=False)[source]

Given a variant, makes sure that the ‘ref’ allele is consistent with the human genome reference.

Parameters:
  • variant (gepyto.structures.variants.Variant subclass) – The variant to verify.
  • flip (bool) – If True incorrect (ref, alt) pairs will be flipped (Default: False).
Returns:

If flip is True, it returns the correct variant or raises a ValueError in case it is not salvageable. If flip is False, a bool is simply returned.

get_nucleotide(chrom, pos)[source]

Get the nucleotide at the given genomic position.

get_sequence(chrom, start, end=None, length=None)[source]

Get the nucleotide sequence at the given genomic locus.

Parameters:
  • chrom (str) – The chromosome.
  • start (int) – The start position of the locus.
  • end (int) – The end position.
  • length (int) – The length of the sequence to fetch.

Either an end or a length parameter has to be provided.

The ranges are incluse, this means that (start, end) positions will both be included in the sequence.

gepyto.reference.check_indel_reference(indel, ref, fix)[source]

Check and/or fix alleles for Indels.

Parameters:ref (Reference) – A reference object.

In fix mode, this function will try to standardise the alleles for the given indel. This means that the VCF format will be enforced. No “-” alleles will be authorized.

_e.g._ ref: ‘TC’, alt: ‘-‘ will become ref: ‘CTC’, alt: ‘C’ given that the previous nucleotide in the reference is a ‘C’. The position will be adjusted accordingly.

In the regular mode, the only test will be that the ref allele is consistent with the reference. That is, the sequence given as the ref allele equals the one on the same length starting at pos in the genome.

gepyto.reference.check_snp_reference(snp, ref, flip)[source]

Utility function to check if a snp has the correct reference allele.

Parameters:
Returns:

Either a gepyto.structures.variant.SNP with flipped alleles or a bool.

This is used internally by Reference, but it is also available to users, but you need to provide a pyfaidx Fasta object.