Python Objects (structures)

Genes

class gepyto.structures.genes.Gene(**kwargs)[source]

Python object representing a gene.

Store the following information:

Required

  • build: The genome build.
  • chrom: The chromosome.
  • start and end: The genomic positions for the gene.
  • strand: The strand (either 1 or -1).
  • xrefs: A dict of id mappings to multiple databases.
  • transcripts: A list of Transcript objects.

Optional

  • symbol: An HGNC symbol.
  • desc: A short description.
  • exons: A list of pairs of positions for exons.

You can only pass kwargs to build the genes. This makes for more eloquent code and avoids mistakes.

classmethod factory_ensembl_id(ensembl_id, xrefs=None, build='GRCh37')[source]

Builds a gene object from it’s Ensembl ID.

Parameters:ensembl_id (str) – The Ensembl ID.
Returns:The Gene object.
Return type:Gene
classmethod factory_symbol(symbol, build='GRCh37')[source]

Builds a gene object from it’s HGNC symbol.

Parameters:symbol (str) – The HGNC symbol.
Returns:The Gene object.
Return type:Gene
get_ortholog_sequences()[source]

Queries Ensembl to get Sequence objects representing orthologs.

Returns:A list of gepyto.structures.sequences.Sequence
Return type:list
get_paralog_sequences()[source]

Queries Ensembl to get Sequence objects representing paralogs.

Returns:A list of gepyto.structures.sequences.Sequence
Return type:list
classmethod get_xrefs(field, query, build='GRCh37')[source]
Fetches the HGNC (HUGO Gene Nomenclature Commitee) service to get a
gene ID for other databases.
Parameters:
  • field (str) – A searchable fields.
  • query (str) – The query.
Returns:

A dict representing information on the gene.

Return type:

dict

If no gene with this symbol can be found, None is returned.

classmethod get_xrefs_from_ensembl_id(ensembl_id, build='GRCh37')[source]
Fetches the HGNC (HUGO Gene Nomenclature Commitee) service to get a
gene ID for other databases.
Parameters:ensembl_id (str) – The gene Ensembl ID to query.
Returns:A dict representing information on the gene.
Return type:dict

If no gene with this Ensembl ID can be found, None is returned.

classmethod get_xrefs_from_symbol(symbol, build='GRCh37')[source]
Fetches the HGNC (HUGO Gene Nomenclature Commitee) service to get a
gene ID for other databases.
Parameters:symbol (str) – The gene symbol to query.
Returns:A dict representing information on the gene.
Return type:dict

If no gene with this symbol can be found, None is returned.

region

Lazily loads the Region object for this Gene.

class gepyto.structures.genes.Transcript(**kwargs)[source]

Python object representing a transcript.

Store the following information:

Required

  • build: The genome build.
  • chrom: The chromosome.
  • start and end: The genomic positions for the gene.
  • enst: The corresponding Ensembl transcript id.

Optional

  • appris_cat: The APPRIS category.
  • parent: The corresponding Gene object.
  • biotype: The biotype as given by Ensembl.
classmethod factory_position(region, build='GRCh37')[source]

Gets a list of transcripts overlapping with the given position.

Parameters:pos (str) – A genomic position of the form chr2:12345-12347.
Returns:A list of Transcript
Return type:list

This method uses the Ensembl API.

get_sequence(seq_type='genomic')[source]

Build a Sequence object representing the transcript.

Parameters:seq_type (str) – This can be either genomic, cds, cdna or protein.
Returns:A Sequence object representing the feature.
Return type:gepyto.structures.sequences.Sequence
region

Lazily loads the Region object for this Transcript.

Variants

class gepyto.structures.variants.SNP(*args, **kwargs)[source]

Class representing a Single Nucleotide Polymorphism (SNP).

Instances can be created in two ways: either by providing ordered fields: chrom, pos, rs, ref, alt or by using named parameters.

classmethod from_ensembl_api(rs, build='GRCh37')[source]

Gets the information from the Ensembl REST API.

Parameters:
  • rs – The rs number for the variant of interest.
  • build – The genome build (e.g. GRCh37 or GRCh38).
classmethod from_str(s, rs=None)[source]

Parses a variant object from a str of the form chrXX:YYYY_R/A.

Parameters:
  • s – The string to parse the SNP from (Format: chrXX:YYY_R/A).
  • rs – An optional parameter specifying the rs number.
Returns:

A list of SNP objects.

If it is a multi-allelic loci, the list will contain one SNP per alternative allele.

If it is not, this will be a list of length 1...

class gepyto.structures.variants.Indel(*args, **kwargs)[source]

Class representing short insertions/deletions (Indels).

Either initialize with the parameters corresponding to: chrom, pos, rs, ref, alt or by using the corresponding named parameters.

The notation we are using is consistent with the VCF format, but it is different from some API (e.g. Ensembl). We will try to standardize everything to comply with the VCF format.

The latter represents deletions by using the preceding nucleotide for the reference.

e.g. If the genomic sequence is AAGAA -> AAAA (deletion of the G) the indel will be represented as follows:

  • Start: 2
  • Ref: AG
  • Alt: A
  • length: 1 (difference between allele lengths)

e.g. If the genomic sequence is AAGAA -> AAGCAA (insertion of the C) the indel will be represented as follows:

  • Start: 3
  • Ref: G
  • Alt: GC
  • length: 1
classmethod from_ensembl_api(rs, build='GRCh37')[source]

Gets the information from the Ensembl REST API.

Parameters:
  • rs – The rs number for the variant of interest.
  • build – The genome build (e.g. GRCh37 or GRCh38).
length

Computes the size of the indel.

This is the difference between the length of the alleles.

gepyto.structures.variants.variant_list_to_dataframe(variants)[source]

Converts a regular Python list of Variant objects to a Pandas dataframe.

Parameters:variants (list) – A list of Variant objects.
Returns:A Pandas dataframe with genomic information as well as extra fields defined in the _info parameter.
Return type:DataFrame

This function will add annotations from the _info dict if such a parameter is set for the considered variants. This dict has to be comparable between all elements of the list or an Exception will be raised.

This functionality can be very useful to flexibly add annotations to variants and to write them to a CSV file.

Sequences

class gepyto.structures.sequences.Sequence(uid, s, seq_type, info=None)[source]

Object to represent biological sequences.

Parameters:
  • uid (str) – The identifier for this sequence.
  • s (str) – The actual sequence.
  • seq_type (str) – The sequence type (DNA, RNA or AA).
  • info (dict) – A python dict of extra parameters (optional).

Common examples for the info attributes:

  • species: Homo sapiens
  • species_ncbi_tax_id: 9606
  • description: dystroglycan 1
  • db_name: RefSeq
  • db_acc: NM_004393
base_base_correlation(k=10, alphabet=None)[source]

Compute the base base correlation (BBC) for the sequence.

Parameters:
  • k (int) – k is a parameter of the BBC. Intuitively, it represents the maximum distance to observe correlation between bases.
  • alphabet (iterable) – List of possible characters. This can be used to avoid autodetection of the alphabet in the case where sequences with missing letters are to be compared.
Returns:

A 16 dimensional vector representing the BBC.

Return type:

np.ndarray

A description of the method can be found here: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4272582

Liu, Zhi-Hua, et al. “Base-Base Correlation a Novel Sequence Feature and its Applications.” Bioinformatics and Biomedical Engineering, 2007. ICBBE 2007. The 1st International Conference on. IEEE, 2007.

This implementation is generalized for any sequence type.

bbc(k=10, alphabet=None)[source]

Shortcut to base_base_correlation.

find_coding_sequences(cpu=6)[source]

Tries all the ORFs and translates every possible protein.

Returns:A tuple containing the information of the coding sequence. (ORF, start, end, sequence)
Return type:tuple

Warning

This is currently untested.

find_translations(cpu=6)[source]

Finds and translates peptides from any ORF in the sequence.

classmethod from_reference(chrom, start, end=None, length=None)[source]

Create a Sequence object from a given locus.

gc_content()[source]

Computes the GC content for the sequence.

get_annotations()[source]

Return a list of bound SequenceAnnotation objects.

Returns:A list of annotations for the sequence representing different kind of information about sub-sequences like protein domains.
Return type:list
reverse_complement()[source]

Reverse complement the sequence (compatible with IUPAC codes).

to_fasta(line_len=80, full_header=False)[source]

Converts the sequence to a valid fasta string.

Parameters:
  • line_len (int) – The maximum line length for the sequence.
  • full_header (bool) – Add the contents of the info field to the header. (default: False).
Returns:

A fasta string.

Return type:

str

translate(no_check=False)[source]

Use the genetic code to translate a DNA or RNA sequence into an amino acid sequence.

gepyto.structures.sequences.maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

gepyto.structures.sequences.smith_waterman(seq1, seq2, penalties=None, output='sequences')[source]

Compute a pairwise local sequence alignment using the Smith Waterman algorithm.

The output parameter determines how results will be represented:

If “sequences” is chosen, the two aligned sequences will be returned with gaps represented by dashes (“-”).

If “alignment” is chosen, a single encoded string will be reterned where “M” represents matches, “I” represents insertions, “D” represents deletions and “X” represents mismatches. This is done with respect to the first sequence.

In any mode, the first returned element is always the raw similarity score. Note that because this is local alignment, gaps at both ends won’t be penalized. Also, only one of the potentially many best alignments will be output.

Warning

This implementation is not very optimized. It is not written in a low level language. It can be used for small sequences or for low number of comparisons, but should not be used in large scale products.

Note

Some functionality like affine gap penalties, or substitution matrices are not implemented.

The default penalty scheme is the following:

{
    "match": 2,
    "mismatch": -1,
    "gap": -1
}

You can follow this pattern to set your own penalty scores.

Region

class gepyto.structures.region.Region(chrom, start, end)[source]

Region object to represent a part of the genome.

Parameters:
  • chrom (str) – The chromosome.
  • start (int) – The start of the region.
  • end (int) – The end of the region.

This can either represent a contiguous region or a fragmented region with multiple non-overlaping segments. This object can easily be converted to a Sequence using the Region.sequence() property. It is also easy to test overlap with another region or to test if an object is contained within this region using the in operator.

as_range(iterator=False)[source]

Get a range corresponding to all the nucleotide positions.

distance_to(region)[source]

Computes the distance to the given Region.

static from_str(s)[source]

Parses the region object (contiguous only) from a string.

The expected format is chrX:START-END.

overlaps_with(region)[source]

Tests overlap with another region.

sequence

Builds a Sequence object representing the region.

If the region is non contiguous, a tuple of sequences is returned.

union(region)[source]

Primary method to create non contiguous regions.

This will create a region represented by the union of the current Region and the provided Region. This means that overlapping segments will be merged to avoid redundancy.