Python Objects (structures
)¶
Genes¶
-
class
gepyto.structures.genes.
Gene
(**kwargs)[source]¶ Python object representing a gene.
Store the following information:
Required
- build: The genome build.
- chrom: The chromosome.
- start and end: The genomic positions for the gene.
- strand: The strand (either 1 or -1).
- xrefs: A dict of id mappings to multiple databases.
- transcripts: A list of Transcript objects.
Optional
- symbol: An HGNC symbol.
- desc: A short description.
- exons: A list of pairs of positions for exons.
You can only pass kwargs to build the genes. This makes for more eloquent code and avoids mistakes.
-
classmethod
factory_ensembl_id
(ensembl_id, xrefs=None, build='GRCh37')[source]¶ Builds a gene object from it’s Ensembl ID.
Parameters: ensembl_id (str) – The Ensembl ID. Returns: The Gene object. Return type: Gene
-
classmethod
factory_symbol
(symbol, build='GRCh37')[source]¶ Builds a gene object from it’s HGNC symbol.
Parameters: symbol (str) – The HGNC symbol. Returns: The Gene object. Return type: Gene
-
get_ortholog_sequences
()[source]¶ Queries Ensembl to get Sequence objects representing orthologs.
Returns: A list of gepyto.structures.sequences.Sequence
Return type: list
-
get_paralog_sequences
()[source]¶ Queries Ensembl to get Sequence objects representing paralogs.
Returns: A list of gepyto.structures.sequences.Sequence
Return type: list
-
classmethod
get_xrefs
(field, query, build='GRCh37')[source]¶ - Fetches the HGNC (HUGO Gene Nomenclature Commitee) service to get a
- gene ID for other databases.
Parameters: Returns: A dict representing information on the gene.
Return type: If no gene with this symbol can be found, None is returned.
-
classmethod
get_xrefs_from_ensembl_id
(ensembl_id, build='GRCh37')[source]¶ - Fetches the HGNC (HUGO Gene Nomenclature Commitee) service to get a
- gene ID for other databases.
Parameters: ensembl_id (str) – The gene Ensembl ID to query. Returns: A dict representing information on the gene. Return type: dict If no gene with this Ensembl ID can be found, None is returned.
-
classmethod
get_xrefs_from_symbol
(symbol, build='GRCh37')[source]¶ - Fetches the HGNC (HUGO Gene Nomenclature Commitee) service to get a
- gene ID for other databases.
Parameters: symbol (str) – The gene symbol to query. Returns: A dict representing information on the gene. Return type: dict If no gene with this symbol can be found, None is returned.
-
region
¶ Lazily loads the Region object for this Gene.
-
class
gepyto.structures.genes.
Transcript
(**kwargs)[source]¶ Python object representing a transcript.
Store the following information:
Required
- build: The genome build.
- chrom: The chromosome.
- start and end: The genomic positions for the gene.
- enst: The corresponding Ensembl transcript id.
Optional
- appris_cat: The APPRIS category.
- parent: The corresponding Gene object.
- biotype: The biotype as given by Ensembl.
-
classmethod
factory_position
(region, build='GRCh37')[source]¶ Gets a list of transcripts overlapping with the given position.
Parameters: pos (str) – A genomic position of the form chr2:12345-12347. Returns: A list of Transcript
Return type: list This method uses the Ensembl API.
-
get_sequence
(seq_type='genomic')[source]¶ Build a Sequence object representing the transcript.
Parameters: seq_type (str) – This can be either genomic, cds, cdna or protein. Returns: A Sequence object representing the feature. Return type: gepyto.structures.sequences.Sequence
-
region
¶ Lazily loads the Region object for this Transcript.
Variants¶
-
class
gepyto.structures.variants.
SNP
(*args, **kwargs)[source]¶ Class representing a Single Nucleotide Polymorphism (SNP).
Instances can be created in two ways: either by providing ordered fields:
chrom, pos, rs, ref, alt
or by using named parameters.-
classmethod
from_ensembl_api
(rs, build='GRCh37')[source]¶ Gets the information from the Ensembl REST API.
Parameters: - rs – The rs number for the variant of interest.
- build – The genome build (e.g. GRCh37 or GRCh38).
-
classmethod
from_str
(s, rs=None)[source]¶ Parses a variant object from a str of the form chrXX:YYYY_R/A.
Parameters: - s – The string to parse the SNP from (Format: chrXX:YYY_R/A).
- rs – An optional parameter specifying the rs number.
Returns: A list of SNP objects.
If it is a multi-allelic loci, the list will contain one SNP per alternative allele.
If it is not, this will be a list of length 1...
-
classmethod
-
class
gepyto.structures.variants.
Indel
(*args, **kwargs)[source]¶ Class representing short insertions/deletions (Indels).
Either initialize with the parameters corresponding to:
chrom, pos, rs, ref, alt
or by using the corresponding named parameters.The notation we are using is consistent with the VCF format, but it is different from some API (e.g. Ensembl). We will try to standardize everything to comply with the VCF format.
The latter represents deletions by using the preceding nucleotide for the reference.
e.g. If the genomic sequence is AAGAA -> AAAA (deletion of the G) the indel will be represented as follows:
- Start: 2
- Ref: AG
- Alt: A
- length: 1 (difference between allele lengths)
e.g. If the genomic sequence is AAGAA -> AAGCAA (insertion of the C) the indel will be represented as follows:
- Start: 3
- Ref: G
- Alt: GC
- length: 1
-
classmethod
from_ensembl_api
(rs, build='GRCh37')[source]¶ Gets the information from the Ensembl REST API.
Parameters: - rs – The rs number for the variant of interest.
- build – The genome build (e.g. GRCh37 or GRCh38).
-
length
¶ Computes the size of the indel.
This is the difference between the length of the alleles.
-
gepyto.structures.variants.
variant_list_to_dataframe
(variants)[source]¶ Converts a regular Python list of Variant objects to a Pandas dataframe.
Parameters: variants (list) – A list of Variant
objects.Returns: A Pandas dataframe with genomic information as well as extra fields defined in the _info parameter. Return type: DataFrame This function will add annotations from the _info dict if such a parameter is set for the considered variants. This dict has to be comparable between all elements of the list or an Exception will be raised.
This functionality can be very useful to flexibly add annotations to variants and to write them to a CSV file.
Sequences¶
-
class
gepyto.structures.sequences.
Sequence
(uid, s, seq_type, info=None)[source]¶ Object to represent biological sequences.
Parameters: Common examples for the info attributes:
species
: Homo sapiensspecies_ncbi_tax_id
: 9606description
: dystroglycan 1db_name
: RefSeqdb_acc
: NM_004393
-
base_base_correlation
(k=10, alphabet=None)[source]¶ Compute the base base correlation (BBC) for the sequence.
Parameters: - k (int) – k is a parameter of the BBC. Intuitively, it represents the maximum distance to observe correlation between bases.
- alphabet (iterable) – List of possible characters. This can be used to avoid autodetection of the alphabet in the case where sequences with missing letters are to be compared.
Returns: A 16 dimensional vector representing the BBC.
Return type: np.ndarray
A description of the method can be found here: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4272582
Liu, Zhi-Hua, et al. “Base-Base Correlation a Novel Sequence Feature and its Applications.” Bioinformatics and Biomedical Engineering, 2007. ICBBE 2007. The 1st International Conference on. IEEE, 2007.
This implementation is generalized for any sequence type.
-
find_coding_sequences
(cpu=6)[source]¶ Tries all the ORFs and translates every possible protein.
Returns: A tuple containing the information of the coding sequence. (ORF, start, end, sequence) Return type: tuple Warning
This is currently untested.
-
classmethod
from_reference
(chrom, start, end=None, length=None)[source]¶ Create a Sequence object from a given locus.
-
get_annotations
()[source]¶ Return a list of bound SequenceAnnotation objects.
Returns: A list of annotations for the sequence representing different kind of information about sub-sequences like protein domains. Return type: list
-
gepyto.structures.sequences.
maketrans
()¶ Return a translation table usable for str.translate().
If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.
-
gepyto.structures.sequences.
smith_waterman
(seq1, seq2, penalties=None, output='sequences')[source]¶ Compute a pairwise local sequence alignment using the Smith Waterman algorithm.
The output parameter determines how results will be represented:
If “sequences” is chosen, the two aligned sequences will be returned with gaps represented by dashes (“-”).
If “alignment” is chosen, a single encoded string will be reterned where “M” represents matches, “I” represents insertions, “D” represents deletions and “X” represents mismatches. This is done with respect to the first sequence.
In any mode, the first returned element is always the raw similarity score. Note that because this is local alignment, gaps at both ends won’t be penalized. Also, only one of the potentially many best alignments will be output.
Warning
This implementation is not very optimized. It is not written in a low level language. It can be used for small sequences or for low number of comparisons, but should not be used in large scale products.
Note
Some functionality like affine gap penalties, or substitution matrices are not implemented.
The default penalty scheme is the following:
{ "match": 2, "mismatch": -1, "gap": -1 }
You can follow this pattern to set your own penalty scores.
Region¶
-
class
gepyto.structures.region.
Region
(chrom, start, end)[source]¶ Region object to represent a part of the genome.
Parameters: This can either represent a contiguous region or a fragmented region with multiple non-overlaping segments. This object can easily be converted to a Sequence using the
Region.sequence()
property. It is also easy to test overlap with another region or to test if an object is contained within this region using the in operator.-
static
from_str
(s)[source]¶ Parses the region object (contiguous only) from a string.
The expected format is chrX:START-END.
-
sequence
¶ Builds a Sequence object representing the region.
If the region is non contiguous, a tuple of sequences is returned.
-
static