Database module (db
)¶
Ensembl¶
Ensembl provides a very useful REST API. We added an interface that does some throttling and manages the JSON response from the API.
Appris¶
The Appris database aims to
annotate alternative splice isoforms. We provide an interface to this database
that can annotate gepyto.structures.genes.Transcript
objects.
-
gepyto.db.appris.
get_category_for_transcript
(enst)[source]¶ Fetches the annotation for a transcript (ENST).
Parameters: enst (str) – The Ensembl transcript id. Returns: The APPRIS category for this transcript. Return type: str
-
gepyto.db.appris.
get_main_transcripts
(ensg)[source]¶ - Gets the main Ensembl transcript id for the provided gene based on the
- APPRIS annotation.
Parameters: - ensg – The Ensembl gene number (ENSG000000).
- ensg – str
Returns: The “main” transcrit (ENST). If there is an appris_principal annotation, this will be returned. If it is not the case, the order of priority is the following: appris_candidate_longest_ccds, appris_candidate_ccds, appris_candidate_longest_seq, appris_candidate.
Return type:
Index¶
Index is a lower level implementation of an indexing data structure. It was designed to be able to handle any text delimited file with a chromosome column and a position column.
Indexing is fast because not all the file is read: jumps of a fixed size are
used. The jump size is estimated by the index_rate
parameter which is
an estimated coverage of the indexed file. Lower rates will take less time to
index and create smaller index files, but will result in slower lookups.
The structure of the index is a pickled python dictionary using chromosomes
as keys and lists of (position, file seek)
as values.
-
gepyto.db.index.
build_index
(fn, chrom_col, pos_col, delimiter='\t', skip_lines=0, index_rate=0.2, ignore_startswith=None)[source]¶ Build a index for the given file.
Parameters: - fn (str) – The filename
- chrom_col (int) – The column representing the chromosome (0 based).
- pos_col (int) – The column for the position on the chromosome (0 based).
- delimiter (str) – The delimiter for the columns (default tab).
- skip_lines (int) – Number of header lines to skip.
- index_rate (float) – The approximate rate of line indexing. As an example, a file with 1000 lines and the default index_rate of 0.2 will have an index with ~200 entries.
- ignore_startswith (str) – Ignore lines that start with a given string. This can be used to skip headers, but will not be used to parse the rest of the file.
Returns: The index filename.
Return type:
-
gepyto.db.index.
get_index
(fn)[source]¶ - Restores the index for a given file or builds it if the index was not
- previously created.
Parameters: fn (str) – The filname of the file to index. Returns: The numpy array representing the actual index. Return type: numpy.ndarray
-
gepyto.db.index.
goto
(f, index, chrom, pos)[source]¶ Given a file, a locus and the index, go to the genomic coordinates.
Parameters: - f (file) – An open file.
- index – This is actually a tuple. The first element is an information dict containing the delimiter, chromosome column and position column. The second element is a numpy matrix containing the encoded loci and the “tell” positions.
- index – tuple
- chrom – The queried chromosome.
- pos – The queried position on the chromosome.
Returns: True if the position was found and the cursor moved, False if the queried chromosome, position wasn’t found.
Return type:
UCSC¶
-
class
gepyto.db.ucsc.
UCSC
(db=None)[source]¶ Provides raw access to the UCSC MySQL database.
The database will be set to the db parameter or to the BUILD as defined in gepyto‘s settings.
Later versions could implement features like throttling, but for now this is a very simple interface.
-
gepyto.db.ucsc.
get_centromere
(chromosome)[source]¶ Returns a contiguous region representing the centromere of a chromosome.
Parameters: chromosome (str) – The chromosome, _e.g._ “3” Returns: A region corresponding to the centromere. Return type: Region
This is done by connecting to the UCSC MySQL server.
-
gepyto.db.ucsc.
get_phylop_100_way
(region)[source]¶ Get a vector of phyloP conservation scores for a given region.
Scores represent the -log(p-value) under a H0 of neutral evolution. Positive values represent conservation and negative values represent fast-evolving bases.
The UCSC MySQL database only contains aggregate scores for chunks of 1024 bases. We return the results for the subset of the required region that is fully contained in the UCSC bins.
Because UCSC uses 0-based indexing, we adjust the gepyto region before querying. This means that the user should use 1-based indexing, as usual when creating the Region object.
Warning
This function has a fairly low resolution. You should download the raw data (e.g. from goldenpath ) if you need scores for each base. Also note that gepyto can’t parse bigWig, but it can parse Wig files.
Warning
This function is untested.
-
gepyto.db.ucsc.
get_telomere
(chromosome)[source]¶ Returns a Noncontiguous region representing the telomeres of a chromosome.
Parameters: chromosome (str) – The chromosome, _e.g._ “3” Returns: A region corresponding to the telomeres. Return type: Region
This is done by connecting to the UCSC MySQL server.