Database module (db)

Ensembl

Ensembl provides a very useful REST API. We added an interface that does some throttling and manages the JSON response from the API.

gepyto.db.ensembl.query_ensembl(url)[source]

Query the given (Ensembl rest api) url and get a json reponse.

Parameters:url (str) – The API url to query.
Returns:A python object loaded from the JSON response from the server.
gepyto.db.ensembl.get_url_prefix(build)[source]

Generate a Ensembl REST API URL prefix for the given build.

Appris

The Appris database aims to annotate alternative splice isoforms. We provide an interface to this database that can annotate gepyto.structures.genes.Transcript objects.

gepyto.db.appris.get_category_for_transcript(enst)[source]

Fetches the annotation for a transcript (ENST).

Parameters:enst (str) – The Ensembl transcript id.
Returns:The APPRIS category for this transcript.
Return type:str
gepyto.db.appris.get_main_transcripts(ensg)[source]
Gets the main Ensembl transcript id for the provided gene based on the
APPRIS annotation.
Parameters:
  • ensg – The Ensembl gene number (ENSG000000).
  • ensg – str
Returns:

The “main” transcrit (ENST). If there is an appris_principal annotation, this will be returned. If it is not the case, the order of priority is the following: appris_candidate_longest_ccds, appris_candidate_ccds, appris_candidate_longest_seq, appris_candidate.

Return type:

str

gepyto.db.appris.get_transcripts_for_gene(ensg)[source]

Fetches the transcripts and their annotation for a given gene (ENSG).

Parameters:ensg (str) – The Ensembl gene id.
Returns:A list of transcript IDs and their categories (tuples).
Return type:tuple
gepyto.db.appris.init_db()[source]

This is an initialization method for the database.

We use this to load the database only if a function is called.

Index

Index is a lower level implementation of an indexing data structure. It was designed to be able to handle any text delimited file with a chromosome column and a position column.

Indexing is fast because not all the file is read: jumps of a fixed size are used. The jump size is estimated by the index_rate parameter which is an estimated coverage of the indexed file. Lower rates will take less time to index and create smaller index files, but will result in slower lookups.

The structure of the index is a pickled python dictionary using chromosomes as keys and lists of (position, file seek) as values.

gepyto.db.index.build_index(fn, chrom_col, pos_col, delimiter='\t', skip_lines=0, index_rate=0.2, ignore_startswith=None)[source]

Build a index for the given file.

Parameters:
  • fn (str) – The filename
  • chrom_col (int) – The column representing the chromosome (0 based).
  • pos_col (int) – The column for the position on the chromosome (0 based).
  • delimiter (str) – The delimiter for the columns (default tab).
  • skip_lines (int) – Number of header lines to skip.
  • index_rate (float) – The approximate rate of line indexing. As an example, a file with 1000 lines and the default index_rate of 0.2 will have an index with ~200 entries.
  • ignore_startswith (str) – Ignore lines that start with a given string. This can be used to skip headers, but will not be used to parse the rest of the file.
Returns:

The index filename.

Return type:

str

gepyto.db.index.get_index(fn)[source]
Restores the index for a given file or builds it if the index was not
previously created.
Parameters:fn (str) – The filname of the file to index.
Returns:The numpy array representing the actual index.
Return type:numpy.ndarray
gepyto.db.index.goto(f, index, chrom, pos)[source]

Given a file, a locus and the index, go to the genomic coordinates.

Parameters:
  • f (file) – An open file.
  • index – This is actually a tuple. The first element is an information dict containing the delimiter, chromosome column and position column. The second element is a numpy matrix containing the encoded loci and the “tell” positions.
  • index – tuple
  • chrom – The queried chromosome.
  • pos – The queried position on the chromosome.
Returns:

True if the position was found and the cursor moved, False if the queried chromosome, position wasn’t found.

Return type:

bool

UCSC

class gepyto.db.ucsc.UCSC(db=None)[source]

Provides raw access to the UCSC MySQL database.

The database will be set to the db parameter or to the BUILD as defined in gepyto‘s settings.

Later versions could implement features like throttling, but for now this is a very simple interface.

query_gap_table(chromosome, ucsc_type)[source]

Get either the “telomere” or “centromere” of a given chromosome.

raw_sql(sql, params)[source]

Execute a raw SQL query.

gepyto.db.ucsc.get_centromere(chromosome)[source]

Returns a contiguous region representing the centromere of a chromosome.

Parameters:chromosome (str) – The chromosome, _e.g._ “3”
Returns:A region corresponding to the centromere.
Return type:Region

This is done by connecting to the UCSC MySQL server.

gepyto.db.ucsc.get_phylop_100_way(region)[source]

Get a vector of phyloP conservation scores for a given region.

Scores represent the -log(p-value) under a H0 of neutral evolution. Positive values represent conservation and negative values represent fast-evolving bases.

The UCSC MySQL database only contains aggregate scores for chunks of 1024 bases. We return the results for the subset of the required region that is fully contained in the UCSC bins.

Because UCSC uses 0-based indexing, we adjust the gepyto region before querying. This means that the user should use 1-based indexing, as usual when creating the Region object.

Warning

This function has a fairly low resolution. You should download the raw data (e.g. from goldenpath ) if you need scores for each base. Also note that gepyto can’t parse bigWig, but it can parse Wig files.

Warning

This function is untested.

gepyto.db.ucsc.get_telomere(chromosome)[source]

Returns a Noncontiguous region representing the telomeres of a chromosome.

Parameters:chromosome (str) – The chromosome, _e.g._ “3”
Returns:A region corresponding to the telomeres.
Return type:Region

This is done by connecting to the UCSC MySQL server.