Bioinformatics common file formats parsing (formats)

Impute2

class gepyto.formats.impute2.Impute2File(fn, mode='line', **kwargs)[source]

Class representing an Impute2File.

This is used to either generate a dosage matrix where columns represent variants and rows represent samples or to read the file line by line using the generator syntax.

This also implements the context manager interface.

Usage:

# Read as probabilities (Line tuples).
with open(Impute2File(fn)) as f:
    for line in f:
        # line has name, chrom, pos, a1, a2, probabilities
        print(line)

# Read as dosage.
with open(Impute2File(fn), "dosage") as f:
    for dosage_vector, info in f:
        pass

# Read as a matrix.
with open(Impute2File(fn)) as f:
    # 1 row per sample and 1 column per variant. Values between 0 and 2
    m = f.as_matrix()

If you use the dosage mode, you can also add additional arguments:

  • prob_threshold: Genotype probability cutoff for no call values (NaN).

  • is_chr23: Not implemented yet, but dosage is computed differently

    for sexual chromosomes for men (hemizygote).

  • sex_vector: Not implemented yet, but this is a vector representing

    the gender of every sample (for dosage computation on sexual chromosomes).

Warning

Be careful with the Impute2File.as_matrix() function as it will try to load the WHOLE Impute2 file in memory.

as_matrix()[source]

Creates a numpy dosage matrix from this file.

Returns:A numpy matrix where columns represent variant dosage between 0 and 2 and a dataframe describing the variants (major, minor, maf).
Type:tuple

Warning

This will attempt to load the whole file in memory.

readline()[source]

Read a single line from the Impute2File.

This will return either a Line including the genotype probabilities or a dosage vector. This depends on the mode (the second argument given to the file when it was opened).

Available modes are dosage and line.

SeqXML

class gepyto.formats.seqxml.SeqXML(fn)[source]

Parses the SeqXML format representing sequence data.

Parameters:fn (str) – The filename of the SeqXML file. The format description is available at orthoxml.org (visited Nov. 2014).

The returned object will have a list of entries which are Sequence objects.

get_seq(uid)[source]

Get a sequence from it’s unique identifier.

Parameters:uid (str) – The sequence id.

GTF/GFF

gepyto.formats.gtf.GFFFile

alias of GTFFile

class gepyto.formats.gtf.GTFFile(fn)[source]

Parser for GTF files.

This implementation was based on the format specification as described here: http://www.sanger.ac.uk/resources/software/gff/spec.html.

You can use this parser on both local files (compressed using gzip, or not) and on remote files (on a HTTP server).

For every line, this class will return a named tuple with the following fields:

  • seqname
  • source
  • features
  • start
  • end
  • score
  • strand
  • frame
  • attributes

Example usage:

>>> import gepyto.formats.gtf
>>> url = "http://www.uniprot.org/uniprot/O60503.gff"
>>> gtf = gepyto.formats.gtf.GTFFile(url)
>>> gtf
<gepyto.formats.gtf.GTFFile object at 0x1006dd590>
>>> gtf.readline()
_Line(seqname=u'O60503', source=u'UniProtKB', features=u'Chain', start=1,end=1353, score=None, strand=None, frame=None, attributes={u'Note':u'Adenylate cyclase type 9', u'ID': u'PRO_0000195708'})
Line

alias of _Line

Wiggle (fixedStep)

Parser for Wiggle Track Format files.

class gepyto.formats.wig.WiggleFile(stream)[source]

Parser for WIG files.

This returns a pandas dataframe with all the necessary information. In the process, all the inherent compactness of the Wiggle format is lost in exchange for an easier to manage representation. This means that more efficient parsers should be used for large chunks of data.

This implementation is based on the specification from: http://genome.ucsc.edu/goldenpath/help/wiggle.html

Warning

fixedStep is the only implemented mode for now. Future releases might improve this parser to be more flexible.

To access the parsed information, use the WiggleFile.as_dataframe() function.

Usage (given a file on disk):

>>> import gepyto.formats.wig
>>> with gepyto.formats.wig.WiggleFile("my_file.wig") as f:
...     df = f.as_dataframe()
...
>>> df
  chrom     pos  value
0  chr3  400601     11
1  chr3  400701     22
2  chr3  400801     33