Bioinformatics common file formats parsing (formats)

Impute2

class gepyto.formats.impute2.Impute2File(fn, mode='line', **kwargs)[source]

Class representing an Impute2File.

This is used to either generate a dosage matrix where columns represent variants and rows represent samples or to read the file line by line using the generator syntax.

This also implements the context manager interface.

Usage:

# Read as probabilities (Line tuples).
with open(Impute2File(fn)) as f:
    for line in f:
        # line has name, chrom, pos, a1, a2, probabilities
        print(line)

# Read as dosage.
with open(Impute2File(fn), "dosage") as f:
    for dosage_vector, info in f:
        pass

# Read as a matrix.
with open(Impute2File(fn)) as f:
    # 1 row per sample and 1 column per variant. Values between 0 and 2
    m = f.as_matrix()

If you use the dosage mode, you can also add additional arguments:

  • prob_threshold: Genotype probability cutoff for no call values (NaN).

  • is_chr23: Not implemented yet, but dosage is computed differently

    for sexual chromosomes for men (hemizygote).

  • sex_vector: Not implemented yet, but this is a vector representing

    the gender of every sample (for dosage computation on sexual chromosomes).

Warning

Be careful with the Impute2File.as_matrix() function as it will try to load the WHOLE Impute2 file in memory.

as_matrix()[source]

Creates a numpy dosage matrix from this file.

Returns:A numpy matrix where columns represent variant dosage between 0 and 2 and a dataframe describing the variants (major, minor, maf).
Type:tuple

Warning

This will attempt to load the whole file in memory.

readline()[source]

Read a single line from the Impute2File.

This will return either a Line including the genotype probabilities or a dosage vector. This depends on the mode (the second argument given to the file when it was opened).

Available modes are dosage and line.

SeqXML

class gepyto.formats.seqxml.SeqXML(fn)[source]

Parses the SeqXML format representing sequence data.

Parameters:fn (str) – The filename of the SeqXML file. The format description is available at orthoxml.org (visited Nov. 2014).

The returned object will have a list of entries which are Sequence objects.

get_seq(uid)[source]

Get a sequence from it’s unique identifier.

Parameters:uid (str) – The sequence id.