Bioinformatics common file formats parsing (formats
)¶
Impute2¶
-
class
gepyto.formats.impute2.
Impute2File
(fn, mode='line', **kwargs)[source]¶ Class representing an Impute2File.
This is used to either generate a dosage matrix where columns represent variants and rows represent samples or to read the file line by line using the generator syntax.
This also implements the context manager interface.
Usage:
# Read as probabilities (Line tuples). with open(Impute2File(fn)) as f: for line in f: # line has name, chrom, pos, a1, a2, probabilities print(line) # Read as dosage. with open(Impute2File(fn), "dosage") as f: for dosage_vector, info in f: pass # Read as a matrix. with open(Impute2File(fn)) as f: # 1 row per sample and 1 column per variant. Values between 0 and 2 m = f.as_matrix()
If you use the
dosage
mode, you can also add additional arguments:prob_threshold: Genotype probability cutoff for no call values (NaN).
- is_chr23: Not implemented yet, but dosage is computed differently
for sexual chromosomes for men (hemizygote).
- sex_vector: Not implemented yet, but this is a vector representing
the gender of every sample (for dosage computation on sexual chromosomes).
Warning
Be careful with the
Impute2File.as_matrix()
function as it will try to load the WHOLE Impute2 file in memory.
SeqXML¶
-
class
gepyto.formats.seqxml.
SeqXML
(fn)[source]¶ Parses the SeqXML format representing sequence data.
Parameters: fn (str) – The filename of the SeqXML file. The format description is available at orthoxml.org (visited Nov. 2014). The returned object will have a list of entries which are
Sequence
objects.