Bioinformatics common file formats parsing (formats
)¶
Impute2¶
-
class
gepyto.formats.impute2.
Impute2File
(fn, mode='line', **kwargs)[source]¶ Class representing an Impute2File.
This is used to either generate a dosage matrix where columns represent variants and rows represent samples or to read the file line by line using the generator syntax.
This also implements the context manager interface.
Usage:
# Read as probabilities (Line tuples). with open(Impute2File(fn)) as f: for line in f: # line has name, chrom, pos, a1, a2, probabilities print(line) # Read as dosage. with open(Impute2File(fn), "dosage") as f: for dosage_vector, info in f: pass # Read as a matrix. with open(Impute2File(fn)) as f: # 1 row per sample and 1 column per variant. Values between 0 and 2 m = f.as_matrix()
If you use the
dosage
mode, you can also add additional arguments:prob_threshold: Genotype probability cutoff for no call values (NaN).
- is_chr23: Not implemented yet, but dosage is computed differently
for sexual chromosomes for men (hemizygote).
- sex_vector: Not implemented yet, but this is a vector representing
the gender of every sample (for dosage computation on sexual chromosomes).
Warning
Be careful with the
Impute2File.as_matrix()
function as it will try to load the WHOLE Impute2 file in memory.
SeqXML¶
-
class
gepyto.formats.seqxml.
SeqXML
(fn)[source]¶ Parses the SeqXML format representing sequence data.
Parameters: fn (str) – The filename of the SeqXML file. The format description is available at orthoxml.org (visited Nov. 2014). The returned object will have a list of entries which are
Sequence
objects.
GTF/GFF¶
-
class
gepyto.formats.gtf.
GTFFile
(fn)[source]¶ Parser for GTF files.
This implementation was based on the format specification as described here: http://www.sanger.ac.uk/resources/software/gff/spec.html.
You can use this parser on both local files (compressed using gzip, or not) and on remote files (on a HTTP server).
For every line, this class will return a named tuple with the following fields:
- seqname
- source
- features
- start
- end
- score
- strand
- frame
- attributes
Example usage:
>>> import gepyto.formats.gtf >>> url = "http://www.uniprot.org/uniprot/O60503.gff" >>> gtf = gepyto.formats.gtf.GTFFile(url) >>> gtf <gepyto.formats.gtf.GTFFile object at 0x1006dd590> >>> gtf.readline() _Line(seqname=u'O60503', source=u'UniProtKB', features=u'Chain', start=1,end=1353, score=None, strand=None, frame=None, attributes={u'Note':u'Adenylate cyclase type 9', u'ID': u'PRO_0000195708'})
-
Line
¶ alias of
_Line
Wiggle (fixedStep)¶
Parser for Wiggle Track Format files.
-
class
gepyto.formats.wig.
WiggleFile
(stream)[source]¶ Parser for WIG files.
This returns a pandas dataframe with all the necessary information. In the process, all the inherent compactness of the Wiggle format is lost in exchange for an easier to manage representation. This means that more efficient parsers should be used for large chunks of data.
This implementation is based on the specification from: http://genome.ucsc.edu/goldenpath/help/wiggle.html
Warning
fixedStep
is the only implemented mode for now. Future releases might improve this parser to be more flexible.To access the parsed information, use the
WiggleFile.as_dataframe()
function.Usage (given a file on disk):
>>> import gepyto.formats.wig >>> with gepyto.formats.wig.WiggleFile("my_file.wig") as f: ... df = f.as_dataframe() ... >>> df chrom pos value 0 chr3 400601 11 1 chr3 400701 22 2 chr3 400801 33