Biology¶
GFF3¶
-
petlx.bio.gff3.
fromgff3
(filename, region=None)[source]¶ Extract feature rows from a GFF3 file, e.g.:
>>> import petl as etl >>> # activate bio extensions ... import petlx.bio >>> table1 = etl.fromgff3('fixture/sample.gff') >>> table1.look(truncate=30) +--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+ | seqid | source | type | start | end | score | strand | phase | attributes | +==============+=========+===============+=======+=========+=======+========+=======+================================+ | 'apidb|MAL1' | 'ApiDB' | 'supercontig' | 1 | 643292 | '.' | '+' | '.' | {'localization': 'nuclear', 'o | +--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+ | 'apidb|MAL2' | 'ApiDB' | 'supercontig' | 1 | 947102 | '.' | '+' | '.' | {'localization': 'nuclear', 'o | +--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+ | 'apidb|MAL3' | 'ApiDB' | 'supercontig' | 1 | 1060087 | '.' | '+' | '.' | {'localization': 'nuclear', 'o | +--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+ | 'apidb|MAL4' | 'ApiDB' | 'supercontig' | 1 | 1204112 | '.' | '+' | '.' | {'localization': 'nuclear', 'o | +--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+ | 'apidb|MAL5' | 'ApiDB' | 'supercontig' | 1 | 1343552 | '.' | '+' | '.' | {'localization': 'nuclear', 'o | +--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+ ...
A region query string of the form ‘[seqid]’ or ‘[seqid]:[start]-[end]’ may be given for the region argument. If given, requires the GFF3 file to be position sorted, bgzipped and tabix indexed. Requires pysam to be installed. E.g.:
>>> # extract from a specific genome region via tabix ... table2 = etl.fromgff3('fixture/sample.sorted.gff.gz', ... region='apidb|MAL5:1289593-1289595') >>> table2.look(truncate=30) +--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+ | seqid | source | type | start | end | score | strand | phase | attributes | +==============+=========+===============+=========+=========+=======+========+=======+================================+ | 'apidb|MAL5' | 'ApiDB' | 'supercontig' | 1 | 1343552 | '.' | '+' | '.' | {'localization': 'nuclear', 'o | +--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+ | 'apidb|MAL5' | 'ApiDB' | 'exon' | 1289594 | 1291685 | '.' | '+' | '.' | {'size': '2092', 'Parent': 'ap | +--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+ | 'apidb|MAL5' | 'ApiDB' | 'gene' | 1289594 | 1291685 | '.' | '+' | '.' | {'ID': 'apidb|MAL5_18S', 'web_ | +--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+ | 'apidb|MAL5' | 'ApiDB' | 'rRNA' | 1289594 | 1291685 | '.' | '+' | '.' | {'ID': 'apidb|rna_MAL5_18S-1', | +--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+
Tabix (pysam)¶
-
petlx.bio.tabix.
fromtabix
(filename, reference=None, start=None, stop=None, region=None, header=None)[source]¶ Extract rows from a tabix indexed file, e.g.:
>>> import petl as etl >>> # activate bio extensions ... import petlx.bio >>> table1 = etl.fromtabix('fixture/test.bed.gz', ... region='Pf3D7_02_v3') >>> table1 +---------------+----------+----------+-----------------------------+ | #chrom | start | end | region | +===============+==========+==========+=============================+ | 'Pf3D7_02_v3' | '0' | '23100' | 'SubtelomericRepeat' | +---------------+----------+----------+-----------------------------+ | 'Pf3D7_02_v3' | '23100' | '105800' | 'SubtelomericHypervariable' | +---------------+----------+----------+-----------------------------+ | 'Pf3D7_02_v3' | '105800' | '447300' | 'Core' | +---------------+----------+----------+-----------------------------+ | 'Pf3D7_02_v3' | '447300' | '450450' | 'Centromere' | +---------------+----------+----------+-----------------------------+ | 'Pf3D7_02_v3' | '450450' | '862500' | 'Core' | +---------------+----------+----------+-----------------------------+ ... >>> table2 = etl.fromtabix('fixture/test.bed.gz', ... region='Pf3D7_02_v3:110000-120000') >>> table2 +---------------+----------+----------+--------+ | #chrom | start | end | region | +===============+==========+==========+========+ | 'Pf3D7_02_v3' | '105800' | '447300' | 'Core' | +---------------+----------+----------+--------+
Variant call format (PyVCF)¶
-
petlx.bio.vcf.
fromvcf
(filename, chrom=None, start=None, stop=None, samples=True)[source]¶ Returns a table providing access to data from a variant call file (VCF). E.g.:
>>> import petl as etl >>> # activate bio extensions ... import petlx.bio >>> table1 = etl.fromvcf('fixture/sample.vcf') >>> table1.look(truncate=20) +-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+ | CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | NA00001 | NA00002 | NA00003 | +=======+=========+=============+=====+========+======+=========+======================+======================+======================+======================+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | {} | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, | +-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+ | '19' | 112 | None | 'A' | [G] | 10 | None | {} | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, | +-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+ | '20' | 14370 | 'rs6054257' | 'G' | [A] | 29 | [] | {'DP': 14, 'H2': Tru | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, | +-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+ | '20' | 17330 | None | 'T' | [A] | 3 | ['q10'] | {'DP': 11, 'NS': 3, | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, | +-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+ | '20' | 1110696 | 'rs6040355' | 'A' | [G, T] | 67 | [] | {'DP': 10, 'AA': 'T' | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, | +-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+ ...
-
petlx.bio.vcf.
vcfunpackinfo
(table, *keys)[source]¶ Unpack the INFO field into separate fields. E.g.:
>>> import petl as etl >>> # activate bio extensions ... import petlx.bio >>> table1 = ( ... etl ... .fromvcf('fixture/sample.vcf', samples=None) ... .vcfunpackinfo() ... ) >>> table1 +-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+ | CHROM | POS | ID | REF | ALT | QUAL | FILTER | AA | AC | AF | AN | DB | DP | H2 | NS | +=======+=========+=============+=====+========+======+=========+======+======+================+======+======+======+======+======+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | None | None | None | None | None | None | None | None | +-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+ | '19' | 112 | None | 'A' | [G] | 10 | None | None | None | None | None | None | None | None | None | +-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+ | '20' | 14370 | 'rs6054257' | 'G' | [A] | 29 | [] | None | None | [0.5] | None | True | 14 | True | 3 | +-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+ | '20' | 17330 | None | 'T' | [A] | 3 | ['q10'] | None | None | [0.017] | None | None | 11 | None | 3 | +-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+ | '20' | 1110696 | 'rs6040355' | 'A' | [G, T] | 67 | [] | 'T' | None | [0.333, 0.667] | None | True | 10 | None | 2 | +-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+ ...
-
petlx.bio.vcf.
vcfmeltsamples
(table, *samples)[source]¶ Melt the samples columns. E.g.:
>>> import petl as etl >>> # activate bio extensions ... import petlx.bio >>> table1 = ( ... etl ... .fromvcf('fixture/sample.vcf') ... .vcfmeltsamples() ... ) >>> table1 +-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+ | CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | SAMPLE | CALL | +=======+=====+======+=====+=====+======+========+======+===========+=====================================================+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | {} | 'NA00001' | Call(sample=NA00001, CallData(GT=0|0, HQ=[10, 10])) | +-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | {} | 'NA00002' | Call(sample=NA00002, CallData(GT=0|0, HQ=[10, 10])) | +-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | {} | 'NA00003' | Call(sample=NA00003, CallData(GT=0/1, HQ=[3, 3])) | +-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+ | '19' | 112 | None | 'A' | [G] | 10 | None | {} | 'NA00001' | Call(sample=NA00001, CallData(GT=0|0, HQ=[10, 10])) | +-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+ | '19' | 112 | None | 'A' | [G] | 10 | None | {} | 'NA00002' | Call(sample=NA00002, CallData(GT=0|0, HQ=[10, 10])) | +-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+ ...
-
petlx.bio.vcf.
vcfunpackcall
(table, *keys)[source]¶ Unpack the call column. E.g.:
>>> import petl as etl >>> # activate bio extensions ... import petlx.bio >>> table1 = ( ... etl ... .fromvcf('fixture/sample.vcf') ... .vcfmeltsamples() ... .vcfunpackcall() ... ) >>> table1 +-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+ | CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | SAMPLE | DP | GQ | GT | HQ | +=======+=====+======+=====+=====+======+========+======+===========+======+======+=======+==========+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | {} | 'NA00001' | None | None | '0|0' | [10, 10] | +-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | {} | 'NA00002' | None | None | '0|0' | [10, 10] | +-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+ | '19' | 111 | None | 'A' | [C] | 9.6 | None | {} | 'NA00003' | None | None | '0/1' | [3, 3] | +-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+ | '19' | 112 | None | 'A' | [G] | 10 | None | {} | 'NA00001' | None | None | '0|0' | [10, 10] | +-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+ | '19' | 112 | None | 'A' | [G] | 10 | None | {} | 'NA00002' | None | None | '0|0' | [10, 10] | +-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+ ...