Biology

GFF3

petlx.bio.gff3.fromgff3(filename, region=None)[source]

Extract feature rows from a GFF3 file, e.g.:

>>> import petl as etl
>>> # activate bio extensions
... import petlx.bio
>>> table1 = etl.fromgff3('fixture/sample.gff')
>>> table1.look(truncate=30)
+--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+
| seqid        | source  | type          | start | end     | score | strand | phase | attributes                     |
+==============+=========+===============+=======+=========+=======+========+=======+================================+
| 'apidb|MAL1' | 'ApiDB' | 'supercontig' |     1 |  643292 | '.'   | '+'    | '.'   | {'localization': 'nuclear', 'o |
+--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+
| 'apidb|MAL2' | 'ApiDB' | 'supercontig' |     1 |  947102 | '.'   | '+'    | '.'   | {'localization': 'nuclear', 'o |
+--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+
| 'apidb|MAL3' | 'ApiDB' | 'supercontig' |     1 | 1060087 | '.'   | '+'    | '.'   | {'localization': 'nuclear', 'o |
+--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+
| 'apidb|MAL4' | 'ApiDB' | 'supercontig' |     1 | 1204112 | '.'   | '+'    | '.'   | {'localization': 'nuclear', 'o |
+--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+
| 'apidb|MAL5' | 'ApiDB' | 'supercontig' |     1 | 1343552 | '.'   | '+'    | '.'   | {'localization': 'nuclear', 'o |
+--------------+---------+---------------+-------+---------+-------+--------+-------+--------------------------------+
...

A region query string of the form ‘[seqid]’ or ‘[seqid]:[start]-[end]’ may be given for the region argument. If given, requires the GFF3 file to be position sorted, bgzipped and tabix indexed. Requires pysam to be installed. E.g.:

>>> # extract from a specific genome region via tabix
... table2 = etl.fromgff3('fixture/sample.sorted.gff.gz',
...                       region='apidb|MAL5:1289593-1289595')
>>> table2.look(truncate=30)
+--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+
| seqid        | source  | type          | start   | end     | score | strand | phase | attributes                     |
+==============+=========+===============+=========+=========+=======+========+=======+================================+
| 'apidb|MAL5' | 'ApiDB' | 'supercontig' |       1 | 1343552 | '.'   | '+'    | '.'   | {'localization': 'nuclear', 'o |
+--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+
| 'apidb|MAL5' | 'ApiDB' | 'exon'        | 1289594 | 1291685 | '.'   | '+'    | '.'   | {'size': '2092', 'Parent': 'ap |
+--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+
| 'apidb|MAL5' | 'ApiDB' | 'gene'        | 1289594 | 1291685 | '.'   | '+'    | '.'   | {'ID': 'apidb|MAL5_18S', 'web_ |
+--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+
| 'apidb|MAL5' | 'ApiDB' | 'rRNA'        | 1289594 | 1291685 | '.'   | '+'    | '.'   | {'ID': 'apidb|rna_MAL5_18S-1', |
+--------------+---------+---------------+---------+---------+-------+--------+-------+--------------------------------+

Tabix (pysam)

Note

The pysam package is required, e.g.:

$ pip install pysam
petlx.bio.tabix.fromtabix(filename, reference=None, start=None, stop=None, region=None, header=None)[source]

Extract rows from a tabix indexed file, e.g.:

>>> import petl as etl
>>> # activate bio extensions
... import petlx.bio
>>> table1 = etl.fromtabix('fixture/test.bed.gz',
...                        region='Pf3D7_02_v3')
>>> table1
+---------------+----------+----------+-----------------------------+
| #chrom        | start    | end      | region                      |
+===============+==========+==========+=============================+
| 'Pf3D7_02_v3' | '0'      | '23100'  | 'SubtelomericRepeat'        |
+---------------+----------+----------+-----------------------------+
| 'Pf3D7_02_v3' | '23100'  | '105800' | 'SubtelomericHypervariable' |
+---------------+----------+----------+-----------------------------+
| 'Pf3D7_02_v3' | '105800' | '447300' | 'Core'                      |
+---------------+----------+----------+-----------------------------+
| 'Pf3D7_02_v3' | '447300' | '450450' | 'Centromere'                |
+---------------+----------+----------+-----------------------------+
| 'Pf3D7_02_v3' | '450450' | '862500' | 'Core'                      |
+---------------+----------+----------+-----------------------------+
...

>>> table2 = etl.fromtabix('fixture/test.bed.gz',
...                        region='Pf3D7_02_v3:110000-120000')
>>> table2
+---------------+----------+----------+--------+
| #chrom        | start    | end      | region |
+===============+==========+==========+========+
| 'Pf3D7_02_v3' | '105800' | '447300' | 'Core' |
+---------------+----------+----------+--------+

Variant call format (PyVCF)

Note

The pyvcf package is required, e.g.:

$ pip install pyvcf
petlx.bio.vcf.fromvcf(filename, chrom=None, start=None, stop=None, samples=True)[source]

Returns a table providing access to data from a variant call file (VCF). E.g.:

>>> import petl as etl
>>> # activate bio extensions
... import petlx.bio
>>> table1 = etl.fromvcf('fixture/sample.vcf')
>>> table1.look(truncate=20)
+-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+
| CHROM | POS     | ID          | REF | ALT    | QUAL | FILTER  | INFO                 | NA00001              | NA00002              | NA00003              |
+=======+=========+=============+=====+========+======+=========+======================+======================+======================+======================+
| '19'  |     111 | None        | 'A' | [C]    |  9.6 | None    | {}                   | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, |
+-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+
| '19'  |     112 | None        | 'A' | [G]    |   10 | None    | {}                   | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, |
+-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+
| '20'  |   14370 | 'rs6054257' | 'G' | [A]    |   29 | []      | {'DP': 14, 'H2': Tru | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, |
+-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+
| '20'  |   17330 | None        | 'T' | [A]    |    3 | ['q10'] | {'DP': 11, 'NS': 3,  | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, |
+-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+
| '20'  | 1110696 | 'rs6040355' | 'A' | [G, T] |   67 | []      | {'DP': 10, 'AA': 'T' | Call(sample=NA00001, | Call(sample=NA00002, | Call(sample=NA00003, |
+-------+---------+-------------+-----+--------+------+---------+----------------------+----------------------+----------------------+----------------------+
...
petlx.bio.vcf.vcfunpackinfo(table, *keys)[source]

Unpack the INFO field into separate fields. E.g.:

>>> import petl as etl
>>> # activate bio extensions
... import petlx.bio
>>> table1 = (
...     etl
...     .fromvcf('fixture/sample.vcf', samples=None)
...     .vcfunpackinfo()
... )
>>> table1
+-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+
| CHROM | POS     | ID          | REF | ALT    | QUAL | FILTER  | AA   | AC   | AF             | AN   | DB   | DP   | H2   | NS   |
+=======+=========+=============+=====+========+======+=========+======+======+================+======+======+======+======+======+
| '19'  |     111 | None        | 'A' | [C]    |  9.6 | None    | None | None | None           | None | None | None | None | None |
+-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+
| '19'  |     112 | None        | 'A' | [G]    |   10 | None    | None | None | None           | None | None | None | None | None |
+-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+
| '20'  |   14370 | 'rs6054257' | 'G' | [A]    |   29 | []      | None | None | [0.5]          | None | True |   14 | True |    3 |
+-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+
| '20'  |   17330 | None        | 'T' | [A]    |    3 | ['q10'] | None | None | [0.017]        | None | None |   11 | None |    3 |
+-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+
| '20'  | 1110696 | 'rs6040355' | 'A' | [G, T] |   67 | []      | 'T'  | None | [0.333, 0.667] | None | True |   10 | None |    2 |
+-------+---------+-------------+-----+--------+------+---------+------+------+----------------+------+------+------+------+------+
...
petlx.bio.vcf.vcfmeltsamples(table, *samples)[source]

Melt the samples columns. E.g.:

>>> import petl as etl
>>> # activate bio extensions
... import petlx.bio
>>> table1 = (
...     etl
...     .fromvcf('fixture/sample.vcf')
...     .vcfmeltsamples()
... )
>>> table1
+-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+
| CHROM | POS | ID   | REF | ALT | QUAL | FILTER | INFO | SAMPLE    | CALL                                                |
+=======+=====+======+=====+=====+======+========+======+===========+=====================================================+
| '19'  | 111 | None | 'A' | [C] |  9.6 | None   | {}   | 'NA00001' | Call(sample=NA00001, CallData(GT=0|0, HQ=[10, 10])) |
+-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+
| '19'  | 111 | None | 'A' | [C] |  9.6 | None   | {}   | 'NA00002' | Call(sample=NA00002, CallData(GT=0|0, HQ=[10, 10])) |
+-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+
| '19'  | 111 | None | 'A' | [C] |  9.6 | None   | {}   | 'NA00003' | Call(sample=NA00003, CallData(GT=0/1, HQ=[3, 3]))   |
+-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+
| '19'  | 112 | None | 'A' | [G] |   10 | None   | {}   | 'NA00001' | Call(sample=NA00001, CallData(GT=0|0, HQ=[10, 10])) |
+-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+
| '19'  | 112 | None | 'A' | [G] |   10 | None   | {}   | 'NA00002' | Call(sample=NA00002, CallData(GT=0|0, HQ=[10, 10])) |
+-------+-----+------+-----+-----+------+--------+------+-----------+-----------------------------------------------------+
...
petlx.bio.vcf.vcfunpackcall(table, *keys)[source]

Unpack the call column. E.g.:

>>> import petl as etl
>>> # activate bio extensions
... import petlx.bio
>>> table1 = (
...     etl
...     .fromvcf('fixture/sample.vcf')
...     .vcfmeltsamples()
...     .vcfunpackcall()
... )
>>> table1
+-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+
| CHROM | POS | ID   | REF | ALT | QUAL | FILTER | INFO | SAMPLE    | DP   | GQ   | GT    | HQ       |
+=======+=====+======+=====+=====+======+========+======+===========+======+======+=======+==========+
| '19'  | 111 | None | 'A' | [C] |  9.6 | None   | {}   | 'NA00001' | None | None | '0|0' | [10, 10] |
+-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+
| '19'  | 111 | None | 'A' | [C] |  9.6 | None   | {}   | 'NA00002' | None | None | '0|0' | [10, 10] |
+-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+
| '19'  | 111 | None | 'A' | [C] |  9.6 | None   | {}   | 'NA00003' | None | None | '0/1' | [3, 3]   |
+-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+
| '19'  | 112 | None | 'A' | [G] |   10 | None   | {}   | 'NA00001' | None | None | '0|0' | [10, 10] |
+-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+
| '19'  | 112 | None | 'A' | [G] |   10 | None   | {}   | 'NA00002' | None | None | '0|0' | [10, 10] |
+-------+-----+------+-----+-----+------+--------+------+-----------+------+------+-------+----------+
...