Transcriptome

class methods

class GRATIOSA.Transcriptome.Transcriptome(gen)

The Transcriptome class is the key class for loading expression data and differential expression data for analysis and comparison. Typical data are data obtained with high throughput methods such as RNASeq and processed with common tools such as DeSeq2.

Each Transcriptome instance has to be initialized with a genome object

>>> gen = Genome.Genome("dickeya")
>>> tr = Transcriptome.Transcriptome(gen)

load_expression()

loads gene-wise expression data (such as log2rpkm) from files that are in the /expression/ directory.

Creates:

self.genes[locus].expression (dict.): new attribute of Gene: instances related to the Transcriptome instance given as argument. Dictionary of shape {condition: expression level (float.)}
self.genes_valid_expr(dict.): Dictionary of shape: {condition: list of genes having an expression value}

Note

The data importation requires an expression.info file, containing column indices of each information in the data file and some additional information, in the following order: [0] Condition [1] Filename [2] Locus_tag column [3] Expression column [4] is Log ? (boolean) [5] Separator In the case of a multi-chromosome genome, the main genome object as well as each single-chromosome genome objects get the “genes” attribute. Each gene is primarily assigned to the genome object of their chromosome/plasmid.

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Genome, Transcriptome >>> tr = Transcriptome.Transcriptome(“ecoli”) >>> g = Genome.Genome(tr.name) >>> g.load_annotation(annot_file=chosen_file) >>> tr.genes = g.genes

Example

>>> from GRATIOSA import Transcriptome
>>> tr = Transcriptome.Transcriptome("dickeya")
>>> tr.load_expression()
>>> tr.genes["Dda3937_00004"].expression
{'WT_nov_stat(E10)_rpkm': 66.6413789332929,
'WT_stat(E3)': 6.980245227157392,
'WT_PGA_stat(E13)_rpkm': 13.9428053948966}

compute_fc_from_expr(ctrls, conds, condname)

Compute_fc_from_expr computes, for each gene:

the mean of expression values (such as log2rpkm) in a list of controls replicates (ctrls)
the mean of expression values in a list of tests replicates (conds)
the fold-change ie the difference between the mean control and the mean test values
the pvalue of the Student-test for these two means

Creates:

self.genes[locus].fc_pval[condname] (tuple)
new attribute of Gene instances related to the Transcriptome instance given as argument.Tuple of shape (fold-change,pvalue).
self.genes_valid_fc[condname] (list.)
list of genes having a fold-change and a pvalue corresponding to the new condition

Parameters:

conds (list of str.) – list of expression conditions to use as test
ctrls (list of str.) – list of expression conditions to use as control
condname (str.) – condition name to use for the new fold-changes and p-values

Note

The t-test is computed with scipy.stats.ttest_ind

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Genome, Transcriptome >>> tr = Transcriptome.Transcriptome(“ecoli”) >>> g = Genome.Genome(tr.name) >>> g.load_annotation(annot_file=chosen_file) >>> tr.genes = g.genes

Example

>>> from GRATIOSA import Transcriptome
>>> tr = Transcriptome.Transcriptome("ecoli")
>>> tr.compute_fc_from_expr(["Ctrl1","Ctrl2"],
                            ["Test1","Test2"],
                            ["FC_Test12_Ctrl12"])
>>> tr.genes["b0002"].fc_pval["FC_Test12_Ctrl12"]
(-1.786, 0.003125)

load_fc_pval()

loads Fold-changes and p-values (if available) from data files that are in the /fold_changes/ directory.

Creates:

self.genes[locus].fc_pval (dict.): new attribute of Gene instances
related to the Transcriptome instance given as argument. Dictionary of shape {condition: (fold-change,pvalue)}. If no p-value is available in the data file, 0 is set as p-value.
self.genes_valid_fc (dict.): Dictionary of shape
{condition: list of genes having at least a fold-change}

Note

The data importation requires a fc.info file, containing column indices of each information in the data file and some additional information, in the following order: [0] Condition [1] Filename [2] Locus_tag column [3] FC column [4] Separator [5] Startline [6] P-value column

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Genome, Transcriptome >>> tr = Transcriptome.Transcriptome(“ecoli”) >>> g = Genome.Genome(tr.name) >>> g.load_annotation(annot_file=chosen_file) >>> tr.genes = g.genes

Example

>>> from GRATIOSA import Transcriptome
>>> tr = Transcriptome.Transcriptome("ecoli")
>>> tr.load_fc_pval()
>>> tr.genes["b0002"].fc_pval
{'WT': (-1.786, 0.003125),
 'osmotic': (-1.81603942323042, 0.0),
 'heat': (1.62203513077067, 0.0)}

compute_state_from_fc(thresh_pval=0.05, thresh_fc=0)

Loads Fold-changes and p-values data and computes genes state from these data. If no p-values is available in the data file, the default value of 0 is assigned to each gene’s pvalue.

A gene is considered:

activated if its FC is above the FC threshold given as argument and its pvalue is below the pvalue threshold given as argument ie. FC > thresh_FC and pval < thresh_pval
repressed if its FC is below the opposite of the FC threshold and its pvalue is below the pvalue threshold ie. FC < -thresh_FC and pval < thresh_pval
not affected either if its pvalue is above the threshold, or if its FC is between the - thresh_FC and + thresh_FC

Creates:

self.statesFC (dict. of dict.)
New attribute of the Transcriptome instance. Dictionary containing one subdictionary per condition listed in fc.info. Each subdictionary contains the list of genes corresponding to each state (‘act’, ‘rep’, ‘non’ or ‘null’).

self.genes[locus].fc_pval (dict.)
new attribute of Gene instances related to the Transcriptome instance given as argument. Dictionary of shape {condition: (FC,pvalue)}.

self.genes[locus].state (dict.)
new attribute of Gene instances related to the Transcriptome instance given as argument. Dictionary of shape {condition: state} with state either * act if the gene is activated * rep if the gene is repressed * non if the gene is not affected * null if the gene is not present in the data

Parameters:

thresh_pval (Float) – pvalue threshold used for the genes classification
thresh_fc (Float) – fold-change threshold used for the genes classification

Note

The FC and pvalues data importation requires an FC.info file, in the fold_changes directory, containing column indices of each information in the data file, in the following order: [0] Condition [1] Filename [2] Locus_tag column [3] FC columns [4] Separator [5] File start line [6] P-value column See load_fc_pval for more details about the data importation.

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Transcriptome, Genome >>> tr = Transcriptome.Transcriptome(“ecoli”) >>> g = Genome.Genome(tr.name) >>> g.load_annotation(annot_file=chosen_file) >>> tr.genes = g.genes

Example

>>> from GRATIOSA import Transcriptome
>>> tr = Transcriptome.Transcriptome("ecoli")
>>> tr.compute_state_from_fc()
>>> tr.genes['b0001'].state
{'osmotic': 'rep', 'acidic_1mn': 'act'}
>>> tr.genes['b0001'].fc_pval
{'osmotic': (-1.73717046009437, 0.0), 'acidic_1mn': (1.73, 0.0)}
>>> tr.statesFC['osmotic']['rep']
['b0001', 'b0002', 'b0003', 'b0004',...]

load_rnaseq_cov(cond='all', compute_from_bam=False, rev=False)

test load_rnaseq_cov loads a RNASeq coverage to a Transcriptome instance either from:

.npz files (described in a cov.info file and computed when new data are loaded for the first time) which are in the /rnaseq_cov/ directory,

coverage files which are in the /rnaseq_cov/ directory and are described in cov_txt.info file, with ONE LINE per genomic position!

.bam reads files which are in the /rnaseq_reads/ directory and are treated with useful_functions_transcriptome.cov_from_reads function.

Creates 2 new attributes of the Transcriptome instance:

self.rnaseq_cov_pos:
dictionary of shape {condition: cov+} with cov+ an array containing one signal data per genomic position for the + strand (forward coverage)

self.rnaseq_cov_neg:
idem for the - strand (reverse coverage)

Note: this function is only designed for single-chromosome genomes.

Parameters:

cond (Optional [list of str.]) – selection of one or several conditions (1 condition corresponds to 1 data file). By default: cond =’all’ ie all available coverages are loaded.
compute_from_bam (Boolean.) – if True, computes, using useful_functions_transcriptome.cov_from_reads, coverage from .bam reads files that are, with a bam_files.info file, in the /rnaseq_reads/ directory.
rev (Boolean) – reverses the + and - strands when computing the coverage. This applies to stranded sequencing on the opposite strand. Only applied when compute_from_bam is True.

Note

To use directly new coverages data, both forward and reverse coverage data files are required. These files contain the coverage for each genomic position (one line = one genomic position, no position can be ommited), therefore if you use bedtools genomecov, please use the -d option. The importation of these new data requires a cov_txt.info TSV file, in the /rnaseq_cov/ directory, containing the following information: [0] Condition [1] Reverse coverage filename [2] Forward coverage filename [3] Startline (has to be the same for both coverage files) [4] Column containing the coverage data [5] Column containing the chromosome name (default 0), must match the annotation

Coverages from the reverse file will be added to the Transcriptome instance as “rnaseq_cov_neg” attribute and coverages from the forward file will be the “rnaseq_cov_pos”attribute.

Note

To compute coverages data from reads files, the data files and a bam_files.info file have to be in the /rnaseq_reads/ directory. The bam_files.info files contains: [0] Condition [1] Reads filename

Note

During the first importation of new coverages data, a .npy file containing the data is created and will enable a faster data loading for the next data importation. A cov.info file containing information for this fast importation is also in the /rnaseq_cov/ directory and contains in the following columns: [0] Condition [1] Coverage filename This cov.info file automatically completes itself when the data are loaded for the first time. WARNING: for multi-chromosome species, the npz files contain arrays with chromosome coordinates stacked behind each other!

Example

>>> from GRATIOSA import Transcriptome
>>> tr = Transcriptome.Transcriptome("ecoli")
# To load only the coverage for one condition named "WT"
# in the cov.info or cov_txt.info file
>>> tr.load_rnaseq_cov(["WT"])
>>> tr.rnaseq_cov_neg["WT"]
array([0., 0., 0., ..., 0., 0., 0.])
>>> tr.rnaseq_cov_pos["WT"]
array([0., 0., 0., ..., 0., 0., 0.])
# To compute the coverage from a new .bam files obtained from
# paired-end data for example with: tophat2 genome/MG1655
# Test_1.fastq.gz Test_2.fastq.gz
>>> tr = Transcriptome.Transcriptome("ecoli")
>>> tr.load_rnaseq_cov(compute_from_bam=True)
>>> tr.rnaseq_cov_pos["Test"]
array([10., 10., 10., ..., 0., 0., 0.])
>>> tr.rnaseq_cov_pos["Test"]
array([0., 0., 0., ..., 20., 20., 20.])

load_cov_start_end(compute_from_bam=False)

Useful to locate TSS and TTS, load_cov_start_end loads the density of RNA fragment starts and ends from .npz files described in cov_start_stop.info, in the /rnaseq_cov/ directory. .npz files and .info files are obtained using the useful_functions_transcriptome.cov_start_stop_from_reads

Creates 2 new attributes of the Transcriptome instance:

self.cov_start (Dict. of dict.)
dictionary containing one subdictionary per condition. Each subdictionary has 2 keys: “0” for the start positions on the - strand and “1” for the start positions on the + strand and the subdictionary values are arrays containing all start positions on the corresponding strand
self.cov_end (Dict. of dict.)
idem for the end positions

Parameters:: compute_from_bam (Boolean) – if True, computes, using useful_functions_transcriptome.cov_start_stop_from_reads, coverage from .bam reads files that are, with a bam_files.info file, in the /rnaseq_reads/ directory.

Note

To compute new coverages data from .bam reads files, the data files and a bam_files.info file have to be in the /rnaseq_reads/ directory and the input argument “compute_from_bam” has to be set to “True”.The bam_files.info files contains: [0] Condition [1] Reads filename During the importation of these new data, a corresponding .npz files is created and the cov_start_stop.info file automatically completes itself. For the next importations of these data, only the npz. and the cov_start_stop.info will be loaded, thus allowing faster loading.

Warning

This function is only implemented for single contig compute_from_bam works with paired-end or single-end .bam files, but the first read of each file is used to determine if PE or SE. i.e., if you are using PE sequencing, please ensure that the first read of each file is paired, through prior filtering of unpaired reads. For single-end reads, the coverage is computed from sequenced reads, without extension of fragments (i.e., it is the read rather than fragment coverage).

Example

>>> from GRATIOSA import Transcriptome
>>> tr = Transcriptome.Transcriptome("ecoli")
>>> tr.load_cov_start_end(compute_from_bam=True)
>>> tr.cov_end["WT_paired"]
{0: array([0, 0, 0, ..., 0, 0, 0]),
 1: array([0, 0, 0, ..., 0, 0, 0])}

compute_rpkm_from_cov(cond='all', before=100)

Compute_rpkm_from_cov method:

loads the RNASeq coverage (using load_rnaseq_cov method)
computes and loads RPKM value on the Gene instances (using Gene.add_single_rpkm )
saves the RPKM data in a new .csv file (1 file for all conditions, with 1 column per condition)
adds the new .csv file informations in the expression.info file in the /expression/ directory

Creates:

self.genes[locus].rpkm (dict.):
new attribute of Gene instances related to the Transcriptome instance given as argument. Dictionary of shape {condition: RPKM value}, containing the RPKM for each computed condition.
log2rpkm_from_cov_{datetime.now()()}.csv: csv file containing the
log2(RPKM) values (1 row per gene, 1 column per condition). If a RPKM is equal to 0, log2RPKM is set to 1. This file is saved in the /expression/ directory and is automatically added in the expression.info file.

Parameters:

cond (list of str.) – selection of one or several RNASeq coverage condition By default: cond =’all’ ie all available coverages are converted. All selected conditions have to be listed in cov.info file. If one of the chosen condition is already registered in the expression.info file in the expression/ directory, this condition is ignored.
before (int.) – number of bps before the gene start to take into account

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Genome, Transcriptome >>> tr = Transcriptome.Transcriptome(“ecoli”) >>> g = Genome.Genome(tr.name) >>> g.load_annotation(annot_file=chosen_file) >>> .tr.genes = g.genes

Example

>>> from GRATIOSA import Transcriptome
>>> tr = Transcriptome.Transcriptome("ecoli")
>>> tr.compute_log2rpkm_from_cov(cond="WT")
>>> tr.genes["b0002"].rpkm["WT"]
142

useful functions

Functions called by Transcriptome methods

GRATIOSA.useful_functions_transcriptome.add_expression_to_genes(genes_dict, cond, filename, tag_col, expr_col, is_log, separator)

Called by load_expression, adds expression data to Gene objects by parsing a file with one column containing the locus tags of the genes and one containing the expression data.

Creates a new attribute “expression” of the Gene objects contained in the genes_dict given as argument. This attribute is a dictionary of shape {condition: expression level [float.]}

Parameters:

genes_dict (dict.) – dictionary of shape {locus: Gene object}
cond (str.) – condition name
filename (str.) – path to the file containing the expression data
tag_col (int.) – index of the column containing the locus tags in the file
expr_col (int.) – index of the column containing the expression data
is_log (bool.) – True if the data are already in log2, such as log2RPKM. False if the data are RPKM.
separartor (str.) – file separator

Returns:

Dict. of shape {‘locus’:Gene object} with in key the locus: tags associated to an expression data and in value the gene object associated to each of these locus tags.

Return type:

Dictionary

Note

Column numbering starts at 0.

GRATIOSA.useful_functions_transcriptome.load_fc_pval_cond(genes_dict, filename, condition, tag_col, fc_col, separator, start_line, p_val_col=None)

Called by load_fc_pval, allows expression data to be loaded by specifying a file containing the following information (one column for each type of information): locus tags of the genes, log2(Fold-Changes) and p-value.

Parameters:

genes_dict (dict.) – dictionary of shape {locus: Gene object}
filename (str.) – path to the file containing the expression data
cond (str.) – condition name
tag_col (int.) – index of the column containing the locus tags in the file
fc_col (int.) – index of the column containing the log2FC
separartor (str.) – file separator
start_line (int.) – file start line
p_val_col (int.) – index of the column containing the p-values in the file. If no index is indicated (default: None), the p-values will be assigned to 0 for all genes.

Returns:

locus tags of all genes having a valid log2FC

Return type:

List

Note

Column numbering starts at 0.

GRATIOSA.useful_functions_transcriptome.process_bam(tr)

Called by load_cov_start_end and load_rnaseq_cov, process_bam converts .bam files in .npy files containing the fragments. These files will enable a faster data loading for the next data importation with load_cov_start_end or load_rnaseq_cov.

Creates:

For each condition listed in the bam_files.info file, this function creates a npz file containing the coverage data on both DNA strands (Rpos.npy and Rneg.npy). These files are saved in the /rnaseq_reads/ directory
A reads.info file containing the information required to associate each .npz file to a condition name is also created (or completed if this file already exists) in the /rnaseq_reads/ directory. This file contains the following columns: [0] Condition [1] Reads filename [2] BAM filename

Parameters:: tr – Transcriptome instance

Note

REQUIRES the pySam library! The .bam reads files are, with a bam_files.info file, in the /rnaseq_reads/ directory. The bam_files.info file contains the following information: [0] Condition [1] Reads filename for this condition

Warning

The first read of the .bam file is used to determine if paired-end or single-end. For paired-end files, the real fragment coverage is computed. For single-end files, the READ coverage is computed (i.e., without fragment extension). CAUTION for multi-chromosome species: the file is generated with chromosome coordinates stacked behind each other. This is then propagated for the computation of coverage files. So, for a species with one 2-Mb chromosome and one 1-Mb chromosome, the files will have a 3-Mb long array. If the chromosome length or ordering is changed, re-compute all files.

GRATIOSA.useful_functions_transcriptome.cov_from_reads(tr, rev=False)

Called by load_rnaseq_cov, cov_from_reads computes the coverage from reads previously converted to .npz format (containing one .npy per strand: Rpos.npy and Rneg.npy) with the process_bam function.

Creates:

For each condition listed in the reads.info file, this function creates a npz file containing the coverage data on both DNA strands: cov_pos.npy and cov_neg.npy. These files are saved in the /rnaseq_cov/ directory and will enable a faster data loading for the next data importation with load_rnaseq_cov.
A cov.info file containing the information required to associate each .npz file to a condition name is also created (or completed if this file already exists) in the /rnaseq_cov/ directory. This file contains the following columns: [0] Condition [1] Coverage filename

Parameters:

tr – Transcriptome instance
rev (False) – for special cases where the reads were mapped on reverse strand (equivalent to -s reverse in htseq-count). All reads are strand-reversed before computing the coverage. This is never called by the main functions, but can be called “by hand” after deleting the previous coverage files that were on the wrong strand.

Note

The .npz files have to be, with a reads.info file (also created by the process_bam function), in the /rnaseq_reads/ directory. This reads.info file contains at least the following information: [0] Condition [1] Reads filename

CAUTION for multi-chromosome species: the file is generated with chromosome coordinates stacked behind each other. E.g., for a species with one 2-Mb chromosome and one 1-Mb chromosome, the files will have a 3-Mb long array. If the chromosome length or ordering is changed, re-compute all files.

GRATIOSA.useful_functions_transcriptome.cov_start_stop_from_reads(tr)

Called by load_cov_start_end, cov_start_stop_from_reads computes the density of RNA fragment starts and ends from reads previously converted to .npz format (containing one .npy per strand: Rpos.npy and Rneg.npy) with the process_bam function.

Creates:

For each condition listed in the reads.info file, this function creates a npz file containing the density of RNA fragment starts and ends on both DNA strands: cov_start_pos.npy, cov_start_neg.npy, cov_end_pos.npy and cov_end_neg.npy. These files are saved in the /rnaseq_cov/ directory and will enable a faster data loading for the next data importation with the cov_start_stop_from_reads function.
A cov_start_stop.info file containing the information required to associate each .npz file to a condition name is also created (or completed if this file already exists) in the /rnaseq_cov/ directory. This file contains the following columns: [0] Condition [1] Coverage filename

Parameters:: tr – Transcriptome instance

Note

The .npz files have to be, with a reads.info file (also created by process_bam function), in the /rnaseq_reads/ directory. This reads.info file contains at least the following information: [0] Condition [1] Reads filename