Chipseq

class methods

class GRATIOSA.Chipseq.Chipseq(gen)

The Chipseq class is mainly used to collect coverage data (Chipseq signals) along a genome. The associated methods and functions allow the binning and averaging of these signals. The enrichment peaks positions can also be loaded as an attribute of this class. Note: this class is only implemented for single-chromosome genomes with an annotation.

Each Chipseq instance has to be initialized with an organism name .. rubric:: Example

>>> from GRATIOSA import Chipseq
>>> gen = Genome.Genome("dickeya")
>>> ch = Chipseq.Chipseq(gen)
load_signal(cond='all')

load_signal imports a 1D distribution along the chromosome (typically a CHIPSeq distribution) from a data file (typically a .bedGraph obtained with bamCompare) containing: (1) bin starting positions, (2) bin ending positions and (3) signal in each bin.

Creates a new attribute of the Chipseq instance:
  • self.signal (dict.)

    Dictionary of shape {condition: array containing one signal value per genomic position}

Parameters:

cond (list of str.) – Selection of one or several conditions (1 condition corresponds to 1 data file). By default: cond =’all’ ie all available signals are loaded. All selected conditions have to be listed in signal.info file.

Note

The data importation requires an signal.info file that contains the columns positions of each information in the data file, in the following order: [0] condition, [1] filename, [2] separator used in the data file, [3] bin_start, [4] bin_end, [5] signal, [6] chromosome name (def 0, not used for single-chromosome genome)

Example

>>> from GRATIOSA import Chipseq
>>> ch = Chipseq.Chipseq("ecoli")
>>> ch.load_signal()
>>> ch.signal['Signal_Test']
array([100,   0,   0, ..., 100, 100,   0])
load_binned_signal(binsize, cond='all')

If a file containing the data for the chosen condition and binsize exists, load_binned_signal loads these data using the load_signal method Else, load_binned signal performs the following steps: 1 - imports a 1D distribution along the chromosome (typically a CHIPSeq distribution) using the load_signal method 2 - performs the binning at the chosen binsize using the useful_functions_Chipseq.binning 3 - saves the binned data in a file

Creates or adds items to 2 Chipseq instance attributes:
  • self.signal (dict.)

    Dictionary of shape {condition: array containing one signal value per genomic position (before binning)}

  • self.binned_signal (dict.)

    Dictionary of shape {cond_bin: array containing one binned signal value per genomic position} with cond_bin the condition name merged with the bin size (example: WT_bin200b).

Parameters:
  • binsize (int.) –

  • cond (list of str.) – Selection of one or several conditions (1 condition corresponds to 1 data file). By default: cond =’all’ ie all available signals are loaded. All selected conditions have to be listed in signal.info file.

Note

See ChipSeq.load_signal method for the data requirements.

Example

>>> from GRATIOSA import Chipseq
>>> ch = Chipseq.Chipseq("ecoli")
>>> ch.load_binned_signal(binsize=100,cond='Signal_Test')
>>> ch.binned_signal["Signal_Test_bin100b"]
array([100,   100,   100, ..., 10, 10, 10])
load_smoothed_signal(window, cond='all', *args, **kwargs)

If a data file containing the data for the chosen condition and smoothing exists, load_binned_signal loads these data using the load_signal method Else, load_smoothed_signal performs the following steps: 1 - imports a 1D distribution along the chromosome (typically a CHIPSeq distribution) using the load_signal method 2 - performs the smoothing (moving average with the chosen window size) using the useful_functions_Chipseq.smoothing 3 - saves the smoothed data in a file

Creates or adds items to 2 Chipseq instance attributes:
  • self.signal (dict.)

    Dictionary of shape {condition: array containing one signal value per genomic position (before smoothing)}

  • self.smoothed_signal (dict.)

    Dictionary of shape {cond_smoo: array containing one binned signal value per genomic position} with cond_smoo the condition name merged with the smoothing window size (example: WT_smooth200b).

Parameters:
  • window (int.) – window size. The value of the smoothed signal of a position p is equal to the average of the signal between p - window/2 and p + window/2.

  • cond (list of str.) – Selection of one or several conditions (1 condition corresponds to 1 data file). By default: cond =’all’ ie all available signals are loaded. All selected conditions have to be listed in signal.info file.

Note

See Chipseq.load_signal method for the data requirements.

Example

>>> from GRATIOSA import Chipseq
>>> ch = Chipseq.Chipseq("ecoli")
>>> ch.load_smoothed_signal(window=100,cond='Signal_Test')
>>> ch.smoothed_signal["Signal_Test_smooth100b"]
array([100.1,   99.8,   98.8, ..., 10.1, 11.1, 10.8])
load_signals_average(list_cond, average_name, *args, **kwargs)

Load_signals_average computes and loads the average of signals replicates. First, the function loads the signals (which must be listed in the signals.info file in the /chipseq/signals/ folder) using the load_signal method. It can then process the data using the load_binned_signal or load_smoothed_signal methods. Finally, the average of these signals at each genomic position of the genome is calculated. This average signal is assigned to the Chipseq instance as signals_average attribute:

  • self.signals_average (dict.):

    Dictionary of shape {average_name: array containing one averaged signal value per genomic position}.

Parameters:
  • list_cond (list of str.) – Selection of conditions that will be averaged. All selected conditions have to be listed in signal.info file.

  • average_name (str.) – Name of the obtained signal

  • data_treatment (Optionnal [str.] "binning", "smoothing" or None) – Treatment to be applied to the different signals before averaging them (None by default).

  • window (Optionnal [int.]) – window size used only if data_treatment = “smoothing” The value of the smoothed signal of a position p is equal to the average of the signal between p - window/2 and p + window/2

  • binsize (Optionnal [int.]) – bin size used only if data_treatment = “binning”

Note

See Chipseq.load_signal method for the data requirements

Example

>>> from GRATIOSA import Chipseq
>>> ch = Chipseq.Chipseq("ecoli")
>>> ch.load_signals_average(list_cond=["Signal1","Signal2"],
...                         average_name="Mean_signal",
...                         data_treatment = "smoothing",
...                         window=500)
>>> ch.signals_average["Mean_signal"]
array([100.1,   99.8,   98.8, ..., 10.1, 11.1, 10.8])
get_all_signals()

get_all_signals groups all loaded signals into a single attribute:

  • self.all_signals (dict.):

    Dictionary of shape {condition: array containing one signal value per genomic position}

load_signal_per_genes(cond='all', window=0)

Load_signal_per_genes computes the mean signal for each Gene (i.e. the mean signal between the gene start and gene end). Genomic signals have to be loaded (using load_signal,load_binned_signal, load_smoothed_signal or load_signals_average methods) before using load_signal_per_genes.

Creates:
  • self.genes[locus].signal (float.):

    new attribute of Gene instances related to the Chipseq instance given as argument.Contains the Gene mean signal.

  • self.signals_gene (dict. of dict.):

    Dictionary containing one subdictionary per condition given in input. Each subdictionary contains the signal (value) for each gene (locus_tag as key).

Parameters:
  • cond (Optional [list of str.]) – selection of one or several conditions (1 condition corresponds to 1 data file). By default cond =’all’ ie all available loaded signals are used.

  • window (Optional int.) – to include the signal around the gene. The mean signal will be computed between gene start - window and gene end + window. By default, window = 0

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Genome, Chipseq >>> g = Genome.Genome(ch.name) >>> g.load_annotation(annot_file=chosen_file) >>> ch = Chipseq.Chipseq(g)

Example

>>> from GRATIOSA import Chipseq
>>> ch = Chipseq.Chipseq(g)
>>> ch.load_signal()
>>> ch.load_signal_per_genes()
>>> ch.signals_gene["WT"]["b0984"]
0.40348639903919425
>>> ch.genes["b0984"].signal
{'WT': 0.40348639903919425,
 'Signal_Test': 0.210505235030067}
load_peaks()

load_peaks imports a list of peaks from a data file (typically a .BED file of peaks obtained with MACS2).

Creates:
  • self.peaks (dict. of dict.)new attribut of the Chipseq instance.

    dictionary containing one subdictionary per condition with the shape {(start,end):value}. The key of each subdictionary is a tupe containing the start and end positions of one peak. The value can for example be the peak score or the peak height.

Note

The data importation requires a peaks.info file that contains the column indices of each information in the data file and some additional information, in the following order: [0] Condition, [1] Filename, [2] Startline, [3] Separator, [4] StartCol, [5] StopCol [6] Peak value, [7] Chromosome name (def 0) peaks.info and data file have to be in the /chipseq/peaks/ directory

Example

>>> from GRATIOSA import Chipseq
>>> ch = Chipseq.Chipseq("ecoli")
>>> ch.load_peaks()
>>> ch.peaks['test']
{(840081, 840400): 4.89872,
(919419, 919785): 5.85158,
(937220, 937483): 4.87632,
...}

useful functions