HiC

class methods

class GRATIOSA.HiC.HiC(gen)

The HiC class is used to gather spatial information obtained from HiC data analysis with tools such as Chromosight. The two patterns analyzed here are borders (between two topological domains) and loops (more specifically the positions of the two regions that come into contact).

Each HiC instance has to be initialized with an organism name

Example

>>> from GRATIOSA import HiC
>>> gen = Genome.Genome("dickeya")
>>> hc = HiC.HiC(gen)

load_hic_borders(cond='all')

load_hic_borders imports a list of borders from a data file (typically a borders.tsv data file obtained with Chromosight) containing at least, for each border, its position (bin number) and optionally its identification score (pearson correlation coefficient between the border kernel and the detected pattern), pvalue and qvalue. Creates 2 new attributes of the HiC instance:

self.borders (dict. of dict.)
dictionary containing one subdictionary per condition with the shape self.borders[cond]={bin_nb: {“score”: score, “pval”: pvalue, “qval”: qvalue,”binsize”:binsize}}. N.B.: If data are binned at 2kb, the bin with number 10 corresponds to data between 20000 and 22000.
self.borders_pos (dict. of dict)
dictionary containing one subdictionary per condition, with the shape self.borders_pos[cond] = {‘borders’:list of positions that are in a border ‘no_borders’:list of positions that are not in a border}

Parameters:: cond (Optional [list of str.]) – selection of one or several conditions (1 condition corresponds to 1 data file and each condition has to be listed in the border.info file). By default: cond =’all’ ie all available conditions are loaded.

Note

The data importation requires a borders.info file that contains the column indices of each information in the data file and some additional information, in the following order:

(required) [0] Condition, [1] Filename, [2] Startline, [3] Separator, [4] Bin, [5] Binsize (in b)
(optional) [6] Score, [7] Pvalue,[8] Qvalue

Note

borders.info and data file have to be in the /HiC/Borders/ directory

Note

See Chromosight documentation for more details about the input format.

Warning

Make sure that the version of the genome is the same as the version used in the HiC analysis workflow

Example

>>> from GRATIOSA import HiC
>>> hc = HiC.HiC("dickeya")
>>> hc.load_hic_borders("WT")
>>> hc.borders_pos['WT']['no_borders'][0:5]
[2001,2002,2003,2004,2005]

load_hic_loops(cond='all')

load_hic_loops imports a list of loops from a data file (typically a loops.tsv data file obtained with Chromosight) containing, for each loop, at least the loop coordinates (2 bin numbers) and optionally the loop score (pearson correlation coefficient between the loop kernel and the detected pattern), the p-value and the q-value.

Creates 2 new attributes of the HiC instance:

self.loops (dict. of dict.)
dictionary containing one subdictionary per condition with the shape self.loops[cond] = {(bin1_nb,bin2_nb): {“score”: score, “pval”: pvalue, “qval”: qvalue, “binsize”:binsize}} N.B.: If data are binned at 2kb, the bin with number 10 corresponds to data between 20000 and 22000.

self.loops_pos (dict. of dict)
dictionary containing one subdictionary per condition, with the shape self.loops_pos[cond] = {‘loops’:list of positions that are in a loop, ‘no_loops’:list of positions that are not in a loop}

Parameters:

self (HiC instance) –
cond (Optional [list of str.]) – selection of one or several conditions (1 condition corresponds to 1 data file and has to be in the loops.info file). By default: cond =’all’ ie all available data are loaded.

Note

The data importation requires a loops.info file that contains the column indices of each information in the data file and some additional information, in the following order:

(required) [0] Condition, [1] Filename, [2] Startline,
[3] Separator, [4] Bin1, [5] Bin2, [6] Binsize (in b),
(optional) [7] Score, [8] Pvalue,[9] Qvalue

Note

loops.info and data file have to be in the /HiC/Loops/ directory

Note

See Chromosight documentation for more details about the input format.

Warning

Make sure that the version of the genome is the same as the version used in the HiC analysis workflow

Example

>>> from GRATIOSA import HiC
>>> hc = HiC.HiC("dickeya")
>>> hc.load_hic_loops()
>>> hc.loops_pos['WT']['no_borders'][0:5]
[2001,2002,2003,2004,2005]
>>> hc.loops['WT1']
{(42000, 80000): {'binsize': 2000},
 (238000, 252000): {'binsize': 2000}
 ...}

load_loops_genes(cond='all', window=0)

load_loops_genes imports a list of loops determined with HiC from a data file (typically a loops.tsv data file obtained with Chromosight) containing at least the loops coordinates (2 bin numbers) and determines the list of genes overlapping one of the loops positions.

Creates:

self.loops_pos (dict. of dict.): new attribute of the Genome instance. Dictionary containing one dictionary per condition listed in loops.info. Each subdictionary has 2 keys: “loops” and “no_loops” and contains the corresponding lists of genomic positions.
self.loops_genes (dict. of dict.): New attribute of the Genome instance. Dictionary containing one dictionary per condition listed in loops.info. The key of this dictionary contains the condition name and the chosen window size (for example “cond_w100b” for a 100b window). Each subdictionary has 2 keys: “loops” and “no_loops” and contains the corresponding lists of genes. For example, self.loops_genes[cond_w0b][“loops”] returns the list of genes that overlap the loops positions.
self.genes[locus].is_loop (dict.): New attribute of Gene instances related to the Genome instance given as argument.Dictionary of shape {condition:boolean}. The boolean is True if the gene overlaps the loop positions (window included).

Parameters:

cond (Optional [list of str.]) – selection of one or several conditions (1 condition corresponds to 1 data file and has to be in the loops.info file). By default: cond =’all’ ie all available data are loaded.
per_genes (Optional [Bool.]) – if True, determines the list of genes overlapping the loops positions and returns the outputs self.loops_genes and self.genes[locus].is_loop described below.
window (Optional [int.]) – window around the loop positions (ie around the 2 bins) for the seeking of overlapping genes (Default: 0). Are considered “loops” genes, all genes overlaping any position between : * loops_start_bin1 - window and loops_end_bin1 + window or * loops_start_bin2 - window and loops_end_bin2 + window

Note

The data importation requires a loops.info file that contains the column indices of each information in the data file and some additional information, in the following order:

(required) [0] Condition, [1] Filename, [2] Startline, [3] Separator, [4] Bin1, [5] Bin2, [6] Binsize (in b),
(optional) [7] Score, [8] Pvalue,[9] Qvalue

Note

loops.info and data file have to be in the /HiC/Loops/ directory

Note

The position (pos) is the bin number. If data are binned at 2kb, the bin with number 10 corresponds to data between 20000 and 22000.

Warning

This method needs a genomic sequence and a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. If no sequence is loaded, the load_seq method with de defualt “sequence.fasta” is computed. To use another annotation or sequence, please load them to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Genome, HiC >>> HC = HiC.HiC(“ecoli”) >>> g = Genome.Genome(HC.name) >>> g.load_annotation(annot_file=chosen_file) >>> g.load_seq(filename=chosen_file2 >>> HC.genes = g.genes >>> HC.length = g.length

Warning

Make sure that the version of the genome is the same as the version used in the HiC analysis workflow

Example

>>> from GRATIOSA import HiC
>>> hc = HiC.HiC("dickeya")
>>> hc.load_loops_genes()
>>> hc.loops_pos['WT2kb_bin2000b']['loops']
[254000,254001,254002,254003,...]
>>> hc.loops_genes['WT2kb_bin2000b_w0b']['loops']
['Dda3937_00221','Dda3937_04438','Dda3937_03673',...]

load_borders_genes(cond='all', window=0)

load_hic_borders_genes imports a list of borders from a data file (typically a borders.tsv data file obtained with Chromosight) containing at least the border position (bin number) and determines the list of genes overlapping one of the borders.

Creates:

self.borders_genes (dict. of dict.)
created only if per_genes = True. New attribute of the Genome instance. Dictionary containing one dictionary per condition listed in borders.info. The key of this dictionary contains the condition name and the chosen window size (for example “cond_w100b” for a 100b window). Each subdictionary has 2 keys: “borders” and “no_borders” and contains the corresponding lists of genes. For example, self.borders_genes[cond_w0b][“borders”] returns the list of genes that overlap the borders positions.

self.borders_pos (dict. of dict)
new attribute of the HiC instance, dictionary containing one subdictionary per condition, with the shape self.borders_pos[cond] = {‘borders’:list of positions that are in a border, ‘no_borders’:list of positions that are not in a border}

self.genes[locus].is_border (dict.)
created only if per_genes = True. New attribute of Gene instances related to the Genome instance given as argument.Dictionary of shape {condition:boolean}. The boolean is True if the gene overlaps the border positions (window included).

Parameters:

cond (Optional [list of str.]) – selection of one or several conditions (1 condition corresponds to 1 data file and has to be in the borders.info file). By default: cond =’all’ ie all available data are loaded.
per_genes (Optional [Bool.]) – if True, determines the list of genes overlapping the borders positions and returns the outputs self.borders_genes and self.genes[locus].is_border described below.
window (Optional [int.]) – window around the border positions ie around the bin) for the seeking of overlapping genes (Default: 0). All genes overlaping any position between border_start_bin - window and border_end_bin + window are considered “borders” genes.

Note

The data importation requires a borders.info file that contains the column indices of each information in the data file and some additional information, in the following order:

(required) [0] Condition, [1] Filename, [2] Startline, [3] Separator, [4] Bin, [5] Binsize (in b)
(optional) [6] Score, [7] Pvalue,[8] Qvalue

Note

borders.info and data file have to be in the /HiC/Borders/ directory

Note

The position (pos) is the bin number. If data are binned at 2kb, the bin with number 10 corresponds to data between 20000 and 22000.

Warning

This method needs a genomic sequence and a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. If no sequence is loaded, the load_seq method with de defualt “sequence.fasta” is computed. To use another annotation or sequence, please load them to your Transcriptome instance with the following commands before using this method: >>> from GRATIOSA import Genome, HiC >>> HC = HiC.HiC(“ecoli”) >>> g = Genome.Genome(HC.name) >>> g.load_annotation(annot_file=chosen_file) >>> g.load_seq(filename=chosen_file2 >>> HC.genes = g.genes >>> HC.length = g.length

Warning

make sure that the version of the genome is the same as the version used in the HiC analysis workflow

Example

>>> from GRATIOSA import HiC
>>> hc = HiC.HiC("dickeya")
>>> hc.load_borders_genes()
>>> hc.borders_pos["WT2_2kb_bin2000b"]['borders']
[20000,20001,20002,20003,20004,...]
>>> hc.borders_genes['acid1kb_bin1000b_w0b']['borders']
['Dda3937_00158','Dda3937_01107','Dda3937_01108',...]
>>> hc.genes['Dda3937_00221'].is_border
{'acid1kb_bin1000b': False,'WT2_2kb_bin2000b': False}

useful functions

Functions called by HiC methods

GRATIOSA.useful_functions_HiC.load_HiC_site_cond(site_type, path2file, startline, sep, bin1_col, binsize, bin2_col=None, score_col=None, pval_col=None, qval_col=None)

Called by load_HiC_borders and load_HiC_loops, allows borders and loops data to be loaded by specifying files, and where each information is (one column for each type of information).

Parameters:

type (str.) – HiC detected motif type (“borders” or “loops”)
path2file (str.) – path to the file containing the data
startline (int.) – file start line
sep (str.) – file separator
bin1_col (int.) – index of the column containing the position of the first bin
binsize (int.) – binsize (in b)
bin2_col (Optional [int.]) – index of the column containing the position of the second bin
score_col (Optional [int.]) – index of the column containing the position of the score
pval_col (Optional [int.]) – index of the column containing the position of the pvalue
qval_col (Optional [int.]) – index of the column containing the position of the qvalue

Note

Column numbering starts at 0.