statistical analysis

statistical tests

This module allows the classification and the statistical analysis (enrichment test, proportion test, student test) of omics and spatial data loaded on the different objects created, in particular with the Genome, Transcriptome, Chipseq and HiC classes.

GRATIOSA.stat_analysis.data_classification(data_x, data_y, class_nb, *args, **kwargs)

Classification of data into fractions according to defined thresholds (using the thresholds argument), according to class sizes (using the class_sizes argument) or into classes of equal size (if neither the thresholds argument nor the class_sizes argument is specified).

Parameters:

data_x (list) – list of elements such as positions or gene names
data_y (list) – list of data associated with each element such as signal coverage or gene expression
class_nb (int.) – number of classes (fractions) to create
class_names (Optionnal [list]) – list of names to give to each class By default, each class will be named by a number between 0 and class_nb.
thresholds (Optionnal [list of ints.]) – Thresholds used for the classification. If not given, the classification will distribute the data in fractions of equal sizes.
class_sizes (Optionnal [list of ints.]) – Number of elements to put in each class. The total must be equal to the number of elements in data_x and in data_y.

Returns:

2 dictionaries:

dict. of shape {class_name:list of elements}
dict. of shape {class_name: list of data associated with each element}

Return type:

tuple

Example

>>> from GRATIOSA import stat_analysis
>>> stat_analysis.data_classification(["a","b","d","c","e"],[1,2,4,3,5],
...                                   class_nb=3,thresholds = [1,5])
({0: ['a'], 1: ['b', 'c', 'd'], 2: []}, {0: [1], 1: [2, 3, 4], 2: []})
>>> stat_analysis.data_classification(["a","b","d","c","e"],[1,2,4,3,5],
...                                   class_nb=3,class_sizes=[1,1,3])
({0: ['a'], 1: ['b'], 2: ['c', 'd', 'e']}, {0: [1], 1: [2], 2: [3, 4, 5]})
>>> stat_analysis.data_classification(["a","b","d","c","e","f"],[1,2,4,3,5,6],
...                                   class_nb=3)
({0: ['a', 'b'], 1: ['c', 'd'], 2: ['e', 'f']}, {0: [1, 2], 1: [3, 4], 2: [5, 6]})

GRATIOSA.stat_analysis.proportion_test(dict_cats, dict_features, targ_features, all_features='all', cats='all', alt_hyp='one-sided', output_dir='/home/usr/documents/resdir/proportion_test/', output_file='prop_test2025-01-21 14:20:59.356778')

Compares the proportion of targ_features/all_features between categories. The proportion test is based on a normal test using the stats.proportion.proportions_ztest function from statsmodels.

Parameters:

dict_cats (dict.) – classification of each element in a dictionary of shape {category:[elements]}
dict_features (dict.) – feature corresponding to each element in a dictionary of shape {feature:[elements]}
targ_features (list or string) – targeted feature(s)
all_features (Optionnal [list or str.]) – features that will be used to calculate the proportion, including targ_features (default: all keys of the dict_features dictionary)
cats (Optionnal [list]) – list of categories to compare (default: all keys of dict_cats)
alt_hyp (Optionnal ["two-sided" or "one-sided"]) – alternative hypothesis. If “one-sided” is chosen, both one-sided tests will be performed with the statsmodels.stats.proportions_ztest function and the smaller p-value will be kept. See statsmodels documentation for more details. (default: “one-sided”)
output_dir (Optionnal [str.]) – output directory
output_file (Optionnal [str.]) – output filename

Returns:

dict. of shape {“categories”:cats, “proportions”:prop, “confidence intervals”:(ci0,ci1), “p-values”:{(cat1,cat2):pval}

Return type:

Dictionary

Note

Test results are saved in the output_file

Example

>>> from GRATIOSA import stat_analysis
>>> dict_cats = {"borders":["A","B","G","K","L"],
...              "loops":["C","D","F","H","J"],
...              "None":["E","I"]}
>>> data = {"act":["A","B","C","F","H"],
...         "rep":["D","G","L","I"],
...         "None":["J","K"],"NA":["E"]}
>>> res = stat_analysis.proportion_test(dict_cats,data,"act",
...                                     all_features=["act","rep"],
...                                     alt_hyp="two-sided")
>>> res["categories"]
['borders', 'loops', 'None']
>>> res["proportions"]
[0.5, 0.75, 0.0]
>>> res["confidence intervals"]
[(0.010009, 0.98999),(0.32565, 1.0),(0.0, 0.0)]
>>> res["p-values"]
{'borders-loops': 0.4652088184521418,
 'borders-None': 0.3613104285261787,
 'loops-None': 0.17090352023079747}

GRATIOSA.stat_analysis.enrichment_test(dict_cats, dict_features, targ_features, all_features='all', targ_cats='all', all_cats='all', min_nb_elements=4, output_dir='/home/usr/documents/resdir/enrichment_test/', output_file='enrich_test2025-01-21 14:20:59.356795')

Computes enrichment tests (hypergeometric test) of a target in a category.

Parameters:

dict_cats (dict.) – classification of each element in a dictionary of shape {category:[elements]} N.B.: the same element can be associated to multiple features
dict_features (dict.) – feature corresponding to each element in a dictionary of shape {feature:[elements]}
targ_features (list or string) – targeted feature(s)
all_features (Optional[list or str.]) – features that will be used to calculate the proportion, including targ_features (default: all keys of the dict_features dictionary)
targ_cats (Optional[list]) – list of categories. The enrichment test is performed for each category. (default: all keys of dict_cats)
all_cats (Optional[list]) – list of categories used to compute the global proportion and the expected number in the selection, including targ_cats. (default: all keys of dict_cats)
min_nb_elements (Optional[int.]) – Number of elements used as thresholds for the feature selection. If there is strictly less than min_nb_elements corresponding to a feature, the result corresponding to this feature is not relevant and is therefore neither returned nor reported in the output file. By default, min_nb_elements is set to 4.
output_dir (Optional[str.]) – output directory
output_file (Optional[str.]) – output filename

Returns:

DataFrame containing the following columns:

’Category’(str.): category
’Selected_gene_nb’(int.): Nb of elements corresponding to this feature in the selection
’Expected_selected_nb’ (int.): Expected nb of elements corresponding to this feature in the selection
’Total_gene_nb’(int.): Nb of elements corresponding to this feature in the dict_features
’Proportion’(float): ratio between Selected_gene_nb and Total_gene_nb
’Prop_conf_int’ (np.array): 95% confidence interval with equal areas around the proportion.
’p-value’ (float): p-value obtained with the enrichment test
’Adj p-value (FDR)’ (float): p-value corrected for false discovery rate
’Global_proportion’ (float): ratio between nb of elements in the selection and nb of elements in dict_features

Return type:

DataFrame

Note

This function performs a p-value correction for false discovery rate using statsmodels.stats.multitest.fdrcorrection

Note

The created DataFrame, ordered according to the adjusted pvalues, is reported in the output_file.

Example

>>> from GRATIOSA import stat_analysis
>>> dict_features = {"act": ["B", "D", "E", "H", "I", "M", "P", "Q", "R", "S", "T", "W"],
...                  "rep": ["C", "F", "G", "U", "X"], "None": ["A"], "NA": ["J"]}
>>> dict_cats = {"GOterm1": ["A", "B", "D", "E", "F", "P", "Q", "R", "S", "T", "U"],
...              "GOterm2": ["C", "E"],
...              "GOterm3": ["A", "B", "F", "G", "H", "I", "M", "U", "V", "W", "X"],
...              "GOterm4": ["C", "F", "G", "J"]}
>>> stat_analysis.enrichment_test(dict_cats,
...                           dict_features,
...                           targ_features=["act", "None"],
...                           all_features=["act", "None", "rep", "NA"],
...                           targ_cats=["GOterm1", "GOterm2", "GOterm3"],
...                           min_nb_elements=3,
...                           output_file="test")
  Category  Selected_gene_nb  Total_gene_nb  Proportion  Prop_conf_int  p-value
0  GOterm1                9          11    0.818182  [0.4545, 1.0]  0.27206
1  GOterm3                6          10    0.600000     [0.2, 1.0]  0.97059
  Adj p-value (FDR)  Global_proportion  Expected_selected_nb
0           0.54412        0.722222        7.944444
1           0.97059        0.722222         7.22222
  Category  Selected_gene_nb  Total_gene_nb  Proportion     Prop_conf_int
0  GOterm1                9          11    0.818182  [0.6818, 0.9545]
1  GOterm3                6          10    0.600000        [0.4, 0.8]
  p-value  Adj p-value (FDR)  Global_proportion  Expected_selected_nb
0 0.16563            0.33126        0.684211        7.526316
1 0.90867            0.90867        0.684211        6.84210
# GOterm2 was ignored because its nb of elements is less than 3.
# GOterm4 was ignored because it was not selected in the "features"

GRATIOSA.stat_analysis.quantitative_data_student_test(dict_data, cats='all', method='student', alt_hyp='one-sided', output_dir='/home/usr/documents/resdir/student_test/', output_file='student_test2025-01-21 14:20:59.356798')

Computes the T-test (or Wilcoxon-Mann-Whitney’s non parametric test) for the means of independants categories.

Parameters:

dict_data (dict.) – datapoints corresponding to each category in a dictionary of shape {category:list of datapoints}
cats (Optional [list]) – list of categories to compare (default: all keys of dict_data)
method- (Optional ["student" or "wilcoxon"]) – type of test to be carried, Student’s t-test (assumes normal data) or Wilcoxon’s rank test.
alt_hyp (Optional ["two-sided" or "one-sided"]) – alternative hypothesis. If “one-sided” is chosen, both one-sided tests will be performed with the scipy.stats.ttest_ind function and the smaller p-value will be kept. See scipy documentation for more details. (default: “one-sided”)
output_dir (Optional[str.]) – output directory
output_file (Optional[str.]) – output filename

Returns:

dict. of shape {“categories”:cats, “means”:means, “size”: number of values, “confidence intervals”:(ci0,ci1), “p-values”:{(cat1,cat2):pval}

Return type:

Dictionary

Note

Test results are also reported in the output_file.

Example

>>> from GRATIOSA import stat_analysis
>>> dict_data = {'a':[1,2,5,6,19], 'b':[10,24,4,15]}
>>> stat_analysis.quantitative_data_student_test(dict_data)
{'categories': ['a', 'b'],
 'means': [6.6, 13.25],
 'confidence intervals': [(0.2610995224360444, 12.938900477563955),
                          (4.958672795504613, 21.54132720449539)],
 'p-values': {('a', 'b'): 0.12169488735182109}}

graphical representation

This module allows to graphically represent (barplots) the statistical analysis performed with the module stat_analysis.

GRATIOSA.plot_stat_analysis.significance(pval): Converts p-values in stars annotation.

GRATIOSA.plot_stat_analysis.barplot_annotate_brackets(categories, y_up, dict_pval, *args, **kwargs)

Annotates barplot with p-values, using the significance function to convert p-values in stars annotation.

Parameters:

categories (list) – list of the plotted categories, in the order of their position on the plot
y_up (list) – list of the maximal y position of each bar (taking into account the confidence intervals)
dict_pval (dict.) – dictionnary of shape {“cat1-cat2”:pvalue} with cat1 and cat2 contained in the categories list given as argument
linewidth (Optional [float.]) – Linewidth of the brackets. (default: 1.)

GRATIOSA.plot_stat_analysis.plot_proportion_test(dict_cats, dict_features, targ_features, all_features='all', cats='all', alt_hyp='one-sided', output_dir='/home/usr/documents/resdir/proportion_test/', output_file='prop_test2025-01-21 14:21:03.732290', file_extension='.pdf', xlabel='', ylabel='Proportion', title='', annot_brackets=True, *args, **kwargs)

Barplots of the proportion test: targ_features/all_features between categories. The proportion test, based on normal test, is computed with stat_analysis.proportion_test (see its documentation for more details)

Parameters:

dict_cats (dict.) – classification of each elements in a dictionary of shape {category:[elements]} Example: {“border”:[“GeneA”,”GeneB”],”None”:[“GeneC”,”GeneD”]}
dict_features (dict.) – feature corresponding to each element in a dictionary of shape {feature:[elements]} Example: {“act”: [“GeneA”,”GeneC”,”GeneD”], “rep”:[“GeneB”]}
targ_features (list or string) – targeted feature(s)
all_features (Optional [list or string]) – features that will be used to calculate the proportion, including targ_features default: all keys of the dict_features dictionary)
cats (Optional [list]) – list of categories to compare (default: all keys of dict_cats)
alt_hyp (Optional ["two-sided" or "one-sided"]) – alternative hypothesis. If “one-sided” is chosen, both one sided tests will be perform with the statsmodels.stats.proportions_ztest function and the smaller pvalue will be kept. See statsmodels documentation for more details. (default: “one-sided”)
output_dir (Optional [str.]) – output directory
output_file (Optional [str.]) – output filename for the proportion test data (.txt) and the plot
file_extension (Optional [str.]) – Graphic file extension type (.pdf by default)
xlabel (Optional [str.]) – label for the x-axis (default: empty)
ylabel (Optional [str.]) – label for the y-axis (default: “Proportion”)
title (Optional [str.]) – general title for the figure
annot_brackets (Optional [bool.]) – if True, the barplot will be annotated according to the p-values using stars annotaion (default: True)
brackets_linewidth (Optional [float.]) – Linewidth of the brackets. (default: 1.)
ymin (Optional [float]) – y-axis bottom limit
ymax (Optional [float]) – y-axis top limit
figsize (Optional [(float,float)]) – width and height in inches (by default: (w,2.2) with w dependent on the number of categories)
xticks_rotation (Optional [int.]) – x-ticks labels rotation in degrees
xticks_labels (Optional [list.]) – x-ticks labels (by default: cats)
err_capsize (Optional [float.]) – Length of the error bar caps in points
bar_linewidth (Optional [float.]) – Width of the bars edge. (default: 1.)
bar_width (Optional [float.]) – Width of the bars. (default dependent on the number of categories. If less than 5 cats: 0.7)

Example

>>> import numpy as np
>>> from GRATIOSA import plot_stat_analysis
>>> dict_cat = {"cat1":np.arange(100,154),"cat2":np.arange(1,100),
...             "cat3":np.arange(154,180),"cat4":np.arange(180,230)}
>>> dict_features ={"act":list(np.arange(1,90))+list(np.arange(100,120))
...                +list(np.arange(154,158))+list(np.arange(180,220)),
...        "rep":list(np.arange(90,94))+list(np.arange(120,150))
...                +list(np.arange(158,177))+list(np.arange(220,225)),
...        "None":list(np.arange(94,100))+list(np.arange(150,154))
...                +list(np.arange(177,180))+list(np.arange(225,230))}
>>> plot_stat_analysis.plot_proportion_test(dict_cat,dict_features,"act",
...                all_features=["act","rep"],alt_hyp="two-sided",output_file="test")

GRATIOSA.plot_stat_analysis.plot_enrichment_test(dict_cats, dict_features, targ_features, all_features='all', targ_cats='all', all_cats='all', min_nb_elements=4, output_dir='/home/usr/documents/resdir/enrichment_test/', output_file='enrich_test2025-01-21 14:21:03.732303', file_extension='.pdf', xlabel='', ylabel='Proportion', title='', legend_text='Global\nproportion', legend_loc='best', annot_star=True, *args, **kwargs)

Barplots of enrichment tests (hypergeometric test) of features in a sublist. The test is performed with stat_analysis.enrichment_test (see its documentation for more details)

Parameters:

dict_cats (dict.) – classification of each elements in a dictionary of shape {category:[elements]} N.B.: the same element can be associated to multiple features Example: {“GOterm1”:[“GeneA”,”GeneB”],”GOterm2:[“GeneA”,”GeneC”]}
dict_features (dict.) – feature corresponding to each element in a dictionary of shape {feature:[elements]} Example: {“act”: [“GeneA”,”GeneC”,”GeneD”], “rep”:[“GeneB”]}
targ_features (list or string) – targeted feature(s)
all_features (Optional [list or str.]) – features that will be used to calculate the proportion, including targ_features (Default: all keys of the dict_features dictionary)
targ_cats (Optional [list]) – list of categories. The enrichment test is performed for each catergory. (default: all keys of dict_cats)
all_cats (Optional [list]) – list of categories used to compute the global proportion and the expected number in the selection. All_cats includes targ_cats. (default: all keys of dict_cats)
min_nb_elements (Optional [int.]) – Number of elements used as thresholds for the feature selection. If there is stricly less than min_nb_elements corresponding to a feature, the result corresponding to this feature is not relevant and is therefore neither returned nor reported in the output file. By default, min_nb_elements is set to 4.
output_dir (Optional [str.]) – output directory
output_file (Optional [str.]) – output filename for the proportion test data (.txt) and the plot
file_extension (Optional [str.]) – Graphic file extension type (.pdf by default)
xlabel (Optional [str.]) – label for the x-axis (default: empty)
ylabel (Optional [str.]) – label for the y-axis (default: “Proportion”)
title (Optional [str.]) – general title for the figure
annot_star (Optional [bool.]) – if True, the barplot will be annotated according to the p-values using stars annotaion (default: True)
ymin (Optional [float]) – y-axis bottom limit
ymax (Optional [float]) – y-axis top limit
legend_text (Optional [str.]) – Legend text (default: “Global proportion”) If set to None, no legend will be plotted.
legend_loc (Optional [str.]) – Location of the legend such as ‘upper right’, ‘lower right’, ‘lower left’, ‘lower left’ and ‘best’ (default: ‘best’). See matplotlib.pyplot.legend for more options
figsize (Optional [(float,float)]) – width and height in inches (by default: (w,2.2) with w dependent on the number of categories)
xticks_rotation (Optional [int.]) – x-ticks labels rotation in degrees
xticks_labels (Optional [list.]) – x-ticks labels (by default: targ_cats)
err_capsize (Optional [float]) – Length of the error bar caps in points
bar_linewidth (Optional [float]) – Width of the bars edge. (default: 1.5)
bar_width (Optional [float]) – Width of the bars. (default dependent on the number of categories. If less than 5 cats: 0.7)

Example

>>> from GRATIOSA import plot_stat_analysis
>>> dataX = {"GOterm1":["A","B","D","E","F"],
...           "GOterm2":["C","E"],
...           "GOterm3":["A","B","F","G","H","I","M"],
...           "GOterm4":["C","F","G","J"]}
>>>  = ["A","E","I","F","G","H","J"]
>>> plot_stat_analysis.plot_enrichment_test(
...                         dict_cats,dict_features,
...                         targ_feature=["act","None"],
...                         all_features=["act","None","rep","NA"],
...                         targ_cats=["GOterm1","GOterm2","GOterm3"],
...                         min_nb_elements=3,output_file="test1")

GRATIOSA.plot_stat_analysis.plot_student_test(dict_data, cats='all', style='bar', method='student', alt_hyp='one-sided', output_dir='/home/usr/documents/resdir/student_test/', output_file='student_test2025-01-21 14:21:03.732305', file_extension='.pdf', xlabel='', ylabel='Mean(data)', title='', annot_brackets=True, *args, **kwargs)

Barplots of the student test computed with stat_analysis.quantitative_data_student_test (see its documentation for more details)

Parameters:

dict_data (dict.) – feature corresponding to each element in a dictionary of shape {category:list of datapoints}
cats (Optional [list]) – list of categories to compare (default: all keys of dict_data)
method (Optional ["student" or "wilcoxon"]) – uses the t test or the wilcoxon non parametric test for p-values.
alt_hyp (Optional ["two-sided" or "one-sided"]) – alternative hypothesis. If “one-sided” is chosen, both one-sided tests will be performed with the scipy.stats.ttest_ind function and the smaller p-value will be kept. See scipy documentation for more details. (default: “one-sided”)
output_dir (Optional [str.]) – output directory
output_file (Optional [str.]) – output filename for the student test data (.txt) and the plot
file_extension (Optional [str.]) – Graphic file extension type (.pdf by default)
xlabel (Optional [str.]) – label for the x-axis (default: empty)
ylabel (Optional [str.]) – label for the y-axis (default: “Mean(data)”)
title (Optional [str.]) – general title for the figure
annot_brackets (Optional [bool.]) – if True, the barplot will be annotated according to the p-values using stars annotaion (default: True)
brackets_linewidth (Optional [float.]) – Linewidth of the brackets.(default: 1.)
ymin (Optional [float]) – y-axis bottom limit
ymax (Optional [float]) – y-axis top limit
figsize (Optional [(float,float)]) – width and height in inches (by default: (w,2.2) with w dependent on the number of categories)
xticks_rotation (Optional [int.]) – x-ticks labels rotation in degrees
xticks_labels (Optional [list]) – x-ticks labels (by default: cats)
err_capsize (Optional [float]) – Length of the error bar caps in points
bar_linewidth (Optional [float]) – Width of the bars edge. (default: 2.0)
bar_width (Optional [float]) – Width of the bars. (default: 0.6) (default dependent on the number of categories. If less than 5 cats: 0.7)

Example

>>> from GRATIOSA import plot_stat_analysis
>>> dict_data = {'a':[1,2,5,6,19], 'b':[10,24,4,15]}
>>> plot_stat_analysis.plot_student_test(dict_data)