Genome

class methods

class GRATIOSA.Genome.Genome(name)

The Genome class is the primary class of this package. It gathers all the attributes of a genome such as the sequence, the set of genes (with their functional annotations and their orientation) but also all the annotations of TSS, TU and TTS.

Each Genome instance has to be initialized with an organism name
>>> from GRATIOSA import Genome
>>> g = Genome.Genome("ecoli")
create_database(NCBI_accession=None)

Creates the hierarchy of directories for the genome. If a NCBI GenBank accession number is given (e.g., GCA_000147055.1), we try getting the reference sequence and genomic annotation from the NCBI database, using two methods: (1) try fetching the NCBI server using the “datasets” command from the NCBI command-line tool (if previously installed); (2) otherwise, try downloading the right files from the NCBI server. The user can then add data manually in the directories.

Parameters:

NCBI_accession (Optional [str.]) – name of the NCBI GenBank accession number of the requested genome. Example: GCA_000147055.1

Note

If the NCBI command line tool is not installed, the software tries to open/download the “accession_number_complete.txt” file in the base directory. This is the list of all NCBI complete genomes (~ 20 Mb). If it is already present, the software does not try to update it. For an update of this file from the NCBI server, please run update_NCBI_genomes().

load_seq(filename='sequence.fasta')

Load_seq loads DNA sequence from a .fasta file present in the main directory of the organism using useful_functions_genome.load_seq function. Adds this sequence, its complement, and its length to a Genome instance. Note: if the fasta file indicates a multi-chromosome/plasmid genome, the sequence is a list, and several genome objects are created (one per chromosome), each associated to a sequence (string).

For a single-chromosome species, creates 3 new attributes to the Genome instance
  • seq (str.): genomic sequence compose of A,T,G and C

  • seqcompl (str.): complement sequence to seq

  • length (int.): length of the genomic sequence

For a multi-chromosome (or plasmid) species, the attribute seq is a list, and the genome object of each chromosome has previous attributes.

Parameters:

filename (Optional [str.}) – name of the file containing the DNA sequence in FASTA format. Default: “sequence.fasta”

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_seq()
>>> g.seq[100:151]
'AATGTCGATCTTCAACATATCGCCGATCCGACGGGCACCCAGATCCTGCAG'
>>> g.seqcompl[100:151]
'TTACAGCTAGAAGTTGTATAGCGGCTAGGCTGCCCGTGGGTCTAGGACGTC'
load_annotation(annot_file='sequence.gff3', features=['gene'])

load_annotation loads a gene annotation (coordinates, length, name…) from a file present in the /annotation/ directory. If the genome is multi-chromosome, the genes are associated both to the main genome object and the genome object of each chromosome.

Creates a new attribute of the Genome instance:
  • self.genes (dict.)

    self.genes is a dictionary of shape {locus_tags: Gene object}. Each Gene object is initialized with the following attributes: locus_tag, ID, name, strand, left, right, start, end, middle, length, ASAP.

Parameters:

filename (Optional [str.]) – name of the file containing the genomic annotation. Default: “sequence.gff3”

Note

If the file is a .gff3 or .gff information will be loaded using useful_functions_genome.load_gff. Otherwise (LEGACY), the information importation requires an annotation.info file, containing column indices of each information in the data file and some additional information, in the following order: [0] Filename [1] Separator [2] Locus_tag column [3] Name column [4] ID column [5] Strand column [6] Left coordinate column [7] Right coordinate column [8] File start line [9] chromosome name

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_annotation()
>>> g.genes["Dda3937_00005"].name
'guaA'
load_genes_per_pos(window=0)

load_genes_per_pos associates each position with a list of genes overlapping this and its surrounding positions delimited by the window size given as an argument.

Creates a new attribute of the Genome instance:
  • self.genes_per_pos (dict.)

    self.genes is a dictionary of shape {position: list of genes}. It contains, for each position p, the list of genes overlapping any position between p-window/2 and p+window/2 (inclusive)

Note: if the genome object has multiple chromosome/plasmids, the attribute is created only for each individual genome object of each chromosome!

Parameters:

window (Optional [int.]) – window size in b. load_genes_per_pos finds genes between pos-window/2 and pos+window/2 (inclusive). Default: window = 0

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation before using the load_genes_per_pos method.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_annotation(annot_file="sequence.gff3")
>>> g.load_genes_per_pos(window=1000)
>>> g.genes_per_pos[120]
['Dda3937_00155', 'Dda3937_00156', 'Dda3937_00154']
load_neighbor_all()

For each gene and positions, load_neighbor_all finds nearest neighbors (left and right) on genome, whatever their strand.

Creates 2 new attributes of Gene instances:
  • self.genes[locus].left_neighbor

    locus of the nearest left-side neighbor gene

  • self.genes[locus].right_neighbor

    locus of the nearest right-side neighbor gene

Note: if the genome object has multiple chromosome/plasmids, the attributes are created only for each individual genome object of each chromosome!

and 4 new attributes of Genome instance:
  • self.genomic_situation (dict.)

    dict of shape {position: situation} with situation either “intergenic” or “intragenic”.

  • self.left_neighbor (dict.)

    dict of shape {position: locus of the nearest left-side neighbor gene}

  • self.right_neighbor (dict.)

    dict of shape {position: locus of the nearest right-side neighbor gene}

  • self.gene (dict.)

    dict of shape {position: gene} with gene = “NA” if the position is intergenic.

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation before using the load_neighbor_all method.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_annotation(annot_file="sequence.gff3")
>>> g.load_neighbor_all()
>>> g.genes["Dda3937_00005"].left_neighbor
'Dda3937_00006'
>>> g.genes["Dda3937_00005"].right_neighbor
'Dda3937_00004'
>>> g.g.genomic_situation[569]
'intragenic'
>>> g.gene[569]
'Dda3937_00156'
>>> g.left_neighbor[569]
'Dda3937_00155'
>>> g.right_neighbor[569]
'Dda3937_00157'
load_gene_orientation(couple=3, max_dist=5000)

Compute gene orientation with the following criteria:

  • If couple = 3, gene is considered:

    • divergent if left neighbor on - strand and right neighbor on + strand,

    • convergent if left neighbor on + strand and right neighbor on - strand,

    • tandem if left and right neighbors on same strand (whatever the strand of the given gene is),

    • isolated if the distance between neighbors is higher than the maximal distance given as argument.

  • If couple = 2, gene is considered:

    • tandem if predecessor (left neighbor for gene on + strand, right neighbor for gene on - strand) is on same strand,

    • divergent if the predecessor is on opposite strand.

Creates new attributes:
  • self.orientation (dict.)

    new attribute of the Genome instance. self.orientation is a dictionary of shape {orientation: list of genes}. It contains the list of genes for each orientation (tandem, divergent, convergent, and isolated if couple=3, tandem and divergent if couple=2)

  • self.genes[locus].orientation (str.)

    new attribute of Gene instances related to the Genome instance given as argument

Note: if the genome object has multiple chromosome/plasmids, the first attribute is created only for each individual genome object of each chromosome!

Parameters:
  • couple (int.) –

    number of genes to consider in a “couple”.

    • If couple = 2: computes the orientation of a gene relative to its predecessor

    • If couple = 3: computes the orientation of a gene relative to its two neighbors

    Default: 3

  • max_dist (Optional [int.]) – maximal distance between 2 genes start positions for seeking neighbor (Default: 5kb)

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation before using the load_neighbor_all method.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_annotation(annot_file="sequence.gff3")
>>> g.load_gene_orientation()
>>> g.genes["Dda3937_00005"].orientation
'tandem'
>>> g.orientation["isolated"]
['Dda3937_02360', 'Dda3937_01107', 'Dda3937_03898', 'Dda3937_04216',
'Dda3937_00244', 'Dda3937_04441', 'Dda3937_01704', 'Dda3937_01705',
...
'Dda3937_02126', 'Dda3937_01530', 'Dda3937_04419', 'Dda3937_02081']
load_pos_orientation(max_dist=5000)

Defines an orientation for each genomic position, based on the neighboring genes, with the following criteria:

  • divergent if left neighbor on - strand and right neighbor on + strand,

  • convergent if left neighbor on + strand and right neighbor on - strand,

  • tandem if left and right neighbors on same strand

  • isolated if the distance between neighbors is higher than the maximal distance given as argument

Creates 2 new attributes of the Genome instance:
  • self.pos_orientation (dict of dict)

    dictionary containing 2 subdictionaries. One subdictionary for “intergenic” positions and one for “intragenic” positions. Each subdictionary contains the list of position for each orientation. {“intergenic”:{orientation: list of positions}}, “intragenic”:{orientation: list of positions}} with orientation in [“divergent”,”convergent”,”tandem”, “isolated”]

  • self.orientation_per_pos (dict.)

    dictionary of shape {position: orientation}.

Note: if the genome object has multiple chromosome/plasmids, the attributes are created only for each individual genome object of each chromosome!

Parameters:

max_dist (Optional [int.]) – maximal distance between 2 genes start positions for seeking neighbor (Default: 5kb)

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation before using this method.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_annotation(annot_file="sequence.gff3")
>>> g.load_pos_orientation()
>>> g.orientation_per_pos[30]
'tandem'
>>> g.pos_orientation["intragenic"]["isolated"]
[11534,11535,11536,11537,11538,11539,11540,11541,11542,11543,...]
load_TSS()

load_TSS loads a TSS annotation from a file present in the /TSS/ directory.

Creates:
  • self.TSSs (dict. of dict.)

    New attribute of the Genome instance self.TSSs is a dictionary of shape {Condition: {TSS position: TSS object}}. One subdictionary is created for each condition listed in TSS.info file. Each TSS object is initialized with the following attributes: pos, genes, promoter, score, strand The promoter attribute is a dictionary containing, for each sigma factor (keys) a subdictionary (value). The first created key of this subdictionary is “sites” and the associated value is a tuple containing the positions of promoter elements. See __init__ and add_promoter in the TSS class for more details about each attribute.

  • self.TSSs[‘all_TSSs’] (subdictionary of self.TSSs)

    additional subdictionary of self.TSSs, of shape: self.TSSs[‘all_TSSs’]={TSSpos: [TSScond]} with [TSScond] the list of TSS conditions where this TSS was found.

Note

This function is designed for single-chromosome genomes only. The information importation requires a TSS.info file, containing column indices of each information in the data file and some additional information, in the following order: [0] Condition [1] Filename [2] Locus tag column [3] TSS position [4] File start line [5] Separator [6] Strand column [7] Sigma factor column [8] Sites column [9] Score column

Note

If some of the data types are missing (locus tags, sigma factors, scores or sites), an empty space can be left in the .info file.

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation before using this method.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_TSS()
>>> g.TSSs["dickeya-btss"][4707030].promoter
{'sigma70': {'sites': (4707039, 4707044, 4707060, 4707065)},
 'sigma32': {'sites': (4707037, 4707046, 4707062, 4707068)}}
>>> g.TSSs["dickeya-btss"][4707030].strand
False
load_prom_elements(shift=0, prom_region=[0, 0])

load_prom_elements extracts sequences of the different promoter elements (spacer, -10, -35, discriminator, region around TSS) based on -35 and -10 coordinates loaded with load_TSS method, for all TSS conditions. For each sigma factor associated to a TSS annotation, creates a subdictionary in self.TSSs[condTSS][TSS].promoter with the shape promoter[sigma] = {element: sequence of the element} with element in [“spacer”, “minus10”, “minus35”, “discriminator”, “region”].

Parameters:
  • shift (Optional [int.]) – number of nucleotides to include beyond each region on either side (Default: 0nt)

  • prom_region (Optional [int.,int.]) – region upstream and downstream TSSs to extract. Argument with the shape: [length before TSS, length after TSS]. Default: [0,0] ie no sequence will be extracted around TSS.

Note

See load_TSS description to understand the structure of the subdictionary self.TSSs[condTSS][TSS].promoter

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation before using this method.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_prom_elements()
>>> g.TSSs["dickeya-btss"][4707030].promoter
{'sigma70': {'sites': (4707039, 4707044, 4707060, 4707065),
             'spacer': 'TCGCCCACCCTCAAT',
             'minus10': 'CATCAT',
             'minus35': 'TACCCC',
             'discriminator': 'GAATAACC'},
 'sigma32': {'sites': (4707037, 4707046, 4707062, 4707068),
             'spacer': 'CCTCGCCCACCCTCA',
             'minus10': 'ATCATCATGA',
             'minus35': 'CCGTACC',
             'discriminator': 'ATAACC'}}
load_TU()

load_TU loads a TU annotation from a file present in the /TU/ directory.

Creates a new attribute of the Genome instance:
  • self.TUs (dict. of dict.)

    self.TUs is a dictionary of shape {Condition: {TU start pos: TU object}} One subdictionary is created for each condition listed in TU.info file. Each TU object is initialized with the following attributes: start, stop, orientation, genes, left,right, expression. See __init__ in the TU class for more details about each attribute.

Note

The information importation requires a TU.info file, containing column indices of each information in the data file and some additional information, in the following order: [0] Condition [1] Filename [2] TU ID [3] Start column [4] Stop column [5] Strand column [6] File start line [7] Separator [6] Gene column (optional) [9] Expression (optional) See the load_TU_cond function in useful_functions_genome for more details.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_TU()
>>> g.TUs["TU_Forquet"][2336702].right
2339864
>>> g.TUs["TU_Forquet"][2336702].strand
True
load_TTS()

load_TTS loads a TTS annotation from a file present in the /TTS/ directory.

Creates:
  • self.TTSs (dict. of dict.)

    New attribute of the Genome instance self.TTSs is a dictionary of shape {Condition: {TTS position: TTS object}}. One subdictionary is created for each condition listed in TTS.info file. Each TTS object is initialized with the following attributes: left, right, start, end, strand, rho_dpdt, genes, seq, score. If the data do not contain information about associated genes, sequence, or rho dependency, the corresponding attributes will be initialized as “None”. See __init__ in the TTS class for more details about each attribute.

  • self.TTSs[‘all_TTSs’] (subdictionary of self.TTSs)

    additional subdictionary of self.TTSs, of shape: self.TTSs[‘all_TTSs’]={TTSpos: [TTScond]} with [TTScond] the list of TTS conditions where this TTS was found.

Note

The information importation requires a TTS.info file, containing column indices of each information in the data file and some additional information, in the following order: [0] Condition [1] Filename [2] Left coordinate column [3] Right coordinate column [4] Strand column [5] Startline [6] Separator [7] Sequence column and optionally: [8] Score column [9] Genes column [10] Rho dependency column

Note

See the load_TTS_cond function in useful_functions_genome for more details.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_TTS()
>>> g.TTSs["RhoTerm"][2791688].left
2791688
load_GO()

load_GO loads file specified in GO.info to assign GO terms to genes.

Creates:
  • self.GO (dict. of dict)

    new attribute of the Genome instance. self.GO is a dictionary of shape {annot_syst: {GOterm: list of genes}} i.e. one subdictionary is created for each annotation system (such as GOc, COG or domain) listed in GO.info file.

  • self.genes[locus].GO (list)

    new attribute of Gene instances related to the Genome instance given as argument. List of terms (such as GO terms) associated to the gene.

Note

The information importation requires a GO.info file, containing column indices of each information in the data file and some additional information, in the following order: [0] Annotation system [1] Filename [2] Tag Type (locus tag (default), ASAP tag, gene name…) [3] Locus tag column [4] GOterm column [5] Separator [6] Start line GO.info and the data files have to be in the /GO/ directory

Note

Other annotation systems such as COG, or domain assignment can also be used.

Warning

This method needs a genomic annotation. If no annotation is loaded, the load_annotation method with the default “sequence.gff3” file is computed. To use another annotation, please load an annotation before using this method.

Example

>>> from GRATIOSA import Genome
>>> g = Genome.Genome("dickeya")
>>> g.load_GO()
>>> g.GO['GO']['GO:0000100']
['Dda3937_01944', 'Dda3937_03618']
>>> g.genes["Dda3937_00004"].GO
{'GO': ['GO:0005829','GO:0055114','GO:0046872','GO:0042802',
        'GO:0000166','GO:0006177','GO:0003938','GO:0005829',
        'GO:0055114','GO:0046872','GO:0042802','GO:0000166',
        'GO:0006177','GO:0003938']}
load_sites(cond='all')

load_sites imports a list of sites from a csv data file containing, for each site, its chromosome start and end, and optionally a score and chromosome name. Creates a new attribute of the genome instance: * self.sites (dict. of numpy arrays): self.sites[cond] is a list of

(start, end) tuples

  • self.sites_scores (if scores are provided): list of scores of the same size

Parameters:

cond (Optional [list of str.]) – selection of one or several conditions (1 condition corresponds to 1 data file and each condition has to be listed in the sites.info file). By default: cond =’all’ ie all available conditions are loaded.

Note

The data importation requires a sites.info file that contains the column indices of each information in the data file and some additional information, in the following order: * (required) [0] Condition, [1] Filename, [2] Startline,

[3] Separator, [4] Start, [5] End

  • (optional) [6] Score

Note

sites.info and data file have to be in the /sites/ directory

Warning

The function is designed only for genomes with a single chromosome.

Example

>>> g.load_sites("borders_SRR10394904")
>>> g.sites['borders_SRR10394904'][0:5]

useful functions

Functions called by Genome methods

GRATIOSA.useful_functions_genome.read_seq(filename)

Called by load_seq, read_seq allows name(s) and genomic sequence(s) to be loaded from a .fasta file

Parameters:

filename (str.) – name of the file containing the DNA sequence in FASTA format.

Returns:

tuple with list of names (str.) and list of genomic sequences (str.)

GRATIOSA.useful_functions_genome.update_NCBI_genomes()

Tries to download the current list of NCBI complete bacterial genomes, and store it in the file “accession_number_complete.txt” in the base directory. This is useful if you want to create a new organism and download the annotation automatically from NCBI, and you do not have the NCBI command-line tool installed… Caution, this requires to download a large file (~ 1 Gb). The stored file contains only the complete genomes and is much lighter (~ 20 Mb).

GRATIOSA.useful_functions_genome.load_gff(annotations_filename, genome_dict, features=['gene'])

Called by load_annotation, load_gff allows genes to be loaded from .gff or .gff3 annotation file, using the defined list of features and using the “locus_tag” as primary field to name the genes, with “gene” being the secondary field if locus_tag does not exist. For each gene of the annotation, one Gene object is created and intialized with the following attributes: locus_tag, ID, name, left (left coordinate), right (right coordinate), middle, length, strand, start, end, ASAP ID (legacy), and genome.

Parameters:
  • filename (str.) – name of the annotation file.

  • genome_dict (dict.) – dictionary giving the genome object associated to

  • file (each chromosome name found in the gff) –

  • features (list of str.) – list of attributes (column 3) to be imported:

  • ["gene"] (usually) –

etc. A feature will be retained if it contains a

Returns:

Dictionary of shape {locus: Gene object} with each Gene object initialized with the following attributes:

  • locus_tag: “locus_tag” recorded in the annotation,

  • ID: “ID” or “locus_tag” if no “ID” is associated this gene in the annotation,

  • name: “gene” or “locus_tag” if no “gene” is associated to this gene in the annotation,

  • ASAP: “ASAP” name (By default: ‘’).

  • strand: gene strand,

  • left, right: gene coordinates (does not take into account the strand, ie right > left)

  • start, end, middle: positions of the beginning, the middle and the end of the gene

  • length: gene length (=right-left)

  • genome: genome object where this gene belongs (useful only for genomes with multiple chromosomes)

Return type:

Dictionary

Warning

ASAP name is found only if it is noted on the same line as the “name” line or the line next to it (legacy of an annotation file of Dickeya dadantii)

GRATIOSA.useful_functions_genome.load_annot_general(annotations_filename, separator, tag_column, name_column, ID_column, genome_column, strand_column, left_column, right_column, ASAP_column, start_line, features=['gene'])

Legacy function, not very useful. Called by load annotation, load_annot_general allows genes to be loaded from any csv annotation file, not following gff format. This is almost never useful, as annotation is usually loaded from gff files (using the load_gff function). The annotation file has to contains one row per genes and the following columns: locus_tag, ID, name, ASAP, left coordinate, right coordinate and strand. For each gene of the annotation, one Gene object is created and intialized with the following attributes: locus_tag, ID, name, left (left coordinate), right (right coordinate), middle, length, strand, start, end and ASAP.

Parameters:
  • filename (str.) – name of the annotation file

  • separator (str.) – file separator

  • tag_col (int.) – index of the column containing the locus tags

  • name_column (int.) – index of the column containing the names

  • ID_column (int.) – index of the column containing the ID

  • strand_column (int.) – index of the column containing the strand

  • left_column (int.) – index of the column containing the left coordinates

  • right_column (int.) – index of the column containing the right coordinates

  • ASAP_column (int.) – index of the column containing the ASAP names

  • start_line (int.) – file start line

Returns:

Dict. of shape {locus: Gene object} with each Gene object initialized with the following attributes: locus_tag, ID, name, left (left coordinate), right (right coordinate), middle, length, strand, start, end and ASAP.

Return type:

Dictionary

Note

Column numbering starts at 0.

Strands can be noted with one of the following writings:

  • Forward: “forward”,”1”,”+”,”true”,”plus” (case insensitive)

  • Reverse: “complement”,”-1”,”-“,”false”,”minus” (case insensitive)

GRATIOSA.useful_functions_genome.load_TSS_cond(genes_dict, filename, TSS_column, start_line, separator, strandcol, genescol=None, sigcol=None, sitescol=None, scorecol=None)

Called by load_TSS, load_TSS_cond allows TSS data to be loaded from any file with one row per TSS and at least the following columns (separated by a separator that is not a comma): TSS position and DNA strand. Additional columns can be locus tags, sigma factors, sites positions and score. The sites positions columns is divided in 4 part separated by commas: -10 element left coordinate, -10 element right coordinate, -35 element left coordinate and -35 element right coordinate. For each TSS of the annotation, one TSS object is created and intialized with the attributes available in the data file (at least pos and strand).

Parameters:
  • genes_dict (dict.) – dictionary of shape {locus tag: Gene object}

  • filename (str.) – name of the annotation file

  • TSS_column (int.) – index of the column containing the TSS positions

  • start_line (int.) – file start line

  • separator (str.) – file separator (other than commas !)

  • strandcol (int.) – index of the column containing the strands in the file

  • genescol (int.) – index of the column containing the locus tags of the genes associated to each TSS in the file. By default: None (ie not in the file)

  • sigcol (int.) – index of the column containing the name of the sigma factor associated to each TSS in the file By default: None (ie not on file)

  • sitescol (int.) – index of the column containing the sites positions (ie the coordinates of the -10 and the -35 elements) associated to each TSS in the file By default: None (ie not on file)

  • scorecol (int.) – index of the TSS scores column in the file By default: None (ie not on file)

Returns:

dict. of shape {TSS position: TSS object} with each TSS object initialized with, at least, the following attributes: pos and strand. Depending on the data present in the file and passed as arguments, the following attributes may also have been added: genes, score and promoter. The promoter attribute is a dictionary containing, for each sigma factor (keys) a subdictionary (value). The first created key of this subdictionary is “sites” and the associated value is a tuple containing the positions of promoter elements (-10l,-10r,-35l,-35r) with l = left coordinate and r = right coordinate.

Return type:

Dictionary

Note

Column numbering starts at 0.

Strands can be noted with one of the following writings:

  • Forward: “forward”,”1”,”+”,”true”,”plus” (case insensitive)

  • Reverse: “complement”,”-1”,”-“,”false”,”minus” (case insensitive)

GRATIOSA.useful_functions_genome.load_TTS_cond(filename, separator, start_line, leftcol, rightcol, strandcol, rhocol, seqcol=None, scorecol=None, genescol=None, *args, **kwargs)

Called by load_TTS, load_TTS_cond allows TTS data to be loaded from any file with one row per TTS and at least the following columns: left coordinate, right coordinate, strand and rho_dpdt. Additional columns can be genes locus tags, TTS sequence and TTS score. For each TTS of the annotation, one TTS object is created and intialized with the attributes available in the data file (at least left, right, rho_dpdt and strand).

Parameters:
  • filename (str.) – name of the annotation file

  • separator (str.) – file separator

  • start_line (int.) – file start line

  • leftcol (int.) – index of the column containing the TTS left coordinates

  • rightcol (int.) – index of the column containing the TTS right coordinates

  • strandcol (int.) – index of the column containing the TTS strands

  • rhocol (int.) – index of the column containing the information about the rho dependency of the TTS

  • seqcol (int.) – index of the column containing the TTS sequences By default: None (ie not on file)

  • scorecol (int.) – index of the TTS scores column in the file By default: None (ie not on file)

  • genescol (int.) – index of the column containing the locus tags of the genes associated to each TTS in the file. By default: None (ie not on file)

Returns:

Dict. of shape {TTS position: TTS object} with each TTS object initialized with, at least, the following attributes: left, right, start, end, strand and rho_dpdt. Depending on the data present in the file and passed as arguments, the following attributes may also have been added: genes, score and seq.

Return type:

Dictionary

Note

Column numbering starts at 0.

Strands can be noted with one of the following writings:

  • Forward: “forward”,”1”,”+”,”true”,”plus” (case insensitive)

  • Reverse: “complement”,”-1”,”-“,”false”,”minus” (case insensitive)

Rho dependent TTS can be noted with one of the following writings: “True”,”1” (case insensitive)

GRATIOSA.useful_functions_genome.load_TU_cond(filename, IDcol, startcol, endcol, strandcol, startline, separator, genescol=None, exprcol=None, TSScol=None, TTScol=None)

Called by load_TU, load_TU_cond allows TU data to be loaded from any file with one row per TU and the following columns: start position, end position, strand and genes locus tags. For each TU of the annotation, one TU object is created and intialized with the attributes start, end, strand, genes and expression.

Parameters:
  • filename (str.) – name of the annotation file

  • IDcol (int.) – index of the column containing the TU ID

  • startcol (int.) – index of the column containing the TU start positions

  • endcol (int.) – index of the column containing the TU end positions

  • strandcol (int.) – index of the column containing the TU strands

  • start_line (int.) – file start line

  • separator (str.) – file separator

  • genescol (int.) – index of the column containing the locus tags of the genes associated to each TU in the file, separated by commas. By default: None (ie not on file)

  • exprcol (int.) – index of the column containing the TU expression (def. None)

  • TSScol (int.) – index of the column with TSS position

  • TTScol (int.) – index of the column with TTS position

Returns:

dict. of shape {TU start: TU object} with each TU object initialized with the attributes start, end, strand and genes.

Return type:

Dictionary

Note

Column numbering starts at 0.

Strands can be noted with one of the following writings:

  • Forward: “forward”,”1”,”+”,”true”,”plus” (case insensitive)

  • Reverse: “complement”,”-1”,”-“,”false”,”minus” (case insensitive)