RISCI - 'Repeat Induced Sequence Changes Identifier' - Readme (best viewed in Internet Explorer 8 or above) |
|
|
|
'RISCI' – Repeat Induced Sequence Changes Identifier - is a comprehensive, comparative genomics-based, in silico subtractive hybridization pipeline, to identify differential insertions and associated subtle sequence changes like target site duplications (TSD), 3’ and 5' flank transductions, insertion mediated deletions, recombination mediated deletions and polymorphism induced by transposons using specific alignment signatures. It emulates subtractive hybridization in the sense that, only when the locus in the two genomes is differential i.e. the change is exclusive to one of the two genomes under comparison, does it report the sequence changes associated with the transposon insertion. RISCI picks up the repeat locus from a given genome (Main or Reference genome) and zooms into the orthologous locus in one or more genomes (Comparative genomes) of the same or closely related species and reports whether the insertion is differential or otherwise. When differential, RISCI also reports additional subtle sequence changes brought about by the transposon insertion in the main or comparative genome(s) which may then be studied for their downstream effects. When different genomes of the same species are compared, all identified differential insertions represent polymorphic sites. RISCI also integrates the genomic context (genic or intergenic, if genic- exonic or intronic) of the the repeat locus in the main genome and of the identified orthologous locus in the comparative genome(s) provided the annotation files are available. ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
RISCI offers two types of analysis 1. WHOLE GENOME ANALYSIS All occurrences of a particular repeat-type e.g. (L1HS) or repeat class (e.g. L1) in the main genome are analyzed. It may be emphasized the the whole genome analysis is done chromosome-wise and the user may restrict to one or more chromosomes, even when whole genome analysis is selected for. 2. SPECIFIC REGION ANALYSIS A specific region of the main genome (e.g. chromosome 2, 11000 - 20000) is analyzed for a particular repeat type (L1HS) or repeat class (L1) input by the user. ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
There are two inbuilt filters in RISCI to refine user query. These include - 1. LENGTH FILTER - (FULL LENGTH OR TRUNCATED OR BOTH ) Allows filtering on the basis of length of the repeat (full length or truncated). Should not be used in case of repeats with variable length e.g. SVAs - which have variable length due to the presence of inherent variable number tandem repeats 2. GENE FILTER (GENIC OR INTERGENIC OR BOTH) Gene filter function is subject to the availability of annotation files (.gbs or .gbk files). ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
1. Local installation of standalone Legacy Blast. 2. Blast databases of main and comparative genomes. *** Please note, RISCI uses fastacmd to retrieve sequence from main and comparative genomes. Therefore, while using formatdb to create blast databases, ensure that the -o option is set to 'T'. All blast databases of main and comparative genomes should be stored in the same folder, the path for which is required as input. *** The names of the blast databases should correspond to the names of the main and comparative genomes input by the user. 3.Local installation of RepeatMasker if RISCI_BLAST is used as the repeat mining option or the confirmation modules are run. 4. Repeat Masker files of the main genome if "RISCI_RM" is used for repeat mining. 5. Local installation of EMBOSS modules. 6. Local installation of Perl. Optional - Annotation files (gbs or gbk) of the main genome for gene filter. ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
RISCI can be downloaded from http://www.ccmb.res.in/rakeshmishra/tools.html as a compressed file (RISCI.tar.gz). Unzip the file after downloading. A sample run of RISCI is also available (L1HS.tar.gz) to allow access to the data mentioned in the manuscript, and to make the users familiar to RISCI directory structure. The result files may be opened in word pad or Excel sheets. ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
RISCI can be executed by typing perl risci.pl on the terminal and feeding the required inputs, in the directory where it has been unzipped. ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
RISCI inputs vary slightly with the type of analysis. Most inputs are common as mentioned below.
_____________________________________________________________________________________________________________________ |
|
1. NAME OF THE PROJECT :
(Main directory formed by RISCI carries this name.)
______________________________________________________________________________________________________________________ RISCI offers three different modules for repeat mining from the main genome a) RISCI_RM - the repeat coordinates are read directly from the premasked files by RepeatMasker. If chosen, path for RepeatMasker files of the main genome also needs to be input.
----------------------------------------------------------------------------------------------------------------------------- b) RISCI_BLAST - In case of non availability of premasked files from RepeatMasker, this option may be used for repeat mining. If the choice is 1 or 2, 11.NAME OF THE REPEAT AS IDENTIFIED IN RepeatMasker : all repeats starting with the same prefix as the name input will be picked up. Thus if 'L1HS' is input, all L1HS repeats will be picked up for analysis. On the other hand if 'L1' is input, all repeats starting with L1 (e.g. L1HS,L1PA1,L1PA3 etc) will be picked up.
---------------------------------------------------------------------------------------------------------------------- RISCI_BLAST requires prior installation of Repeat Masker and additional inputs, which include - i) INPUT SIGNATURE REPEAT SEQUENCE : a 19 tp 22 bp oligomer specific to the repeat, preferably towards the 3' end of the repeat. ii) UPSTREAM SEQUENCE RANGE SO AS TO RETRIEVE ENTIRE REPEAT AND SIZABLE FLANK (5KB) : the input to this depends on the relative position of the signature sequence on the transposon eg - if the signature sequence is located at the extreme 3' end of L1 ( full length 6kb), then to extract 5000 bp uptream sequence of a full length L1, 12000 bp (6kb of L1 and 5 kb of flank will have to be extracted). iii) DOWNSTREAM SEQUENCE RANGE TO RETRIEVE ENTIRE REPEAT AND SIZABLE FLANK : as above, the input depends on the relative position of the signature sequence in the repeat.
----------------------------------------------------------------------------------------------------------------------------- c) RISCI_NONRM - The user may directly input the repeat coordinates list using this option. The format of coordinates list is specified below - Unique Repeat ID (space) Contig (space) Repeat start coordinate (space) Repeat end coordinate (space) Repeat orientation(+/C) eg.
---------------------------------------------------------------------------------------------------- L1HS_1_1 NC_000001 5277318 5283348 + L1HS_1_2 NC_000001 6451604 6457635 C AluYa5_1_3 NC_000001 1000000 1000300 +
------------------------------------------------------------------------------------------------------ The coordinate list should be appropriately named "Name of the project as input in 1(see above) _COORDINATES" eg. If the name of the project (input 1) is RANDOM, then the coordinate list name would be "RANDOM_COORDINATES". Finally, place the list in the appropriate chromosome directory within the COORDINATE directory in the main genome directory. eg. if the NAME of the project is "RAMDOM", the main genome is "Human", and the repeat coordinates are from chromosome 1, then place the "RANDOM_COORDINATES" file in the following path - /RANDOM/ Human/RANDOM_COORDINATES/chr1.RISCI/
______________________________________________________________________________________________________________________ 12. ANALYZE FULL LENGTH OR TRUNCATED or BOTH (1/2/3) : If the input above is 1 OR 2, additional input is required as mentioned below. i) CUT OFF LENGTH FOR FULL LENGTH AND
TRUNCATED
:
______________________________________________________________________________________________________________________ 13. FLANK SEQUENCE LENGTH (default 5000) : Length of flanks to be considered for identifying the orthologous locus in the comparative genome.
______________________________________________________________________________________________________________________ 14. PREFIX FOR CHROMOSOME FILES e.g. chr or ref_chr : (Ensure that the RepeatMasker and gbs files of the main genome and gbs files of the main and comparative genomes have the same prefix eg. chr1.fa.out for RepeatMasker file and chr1.gbs for Genbank summary file.)
______________________________________________________________________________________________________________________ 15. NAME(s) OF CHROMOSOME(s) separated by commas (e.g.) 1,2,3,4, .... X,Y : If nature of main genome sequence is assembled chromosomes, additional input is required as follows i) ACCESSION NUMBERS FOR CHROMOSOME FILES OF THE MAIN GENOME corresponding to the chromosomes input above : e.g. NC_000001,NC_00002 ...NC_000024 for chromosome 1 and chromosome 2 and chromosome X respectively of the reference human genome. Please note that input here varies according to nature of sequence files. Thus if sequence files are downloaded from NCBI, these will carry an accession number eg. sequence file for chromosome 1 from NCBI (NCBI Build ...) carry the header >gi|89161185|ref|NC_000001.9|NC_000001 However, if the sequence is downloaded from UCSC genome browser (Hg ..), the input will be chr1,chr2 ...
______________________________________________________________________________________________________________________
allows tagging the upstream query
sequence with a minimum user defined length of non repeat sequence at the 5' end, if
possible.
______________________________________________________________________________________________________________________ if input is "Yes", additional input is required i) MAXIMUM SEPARATION ALLOWED FOR HITS TO BE MERGED (default 50) :
merges BLAST hits in the
comparative genome if they are separated from each other by less than the
length input, both in terms of query coordinates as well as in subject coordinates.
______________________________________________________________________________________________________________________ FAST - blast - v option set to 2, a maximum of 100 blast hits (in each orientation) compared. MEDIUM - blast -v option set to 3, a maximum of 500 blast hits (in each orientation) compared. SLOW - blast -v option set to 5, a maximum of 10,000 blast (in each orientation) compared.
21. STOP AT FIRST MATCH OR ALL AGAINST ALL
COMPARISONS (1/2)
: If the first option (stop at first match) is selected, the
comparison of upstream and downstream hits stops the moment a match to any
of the alignment signatures in RISCI is found. To avoid orientation bias,
the control shifts from plus to minus every 15 hits. However, in such a
case duplications cannot be identified and the scoring scheme becomes
redundant.
______________________________________________________________________________________________________________________ 22. INPUT RUN CONFIRMATION MODULES (YES/NO) (default YES) : Runs confirmation modules for 3' and 5' flank transductions, 3' flank transductions concurrent with insertion mediated deletions and INDELS.
------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
RISCI output is saved in a directory (folder) with the project name (e.g. L1HS), as input by the user. The main folder (L1HS) contains a folder each for the main genome and the comparative genomes (HuRef, Chimp, Celera here). Each comparative genome folder contains a *_RESULTS folder (L1HS_FL_RESULTS), which in turn contains an *.RISCI (chr1.RISCI ... chr4.RISCI) folder for each chromosome analyzed (contain individual results file for the chromosome- 'RESULTS LOG', 'FINAL RESULTS' and 'NO MATCH FOUND' ) and a RESULTS folder which contains complied result files for all chromosomes. The details of these files are mentioned below. 1. RESULTS_LOG_* - this is the raw result file. The file is named 'RESULTS_LOG_' suffixed with the name of the comparative genome, e.g. RESULTS_LOG_Chimp. The file is tab separated and may be opened in Excel sheet. This file has several columns, each column defining a RISCI parameter. The details of each column are as follows - A. Repeat name, B. RISCI annotation C. RISCI-score D. Contig in main genome (in case of assembled chromosome data - NA -Not Applicable), columns E and F - start and end coordinate of the repeat in the main genome, Columns G-I - genic annotation ( genic/intergenic - if genic - exonic or intronic) of the repeat locus in main genome, columns J-L - genic annotation of the identified orthologous locus in the comparative genome. (Please note that the exonic or intronic location of the repeat in the main genome or of the identified orthologous in the comparative genome are based on the CDS coordinates in the annotation files). Column M - repeat length, Column N - percentage repeat content in the specified length of flanks, Column 0 - chromosomal location of the orthologous locus in the comparative genome, columns P and Q - A and AT scores, column R - orientation of the orthologous locus in the comparative genome, column S - orthologous locus contig in the comparative genome, columns T-Z - Blast details of upstream (5') flank including E-value (column T), query length (column U), non repeat tag for the flank (column V), start and end query coordinates (column W and X), start and end subject coordinates (column Y and Z), columns AA - AF - Blast details of downstream (3') flank, E-value (column AA), query length (column AB), start and end query coordinates (columns AC and AD), start and end subject coordinates (columns AE and AF), column AG - length of Target site duplication or sequence separating the upstream and downstream flanks at the orthologous locus, columns AH and AI - 5' and 3' TSD sequence, column AJ - TSD sequence in the comparative genome (Please note REFER_INDEL_FOLDER refers to INDEL folder (L1HS_FL_INDEL_SEQUENCES) in the comparative genome folders in which the retrieved INDEL, OCCUPIED or RECOMBINED sequence is saved with the repeat name), column AK and AL - N-score and N-positions in the retrieved INDEL, OCCUPIED or RECOMBINED sequence. CDS_NA -CDS not available, NC- Not comparable, NMF- No Match found, TSD- target site duplication, SFC-Subject first coordinate, SLC-Subject last coordinate, QFC -Query first coordinate, QLC -Query last coordinate, 2. RESULTS_* - This file contains parsed results from the RESULTS_LOG file. Names of all fragments of a disrupted repeat are concatenated and marked by “!” suffix. Likewise, constituents of a twin priming event are also concatenated and suffixed by "*". True RISCI annotation of each element with appropriate upstream and downstream blast coordinates is mentioned. The frequency of elements in each RISCI class is mentioned as a footnote. 3. RESULTS_SUMMARY_* - Not as descriptive as the RESULTS LOG or RESULTS file, The order of columns is also shuffled so as to highlight the findings. Thus, the first column gives the name of the repeat, the next fifteen columns describe the nature of the identified orthologous locus in the comparative genome, while the last six columns describe the nature of the transposon locus in the main genome. If the confirmation modules are run additional files formed include - 4. 3PTS_RESULTS - tabulates the results of 3' flank transduction confirmation module. The putative transduced sequence (PTS) is displayed in embl format, followed by the RepeatMasker annotation of the PTS, followed by listing of number of Blast obtained in the main and the comparative genome and the most probable target and source loci in the main genome, and prospective source loci in the comparative genome. 5. 5PTS_RESULTS - tabulates the results of 5' flank transduction confirmation module. The file is similar to COMPILED_3PTS_RESULTS. 6. INDEL_3PTS_RESULTS - tabulates the results of 3' flank transduction predicted by RISCI to occur concurrently with insertion mediated deletions for loci with N-score less than 10. The file is similar to COMPILED_3PTS_RESULTS. 7. INDEL BLAST2 RESULTS WITH REPEATMASKER ANNOTATION - tabulates the results of pairwise alignment between the repeat sequence in the main genome and the INDEL sequence retrieved from the comparative genome, and the RepeatMasker annotation of the retrieved sequence.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
||||||||||||||||||||||||||||||||||||
The inputs are similar to WHOLE GENOME ANALYSIS. Specific inputs required for this module include 1.CONTIG NAME ( if contigwise assembly is used or accession number of chromosome file if assembled chromosome is used) 2.START COORDINATE 3.END COORDINATE -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
||||||||||||||||||||||||||||||||||||
-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
||||||||||||||||||||||||||||||||||||
-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
||||||||||||||||||||||||||||||||||||
Apart from the speed options inbuilt in RISCI, processing speed can also be increased n times by running RISCI in parallel for each comparative genome (n = number of comparative genomes). For parallel processing, 1. Open 'run.pl' and comment lines (add # symbol in front of the line) 28, 44, 57, 70. 2. Run risci.pl ( perl risci.pl enter) The user inputs to RISCI are written to a file 'INPUTFILE' tagged with a minimum of 6 digit number. The input file name (e.g. INPUTFILE_123456) is displayed at the beginning and end of RISCI inputs. The INPUTFILE_123456 contains names of all comparative genomes (e.g. Chimpanzee, Celera and HuRef). After risci.pl has been run, 3. Remove all lines starting with COMPGENOME except the one for which RISCI would be run in the current terminal, from the INPUTFILE_123456. 4. Open tsd.pl and add line - $inputfile="INPUTFILE_123456"; below the comment line "Collect arguements from the input file" and execute the program tsd.pl by entering perl tsd.pl open as many terminals as the number of comparative genomes and repeat steps 3 and 4. -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
||||||||||||||||||||||||||||||||||||
for any query, suggestions, bugs or problems regarding RISCI, contact vipin@ccmb.res.in, or ashvip@gmail.com -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
||||||||||||||||||||||||||||||||||||
Thanks for visiting RISCI . |
||||||||||||||||||||||||||||||||||||