Readme

RISCI - 'Repeat Induced Sequence Changes Identifier' - Readme (best viewed in Internet Explorer 8 or above)
1.Introduction 2.Types of Analysis 3.RISCI prerequisites 4.RISCI filters 5.Installing RISCI 6.Executing RISCI 7.RISCI INPUTS a. WHOLE GENOME ANALYSIS b. SPECIFIC REGION ANALYSIS 8.RISCI output 10.RISCI ANNOTATIONS 11.RISCI - SPEED OPTIONS 12.RISCI- Parallel processing 13. PROBLEMS - contact us
Introduction	'RISCI' – Repeat Induced Sequence Changes Identifier - is a comprehensive, comparative genomics-based, in silico subtractive hybridization pipeline, to identify differential insertions and associated subtle sequence changes like target site duplications (TSD), 3’ and 5' flank transductions, insertion mediated deletions, recombination mediated deletions and polymorphism induced by transposons using specific alignment signatures. It emulates subtractive hybridization in the sense that, only when the locus in the two genomes is differential i.e. the change is exclusive to one of the two genomes under comparison, does it report the sequence changes associated with the transposon insertion. RISCI picks up the repeat locus from a given genome (Main or Reference genome) and zooms into the orthologous locus in one or more genomes (Comparative genomes) of the same or closely related species and reports whether the insertion is differential or otherwise. When differential, RISCI also reports additional subtle sequence changes brought about by the transposon insertion in the main or comparative genome(s) which may then be studied for their downstream effects. When different genomes of the same species are compared, all identified differential insertions represent polymorphic sites. RISCI also integrates the genomic context (genic or intergenic, if genic- exonic or intronic) of the the repeat locus in the main genome and of the identified orthologous locus in the comparative genome(s) provided the annotation files are available. -------------------------------------------------------------------------------------------------------------------------------------------------------------------
Types of analysis	RISCI offers two types of analysis 1. WHOLE GENOME ANALYSIS All occurrences of a particular repeat-type e.g. (L1HS) or repeat class (e.g. L1) in the main genome are analyzed. It may be emphasized the the whole genome analysis is done chromosome-wise and the user may restrict to one or more chromosomes, even when whole genome analysis is selected for. 2. SPECIFIC REGION ANALYSIS A specific region of the main genome (e.g. chromosome 2, 11000 - 20000) is analyzed for a particular repeat type (L1HS) or repeat class (L1) input by the user. -------------------------------------------------------------------------------------------------------------------------------------------------------------------
RISCI Filters	There are two inbuilt filters in RISCI to refine user query. These include - 1. LENGTH FILTER - (FULL LENGTH OR TRUNCATED OR BOTH ) Allows filtering on the basis of length of the repeat (full length or truncated). Should not be used in case of repeats with variable length e.g. SVAs - which have variable length due to the presence of inherent variable number tandem repeats 2. GENE FILTER (GENIC OR INTERGENIC OR BOTH) Gene filter function is subject to the availability of annotation files (.gbs or .gbk files). -------------------------------------------------------------------------------------------------------------------------------------------------------------------
RISCI Prerequisites	1. Local installation of standalone Legacy Blast. 2. Blast databases of main and comparative genomes. * Please note, RISCI uses fastacmd to retrieve sequence from main and comparative genomes. Therefore, while using formatdb to create blast databases, ensure that the -o option is set to 'T'. All blast databases of main and comparative genomes should be stored in the same folder, the path for which is required as input. * The names of the blast databases should correspond to the names of the main and comparative genomes input by the user. 3.Local installation of RepeatMasker if RISCI_BLAST is used as the repeat mining option or the confirmation modules are run. 4. Repeat Masker files of the main genome if "RISCI_RM" is used for repeat mining. 5. Local installation of EMBOSS modules. 6. Local installation of Perl. Optional - Annotation files (gbs or gbk) of the main genome for gene filter. -------------------------------------------------------------------------------------------------------------------------------------------------------------------

Installing RISCI	RISCI can be downloaded from http://www.ccmb.res.in/rakeshmishra/tools.html as a compressed file (RISCI.tar.gz). Unzip the file after downloading. A sample run of RISCI is also available (L1HS.tar.gz) to allow access to the data mentioned in the manuscript, and to make the users familiar to RISCI directory structure. The result files may be opened in word pad or Excel sheets. -------------------------------------------------------------------------------------------------------------------------------------------------------------------
Executing RISCI	RISCI can be executed by typing perl risci.pl on the terminal and feeding the required inputs, in the directory where it has been unzipped. -------------------------------------------------------------------------------------------------------------------------------------------------------------------
RISCI INPUTS	RISCI inputs vary slightly with the type of analysis. Most inputs are common as mentioned below. _____________________________________________________________________________________________________________________
WHOLE GENOME ANALYSIS	1. NAME OF THE PROJECT : (Main directory formed by RISCI carries this name.) ______________________________________________________________________________________________________________________ 2. DIRECTORY PATH FOR DIRECTORY AND SUBDIRECTORIES : (Specific directory where RISCI directories and subdirectories will be formed.) ______________________________________________________________________________________________________________________ 3. ANALYSIS TYPE - 1 for WHOLE GENOME, 2 for SPECIFIC REGION ANALYSIS (default 1) : ______________________________________________________________________________________________________________________ 4. NAME OF THE MAIN GENOME : ( * Please note that the name should correspond to blast database name of the genome. Only 1 main genome is allowed per run). ______________________________________________________________________________________________________________________ 5.NATURE OF MAIN GENOME SEQUENCE 1- ASSEMBLED CHROMOSOMES 2- CONTIG WISE (default 1) : ______________________________________________________________________________________________________________________ 6. PATH FOR gbs FILES OF THE MAIN GENOME if available else press enter : ______________________________________________________________________________________________________________________ 7. NAME OF THE COMPARATIVE GENOME(s) (separated by commas if more than one) : names should correspond to blast database name(s). ______________________________________________________________________________________________________________________ 8. PATH FOR gbs FILES OF THE COMPARATIVE GENOME if available else press enter : appears as many times as the number of comparative genomes input. ______________________________________________________________________________________________________________________ 9. REGION TO BE ANALYZED - GENIC / INTERGENIC / BOTH (1/2/3) : ( this input is asked for only if the file path for gbs files of the main genome has been input earlier) ______________________________________________________________________________________________________________________ 10. REPEAT MINING OPTION - RISCI_RM OR RISCI_BLAST OR RISCI_NONRM 1/2/3 (default 1) : ---------------------------------------------------------------------------------------------------------------------* RISCI offers three different modules for repeat mining from the main genome a) RISCI_RM - the repeat coordinates are read directly from the premasked files by RepeatMasker. If chosen, path for RepeatMasker files of the main genome also needs to be input. ----------------------------------------------------------------------------------------------------------------------------- b) RISCI_BLAST - In case of non availability of premasked files from RepeatMasker, this option may be used for repeat mining. If the choice is 1 or 2, 11.NAME OF THE REPEAT AS IDENTIFIED IN RepeatMasker : all repeats starting with the same prefix as the name input will be picked up. Thus if 'L1HS' is input, all L1HS repeats will be picked up for analysis. On the other hand if 'L1' is input, all repeats starting with L1 (e.g. L1HS,L1PA1,L1PA3 etc) will be picked up. ---------------------------------------------------------------------------------------------------------------------- RISCI_BLAST requires prior installation of Repeat Masker and additional inputs, which include - i) INPUT SIGNATURE REPEAT SEQUENCE : a 19 tp 22 bp oligomer specific to the repeat, preferably towards the 3' end of the repeat. ii) UPSTREAM SEQUENCE RANGE SO AS TO RETRIEVE ENTIRE REPEAT AND SIZABLE FLANK (5KB) : the input to this depends on the relative position of the signature sequence on the transposon eg - if the signature sequence is located at the extreme 3' end of L1 ( full length 6kb), then to extract 5000 bp uptream sequence of a full length L1, 12000 bp (6kb of L1 and 5 kb of flank will have to be extracted). iii) DOWNSTREAM SEQUENCE RANGE TO RETRIEVE ENTIRE REPEAT AND SIZABLE FLANK : as above, the input depends on the relative position of the signature sequence in the repeat. ----------------------------------------------------------------------------------------------------------------------------- c) RISCI_NONRM - The user may directly input the repeat coordinates list using this option. The format of coordinates list is specified below - Unique Repeat ID (space) Contig (space) Repeat start coordinate (space) Repeat end coordinate (space) Repeat orientation(+/C) eg. ---------------------------------------------------------------------------------------------------- L1HS_1_1 NC_000001 5277318 5283348 + L1HS_1_2 NC_000001 6451604 6457635 C AluYa5_1_3 NC_000001 1000000 1000300 + ------------------------------------------------------------------------------------------------------ The coordinate list should be appropriately named "Name of the project as input in 1(see above) _COORDINATES" eg. If the name of the project (input 1) is RANDOM, then the coordinate list name would be "RANDOM_COORDINATES". Finally, place the list in the appropriate chromosome directory within the COORDINATE directory in the main genome directory. eg. if the NAME of the project is "RAMDOM", the main genome is "Human", and the repeat coordinates are from chromosome 1, then place the "RANDOM_COORDINATES" file in the following path - /RANDOM/ Human/RANDOM_COORDINATES/chr1.RISCI/ ______________________________________________________________________________________________________________________ 12. ANALYZE FULL LENGTH OR TRUNCATED or BOTH (1/2/3) : If the input above is 1 OR 2, additional input is required as mentioned below. i) CUT OFF LENGTH FOR FULL LENGTH AND TRUNCATED : ______________________________________________________________________________________________________________________ 13. FLANK SEQUENCE LENGTH (default 5000) : Length of flanks to be considered for identifying the orthologous locus in the comparative genome. ______________________________________________________________________________________________________________________ 14. PREFIX FOR CHROMOSOME FILES e.g. chr or ref_chr : (Ensure that the RepeatMasker and gbs files of the main genome and gbs files of the main and comparative genomes have the same prefix eg. chr1.fa.out for RepeatMasker file and chr1.gbs for Genbank summary file.) ______________________________________________________________________________________________________________________ 15. NAME(s) OF CHROMOSOME(s) separated by commas (e.g.) 1,2,3,4, .... X,Y : If nature of main genome sequence is assembled chromosomes, additional input is required as follows i) ACCESSION NUMBERS FOR CHROMOSOME FILES OF THE MAIN GENOME corresponding to the chromosomes input above : e.g. NC_000001,NC_00002 ...NC_000024 for chromosome 1 and chromosome 2 and chromosome X respectively of the reference human genome. Please note that input here varies according to nature of sequence files. Thus if sequence files are downloaded from NCBI, these will carry an accession number eg. sequence file for chromosome 1 from NCBI (NCBI Build ...) carry the header >gi\|89161185\|ref\|NC_000001.9\|NC_000001 However, if the sequence is downloaded from UCSC genome browser (Hg ..), the input will be chr1,chr2 ... ______________________________________________________________________________________________________________________ 16. PATH FOR BLAST DATABASE OF GENOMES : ______________________________________________________________________________________________________________________ 17. MAXIMUM TARGET SITE DUPLICATION SIZE (default 60) : ______________________________________________________________________________________________________________________ 18.MINIMUM LENGTH OF NON REPEAT TAG AT THE 5' END OF UPSTREAM QUERY SEQUENCE (default 500) : allows tagging the upstream query sequence with a minimum user defined length of non repeat sequence at the 5' end, if possible. ______________________________________________________________________________________________________________________ 19. MERGE HITS (Yes/No) (default -Yes) : Since the genomes under comparison may have diverged from each other for a substantive time period, though overall local similarity may be retained, the similarity in certain regions may be below the threshold, resulting in small alignment breaks. Such alignment breaks may be merged using the merger option. if input is "Yes", additional input is required i) MAXIMUM SEPARATION ALLOWED FOR HITS TO BE MERGED (default 50) : merges BLAST hits in the comparative genome if they are separated from each other by less than the length input, both in terms of query coordinates as well as in subject coordinates. ______________________________________________________________________________________________________________________ 20. INPUT SPEED -FAST, MEDIUM, SLOW- 1/2/3 (default 3) : RISCI offers three speed options FAST - blast - v option set to 2, a maximum of 100 blast hits (in each orientation) compared. MEDIUM - blast -v option set to 3, a maximum of 500 blast hits (in each orientation) compared. SLOW - blast -v option set to 5, a maximum of 10,000 blast (in each orientation) compared. 21. STOP AT FIRST MATCH OR ALL AGAINST ALL COMPARISONS (1/2) : If the first option (stop at first match) is selected, the comparison of upstream and downstream hits stops the moment a match to any of the alignment signatures in RISCI is found. To avoid orientation bias, the control shifts from plus to minus every 15 hits. However, in such a case duplications cannot be identified and the scoring scheme becomes redundant. ______________________________________________________________________________________________________________________ 22. INPUT RUN CONFIRMATION MODULES (YES/NO) (default YES) : Runs confirmation modules for 3' and 5' flank transductions, 3' flank transductions concurrent with insertion mediated deletions and INDELS. -------------------------------------------------------------------------------------------------------------------------------------------------------------------

RISCI output

RISCI output is saved in a directory (folder) with the project name (e.g. L1HS), as input by the user. The main folder (L1HS) contains a folder each for the main genome and the comparative genomes (HuRef, Chimp, Celera here). Each comparative genome folder contains a *_RESULTS folder (L1HS_FL_RESULTS), which in turn contains an *.RISCI (chr1.RISCI ... chr4.RISCI) folder for each chromosome analyzed (contain individual results file for the chromosome- 'RESULTS LOG', 'FINAL RESULTS' and 'NO MATCH FOUND' ) and a RESULTS folder which contains complied result files for all chromosomes. The details of these files are mentioned below.

1. RESULTS_LOG_* - this is the raw result file. The file is named 'RESULTS_LOG_' suffixed with the name of the comparative genome, e.g. RESULTS_LOG_Chimp. The file is tab separated and may be opened in Excel sheet.

This file has several columns, each column defining a RISCI parameter. The details of each column are as follows -

A. Repeat name, B. RISCI annotation C. RISCI-score D. Contig in main genome (in case of assembled chromosome data - NA -Not Applicable), columns E and F - start and end coordinate of the repeat in the main genome, Columns G-I - genic annotation ( genic/intergenic - if genic - exonic or intronic) of the repeat locus in main genome, columns J-L - genic annotation of the identified orthologous locus in the comparative genome. (Please note that the exonic or intronic location of the repeat in the main genome or of the identified orthologous in the comparative genome are based on the CDS coordinates in the annotation files). Column M - repeat length, Column N - percentage repeat content in the specified length of flanks, Column 0 - chromosomal location of the orthologous locus in the comparative genome, columns P and Q - A and AT scores, column R - orientation of the orthologous locus in the comparative genome, column S - orthologous locus contig in the comparative genome, columns T-Z - Blast details of upstream (5') flank including E-value (column T), query length (column U), non repeat tag for the flank (column V), start and end query coordinates (column W and X), start and end subject coordinates (column Y and Z), columns AA - AF - Blast details of downstream (3') flank, E-value (column AA), query length (column AB), start and end query coordinates (columns AC and AD), start and end subject coordinates (columns AE and AF), column AG - length of Target site duplication or sequence separating the upstream and downstream flanks at the orthologous locus, columns AH and AI - 5' and 3' TSD sequence, column AJ - TSD sequence in the comparative genome (Please note REFER_INDEL_FOLDER refers to INDEL folder (L1HS_FL_INDEL_SEQUENCES) in the comparative genome folders in which the retrieved INDEL, OCCUPIED or RECOMBINED sequence is saved with the repeat name), column AK and AL - N-score and N-positions in the retrieved INDEL, OCCUPIED or RECOMBINED sequence.

CDS_NA -CDS not available, NC- Not comparable, NMF- No Match found, TSD- target site duplication, SFC-Subject first coordinate, SLC-Subject last coordinate, QFC -Query first coordinate, QLC -Query last coordinate,

2. RESULTS_* - This file contains parsed results from the RESULTS_LOG file. Names of all fragments of a disrupted repeat are concatenated and marked by “!” suffix. Likewise, constituents of a twin priming event are also concatenated and suffixed by "*". True RISCI annotation of each element with appropriate upstream and downstream blast coordinates is mentioned. The frequency of elements in each RISCI class is mentioned as a footnote.

3. RESULTS_SUMMARY_* - Not as descriptive as the RESULTS LOG or RESULTS file, The order of columns is also shuffled so as to highlight the findings. Thus, the first column gives the name of the repeat, the next fifteen columns describe the nature of the identified orthologous locus in the comparative genome, while the last six columns describe the nature of the transposon locus in the main genome.

If the confirmation modules are run additional files formed include -

4. 3PTS_RESULTS - tabulates the results of 3' flank transduction confirmation module. The putative transduced sequence (PTS) is displayed in embl format, followed by the RepeatMasker annotation of the PTS, followed by listing of number of Blast obtained in the main and the comparative genome and the most probable target and source loci in the main genome, and prospective source loci in the comparative genome.

5. 5PTS_RESULTS - tabulates the results of 5' flank transduction confirmation module. The file is similar to COMPILED_3PTS_RESULTS.

6. INDEL_3PTS_RESULTS - tabulates the results of 3' flank transduction predicted by RISCI to occur concurrently with insertion mediated deletions for loci with N-score less than 10. The file is similar to COMPILED_3PTS_RESULTS.

7. INDEL BLAST2 RESULTS WITH REPEATMASKER ANNOTATION - tabulates the results of pairwise alignment between the repeat sequence in the main genome and the INDEL sequence retrieved from the comparative genome, and the RepeatMasker annotation of the retrieved sequence.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

SPECIFIC REGION ANALYSIS

The inputs are similar to WHOLE GENOME ANALYSIS. Specific inputs required for this module include

1.CONTIG NAME ( if contigwise assembly is used or accession number of chromosome file if assembled chromosome is used)

2.START COORDINATE

3.END COORDINATE

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

RISCI ANNOTATIONS

Table 1 - RISCI ANNOTATIONS - RISCI annotates the transposon locus in the main genome or the ortholog in the comparative genome into several classes based on specific alignment signatures.
CLASS	RISCI ANNOTATIONS	DETAILS
SHARED ANCESTRY	OCCUPIED	The orthologous locus is occupied in the comparative genome suggesting shared ancestry (transposon insertion ancestral to divergence of sequences under comparison)
POST INSERTION CHANGES	C_DISRUPTED_M_INTER_RMD	There is either disruption of the repeat in the comparative genome or two elements in the main genome recombine, deleting the intervening sequence and a copy of the homologous sequence.
	C_INTER_RMD_M_DISRUPTED	Two elements in the comparative recombine deleting the intervening sequence and a copy of the homologous sequence or the repeat element in the main genome is disrupted
	M_INTRA_RMD	Recombination within the element in the main genome
	C_INTRA_RMD	Recombination within the element in the comparative genome
ORTHOLOGOUS LOCUS EMPTY	CAN	The orthologous locus is empty with exclusive mobilization of the transposon sequence in the main genome (canonical transposition)
	PAC	The orthologous locus is empty with exclusive mobilization of the transposon sequence in the main genome. The 3' end is, however, miss annotated. (canonical transposition)
	PTS	The orthologous locus is empty and the transposition in the main genome is non canonical with a 3' flank transduced (non canonical transposition)
INSERTIONS - DELETIONS	INDEL_CAN	The insertion of transposon in the main genome results in loss of some sequence. There is exclusive mobilization of the transposon sequence.
	INDEL_PAC	The insertion of transposon in the main genome results in loss of some sequence. There is exclusive mobilization of the transposon sequence. The 3' end is however miss annotated.
	INDEL_PTS	The insertion of transposon in the main genome results in loss of some sequence in the main genome concurrent with 3' flank transduction.
NO MATCH FOUND	NMF/NC	The ortholog in the comparative genome is not identified.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

RISCI - SPEED OPTIONS

Speed options in RISCI
Parameters	Fast	Medium	Slow
Blast -v	2	3	5
Maximum no of Blast HSPs compared	100 (in each orientation)	500 (in each orientation)	10000 (in each orientation)
Pros and cons	fastest, least accurate	Fast, reasonably accurate	Most accurate

For each of the speed options (Fast, Medium and Slow), speed may be further enhanced by selecting for "STOP AT FIRST MATCH (SFM)" as opposed to "ALL AGAINST ALL COMPARISONS". SFM option stops further comparisons as soon as the first match conforming to any of the RISCI alignment signatures is found. To avoid orientation bias, the control shifts between plus and minus hits every 15 hits. The scoring scheme becomes redundant since only one match is allowed, and hence duplications cannot be identified with SFM option. Medium speed option with SFM off and merger on is recommended.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

RISCI - parallel processing

Apart from the speed options inbuilt in RISCI, processing speed can also be increased n times by running RISCI in parallel for each comparative genome (n = number of comparative genomes).

For parallel processing,

1. Open 'run.pl' and comment lines (add # symbol in front of the line) 28, 44, 57, 70.

2. Run risci.pl ( perl risci.pl enter)

The user inputs to RISCI are written to a file 'INPUTFILE' tagged with a minimum of 6 digit number. The input file name (e.g. INPUTFILE_123456) is displayed at the beginning and end of RISCI inputs.

The INPUTFILE_123456 contains names of all comparative genomes (e.g. Chimpanzee, Celera and HuRef).

After risci.pl has been run,

3. Remove all lines starting with COMPGENOME except the one for which RISCI would be run in the current terminal, from the INPUTFILE_123456.

4. Open tsd.pl and add line -

$inputfile="INPUTFILE_123456";

below the comment line "Collect arguements from the input file"

and execute the program tsd.pl by entering perl tsd.pl

open as many terminals as the number of comparative genomes and repeat steps 3 and 4.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Contact

for any query, suggestions, bugs or problems regarding RISCI, contact vipin@ccmb.res.in, or ashvip@gmail.com

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Thanks for visiting RISCI .