Getting started with Salanto

SALANTO – SHUFFLING ALIGNMENT

ANALYSIS TOOL

NINA SCHÜRMANN AND CHRISTIAN BENDER

CONTENTS

Contents ................................................................................................................................................................................. 1

Background........................................................................................................................................................................... 2

Installation ............................................................................................................................................................................ 2

Getting started with Salanto .......................................................................................................................................... 2

Detailed Manual .................................................................................................................................................................. 4

Generation of input data files: .................................................................................................................................. 4

Loading of input files ................................................................................................................................................... 5

Analysis of chimeric sequences ............................................................................................................................... 5

Type assignment ............................................................................................................................................................ 6

Precise type assignment .......................................................................................................................................... 10

Mutations ....................................................................................................................................................................... 11

Save data ........................................................................................................................................................................ 12

References .......................................................................................................................................................................... 12

1

BACKGROUND

The Salanto program is developed for the analysis of chimeric DNA or protein sequences created via

(DNA) family shuffling. This high-throughput technology is based on the controlled fragmentation of multiple related genes followed by re-assembly of full length sequences based on partial (>50%) homologies via self-priming PCR reaction (see reference Cohen 2001 ). Different applications in

Salanto enable a fast, accurate and automated analysis of many chimeric sequences in parallel (e.g. from shuffled libraries). The program is able to assign chimeric sequences to their corresponding parental origin, thereby providing a fundamental basis to further investigate characteristics of single chimeras like the composition, sequence homology or crossover frequency. By this means, Salanto is a useful tool to evaluate the quality of a DNA family shuffling approach (e.g. for shuffled libraries) as well as for the functional interpretation of interesting chimeras based on their sequence.

INSTALLATION

The platform independent jar file requires a current Java runtime environment installed on the computer (Windows, Linux, and Mac) and is directly executable, so a separate installation is not required. Simply download the jar-File from the website (click ‘current Release’ on https://bitbucket.org/benderc/salanto/wiki/Home ) and double-click the file.

GETTING STARTED WITH SALANTO

Below is the general workflow for quickly running the program:

1) Loading files

To load the fasta input file (see section ‘Generation of input data files’) press the button

‘browse for fasta files’. Make sure to have the reference sequence pattern defined in the text-box ‘Reference sequences pattern’. If the program complains about missing reference sequences, check again this input field and make sure, the pattern which is shown is contained in the sequences that are defined as references (see section ‘Loading of input files’ for more information).

2

2) Showing type assignments

The main purpose of Salanto is to generate the reference type assignments for each sequence position of all chimeric sequences. The assignment is computed automatically when a fasta file is loaded. To show the assignment, just click either the ‘Type assignment’ or

‘Precise type assignment’ button. Details on the assignment types can be found in the corresponding sections in this manual.

3) Additional tools

Several additional statistics can be computed based on the different types of assignments.

Simply click the buttons below ‘Type assignment’ or ‘Precise type assignment’ to generate statistics on clone composition, crossover frequencies or reference sequence similarities. The different methods are described in the type assignment sections later in this manual.

4) Exporting data

For each analysis, the output can be exported to tab-delimited files by clicking the ‘Save display’ button. The resulting file can be opened in any spreadsheet application program (as for example Excel) for further analysis.

3

DETAILED MANUAL

GENERATION OF INPUT DATA FILES:

For the analysis of the generated chimeras with Salanto, sequences have to be aligned to the parental reference sequences using any alignment tool (e.g. using clustalx ). Although these programs are quite accurate by now, it is important to control for correct alignment of all sequences and adjust misaligned parts manually. This is crucial for the correct analysis of chimeric sequences by Salanto, as incorrect alignment of sequences might result in serious misinterpretation of data. Therefore, it is also advisable to reassess the sequence alignment if you obtain strange or unexpected results using

Salanto.

The alignment, saved in fasta format, serves as input data file for Salanto. If you correct the alignment within the fasta file, make sure to keep the alignment frame (meaning the same length of all constructs, while positions not present in one of the aligned sequences have to be marked with a hyphen). An important requirement for the sequence names is that parental reference sequences can be distinguished correctly from chimeric sequences via a name pattern. For instance, parental

Ago sequences all contain the prefix ‘Ago’ in their names, while the chimeric sequences contain the prefix ‘chim’. During the analysis in Salanto, the reference sequence pattern ‘Ago’ has to be given as input parameter and is used to select the correct reference and chimeric sequences. See section

’Loading of input files’ for more details.

In some cases it might be beneficial to exclude single clones to obtain a more accurate analysis for the residual chimeras (e.g. if a single clone contain a long insertion or a disproportional high mutation rate). Please keep in mind that removal of the unwanted sequence from the fasta file is mostly not sufficient as the sequence might have had a major impact on the sequence alignment itself (e.g. a chimera with a large insertion not present in any of the parental sequences increases the homology of the other chimeras to their parental sequences as these positions are identical in the remaining sequences). Thus, the residual chimeric sequences must be realigned and saved as a new fasta file. An example showing the influence of large insertions on the analysis is given in section

‘Composition of clones’.

4

LOADING OF INPUT FILES

Before loading your fasta file containing the aligned chimeric and parental sequences, you must specify the ’Reference sequences pattern‘ (text box on the right side) to enable the program to distinguish between parental and chimeric sequences. For this purpose, all reference sequence names must include a common text pattern (e.g. prefix or suffix) which are NOT present in the name of chimeric sequences (e.g. ’Ago‘, if the parental sequences are named ’Ago1‘ and ’Ago2‘). For loading the sequences, press the button ‘browse for fasta files’ and select the corresponding file, or type in the path to the file in the text box left of the button. If you do not know the reference sequence pattern or made a mistake while typing the pattern, Salanto will still load the file but assign the sequences incorrectly (or complain about not finding any reference or chimeric sequences). In this case, simply change the pattern specification and hit the ‘Reload fasta file’ button.

After selecting the input file, the type assignments are automatically performed. Salanto also allows reversing the complete alignment by enabling the ‘reverse alignment‘ check-box. Thus, sequences can be analyzed in forward as well as in reverse direction. A different reading direction will have an effect on the ‘Type assignment’ as well as the two additional applications ‘Composition of clones’ and

‘Distribution of ref.seq. per position’. In general, forward and reverse results will not differ much. The influence of the reading direction however increases with the length of so called crossover regions

(see section ‘Type assignment’ for an example).

ANALYSIS OF CHIMERIC SEQUENCES

On the right side of the Salanto window you can find different options to analyze your alignment data. After having chosen one of the applications, the result will be displayed in tabular form in the display region of the main window. To export this table, press the button ‘Save display’. The resulting file can be opened in any spreadsheet application program (as for example Excel) for further analysis

(e.g. coloring of reference sequences, graphical illustration, calculations…).

The first task to analyze chimeras is the assignment of single positions to the parental sequences.

Salanto offers two different options (‘Type assignment’ and ‘Precise type assignment’), which are run automatically on loading an input file and which are described below. Briefly, the ‘Type assignment’ gives an indication of the most probable parental origin of one single position within a chimera and thereby enables to evaluate the shuffling quality. In contrast, the ‘Precise type assignment’ illustrates the overall homology of a chimera to its parents, regardless of the primary origin of the sequence.

Based on these two different assignment types, further characteristics of single chimeras or the complete shuffling approach can be studied by additional applications.

5

TYPE ASSIGNMENT

Efficient shuffling approaches usually generate chimeras with several crossovers, meaning switches between two parental sequences (depicted in

Figure 1 ). Typically, the precise location of the

crossover cannot be determined as it occurs during the reassembly PCR somewhere in the identical region between two non-identical domains (crossover region). For the ‘Type assignment’, we define the crossover location as the first position not identical to the preceding reference sequence. parent 1 parent 2 chimera

Crossover location as defined

Crossover region

C G C G G G T C A C C T C A A C T A

G G C G G G T C A G C T C A A C G C

C G C G G G T C A G C T C A A C G C identical regions

FIGURE 1: EXAMPLE FOR CROSSOVER AFTER DNA FAMILY SHUFFLING

Positions are assigned the type of one parent until a crossover occurs. Notably, crossover regions are counted to different reference sequences depending on the reading direction (forward or reverse).

Thus, the type assignment result changes slightly if using the reversed alignment (see also an

example in

Figure 2

). In our experience, the lengths of crossover regions were quite small compared

to lengths of regions between two crossovers, leading to rather small variations in the type assignments. However, longer crossover regions (e.g. present in highly homologous sequences) increase the influence of the reading direction on the type assignment. parent 1 C A C C T C G A C T A C T G T G G G C T C G T C T A G C T A G C T T C G C C G G T G T G C parent 2 C A G C T C A A C G C C T G T C G G C T A G T C T T T A T A G C T A C G C C G G T T T G T forward

Chimera C A C C T C G A C G C C T G T C G G C T A G T C T A G C T A G C T T C G C C G G T G T G T reverse

FIGURE 2: ILLUSTRATION OF THE EFFECT OF REVERSING THE DIRECTION DURING TYPE ASSIGNMENT.

REGIONS ASSIGNED TO DIFFERENT PARENTS DPENDING ON THE READING DIRECTION ARE

HIGHLIGHTED WITH DASHED FRAMES.

6

TYPE ASSIGNMENT ALGORITHM

The details of the algorithm used for the assignment of a reference type (parental sequence) to each

sequence position of each chimera are shown below in

Figure 3

.

Input:

chimaeras: i=0, k=0 c

C = { c i

: i ∈ 1

…

N } i

= { c ik

: k ∈ 1

…

K }

References: r

R = { r j

= { r j

: j ∈ 1

…

M } jk

: k ∈ 1

…

K }

Algorithm:

i <= N k <= K i = i+1 k = k+1

// select the chimaera

// select position in sequence hits = which ( c ik

= r jk

)

// find out which references j match

// to chimaera i at position k

|hits|=0 types i , k

= mut

|hits|!=0

|hits|=1 types i , k

= r jk types i , k

= NBS

|hits|!=1 types i , k

= r jk

−

1 types i , k

−

1

!

= ( mut OR indet )

AND types i , k

−

1

= c ik else fowardSearch

ForwardSearch returns

Next Best State

(NBS) else types i , k

= indet

// check if no reference matches

// check if exactly one reference

// matches: unambiguous hit

// type at previous position is

// neither mutation nor indetermined

// and matches the chimaera

// look ahead and check if the first

// upcoming unambigous state

// assignment matches the chimaera k > K i > N

FixIndet

'backward Search'

// for each indet-Block: start at

// end of the block and go to

// beginning; for each position,

// replace the indet, if its

// sequence matches the type

// immediately before the block

FIGURE 3: TYPE ASSIGNMENT ALGORITHM FLOWCHART.

7

Assignments can be made unambiguously, where the chimeric sequence originates in exactly one reference sequence. For ambiguous positions, a simple parsimonious strategy is used, trying to minimize the number of type switches (crossovers). In a forward traversal, all positions are inspected and the types assigned, where possible. Identical regions are assigned to the preceding reference type. After that, in a backward pass, every block of indetermined types (regions where several parents a possible, but the sequence is different from the preceding one) is inspected again, trying to resolve some of the indetermined states. Sequences which cannot be assigned to one parent are designated ‘indet’, sequences non-identical to any parent are termed ‘mut’ (for mutation).

The result will be depicted in the display area (top: original alignment containing the sequence information of all references and chimeras; bottom: corresponding assignment). This table can be exported (‘Save display’) and opened in a spreadsheet application program (as for example Excel) for further analysis (e.g. coloration of reference sequences to better visualize the composition of single chimeras).

Since the type assignment describes the probable parental origin of single positions in chimeras, this analysis is well suited to evaluate the quality of the shuffling approach. Based on this analysis, several parameters important to measure shuffling efficiency can be calculated, for example the number of crossovers, the incorporation of parental sequences over all positions and the mutation rate. A detailed description of the applications mentioned is given below.

COMPOSITION OF CLONES

With this application, the proportion in a chimera derived from each parental sequence (reference sequence) based on the type assignment is calculated. The result is depicted in a tabular form - names of the analyzed sequences are listed in the first column, the proportion of parental sequences present in each chimera are indicated in the following columns. To examine whether the general distribution of parental sequences in chimeras meets the expectation, the mean value of incorporated parental sequence over all chimeras is calculated and displayed in the last row. As already mentioned above, chimeras containing large insertions affect the sequence alignment and thus have an impact on most Salanto analyses. An example showing the influence on the result of the

application ‘Composition of clones’ is shown in

Figure 4

.

8

parent 1 parent 2

Insertion present in chimera 2

C A C C T C GA C T A C T G T G - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - GG C T C G T C T A G C T A G C T T C G C C G G T G T G C

C AG C T C A A C G C C T G T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - GG C T AG T C T T T A T A G C T A C G C C G G T T T G T chimera 1 C A C C T C GA C G C C T G T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - GG C T A G T C T A G C T A G C T T C G C C G G T G T G T chimera 2 C A C C T C GA C T A C T G T GAG C A A A T T A T GA T G T G T A A A A T T C T C T G T G T GG C T AG T C T T T A T A G C T T C G C C G G T G T G C

Chimera 2 removed, residual sequences realigned parent 1 parent 2

C A C C T C GA C T A C T G T GGG C T C G T C T A G C T AG C T T C G C C GG T G T G C

C A G C T C A A C G C C T G T C GG C T A G T C T T T A T AG C T A C G C C GG T T T G T chimera 1 C A C C T C GA C G C C T G T C GG C T A G T C T A G C T AG C T T C G C C GG T G T G T

Composition of chimera 1 calculated with different alignments:

Alignment 1

Alignment 2 parent 1

37%

62% parent 2

63%

38%

FIGURE 4: INFLUENCE OF LARGE SEQUENCE INSERTIONS ON THE CALCULATED COMPOSITION OF

CHIMERAS.

DISTRIBUTION OF REFERENCE SEQUENCES PER POSITION

This analysis is of particular interest for the quality control of shuffled libraries as local biases in the incorporation of parental sequences can be detected. The application uses the type assignment to calculate the parental distribution at each alignment position over all chimeras. The first column refers to the position within the alignment; the following columns indicate the parental distribution.

NUMBER OF CROSSOVERS PER CLONE

This application reports the number of crossovers in single clones based on the type assignment. For an explanation how a crossover is determined during the type assignment process, see the section

‘Type assignment’ and the description of the algorithm herein. Switches from a parental-derived sequence to a mutation are not taken into account. Switches to regions not clearly determined

(indet) are counted analogous to switches between parental sequences. In some cases, indetermined sequences might contain an additional internal crossover. In our analyses, these events seemed to be very rare and were therefore neglected. However, the amount of regions not clearly determined containing potential additional crossover generally increases with the number of parental sequences

9

used in the shuffling approach. Therefore, in some cases it might be interesting to reexamine the crossover number by analyzing the alignment manually. Please keep in mind that the crossover number of a chimera on DNA level usually exceeds the number calculated with the protein sequence as more crossovers are detectable.

PRECISE TYPE ASSIGNMENT

The ‘Precise type assignment’ shows the general homology of a chimera to its parental sequences.

Every position within a chimera is compared to the parental sequences and all possible parental ancestors are listed (in contrast to ‘Type assignment‘, where the most probable parental origin is chosen; see the corresponding section above). The original alignment containing the sequence information of all references and chimeras are depicted in the upper part whereas the corresponding assignment for each position within a chimera to reference sequences is demonstrated in the lower part of the display region.

SIMILARITY TO REFERENCE SEQUENCES

This application calculates the homology of a chimera to the parental reference sequences. If the parental sequences are highly identical, it might be interesting to focus only on non-identical positions for a better visualization of differences between single chimeras (see the next paragraph).

SIMILARITY TO REFERENCE SEQUENCES (ONLY DIFFERING POSITIONS)

In this similarity analysis, all positions in the alignment which are completely identical to each other are disregarded. In this way, the analysis focuses only on positions, which are different in at least one of the sequences. Notably, the program does not discriminate between parental and chimeric sequences. Thus, if a chimera contains a mutation, this position will not be excluded from the analysis even though all other chimeric and parental sequences might be completely identical. In extreme cases (e.g. many mutations in one chimera due to a frame shift), this might lead to values resembling the results from the ‘Similarity to reference sequences’ analysis, as almost no position is excluded from the calculation. The number of identical and non-identical positions can be calculated with the ‘Number of differing positions’ application (see description in the next paragraph). It will give you an idea whether it might be beneficial to repeat the analysis without the highly mutated

10

chimera. In general, this analysis will better highlight the differences in chimeras compared to the

‘normal’ similarity analysis, especially if the parental sequences share a high sequence homology.

NUMBER OF DIFFERING POSITIONS

Based on the ‘Precise type assignment’, the number of identical and non-identical positions in the complete alignment (parental and chimeric sequences) is calculated. As already mentioned above, positions are also counted as non-identical if a chimera contains a mutation at this specific location.

Thereby, the analysis gives a good indication of whether the similarity analyses (for all or only differing positions) will lead to different results. Please keep in mind that the number of positions in an alignment can be higher than the real number of bases or amino acids in a chimera. Thus the similarity calculation based on the complete alignment might differ slightly from results obtained by direct comparison of a chimera to one of its parents.

DIFFERENCES BETWEEN SINGLE CHIMERAS

Another parameter to address the quality of the shuffling approach is the homology of shuffled chimeras among each other. Based on the ‘Precise type assignment’, pairs of chimeras are compared to each other at a time and the percentage of non-identical positions is depicted in a matrix. Please keep in mind that this number might be higher than the real number of differing nucleotides or amino acids in the two chimeras as alignment often contain insertions due to regions which cannot be aligned properly to each other. However, this analysis will identify identical or highly homologous clones in libraries and give a versatile impression of the variety.

MUTATIONS

This application lists all mutations found in the chimeric sequences. The positions of mutations identified in single chimeras are denoted in the respective column underneath the chimera name.

Please keep in mind that the denoted number reflects the position of the mutation in the alignment, which might differ from the real location in the chimera (e.g. if the alignment contains shifts).

11

SAVE DATA

All analyses performed with the Salanto program can be saved using the ‘Save display’ button. The data is exported as tab-delimited text file and can be opened in any spreadsheet application program

(e.g. Excel, Gnumeric, Libre Office, etc.).

REFERENCES

Jon Cohen: How DNA Shuffling Works. Science 13 July 2001: 293 (5528), 237.

[DOI:10.1126/science.293.5528.237]

ACKNOWLEDGMENTS

We would like to thank Dirk Grimm, Stefanie Grosse and all lab members of the AG Grimm for helpful discussions and suggestions.

12

Getting started with Salanto

SALANTO – SHUFFLING ALIGNMENT

ANALYSIS TOOL

CONTENTS

BACKGROUND

INSTALLATION