Version 1.5.4

Preparing for sequence analysis

The sequence analysis programs in the Bsoft package require aligned sequences. However, Bsoft does not have a sequence alignment capability, and this should be done with another program such as clustalw (see http://www.expasy.ch for extensive proteomics tools).

The sequence formats the Bsoft programs support are EMBL, PIR and Fasta. The recognition of the format is based on the file name extension: ".embl", ".pir" and ".fasta".

An example aligned sequence file is provided:

vp23.pir

Sequence identity

The "overlap" between two aligned sequences are defined as those positions in the alignment where both sequences have residues.
The "identity" between two aligned sequences is defined as the number of identical residues divided by the overlap, and is thus a fraction.

Example:

bseq -verbose 7 -identity vp23.pir

Part of the output:
 

Aligned identity analysis:

Seq1 Seq2 Identity nID Overlap Name1 Name2   
2    1    0.921   293   318 vp23_hsv2h VP23_HSV11
  
3    1    0.427   134   314 VP23_VZVD VP23_HSV11
  
3    2    0.417   131   314 VP23_VZVD vp23_hsv2h
  
4    1    0.438   137   313 VP23_HSVEB VP23_HSV11
  
4    2    0.435   136   313 VP23_HSVEB vp23_hsv2h
  
4    3    0.527   164   311 VP23_HSVEB VP23_VZVD
  
5    1    0.431   135   313 vp23_ehv4 VP23_HSV11
  
5    2    0.428   134   313 vp23_ehv4 vp23_hsv2h
  
5    3    0.527   164   311 vp23_ehv4 VP23_VZVD
  
5    4    0.946   297   314 vp23_ehv4 VP23_HSVEB
  
6    1    0.463   146   315 vp23_bhv1 VP23_HSV11
  
6    2    0.460   145   315 vp23_bhv1 vp23_hsv2h
...

Average identical residues:  81.4238 (54.6525) Average overlap:   297.99 (7.73172)
 
The last two lines give the averages and standard deviations of the number of identical residues and overlap in all pairwise comparisons.

Sequence similarity

 The "similarity" between two aligned sequences is defined as the sum of residue similarities divided by the overlap. The similarity between two residues is taken from a residue substitution matrix. The default substitution matrix in Bsoft is BLOSUM62.
The fraction similarity is defined as the number of residues above a given threshold divided by the overlap, and is thus a fraction comparable to the identity defined above.

Example:

bseq -verbose 7 -similarity 2 vp23.pir

Part of the output:
 

Aligned similarity analysis:

Similar residue threshold:      2

Seq1 Seq2 Sim fracSim Overlap Name1 Name2    
2     1  4.701  0.934   318 vp23_hsv2h VP23_HSV11
   
3     1  2.140  0.535   314 VP23_VZVD VP23_HSV11
   
3     2  2.099  0.525   314 VP23_VZVD vp23_hsv2h
   
4     1  2.326  0.556   313 VP23_HSVEB VP23_HSV11
   
4     2  2.300  0.550   313 VP23_HSVEB vp23_hsv2h
   
4     3  2.859  0.650   311 VP23_HSVEB VP23_VZVD
   
5     1  2.275  0.550   313 vp23_ehv4 VP23_HSV11
   
5     2  2.243  0.543   313 vp23_ehv4 vp23_hsv2h
   
5     3  2.836  0.640   311 vp23_ehv4 VP23_VZVD
   
5     4  4.783  0.955   314 vp23_ehv4 VP23_HSVEB
   
6     1  2.248  0.546   315 vp23_bhv1 VP23_HSV11
   
6     2  2.232  0.537   315 vp23_bhv1 vp23_hsv2h
   
6     3  2.700  0.629   313 vp23_bhv1 VP23_VZVD
...
 

Hydrophobicity analysis

The average hydrophobicity is calculated at each position in the alignment, and a periodicity analysis done with a frequency of 4 to detect helical regions. The default hydrophobicity scale is the GES scale.

A typical command line is:

bseq -verbose 7 -hydrophobicity 0.5 -Postscript vp23_hp.ps vp23.pir

The "-Postscript" option outputs three plots to a postscript file.
 

Information content analysis

The information content of each position in an alignment is calculated as:  

information = log2n - sum(pi * log2pi)
pi = fi / sum(fi)  


 where fi is the frequency of residue i at this alignment position, and n =  sum(fi) if sum(fi) < 20, otherwise n = 20. A moving average of the information is calculated over a given window to smooth the resultant data.

A typical command line is:

bseq -verbose 7 -info -Postscript vp23_info.ps vp23.pir

The "-Postscript" option outputs three plots and a sequence logo representation to a postscript file. The sequence logo displays the occurrence of every residue type at every position in the alignment, where the combined height at each position is the information content, a measure of conservation. Here are the output file in both postscript and pdf (converted with ps2pdf from the postscript file):

vp23_info.ps   vp23_info.pdf

Correlated mutation analysis

The correlated mutation analysis follows the method set out in Gobel, Sander & Schneider (1994) Proteins 18, 309-317, with a few minor differences.

The mutational correlation between two positions i and j in the alignment is defined as:

         sum(w(k,l)*(s(i,k,l) - <s(i)>)*(s(j,k,l) - <s(j)>))
r(i,j) =  ------------------------------------------------
              m^2*o(i)*o(j)
where:
  m:         number of sequences
 o(i):      standard deviation of similarities at alignment position i
 w(k,l):    weight for sequences k and l
                  (1 - fractional identity: see function seq_aligned_identity)
 s(i,k,l):  similarity for alignment position i between sequences k and l
 <s(i)>:    average similarity at alignment position i
 

 

Example:

bcormut -verbose 7 -datatype b -image vp23.jpg -cutoff 0.6 vp23.pir

Output with high-scoring correlations:
 

Res1 Num1 Res2 Num2 Total Corr
T 9 I 17 210  0.631 TAIIIVVVIVVVIVIIIIIII IILLLLLLLLLLLLLLLLLLL
T 104 D 115 210  0.610 TTTTTTTTTTTTTKVAVVVKT DDDDDDDDDDDDDGTSTSTID
Q 26 S 136 210  0.623 QQQQQQQQTSCCCQQQQQQQQ SSSSSSSSLVLLLSSSSSSSS
L 44 S 136 210  0.602 LLLLLLLLHSSSNVIILLLLV SSSSSSSSLVLLLSSSSSSSS
S 136 I 230 210  0.610
SSSSSSSSLVLLLSSSSSSSS IIIIIIIIASAAALVIIILLV
Correlations reported:  5

 

Each high-scoring correlation (above the threshold of 0.6 given with the "-c" option) generates three output lines. The first line contains 6 values with the first 4 values giving the types and alignment positions of the correlated residues. The residue types indicate those in the reference sequence (typically the first sequence in the alignment, but can be set on the command line). The next value is the number of comparisons made: maximally m*(m-1)/2. The last number is the correlation coefficient.
The next two lines gives the corresponding residues at the two alignment positions for all the sequences, allowing the user to see on what basis this is a high correlation. Often, poorly represented pairs of positions score high and these should not be used to reach any conclusions.

Output image:

The image, "vp23.jpg", generated in this example represents all the correlation coefficients calculated for all the positional pairs in the alignment:

Correlated mutation matrix

The line across the diagonal is the comparison between identical sequences (i.e., i = j). The homogeous band towards the right represents a part of the alignment with large gaps for most of the sequences.