Halimeda tuna haplotype analysis

This R code describes our analyses of haplotype distributions in Mediterranean Halimeda tuna.

Prepare R for analyses

Clean up history and load dependencies:

rm(list=ls(all=TRUE))

library(ape)
library(pegas)
library(RColorBrewer)
library(plotrix)
library(rworldmap)

Show the version of R and packages:

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rworldmap_1.3-6    sp_1.3-1           plotrix_3.7-6     
## [4] RColorBrewer_1.1-2 pegas_0.11         adegenet_2.1.1    
## [7] ade4_1.7-13        ape_5.3           
## 
## loaded via a namespace (and not attached):
##  [1] maps_3.3.0         splines_3.6.1      dotCall64_1.0-0   
##  [4] gtools_3.8.1       shiny_1.3.2        assertthat_0.2.1  
##  [7] expm_0.999-4       yaml_2.2.0         LearnBayes_2.15.1 
## [10] pillar_1.4.2       lattice_0.20-38    glue_1.3.1        
## [13] digest_0.6.21      promises_1.0.1     colorspace_1.4-1  
## [16] htmltools_0.3.6    httpuv_1.5.2       Matrix_1.2-17     
## [19] plyr_1.8.4         pkgconfig_2.0.2    gmodels_2.18.1    
## [22] purrr_0.3.2        xtable_1.8-4       scales_1.0.0      
## [25] gdata_2.18.0       later_0.8.0        tibble_2.1.3      
## [28] mgcv_1.8-28        ggplot2_3.2.1      lazyeval_0.2.2    
## [31] magrittr_1.5       crayon_1.3.4       mime_0.7          
## [34] deldir_0.1-23      maptools_0.9-5     evaluate_0.14     
## [37] nlme_3.1-140       MASS_7.3-51.4      foreign_0.8-71    
## [40] class_7.3-15       vegan_2.5-6        tools_3.6.1       
## [43] stringr_1.4.0      munsell_0.5.0      cluster_2.1.0     
## [46] compiler_3.6.1     e1071_1.7-2        rlang_0.4.0       
## [49] classInt_0.4-1     units_0.6-4        grid_3.6.1        
## [52] spam_2.3-0         igraph_1.2.4.1     rmarkdown_1.15    
## [55] boot_1.3-22        gtable_0.3.0       DBI_1.0.0         
## [58] reshape2_1.4.3     R6_2.4.0           knitr_1.25        
## [61] dplyr_0.8.3        seqinr_3.6-1       spdep_1.1-3       
## [64] KernSmooth_2.23-15 permute_0.9-5      stringi_1.4.3     
## [67] parallel_3.6.1     Rcpp_1.0.2         fields_9.8-6      
## [70] sf_0.8-0           spData_0.3.2       tidyselect_0.2.5  
## [73] xfun_0.9           coda_0.19-3

Analyse the tufA dataset

Load the tufA dataset and calculate the haplotypes contained in it. Note that the sequence file was prepared to have the sequence labels refer to the region where the sample was from. For example:
>Malta
ATGATCACTGGAGCGGC…
>Sardinia
ATGATCACTGGAGCGGC…
>Sardinia
ATGATCACTGGAGCGGC…
>Cataluna
ATGATCACTGGAGCGGC…

tufA = read.dna("tufA_loc.fas",format="fasta")
tufAhaps = haplotype(tufA)
tufAhaps

## 
## Haplotypes extracted from: tufA 
## 
##     Number of haplotypes: 7 
##          Sequence length: 822 
## 
## Haplotype labels and frequencies:
## 
##   I  II III  IV   V  VI VII 
##   5  23   3   6   1   2   1

Okay, so there are seven different haplotypes in the dataset. Now let’s calculate the haplotype network and show some basic information about it.

tufAnet = haploNet(tufAhaps)
tufAnet

## Haplotype network with:
##   7 haplotypes
##   8 links
##   link lengths between 1 and 1 steps
## 
## Use print.default() to display all elements.

And let’s build a table that indicates where the different haplotypes are found.

tufA.ind.hap = with(stack(setNames(attr(tufAhaps,"index"), rownames(tufAhaps))),table(hap=ind, individuals=rownames(tufA)[values]))
tufA.ind.hap

##      individuals
## hap   Adriatic Cataluna ItalyNW.France ItalyW Malta Murcia Sardinia
##   I          1        0              2      0     2      0        0
##   II         7        7              3      4     0      1        1
##   III        2        1              0      0     0      0        0
##   IV         1        0              0      0     0      3        2
##   V          0        0              0      1     0      0        0
##   VI         0        0              0      0     0      0        0
##   VII        0        0              0      1     0      0        0
##      individuals
## hap   Sicilia
##   I         0
##   II        0
##   III       0
##   IV        0
##   V         0
##   VI        2
##   VII       0

Okay, time to plot the network. First we will choose a set of 8 colours from RColorBrewer (one per geographic area in the dataset), then plot the haplotype network, showing pie charts indicating in which geographic areas the haplotypes are found, and add a legend.

colors=brewer.pal(8, "Set1")
plot(tufAnet,size=attr(tufAnet,"freq"),scale.ratio=2,cex=0.8,pie=tufA.ind.hap,bg=colors,show.mutation=2,threshold=0)
legend("bottomleft", colnames(tufA.ind.hap), fill=colors)

Analyse the original ribosomal proteins dataset

We will do two analyses of the data from the ribosomal proteins cluster. First, we will analyse the original dataset, which doesn’t take the length polymorphism of the simple sequence repeats into account. All code is the same as for tufA above, so we’ll just run through and produce a haplotype network plot.

rp = read.dna("rp_loc.fas",format="fasta")
rphaps = haplotype(rp)
rphaps

## 
## Haplotypes extracted from: rp 
## 
##     Number of haplotypes: 4 
##          Sequence length: 1499 
## 
## Haplotype labels and frequencies:
## 
##   I  II III  IV 
##   4  10  30   1

rpnet = haploNet(rphaps)

rp.ind.hap<-with(stack(setNames(attr(rphaps,"index"), rownames(rphaps))),table(hap=ind, individuals=rownames(rp)[values]))
rp.ind.hap

##      individuals
## hap   Adriatic Cataluna ItalyNW.France ItalyW Malta Murcia Sardinia Sicily
##   I          0        0              0      0     0      0        4      0
##   II        10        0              0      0     0      0        0      0
##   III        0        7              6      8     3      3        0      3
##   IV         0        0              0      0     0      1        0      0

colors=brewer.pal(8, "Set1")
plot(rpnet,size=attr(rpnet,"freq"),scale.ratio=2,cex=0.8,pie=rp.ind.hap,bg=colors,show.mutation=2,threshold=0)
legend("topleft", colnames(rp.ind.hap), fill=colors)

Analyse the re-coded ribosomal proteins dataset

Because the ribosomal proteins dataset contains several simple repeats, these repeats were manually re-coded so that each additional repeat would count as a single point mutation. We will now analyse the resulting dataset.

rp = read.dna("rp_loc_recoded.fas",format="fasta")
rphaps = haplotype(rp)
rphaps

## 
## Haplotypes extracted from: rp 
## 
##     Number of haplotypes: 10 
##          Sequence length: 1514 
## 
## Haplotype labels and frequencies:
## 
##    I   II  III   IV    V   VI  VII VIII   IX    X 
##    4   10    3    8    8    1    1    8    1    1

So instead of the 4 haplotypes from above, with the re-coding we now get 10 haplotypes from the ribosomal proteins data. Let’s continue the analysis and plot the network.

rpnet = haploNet(rphaps)
rp.ind.hap<-with(stack(setNames(attr(rphaps,"index"), rownames(rphaps))),table(hap=ind, individuals=rownames(rp)[values]))
rp.ind.hap

##       individuals
## hap    Adriatic Cataluna ItalyNW.France ItalyW Malta Murcia Sardinia
##   I           0        0              0      0     0      0        4
##   II         10        0              0      0     0      0        0
##   III         0        1              0      0     0      0        0
##   IV          0        0              1      3     3      0        0
##   V           0        1              3      2     0      2        0
##   VI          0        0              0      1     0      0        0
##   VII         0        0              0      0     0      1        0
##   VIII        0        4              2      1     0      1        0
##   IX          0        0              0      1     0      0        0
##   X           0        1              0      0     0      0        0
##       individuals
## hap    Sicily
##   I         0
##   II        0
##   III       2
##   IV        1
##   V         0
##   VI        0
##   VII       0
##   VIII      0
##   IX        0
##   X         0

colors=brewer.pal(8, "Set1")
plot(rpnet,size=attr(rpnet,"freq"),scale.ratio=2,cex=0.8,pie=rp.ind.hap,bg=colors,show.mutation=2,threshold=0)
legend("topleft", colnames(rp.ind.hap), fill=colors)

Plot a haplotype map

Okay, we will now plot the haplotype information on a map of the western Mediterranean Sea, using the re-coded ribosomal proteins data as that seems to show the finest resolution among all datasets analysed. We’ll start by plotting the map itself.

worldmap <- getMap(resolution = "low")
plot(worldmap,xlim=c(-6,20),ylim=c(38,40),col="gray95")

Now we need some latitude/longitude coordinates for the different geographic areas. This was prepared as a table that we’ll import here as a data frame.

df = read.table(file="regions_coordinates,txt",header=T,sep="\t")
df

##           region  long  lat
## 1       Adriatic 15.60 42.4
## 2       Cataluna  2.90 41.6
## 3 ItalyNW.France  7.80 43.7
## 4         ItalyW 12.70 41.1
## 5          Malta 14.30 35.4
## 6         Murcia -0.73 37.8
## 7       Sardinia  9.00 39.5
## 8         Sicily 14.00 38.2

Let’s plot the map again with some text labels to indicate these regions.

plot(worldmap,xlim=c(-10,20),ylim=c(38,41),col="gray95")
text(x=df$long,y=df$lat,labels=df$region)

For downstream analyses, it’s important to double-check that the order of regions in imported table matches that in the rp.ind.hap variable, so we’ll check that here.

names(rp.ind.hap[1,]) == df$region

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Looks good. Now we will create a matrix from the haplotype frequency information that was calculated earlier and stored in rp.ind.hap.

sec = NULL
for (i in 1:nrow(rp.ind.hap)) {
  sec = cbind(sec,rp.ind.hap[i,])
}
sec

##                [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## Adriatic          0   10    0    0    0    0    0    0    0     0
## Cataluna          0    0    1    0    1    0    0    4    0     1
## ItalyNW.France    0    0    0    1    3    0    0    2    0     0
## ItalyW            0    0    0    3    2    1    0    1    1     0
## Malta             0    0    0    3    0    0    0    0    0     0
## Murcia            0    0    0    0    2    0    1    1    0     0
## Sardinia          4    0    0    0    0    0    0    0    0     0
## Sicily            0    0    2    1    0    0    0    0    0     0

Okay, with all of that in place we can now plot the pie charts onto the map. First, we’ll pick 10 colors (corresponding to the 10 haplotypes in the dataset). Then we’ll plot the map and text labels again, and loop through each region to plot a pie chart and add the number of samples sequenced in that region.

colors=brewer.pal(10, "Set3")
plot(worldmap,xlim=c(-6,20),ylim=c(38,40),col="gray95")
text(x=df$long,y=df$lat+1.1,labels=df$region)
for (i in 1:nrow(df)) {
  r = sec[i,]
  names(r) = colors
  r=sort(r,decreasing=T)
  floating.pie(df$long[i],df$lat[i],r,radius=1,col=names(r))
  text(df$long[i],df$lat[i]-1,cex=.8,labels=paste0("(n=",sum(sec[i,]),")"))
}

Last thing to do is relate the colours used in the map to the haplotypes in the network, so we’ll re-plot the network with those colours.

plot(rpnet,size=attr(rpnet,"freq"),fast=FALSE,show.mutation=2,threshold=0,bg=colors)