Summarizing phylogenetic data

Last updated on 2023-07-11 | Edit this page

Estimated time 12 minutes

Overview

Questions

What biological information is stored in phylogenetic Hill numbers?
How do you import and manipulate phylogeny data in R?
What are the common phylogeny data formats?
How do you visualize phylogenies in R?
How do you calculate Hill numbers with phylogenetic data?

Objectives

After following this episode, participants will be able to:

Identify key features of the Newick format
Import and manipulate phylogenetic data in the R environment
Visualize phylogenies
Calculate and interpret phylogenetic Hill numbers

Introduction to phylogenetic data

In this episode, we will explore how to extract information from phylogenetic trees in order to complement our hypotheses and inferences about processes shaping biodiversity patterns. Phylogenetic trees have information about how the species in a taxonomic group are related to each other and how much relative evolutionary change has accumulated among them. Since local communities differ in their phylogenetic composition, this information can give insights on why communities are how they are.

The relative phylogenetic distance among species in a community as well as the distribution of the amount of evolutionary history (represented by the length of the branches in a phylogeny) are a result of different factors such as the age since the initial formation of the community and the rate of macroevolutionary processes such as speciation and extinction. For instance, young communities that are dominated by closely related species and show very short branch lengths may suggest a short history with few colonization events and high rates of local speciation; alternatively, if the same young communities harbor distantly related species with longer branch lengths, it may suggest that most of the local diversity was generated by speciation elsewhere followed by colonization events involving distantly related species. Coupled with information on ecological traits and rates of macroevolutionary processes, these patterns also allow to test for hypotheses regarding, for instance, ecological filtering or niche conservatism.

Summarizing this phylogenetic information (i.e., phylogenetic distance and distribution of branch lengths) is therefore important for inference. As we have seen in previous episodes, the use of Hill numbers is an informative approach to summarize biodiversity. In this episode, we will see how phylogenetic Hill numbers in different orders are able to capture information about 1) the total amount of evolutionary history in different communities; 2) how this history is relatively distributed across the community.

Working with phylogenetic data in R

Importing phylogenetic data

Several file formats exist to store phylogenetic information. The most common formats are the Newick and Nexus formats. Both these formats are plain text files storing different levels of information about the taxa relationship and evolutionary history. Newick files are the standard for representing trees in a computer-readable form, as they can be extremely simple and therefore do not take up much memory. Nexus file, on the other hand, are composed by different blocks regarding different types of information and can be used to store DNA alignments, phylogenetic trees, pre-defined groupings of taxa, or everything at once. Since they are a step ahead in complexity, we will stick with Newick files for now.

Newick files store the information about the clades in a tree by representing each clade within a set of parentheses. Sister clades are separated by ,. The notation also requires us to add the symbol ; to represent the end of the information for that phylogenetic tree.

The basic structure of a tree in a Newick format is therefore as follows:

((A,B),C);

The notation above indicates that: 1. we have three taxa in our tree, named A, B and C; 2. A and B form one clade (A,B); 3. the (A,B) clade is sister to the C clade (we represent that by adding another set of parentheses and a , separating (A,B) from C).

In addition, Newick files can also store information on the branch leading to it tip and node. We do that by adding : after each tip/node.

((A:0.5,B:0.5):0.5,C:1);

The notation above indicates that: 1. the branches containing A and B have each a length = 0.5; 2. the branch that leads to the node connecting A and B also has length = 0.5; 3. the branch leading to C has length = 1.

The notation above is what we import into R to start working with and manipulating our phylogenetic tree. For that goal, we will use the ape package. Below, we are also loading a few other packages we’ll be using later on.

R

library(ape)
library(tidyr)
library(hillR)

To import our tree, we will be using the function read.tree() from the ape package. In the case of simple trees as the one above, we could directly create them within R by giving that notation as a character value to this function, using the text argument, as shown below:

R

example_tree <- read.tree(text = '((A:0.5,B:0.5):0.5,C:1);')

Now, we can visually inspect our tree using the plot() function:

R

plot(example_tree)

Can you visualize the text notation in that image? We can see the same information: A is closer related to B than C, and the branches leading to A and B have half the length of the branch leading to C.

The read.tree() function creates an object of class phylo. We can further investigate this object by calling it in our console:

R

example_tree

OUTPUT


Phylogenetic tree with 3 tips and 2 internal nodes.

Tip labels:
  A, B, C

Rooted; includes branch lengths.

The printed information shows us that we have a phylogenetic tree with 3 tips and 2 internal nodes, where the tip labels are “A, B, C”. We also are informed that this tree is rooted and has branch lengths.

One way to access the components of this object and better explore it is to use $ after the object name. Here, it will be important for us to know a little bit more about where the information about tip labels and branch lengths are stored in that phylo object. Easy enough, we can access that by calling tip.labels and edge.length after $.

R

example_tree$tip.label

OUTPUT

[1] "A" "B" "C"

R

example_tree$edge.length

OUTPUT

[1] 0.5 0.5 0.5 1.0

Cleaning and filtering phylogenetic data

Now that we learned how to import and visualize trees in R, let’s bring the phylogeny for the communities we are working with in this workshop. Our data so far consists of abundances and traits of several taxa of arthropods collected across three islands in the Hawaiian archipelago. Let’s work though importing phylogenetic information for these species.

Two common approaches to retrieving a phylogeny for a focal group are 1) relying on a published phylogeny for the group, or 2) surveying public phylogenetic databases based on your taxa list. A common option for the latter is the Open Tree of Life Taxonomy, a public database that synthesizes taxonomic information from different sources. You can even interact with this database using the R package rotl. A few tutorials to do so exist online, like this one. Using a public database is a good approach when working with taxonomic groups that are not heavily investigated regarding their phylogenetic relationships (the well-known Darwinian shortfall). In such cases, databases like OTL will give you a summary phylogeny already filtered for the taxa you have in hand and cross-checked for synonyms and misspellings.

For this workshop, since we are using simulated data, we will work with the first option: a “published” arthropod phylogeny. Let’s load this phylogeny into R using the function read.tree we learned earlier.

R

arthro_tree <- read.tree('https://raw.githubusercontent.com/role-model/multidim-biodiv-data/main/episodes/data/phylo_raw.nwk')

class(arthro_tree)

OUTPUT

[1] "phylo"

This new phylo object is way larger than the previous one, being a “real” phylogeny and all. You can inspect it again by directly calling the object arthro_tree. To plot it, we will use the type argument to modify how our tree will be displayed. Here, we used the option 'fan', to display a circular phylogeny (slightly better to show such a large phylogeny in the screen). We also set the show.tip.label argument to False.

R

plot(arthro_tree, type = 'fan', show.tip.label = F)

How do we combine all this information with the community datasets we have so far for our three islands? First, we will have to perform some name checking and filtering.

Cleaning and checking phylogeny taxa

The first thing we want to do is to check the tip labels in our tree. Since this is a “published” arthropod phylogeny, we will likely not have any misspelling in the tip names of the object. However, it is always good practice to check contents to see if anything weird stands out.

R

arthro_tree$tip.label

OUTPUT

 [1] "Leptogryllus_fusconotatus"   "Hylaeus_facilis"            
 [3] "Laupala_pruna"               "Eurynogaster_vittata"       
 [5] "Cydia_gypsograpta"           "Toxeuma_hawaiiensis"        
 [7] "Proterhinus_punctipennis"    "Drosophila_quinqueramosa"   
 [9] "Ectemnius_mandibularis"      "Nesodynerus_mimus"          
[11] "Proterhinus_xanthoxyli"      "Nesiomiris_lineatus"        
[13] "Aeletes_nepos"               "Scaptomyza_vagabunda"       
[15] "Agrotis_chersotoides"        "Kauaiina_alakaii"           
[17] "Atelothrus_depressus"        "Metrothorax_deverilli"      
[19] "Scaptomyza_villosa"          "Hylaeus_sphecodoides"       
[21] "Lucilia_graphita"            "Xyletobius_collingei"       
[23] "Hyposmocoma_sagittata"       "Cis_signatus"               
[25] "Hyposmocoma_scolopax"        "Dryophthorus_insignoides"   
[27] "Eudonia_lycopodiae"          "Chrysotus_parthenus"        
[29] "Limonia_sabroskyana"         "Hyposmocoma_marginenotata"  
[31] "Mecyclothorax_longulus"      "Deinomimesa_haleakalae"     
[33] "Trigonidium_paranoe"         "Eudonia_geraea"             
[35] "Drosophila_furva"            "Hyposmocoma_geminella"      
[37] "Drosophila_obscuricornis"    "Campsicnemus_nigricollis"   
[39] "Odynerus_erythrostactes"     "Phaenopria_soror"           
[41] "Gonioryctus_suavis"          "Laupala_vespertina"         
[43] "Acanthia_procellaris"        "Odynerus_caenosus"          
[45] "Elmoia_lanceolata"           "Nesodynerus_molokaiensis"   
[47] "Sierola_celeris"             "Nysius_lichenicola"         
[49] "Parandrita_molokaiae"        "Agonismus_argentiferus"     
[51] "Cephalops_proditus"          "Nesomicromus_haleakalae"    
[53] "Lispocephala_dentata"        "Agrion_nigrohamatum"        
[55] "Plagithmysus_ilicis_ekeanus" "Scatella_clavipes"          
[57] "Hedylepta_accepta"           "Cis_bimaculatus"            
[59] "Hydriomena_roseata"          "Spolas_solitaria"

And indeed we find something: even though there are probably no misspellings, the genus and species name in this tree are separated by an underscore symbol _. Since the names in our site-by-species matrix do not have that underscore, we will get an error when matching the data if we don’t fix this spelling.

One useful function to do this fixing is the function gsub(). This function allows you to look for a specific character pattern inside character objects, and replace them by any other pattern you may want. In our case, we have a vector of 60 character values containing the names of our tips. We want to find the _ character inside each character value and replace it by an empty space, so it becomes equal to what we have in our site-by-species matrix. We do so by providing to the gsub() function: 1) the pattern we want to replace; 2) the new pattern we want to replace it with; 2) the character object or vector containing the values to be searched. Finally, we assign the output of that function back to the tip.label slot in the arthro_tree object.

R

arthro_tree$tip.label <- gsub('_',' ',arthro_tree$tip.label)

# A quick check to see if worked
arthro_tree$tip.label

OUTPUT

 [1] "Leptogryllus fusconotatus"   "Hylaeus facilis"            
 [3] "Laupala pruna"               "Eurynogaster vittata"       
 [5] "Cydia gypsograpta"           "Toxeuma hawaiiensis"        
 [7] "Proterhinus punctipennis"    "Drosophila quinqueramosa"   
 [9] "Ectemnius mandibularis"      "Nesodynerus mimus"          
[11] "Proterhinus xanthoxyli"      "Nesiomiris lineatus"        
[13] "Aeletes nepos"               "Scaptomyza vagabunda"       
[15] "Agrotis chersotoides"        "Kauaiina alakaii"           
[17] "Atelothrus depressus"        "Metrothorax deverilli"      
[19] "Scaptomyza villosa"          "Hylaeus sphecodoides"       
[21] "Lucilia graphita"            "Xyletobius collingei"       
[23] "Hyposmocoma sagittata"       "Cis signatus"               
[25] "Hyposmocoma scolopax"        "Dryophthorus insignoides"   
[27] "Eudonia lycopodiae"          "Chrysotus parthenus"        
[29] "Limonia sabroskyana"         "Hyposmocoma marginenotata"  
[31] "Mecyclothorax longulus"      "Deinomimesa haleakalae"     
[33] "Trigonidium paranoe"         "Eudonia geraea"             
[35] "Drosophila furva"            "Hyposmocoma geminella"      
[37] "Drosophila obscuricornis"    "Campsicnemus nigricollis"   
[39] "Odynerus erythrostactes"     "Phaenopria soror"           
[41] "Gonioryctus suavis"          "Laupala vespertina"         
[43] "Acanthia procellaris"        "Odynerus caenosus"          
[45] "Elmoia lanceolata"           "Nesodynerus molokaiensis"   
[47] "Sierola celeris"             "Nysius lichenicola"         
[49] "Parandrita molokaiae"        "Agonismus argentiferus"     
[51] "Cephalops proditus"          "Nesomicromus haleakalae"    
[53] "Lispocephala dentata"        "Agrion nigrohamatum"        
[55] "Plagithmysus ilicis ekeanus" "Scatella clavipes"          
[57] "Hedylepta accepta"           "Cis bimaculatus"            
[59] "Hydriomena roseata"          "Spolas solitaria"

Now that we fixed this first obvious issue, we can start looking for others. Since we want to calculate phylogenetic diversity for each of our communities, our main concern here is to make sure that all taxa present in our communities can be found in this phylogeny. One important issue that may arise is the use of different names for the same taxa across the two datasets (i.e., synonyms). This is especially important since we previouslu performed a synonym check and cleaning in our abundance dataset; we need to make sure the names in our tree will follow the same nomenclature decisions.

Instructor Note

Learners should have the object abundances already in their environment, from the previous episode. If not, use line below.

R

abundances <- read.csv("https://raw.githubusercontent.com/role-model/multidim-biodiv-data/main/episodes/data/abundances_resolved.csv")

To see if there are any mismatches, let’s first retrieve a list of the names in our abundances dataset. Since this dataset has repeated instances of the same species when it shows up in different islands, we wrap the vector of taxa names in the function unique() to return each species name only once.

R

all_names <- unique(abundances$final_name)

To cross-check this list against the list of names in our phylogeny, we can use the Boolean operator %in% coupled with !. This will allow us to check for names present in all_names that are not included in the arthro_tree$tip.label. In summary, the expression A %in% B would return whether each element of vector A is present in vector B. This is returned as a Boolean vector: if TRUE, the element of that position in A exists in B; if FALSE, it does not. We add the ! (NOT) operator to return the opposite of that expression, in a way that !(A %in% B) will return whether each element of vector A is NOT present in vector B. In this case, every time we see TRUE, it means the element in that position is NOT in vector B.

R

not_found <- !(all_names %in% (arthro_tree$tip.label))
not_found

OUTPUT

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Checking the vector not_found, we can see it is a collection of TRUEs and FALSEs. We can use that vector to perform bracket subsetting in the vector all_names. Doing so, we are retrieving from all_names only the elements in the position where not_found is TRUE.

R

all_names[not_found]

OUTPUT

[1] "Peridroma chersotoides"

The expression above give us the one element of all_names that is not present in arthro_tree$tip.label. As we expected, it is a synonym that we corrected in our abundances episode. Our tree still has the old name Agrotis chersotoides whereas our site-by-species matrix has the updated name Peridroma chersotoides. To correct that, we need to modify the tip label in our tree to match the new name. How can we do that?

Challenge

Modify the old name Agrotis chersotoides in our phylogenetic tree, replacing it by Peridroma chersotoides. To do so, you will need to 1) find the position of the old name in labels in our tree; 2) assign the new name in the right position where the old name is.

Hint: a similar task was performed before in the abundances.

Instructor Note

This challenge is supposed to assess the learner’s knowledge in indexing, subsetting and replacing values in a vector. The solution below uses Boolean matching to find the old name and replace it with the new name, but other solutions can include checking visually for the position of the old name or using grep() to find the position.

Show me the solution

We need to perform an assignment operation in the position where the old name is located inside arthro_tree$tip.label. To find that position, we can use Boolean matching to ask which element of arthro_tree$tip.label equals the old name.

R

arthro_tree$tip.label[arthro_tree$tip.label == 'Agrotis chersotoides'] <- "Peridroma chersotoides"

A good practice after correcting the name is to re-check if all names in our abundance dataset match the names in the phylogeny. This time, we expect all elements in the vector not_found to be FALSE. We can use the function which() to ask “which elements in not_found are TRUE We expect the answer to be an empty vector, indicating”no elements in not_found are TRUE“.

R

not_found <- !(all_names %in% (arthro_tree$tip.label))
which(not_found)

OUTPUT

integer(0)

Pruning our phylogeny

Now that we have a modified tree where all taxa names are present in our abundances dataset, we can work on pruning our phylogeny to the taxa present in our communities. This is especially important when working with published phylogenies as these are usually large files containing several taxa that may not exist in the local community.

We can prune our phylogeny using the function keep.tip() from the ape package. For this function, we provide the entire phylogeny plus the names of the tips we want to keep. Here, we will retrieve those names from the abundances dataset.

R

arthro_tree_pruned <- keep.tip(arthro_tree,abundances$final_name)

With a pruned phylogeny and a site-by-species matrix, we have the two bits of information we need to summarize phylogenetic diversity per community. As we saw in previous episodes, we can use such abundance matrix to calculate Hill numbers and help us compare patterns across the different communities. In the traits episode, we saw how we can combine abundance and trait data to extract Hill numbers as values that summarize the distribution of trait variation across communities. Here we will be following the same approach to summarize the amount and the distribution of evolutionary history across the species in our community.

Summarizing with hill numbers

In this section, we will extract some summary statistics about the pattern of phylogenetic diversity (PD) in our communities. As we discussed in the intro of this episode, the relative phylogenetic distance among species and the distribution of this distance can give insights into processes of community assembly. Here, we will make use of Hill numbers to extract summaries of phylogenetic distances, i.e. the length of the branches in the phylogenetic tree leading to the taxa present in each community. Phylogenetic hill numbers incorporate information on both the phylogenetic structure of a system and the abundances of different species. In order to get an intuition for how these different components influence phylogenetic hill numbers, we’ll set our Hawaiian data aside for a moment and first explore the behavior of this summary statistic using a few simplified example datasets.

Example 1. Trees with different branch lenghts

For our examples, we’ll assume we have one single community with eight taxa: A through H. Let’s create two site-by-species matrix for this community: one denoting even abundance across species, and another one with uneven abundance. For the even communitu, we create a vector of the value 1 repeated 8 times; for the uneven community, we’ll create a vector where abundance goes up from species A to species H (for simplicity, we’ll just use values from 1 to 8). We then transform them to dataframe, and name the columns with the names of the species.

R

even_comm <- data.frame(rbind(rep(1,8))) # Abundance = 1 for all species
uneven_comm <- data.frame(rbind(seq(1,8))) # Abundance equal 1 for species A and goes up to 8 towards species H.

# We name the columns with the species names
colnames(even_comm) <- colnames(uneven_comm) <- c('A','B','C','D','E','F','G','H')

Now let’s create two different possible trees for these communities: one with short branch lengths and another with longer branch lengths. Remember: branch length = amount of evolutionary change

R

short_tree <- read.tree(text='(((A:3,B:3):1,(C:3,D:3):1):1,((E:3,F:3):1,(G:3,H:3):1):1);')
long_tree <- read.tree(text='(((A:6,B:6):1,(C:6,D:6):1):1,((E:6,F:6):1,(G:6,H:6):1):1);')

If we plot both trees…

R

plot(short_tree)

R

plot(long_tree)

…we can see that the branches leading to extant taxa are longer for long_tree, as we intended. This suggests that a greater amount of evolutionary change is happening in these recent branches of the longer tree when compared to the shorter tree.

Now, we will calculate phylogenetic hill numbers for both trees using both even and uneven communities. To store the calculated values, we’ll create a data.frame called even_comm_short_tree. The first column will be our Hill numbers to be calculated; the second column will be the Hill number order from 0 to 3, to see how the order affects the values; the third and fourth column will be the description of our components

R

even_comm_short_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "even",
    tree = "short"
)

Now we will use a for loop to calculate phylogenetic Hill numbers using the hill_phylo function from the hillR package. This function takes in a a site-by-species matrix and phylogeny, and returns phylogenetic Hill numbers for each site based on which species are present there. We provide to the function the site-by-species matrix, the phylogenetic tree and the order.

R

for(i in 1:nrow(even_comm_short_tree)) {
    even_comm_short_tree$hill_nb[i] <- hill_phylo(even_comm, short_tree, q = even_comm_short_tree$q[i])
}

Let’s repeat this process for the longer tree:

R

even_comm_long_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "even",
    tree = "long"
)

for(i in 1:nrow(even_comm_long_tree)) {
    even_comm_long_tree$hill_nb[i] <- hill_phylo(even_comm, long_tree, q = even_comm_long_tree$q[i])
}

We can combine both dataframes and plot the values for comparison:

R

even_comm_nb <- data.frame(rbind(even_comm_short_tree,even_comm_long_tree))

plot(even_comm_nb$q[even_comm_nb$tree=='short'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='short'],
     type='b',col='red',
     xlab = 'Order',ylab='Hill values',
     xlim = range(even_comm_nb$q),
     ylim = range(even_comm_nb$hill_nb))

lines(even_comm_nb$q[even_comm_nb$tree=='long'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='long'],
     type='b',col='darkred')

This figure clearly shows that the tree with longer branches (dark red line) harbors higher evolutionary history, and therefore higher PD, as calculated by Hill numbers. It also shows that the Hill number value decreases as the order goes up, since higher orders focus on branch lengths that are more common.

What would happen if the abundance of species in our community was uneven (a more realistic scenario)? In this case, both branch lengths and how abundant each branch is will have an effect on the calculated value. To visualize, let’s repeat the calculations above for the uneven community.

R

# Uneven comm with short tree
uneven_comm_short_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "uneven",
    tree = "short"
)

for(i in 1:nrow(uneven_comm_short_tree)) {
    uneven_comm_short_tree$hill_nb[i] <- hill_phylo(uneven_comm, short_tree, q = uneven_comm_short_tree$q[i])
}

# Uneven comm with long tree
uneven_comm_long_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "uneven",
    tree = "long"
)

for(i in 1:nrow(uneven_comm_long_tree)) {
    uneven_comm_long_tree$hill_nb[i] <- hill_phylo(uneven_comm, long_tree, q = uneven_comm_long_tree$q[i])
}

# Combining results
uneven_comm_nb <- data.frame(rbind(uneven_comm_short_tree,uneven_comm_long_tree))

Let’s plot all results together, using red colors for even communities and blue colors for uneven community. Ligher colors will represent short trees whereas darker colors will represent long trees.

R

plot(even_comm_nb$q[even_comm_nb$tree=='short'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='short'],
     type='b',col='red',
     xlab = 'Order',ylab='Hill values',
     xlim = range(even_comm_nb$q),
     ylim = range(min(uneven_comm_nb$hill_nb),max(even_comm_nb$hill_nb)))

lines(even_comm_nb$q[even_comm_nb$tree=='long'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='long'],
     type='b',col='darkred')

lines(uneven_comm_nb$q[uneven_comm_nb$tree=='short'],
     uneven_comm_nb$hill_nb[uneven_comm_nb$tree=='short'],
     type='b',col='lightblue')

lines(uneven_comm_nb$q[uneven_comm_nb$tree=='long'],
     uneven_comm_nb$hill_nb[uneven_comm_nb$tree=='long'],
     type='b',col='darkblue')

From this picture, we can take a few insights:

Longer branches still yield higher Hill numbers, regardless of the evenness in the community abundance;
The Hill number for q = 0 remains the same, regardless of the evenness.
As already observed, the value of the hill numbers drop as the order goes up. In the even community, since species have the same abundance, this decrease reflects higher orders focusing on more common values of branch lengths. In the uneven community, where species have different abundances, this decrease reflects higher orders focusing less and less on branches leading to rare taxa.
For q = 1 to 3, Hill numbers are always lower for uneven communities. This suggests that although branch lengths are the same (e.g., both dark red and dark blue lines represent the long tree), these branches are unevenly represented in the community due to uneven abundance of species. This unevenness is represented by the lower values of Hill number on higher orders. In other words, some branches in the tree “dominate” the community more than others. We can summarize this by saying that the higher the unevenness the lower the value of the Hill number will be.

Keypoints

hillR calculates phylogenetic hill numbers given a phylogeny and a site by species matrix.
In trees with a similar topology, phylogenetic Hill number of order 0 reflect the sum of branch lengths. Orders of 1 and higher reflect the sum of branch lengths weighted by the relative abundance of different species.

Example 2. Balanced vs unbalanced trees

So far, we have learned that Hill numbers are affected by both the sum of branch lengths and the relative representation of each branch (in terms of species abundance) in the community. For example 1, we have used a perfectly balanced tree, i.e., all extant taxa have equal branch length. In example 2, we will explore the effects of the uneven distribution of cladogenesis event along the tree, leading to different phylogeny structures.

Let’s create a totally balanced tree…

R

balanced_tree <- read.tree(text='(((A:1,B:1):1,(C:1,D:1):1):1,((E:1,F:1):1,(G:1,H:1):1):1);')

… and a totally unbalanced tree.

R

unbalanced_tree <- read.tree(text='(A:7,(B:6,(C:5,(D:4,(E:3,(F:2,(G:1,H:1):1):1):1):1):1):1);')

Let’s plot both trees for comparison:

R

plot(balanced_tree)

R

plot(unbalanced_tree)

Notice that here the difference between the trees resides in the fate of each new lineage at a node. In the unbalanced uneven tree, at each diversification event one of the lineages always persists till the present with no change while the other undergoes another round of diversification. In the even tree, both lineages from each node undergo a new split. The consequence is that in the even tree, all extant species result from recent diversification (i.e., they have a short evolutionary history before coalescing into their ancestor), whereas in the unbalanced tree we have a mix of old and recent lineages. This means that the phylogenetic history itself is creating an uneven representation of branch lengths across the community (even before we account for species abundance)

To see how such phylogenetic structure influences hill numbers, let’s repeat the calculations from example 1 with these new trees. First, let’s focus on the even community (i.e., not introducing the relative species abundance factor yet):

R

even_comm_balanced_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "even",
    tree = "balanced"
)

for(i in 1:nrow(even_comm_balanced_tree)) {
    even_comm_balanced_tree$hill_nb[i] <- hill_phylo(even_comm, balanced_tree, q = even_comm_balanced_tree$q[i])
}

even_comm_unbalanced_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "even",
    tree = "unbalanced"
)

for(i in 1:nrow(even_comm_unbalanced_tree)) {
    even_comm_unbalanced_tree$hill_nb[i] <- hill_phylo(even_comm, unbalanced_tree, q = even_comm_unbalanced_tree$q[i])
}

even_comm_nb <- data.frame(rbind(even_comm_balanced_tree,even_comm_unbalanced_tree))

plot(even_comm_nb$q[even_comm_nb$tree=='balanced'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='balanced'],
     type='b',col='red',
     xlab = 'Order',ylab='Hill values',
     xlim = range(even_comm_nb$q),
     ylim = range(even_comm_nb$hill_nb))

lines(even_comm_nb$q[even_comm_nb$tree=='unbalanced'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='unbalanced'],
     type='b',col='darkred')

This plot is similar to example 1 in two ways: 1) one of the trees has higher hill number values, in this case the unbalanced tree. This suggests that the unbalanced structure of the tree accounts for a deeper evolutionary history (i.e., lineages have overall longer branches); 2) the value of the hill numbers drop as the order goes up. This happens because higher orders are weighting less and less those branch lengths that are not so common (like, for instance, the short branch lengths in the unbalanced tree).

Something new that we observe here is a more evident difference between trees in the rate of change of the Hill number value as we increase the order. The unbalanced tree (darker red) shows a steeper drop than the balanced tree (lighter red). This is similar to what we observed in the abundances episode: uneven communities show a steeper drop in Hill numbers as order increases. In this example, instead of species abundance, the unevenness of the community is represented by the structure of the phylogeny: a completely unbalanced tree combining long and short branches generates an unevenness in the distribution of branch lengths. Notice that this difference in rate of change barely observed in example 1: when plotting two trees with the same structure (short vs long trees, but both fully balanced), the lines are very similar In example 1, the difference in the rate of change only becomes evident when we incorporate unevenness from the species abundance, but here in example 2 we see this difference deriving already from the different structures of the trees.

If the tree structure is already introducing some unevenness in our community, how would the pattern differ when species relative abundance is included?

In this case, both the structure of the tree and the species abundance interact to generate a pattern of unevenness. Specifically, the overall value of PD will be the sum of branch lengths weighted by their abundance in the community as informed by the species abundance distribution (instead of by the frequency of the branch length in the tree). This added information brings new insights. It may be the case, for instance, that in the unbalanced tree, even though long branches are over represented, maybe the species with long branches are actually super rare in our community, and the short-branch species are actually super abundant. As you can probably infer, this difference suggests something about the evolutionary history of our community, in this case that the most abundant species have a very recent evolutionary history. Similarly, even though branch lengths are the same in the balanced tree, if some of the species is more abundant than others, it suggests that the evolutionary history of the tree is unevenly represented in our community.

To visualize these interaction between tree structure and species relative abundance, let’s redo the calculation for Hill numbers with balanced and unbalanced trees, this time using the uneven community. We’ll plot all final values together for comparison.

R

uneven_comm_balanced_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "uneven",
    tree = "balanced"
)

for(i in 1:nrow(uneven_comm_balanced_tree)) {
    uneven_comm_balanced_tree$hill_nb[i] <- hill_phylo(uneven_comm, balanced_tree, q = uneven_comm_balanced_tree$q[i])
}

uneven_comm_unbalanced_tree <- data.frame(
    hill_nb = NA,
    q = 0:3,
    comm = "uneven",
    tree = "unbalanced"
)

for(i in 1:nrow(uneven_comm_unbalanced_tree)) {
    uneven_comm_unbalanced_tree$hill_nb[i] <- hill_phylo(uneven_comm, unbalanced_tree, q = uneven_comm_unbalanced_tree$q[i])
}

uneven_comm_nb <- data.frame(rbind(uneven_comm_balanced_tree,uneven_comm_unbalanced_tree))

plot(even_comm_nb$q[even_comm_nb$tree=='balanced'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='balanced'],
     type='b',col='red',
     xlab = 'Order',ylab='Hill values',
     xlim = range(even_comm_nb$q),
     ylim = range(min(uneven_comm_nb$hill_nb),max(even_comm_nb$hill_nb)))

lines(even_comm_nb$q[even_comm_nb$tree=='unbalanced'],
     even_comm_nb$hill_nb[even_comm_nb$tree=='unbalanced'],
     type='b',col='darkred')

lines(uneven_comm_nb$q[uneven_comm_nb$tree=='balanced'],
     uneven_comm_nb$hill_nb[uneven_comm_nb$tree=='balanced'],
     type='b',col='lightblue')

lines(uneven_comm_nb$q[uneven_comm_nb$tree=='unbalanced'],
     uneven_comm_nb$hill_nb[uneven_comm_nb$tree=='unbalanced'],
     type='b',col='darkblue')

Reminders: 1) light colors represent balanced trees, whereas darker colors represent unbalanced trees 2) red lines represent even community whereas blue lines represent uneven community.

From this second plot, we notice that:

Hill number values for the unbalanced tree (darker colors) are still higher than those for the balanced tree (lighter colors). This makes sense since the relative abundance is the same for both trees, so the overall pattern (i.e., longer branches in unbalanced tree) remains;
the Hill number of order 0 is the same regardless of whether you have different relative abundances among species. This was also the case for example 1. Order 0 is simply the sum of all branch lengths (i.e., it does not account for the relative abundance of branch lengths or taxa);
the inclusion of an uneven community accentuates the rate of decrease in the Hill number value as we increase the order. This is because higher orders are now affected by the unevenness of the distribution of branch length as well by the unevenness of species abundance. This is a good example of how calculating different order of hill numbers while incorporating species abundance allows to have a couple summary statistics accounting for evolutionary history and species abundance distribution that is comparable across different regions.

Keypoints

The phylogenetic structure of the community (represented by the topology of the tree) influences the evenness of the branch length distribution and contribute to different rates in the decrease of Hill number values as the order number increases.
The inclusion of community structure (relative abundance of different species) further accentuates this difference in rate of change.
Phylogenetic Hill numbers represent the sum of branch lenghts. Higher orders weight that sum by the distribution of branch lengths and the species relative abundance.

Now that we acquired an intuition on the information phylogenetic Hill numbers can give us, let’s move on to our actual data.

Moving on to our island communities

For this final section, we will be working on the data for the three island communities to calculate hill numbers from orders of 0 to 3. We will be using our pruned phylogeny and the site-by-species matrix we created in the abundances episode. Since we covered the workflow of calculating phylogenetic Hill numbers in the previous section, we will leave this activity for a challenge.

Challenge

For this challenge, you should calculate the phylogenetic Hill numbers of orders 0 to 3 for the three islands we are working with throughout the workshop The key objects here will be the phylogeny stored in arthro_tree_pruned and the sites-by-species matrix stored in abundances_wide. You should also plot the hill numbers for each island and discuss what the calculated values allow you to infer regarding the history of each community.

Instructor Note

Learners should have the object abundances_wide already in their environment, from the previous episode. If not, use line below.

R

abundances_wide <- read.csv("https://raw.githubusercontent.com/role-model/multidim-biodiv-data/main/episodes/data/abundances_wide.csv")

# Note that reading from the URL introduces a "." in the column name. Replace using gsub
colnames(abundances_wide) <- gsub('\\.',' ',colnames(abundances_wide))

# It also got ride of sites as rownames.
rownames(abundances_wide) <- c('BI_01','MA_01','KA_01')

Show me the solution

We can directly calculate Hill numbers using the hill_phylo function along with the objects arthro_tree_pruned and abundances_wide. Since we have three sites, the function will return a vector with tree values, one Hill number for each site of the order we requested. Let’s create an empty list to store a vector for each order, and use a for loop to calculate from orders of 0 to 3.

R

hill_values <- vector('list', length = 4)
for (i in 0:3) {
    hill_values[[i + 1]] <- hill_phylo(abundances_wide, arthro_tree_pruned, 
                                       q = i)
}

Now, let’s create a data.frame with islands as rows and Hill numbers of different orders as columns. For that, we will use the function do.call to collapse our list hill_values using the function cbind.

R

hill_values <- do.call(cbind.data.frame, hill_values)
colnames(hill_values) <- paste0('hill_phylo_', 0:3)
hill_phylo <- data.frame(site = rownames(hill_values),
                         hill_values)

rownames(hill_phylo) <- NULL
hill_phylo

OUTPUT

   site hill_phylo_0 hill_phylo_1 hill_phylo_2 hill_phylo_3
1 BI_01     59.09643     18.71926     13.41881     11.65893
2 MA_01     64.21265     40.30308     26.89875     20.36594
3 KA_01     54.49430     28.10458     20.00878     16.29594

Finally, we can plot using a similar code to the ones we used in our examples. We set xlim to go from 0 to 3, and ylim to go from the lowest to the highest value in the object hill_values. We also add a legend using the legend() function like we did in the traits episode.

R

plot(seq(0,3),hill_phylo_nbs[3,2:5],
     type='b',col="#440154FF",
     xlab = 'Order',ylab='Hill values',
     xlim = c(0,3),
     ylim = range(min(hill_values),max(hill_values)))

ERROR

Error in eval(expr, envir, enclos): object 'hill_phylo_nbs' not found

R

lines(seq(0,3),hill_phylo_nbs[2,2:5],
      type='b',col="#21908CFF")

ERROR

Error in eval(expr, envir, enclos): object 'hill_phylo_nbs' not found

R

lines(seq(0,3),hill_phylo_nbs[1,2:5],
      type='b',col="#FDE725FF")

ERROR

Error in eval(expr, envir, enclos): object 'hill_phylo_nbs' not found

R

legend(
    "topright",
    legend = c("Kauai","Maui","Big Island"),
    pch = 19,
    col = c("#440154FF", "#21908CFF", "#FDE725FF")
)

ERROR

Error in (function (s, units = "user", cex = NULL, font = NULL, vfont = NULL, : plot.new has not been called yet

Discussion

After calculating and plotting hill numbers, what inferences can you make about the history of these communities, in terms of the rates of local speciation and colonization?

Are there any further visualizations we haven’t done so far that you could make to further investigate the phylogenetic history of each community?

Instructor Note

A few points of discussion: 1. Kauai has lowest absolute sum of branch lengths whereas Maui has the highest. The Big Island stays in between. 2. The Big Island, however, has the most uneven distribution of branch length. Even though branches are overall longer in the Big Island than in the Kauai island, lower Hill number on higher orders suggest they are more unevenly distributed in the community of the Big Island. 3. Lower values for the Big Island in higher orders, coupled with a slightly higher value for order 0, suggest that the evolutionary branches in this island are longer than in Kauai but they are unevenly represented, probably represented by a few very abundant species. This suggests a young history of colonization for this island: the community is dominated by the few good colonizers, which have been evolving for a while in the mainland, bring their long branches to the community when they arrive. The absence of shorter branches evenly distributed further suggests that speciation has not occurred much in this island, indicating a young community mostly composed by colonizers. 4. Kauai has the lowest sum of branch lengths, but evenness seems to be in between the other two islands. This suggests this islands harbors more closely related taxa (lowest sum of branches) when compared to other islands, but these values are evenly distributed among species. A longer history may be inferred here, with a lot of local speciation (accounting for closely related taxa) and probably competitive coexistence dictating similar abundances across species (accounting for evenness of branch representation). 5. Maui has overall highest PD in all orders and higher evenness of distribution of branch lengths. This suggests an intermediate history: both speciation and immigration account for a lot of evolutionary history in this island, while elapsed time may have been enough to dictate a more even distribution of species abundance and phylogenetic history.

Further visualization of the trees for each community can help us corroborate these ideas:

R

BI_tree <- keep.tip(arthro_tree_pruned,abundances$final_name[ abundances$island == 'BigIsland' ])
MA_tree <- keep.tip(arthro_tree_pruned,abundances$final_name[ abundances$island == 'Maui' ])
KA_tree <- keep.tip(arthro_tree_pruned,abundances$final_name[ abundances$island == 'Kauai' ])

plot(BI_tree)

R

plot(MA_tree)

R

plot(KA_tree)

The interesting thing here is that we could infer the history of the communities from the numbers only, before looking at the trees per community. This speaks to the power of Hill numbers as summary statistic of biodiversity patterns and how they can be useful for simulation-based inference.