Into the palaeoverse

A community-driven R package

Development team

Funders

  • European Union’s Horizon 2020 research and innovation program (MAPAS project)
    • Grant number: 947921
  • The Royal Society
    • Grant numbers: RF_ERE_210013, RGF_R1_180020, and RGF_EA_180318
  • Juan de la Cierva-formación fellowship
    • FJC2020-044836-I / MCIN /AEI / 10.13039 /501100011033
  • ETH+ Research Grant (BECCY project)
  • FAPESP Postdoctoral fellowship
    • Grant number: 2022/05697-9
  • Population Biology Program of Excellence Postdoctoral Fellowship (the University of Nebraska-Lincoln)
  • Lerner-Gray Postdoctoral Research Fellowship (American Museum of Natural History)

Introduction

The long and the short of it 📏

What is Palaeoverse?



Palaeoverse is a project that aims to bring the palaeobiology community together.

What is the palaeoverse R package?

palaeoverse provides auxiliary functions to support data preparation and exploration.

Improve code readability, reusability and reproducibility.

What makes palaeoverse different?

What makes palaeoverse different?

  • Community-informed development
    • Authors (n = 13)
    • Survey participants (n = 35)
  • Well-documented & peer-reviewed code
    • Formal review process

Functionality

A whistle-stop tour of palaeoverse 🚋

What’s available?

  • axis_geo
  • bin_lat
  • bin_time
  • data
  • group_apply
  • lat_bins
  • look_up
  • palaeorotate
  • phylo_check
  • tax_check
  • tax_expand_lat
  • tax_expand_time
  • tax_range_space
  • tax_range_time
  • tax_unique
  • time_bins

Expected input

A lot of data, a lot of sources, and a lot of unique features.






Data structure, not source.

occdf \(\rightarrow\) function(x) \(\rightarrow\) df

Occurrence dataframe*

Getting started

Let’s dive in 🤿…

Installation

palaeoverse can be installed from the CRAN using:

install.packages("palaeoverse")


The development version can be installed using devtools:

devtools::install_github("palaeoverse-community/palaeoverse")


Once installed, load the package in the usual manner:

library(palaeoverse)

Example datasets

Two example occurrence datasets are available.

Carboniferous–Early Triassic tetrapods (n = 5270, Paleobiology Database).

Code
# Get details on dataset
?tetrapods
# Load dataset
data("tetrapods")
# Available variables
colnames(tetrapods)
##  [1] "occurrence_no"     "collection_no"     "identified_name"  
##  [4] "identified_rank"   "accepted_name"     "accepted_rank"    
##  [7] "early_interval"    "late_interval"     "max_ma"           
## [10] "min_ma"            "phylum"            "class"            
## [13] "order"             "family"            "genus"            
## [16] "abund_value"       "abund_unit"        "lng"              
## [19] "lat"               "collection_name"   "cc"               
## [22] "formation"         "stratgroup"        "member"           
## [25] "zone"              "lithology1"        "environment"      
## [28] "pres_mode"         "taxon_environment" "motility"         
## [31] "life_habit"        "diet"

Phanerozoic reef occurrences (n = 4363, PaleoReefs Database).

Code
# Get details on dataset
?reefs
# Load dataset
data("reefs")
# Available variables
colnames(reefs)
##  [1] "r_number"   "name"       "formation"  "system"     "series"    
##  [6] "interval"   "biota_main" "biota_sec"  "lng"        "lat"       
## [11] "country"    "authors"    "title"      "year"

Reference datasets

Two reference datasets are available.

Geological Time Scale 2012 & 2020 (Gradstein et al. 2012; 2020).

# Get details on dataset
?GTS2012
?GTS2020
# Load dataset
data("GTS2012")
data("GTS2020")
# Increase output width
options(width = 120)
# Print first few rows
head(GTS2012, n = 3)
##   interval_number      interval_name  rank max_ma mid_ma min_ma duration_myr  font  colour abbr
## 1               1           Holocene stage 0.0117 0.0059 0.0000       0.0117 black #FDEDEC <NA>
## 2               2  Upper Pleistocene stage 0.1260 0.0688 0.0117       0.1143 black #FFF2D3 <NA>
## 3               3 Middle Pleistocene stage 0.7810 0.4535 0.1260       0.6550 black #FFF2C7 <NA>
head(GTS2020, n = 3)
##   interval_number interval_name  rank max_ma  mid_ma min_ma duration_myr  font  colour abbr
## 1               1    Meghalayan stage 0.0042 0.00210 0.0000       0.0042 black #FDEDEC <NA>
## 2               2 Northgrippian stage 0.0082 0.00620 0.0042       0.0040 black #FDECE4 <NA>
## 3               3  Greenlandian stage 0.0117 0.00995 0.0082       0.0035 black #FEECDB <NA>

Stratigraphic time bins

# Get stage-level time bins
bins <- time_bins(interval = "Phanerozoic", rank = "stage", plot = TRUE)
# Get first few rows
head(bins, n = 3)
##   bin interval_name  rank max_ma mid_ma min_ma duration_myr abbr  colour  font
## 1   1     Fortunian stage    541  535.0    529           12   Fo #99B575 black
## 2   2       Stage 2 stage    529  525.0    521            8   S2 #A6BA80 black
## 3   3       Stage 3 stage    521  517.5    514            7   S3 #A6C583 black

Macrostrat time bins

# Get North American Land Mammal Ages
bins <- time_bins(scale = "North American Land Mammal Ages", plot = TRUE)
# Get first few rows
head(bins, n = 3)
##   bin interval_name                            rank max_ma mid_ma min_ma duration_myr abbr  colour  font
## 1   1       Puercan North American Land Mammal Ages  66.00 65.375  64.75         1.25    P #FDB469 black
## 2   2   Torrejonian North American Land Mammal Ages  64.75 63.500  62.25         2.50   To #FEBA64 black
## 3   3     Tiffanian North American Land Mammal Ages  62.25 59.875  57.50         4.75   Ti #FEBF6A black

Near-equal-length time bins

# Get stage-level time bins
bins <- time_bins(interval = "Phanerozoic", rank = "stage", size = 15, plot = TRUE)
# Get first few rows
head(bins, n = 3)
##   bin max_ma mid_ma min_ma duration_myr grouping_rank                 intervals  colour  font
## 1   1    541 535.00  529.0         12.0         stage                 Fortunian #80cdc1 black
## 2   2    529 521.50  514.0         15.0         stage          Stage 3, Stage 2 #80cdc1 black
## 3   3    514 507.25  500.5         13.5         stage Drumian, Wuliuan, Stage 4 #80cdc1 black

Temporal occurrence binning

Five temporal binning methods for age range data:

# Use tetrapod example data
occdf <- tetrapods

# Get stage-level time bins
bins <- time_bins(interval = "Phanerozoic", rank = "stage")

# Assign via midpoint age of fossil occurrence data
ex1 <- bin_time(occdf = occdf, bins = bins, method = "mid")

# Assign to all bins that age range covers
ex2 <- bin_time(occdf = occdf, bins = bins, method = "all")

# Assign via majority overlap based on fossil occurrence age range
ex3 <- bin_time(occdf = occdf, bins = bins, method = "majority")

# Randomly assign to overlapping bins based on fossil occurrence age range
ex4 <- bin_time(occdf = occdf, bins = bins, method = "random", reps = 10)

# Randomly assign point estimates (e.g. uniform distribution) based on fossil occurrence age range
ex5 <- bin_time(occdf = occdf, bins = bins, method = "point", reps = 10)

Latitudinal occurrence binning

Generate and bin latitudinal data:

# Generate latitudinal bins
bins <- lat_bins(size = 10, plot = TRUE)
# Use reef example data
occdf <- reefs
# Bin occurrences
occdf <- bin_lat(occdf = occdf, bins = bins, lat = "lat")

Spatial occurrence binning

Generate and bin spatial data:

# Get reef data
occdf <- reefs[1:500, ]
# Bin data using a hexagonal equal-area grid
occdf <- bin_space(occdf = occdf, spacing = 250, return = TRUE)
# Plot world and grid using ggplot2
library(ggplot2)
library(rnaturalearth)
world <- ne_countries(scale = "small",returnclass = "sf")
ggplot() +
  geom_sf(data = world, colour = "black", fill = "lightgrey") + 
  geom_sf(data = occdf$grid, fill = "orange", colour = "black") + 
  theme_void()

Palaeogeographic reconstruction

Palaeorotate fossil occurrences (multiple models available):

# Example with a few occurrences
occdf <- data.frame(lng = c(2, -103, -66),
                    lat = c(46, 35, -7),
                    age = c(88, 125, 200))

# Estimate palaeocoordinates using the GPlates API
ex1 <- palaeorotate(occdf = occdf, method = "point")

# Estimate palaeocoordinates using reconstruction files
ex2 <- palaeorotate(occdf = occdf, method = "grid")

# Estimate palaeocoordinates and uncertainty using reconstruction files
ex3 <- palaeorotate(occdf = occdf, method = "grid", uncertainty = TRUE)

# Increase output width
options(width = 400)
# Get first few rows
head(ex3)
##    lng lat age   rot_model rot_age rot_lng rot_lat    p_lng    p_lat
## 1    2  46  88 MERDITH2021      88    1.80   46.42  13.0134  37.6406
## 2 -103  35 125 MERDITH2021     127 -102.61   34.63 -41.8928  35.0437
## 3  -66  -7 200 MERDITH2021     200  -65.52   -6.95 -22.5209 -16.7714

Taxonomic spell check

Identify and count potential spelling variations of the same taxon:

Code
# load occurrence data
data("tetrapods")
# Check taxon names alphabetically
ex1 <- tax_check(taxdf = tetrapods, name = "genus", dis = 0.05, verbose = FALSE)
# Get first few rows
head(ex1)
##   group     greater     lesser count_greater count_lesser
## 1     D Dvinosaurus Dinosaurus            23            2
## 2     V   Varanopus   Varanops             5            3
Code
# Check taxon names by group
ex2 <- tax_check(taxdf = tetrapods, name = "genus", group = "family", dis = 0.05, verbose = FALSE)
# Get first few rows
head(ex2)
## NULL

In this example dataset:

  • Dinosaurus belongs to the Phthinosuchidae
  • Dvinosaurus belongs to the Dvinosauridae
  • Varanops belongs to the Varanopidae
  • Varanopus belongs to the Captorhinidae

Unique taxa

Identifying unique taxa:

# Create dataframe
occdf <- data.frame(species = c("rex", "aegyptiacus", NA),
                    genus = c("Tyrannosaurus", "Spinosaurus", NA),
                    family = c("Tyrannosauridae", "Spinosauridae", "Diplodocidae"))
# Retain unique taxa
dinosaur_species <- tax_unique(occdf = occdf,
                               species = "species",
                               genus = "genus",
                               family = "family",
                               resolution = "species")
head(dinosaur_species)
##            family         genus           genus_species             unique_name
## 1   Spinosauridae   Spinosaurus Spinosaurus aegyptiacus Spinosaurus aegyptiacus
## 2 Tyrannosauridae Tyrannosaurus       Tyrannosaurus rex       Tyrannosaurus rex
## 3    Diplodocidae          <NA>                    <NA>     Diplodocidae indet.

Temporal range

Calculate and plot temporal range of taxa:

# Grab tetrapod data
occdf <- tetrapods
# Remove NAs
occdf <- subset(occdf, !is.na(order))
# Temporal range
ex <- tax_range_time(occdf = occdf, name = "order", plot = TRUE)

Geographic range

Four approaches to calculate geographic range of taxa:

# Grab internal data
occdf <- tetrapods
# Remove NAs
occdf <- subset(occdf, !is.na(genus))
# Convex hull
ex1 <- tax_range_space(occdf = occdf, name = "genus", method = "con")
# Latitudinal range
ex2 <- tax_range_space(occdf = occdf, name = "genus", method = "lat")
# Great Circle Distance
ex3 <- tax_range_space(occdf = occdf, name = "genus", method = "gcd")
# Occupied grid cells
ex4 <- tax_range_space(occdf = occdf, name = "genus", method = "occ", spacing = 250)
# See first few rows
head(ex2)
##                taxon taxon_id max_lat min_lat range_lat
## 1           Abajudon        1 -10.624 -16.524       5.9
## 2          Abdalodon        2 -31.925 -31.925       0.0
## 3        Abyssomedon        3  34.776  34.776       0.0
## 4   Acanthostomatops        4  51.000  51.000       0.0
## 5          Acerastea        5 -24.833 -24.833       0.0
## 6 Acerosodontosaurus        6 -24.000 -24.000       0.0

Temporal pseudo-occurrences

Convert range data to bin-level pseudo-occurrences:

# Generate example df
taxdf <- data.frame(name = c("A", "B", "C"),
                    max_age = c(150, 60, 30),
                    min_age = c(110, 20, 0))
# Generate pseudo-occurrences
ex1 <- tax_expand_time(taxdf = taxdf, max_ma = "max_age", min_ma = "min_age")
# Increase output width
options(width = 200)
# See first few rows
head(ex1)
##   name max_age min_age   ext  orig interval_number     interval_name  rank max_ma  mid_ma min_ma duration_myr  font  colour abbr
## 1    C      30       0 FALSE FALSE               1        Meghalayan stage 0.0042 0.00210 0.0000       0.0042 black #FDEDEC <NA>
## 2    C      30       0 FALSE FALSE               2     Northgrippian stage 0.0082 0.00620 0.0042       0.0040 black #FDECE4 <NA>
## 3    C      30       0 FALSE FALSE               3      Greenlandian stage 0.0117 0.00995 0.0082       0.0035 black #FEECDB <NA>
## 4    C      30       0 FALSE FALSE               4 Upper Pleistocene stage 0.1290 0.07035 0.0117       0.1173 black #FFF2D3 <NA>
## 5    C      30       0 FALSE FALSE               5         Chibanian stage 0.7740 0.45150 0.1290       0.6450 black #FFF2C7 <NA>
## 6    C      30       0 FALSE FALSE               6         Calabrian stage 1.8000 1.28700 0.7740       1.0260 black #FFF2BA <NA>

Latitudinal pseudo-occurrences

Convert range data to bin-level pseudo-occurrences:

# Generate latitudinal bins
bins <- lat_bins()
# Generate example df
taxdf <- data.frame(name = c("A", "B", "C"),
                    max_lat = c(60, 20, -10),
                    min_lat = c(20, -40, -60))
# Generate pseudo-occurrences
ex1 <- tax_expand_lat(taxdf = taxdf, bins = bins)
# See first few rows
head(ex1)
##   name max_lat min_lat bin max mid min
## 1    A      60      20   4  60  55  50
## 2    A      60      20   5  50  45  40
## 3    A      60      20   6  40  35  30
## 4    A      60      20   7  30  25  20
## 5    B      20     -40   8  20  15  10
## 6    B      20     -40   9  10   5   0

Phylogeny wrangling

Compare a list of taxonomic names to tip names in a user-provided phylogeny:

# Read in example tree of ceratopsians
# from paleotree
library(paleotree)
data(RaiaCopesRule)
# Set smaller margins for plotting
par(mar = rep(0, 4))
# Plot tree
plot(ceratopsianTreeRaia)

# Specify list of names
dinosaurs <- c("Nasutoceratops_titusi", 
               "Diabloceratops_eatoni",
               "Zuniceratops_christopheri",
               "Psittacosaurus_major")

# Table of taxon names in list, tree or both
ex1 <- phylo_check(tree = ceratopsianTreeRaia,
                   list = dinosaurs)
# Get first few rows
head(ex1)
##                   taxon_name present_in_tree present_in_list
## 8      Diabloceratops_eatoni            TRUE            TRUE
## 33      Psittacosaurus_major            TRUE            TRUE
## 38     Nasutoceratops_titusi           FALSE            TRUE
## 39 Zuniceratops_christopheri           FALSE            TRUE
## 1       Centrosaurus_apertus            TRUE           FALSE
## 2  Styracosaurus_albertensis            TRUE           FALSE

Interval linking

Link and match interval names to the Geological Time Scale:

## Link numeric age values
# Create exemplary df
occdf <- data.frame(name = c("A", "B", "C"),
                    early_interval = c("Maastrichtian",
                                       "Campanian",
                                       "Sinemurian"),
                    late_interval = c("Maastrichtian",
                                      "Campanian",
                                      "Bartonian"))
# Assign stages and numerical ages
occdf <- look_up(occdf)

## Use exemplary int_key
# Get internal reef data
occdf <- reefs
# Get internal interval key
int_key <- interval_key
# Assign stages and numerical ages
occdf <- look_up(occdf,
                early_interval = "interval",
                late_interval = "interval",
                int_key = int_key)

Plotting

Add Geological Time Scale to plots:

# Plot data
plot(x = 541:0,
     xlab = "Time (Ma)", ylab = "User-variable",
     xlim = c(541, 0), xaxt = "n", type = "l", lwd = 5)

# Add Geological Time Scale
axis_geo(side = 1, intervals = "periods")

Wrapper

Run functions over groups of data:

# Get tetrapod data
occdf <- tetrapods

# Count number of occurrences from each country
ex1 <- group_apply(occdf = occdf, group = "cc", fun = nrow)

# Remove NA data
occdf <- subset(occdf, !is.na(genus))

# Unique genera per collection with group_apply and input arguments
ex2 <- group_apply(occdf = occdf,
                   group = c("collection_no"),
                   fun = tax_unique,
                   genus = "genus",
                   family = "family",
                   order = "order",
                   class = "class",
                   resolution = "genus")

# Use multiple variables (number of occurrences per collection & formation)
ex3 <- group_apply(occdf = occdf,
                   group = c("collection_no", "formation"),
                   fun = nrow)

What’s next?

Onwards and upwards 🏔️

What’s next?

  • Palaeobiology CRAN Task View
  • Shiny App
  • Workshops
  • Hackathon
  • Funding
  • Your involvement!

Thank-you / Merci / Gracias / Danke / Obrigado / Grazie / Ευχαριστώ