Seaside Librarian: Using the GNRD API to find taxonomic names

Amanda Whitmire

A notebook wherein I learn how to use the GNRD API to find taxonomic names in text

Background

Our use case is text-mining the Proceedings of the Academy of Natural Sciences of Philadelphia (ANSP) to identify species names. This is a first-step toward then identifying species occurrences (species observation + date + location). We know that Global Names can find taxonomic names in ANSP because every page of items in the Biodiversity Heritage Library has taxonomic names listed that were identified using Global Names. See, for example, this page, at the bottom left.

What is GNRD? It is described via Global Names Architecture as: “Global Names Recognition and Discovery: a name discovery service that accepts Microsoft Office documents, PDFs, images, and other files, performs OCR as required, then uses TaxonFinder and NetiNeti name discovery algorithms. It has several configuration options including passing found names to Global Names Resolver.”

This tool is important because rather than try to assemble a list of all possible taxonomic names so that we can search for them in a text (via named entity recognition or some other method), we want to be able to use community-accepted tools and taxonomies. So, the exercise here is to learn how to use the GNRD API, get it working on our corpus (ANSP), and get comfortable working with the inputs (text) and outputs (taxa).

Workspace setup

install.packages("taxize")

Load the Libraries we need.

library(rbhl)
library(tidyverse)
library(rmarkdown)
library(knitr)
library(distill)
library(taxize)

Testing the `scrapenames` function

From Taxize documentation we see how the scrapenames function works. The input can be a URL OR a file OR text, with various options for what you’re looking for (eg., unique names only (no repeats), or all names found). Here is how it looks:

scrapenames(
  url = NULL,
  file = NULL,
  text = NULL,
  engine = NULL,
  unique = NULL,
  verbatim = NULL,
  detect_language = NULL,
  all_data_sources = NULL,
  data_source_ids = NULL,
  return_content = FALSE,
  other options not described...
)

And here it is in use on a Wikipedia page:

taxa <- scrapenames(
  url = 'https://en.wikipedia.org/wiki/Animalia', # a web page with taxa mentioned
  engine = 0, # uses both name finding engines (IDK what the differences are, tbh)
  unique = TRUE, # just show the unique names as a start
  verbatim = FALSE, # default
  detect_language = TRUE, # default
  with_verification = TRUE, # can't seem to get this to work
  all_data_sources = TRUE,
  return_content = FALSE
)

This returns a list of length 2, with a list of metadata (search query settings) and results (data). Let’s look! Here is what the metadata looks like:

taxa[["meta"]]

$token_url
[1] "http://gnrd.globalnames.org/name_finder.json?token=65gr6gcpd7"

$input_url
[1] "https://en.wikipedia.org/wiki/Animalia"

$file
NULL

$status
[1] 200

$engine
[1] "gnfinder"

$unique
[1] TRUE

$verbatim
[1] FALSE

$parameters
$parameters$return_content
[1] FALSE

$parameters$with_verification
[1] FALSE

$parameters$preferred_data_sources
list()

$parameters$detect_language
[1] FALSE

$parameters$engine
[1] 0

$parameters$no_bayes
[1] FALSE


$language_used
[1] "eng"

$execution_time
$execution_time$text_preparation_duration
[1] 2.482378

$execution_time$find_names_duration
[1] 0.05572677

$execution_time$total_duration
[1] 2.541807


$total
[1] 272

And here is a list of the taxonomic names that GNRD found in the Wikipedia page.

# max_len <- max(lengths(taxa[["data"]]))
# df <- do.call(cbind.data.frame, c(lapply(taxa[["data"]], function(x) 
#               c(x, rep('', max_len - length(x)))), stringsAsFactors = FALSE))
# knitr::kable(df)

taxa[["data"]][["scientificname"]]

  [1] "Animalia"                     "Eukaryota"                   
  [3] "Unikonta"                     "Opisthokonta"                
  [5] "Holozoa"                      "Metazoa"                     
  [7] "Bilateria"                    "Protozoa"                    
  [9] "Ecdysozoa"                    "Spiralia"                    
 [11] "Latin animalis"               "Ctenophora"                  
 [13] "Ficedula superciliaris"       "Tiktaalik"                   
 [15] "Balaenoptera musculus"        "Loxodonta africana"          
 [17] "Argentinosaurus"              "Myxozoa"                     
 [19] "Myxobolus"                    "Bryozoa"                     
 [21] "Chordates"                    "Nematodes"                   
 [23] "Platyhelminthes"              "Dickinsonia costata"         
 [25] "Dickinsonia"                  "Anomalocaris canadensis"     
 [27] "Anomalocaris"                 "Gromia sphaerica"            
 [29] "Choanoflagellata"             "Choanozoa"                   
 [31] "Porifera"                     "Placozoa"                    
 [33] "Eumetazoa"                    "Xenacoelomorpha"             
 [35] "Nephrozoa"                    "Deuterostomia"               
 [37] "Chordata"                     "Ambulacraria"                
 [39] "Protostomia"                  "Scalidophora"                
 [41] "Panarthropoda"                "Nematoida"                   
 [43] "Gnathifera"                   "Rotifera"                    
 [45] "Chaetognatha"                 "Lophotrochozoa"              
 [47] "Mollusca"                     "Annelida"                    
 [49] "Echinodermata"                "Hemichordata"                
 [51] "Arthropoda"                   "Onychophora"                 
 [53] "Tardigrada"                   "Nematoda"                    
 [55] "Nematomorpha"                 "Kinorhyncha"                 
 [57] "Priapulida"                   "Loricifera"                  
 [59] "Vermes"                       "Insecta"                     
 [61] "Amphibia"                     "Mammalia"                    
 [63] "Protista"                     "Drosophila melanogaster"     
 [65] "Cecidomyiidae"                "Angela"                      
 [67] "Mossman"                      "Marisa"                      
 [69] "Rita"                         "Alison"                      
 [71] "Myxosporea"                   "Bivalvulida"                 
 [73] "Melissa"                      "Sina"                        
 [75] "Georgina"                     "Crustacea"                   
 [77] "Malacostraca"                 "Trichomycteridae"            
 [79] "Vandelliinae"                 "Stegophilinae"               
 [81] "Tricladida"                   "Terricola"                   
 [83] "Ediacara"                     "Verena"                      
 [85] "Tamara"                       "Luca"                        
 [87] "Bilaterian-specific genes"    "Agatha"                      
 [89] "Safra"                        "Xenoturbella"                
 [91] "Xenoturbella bocki"           "Sabrina"                     
 [93] "Gnathostomulida"              "Cycliophora"                 
 [95] "Dendrobatidae"                "Archaea"                     
 [97] "Hacrobia"                     "Alveolata"                   
 [99] "Rhizaria"                     "Excavata"                    
[101] "Amoebozoa"                    "Micrognathozoa"              
[103] "Limnognathia"                 "Syndermata"                  
[105] "Acanthocephala"               "Mesozoa"                     
[107] "Orthonectida"                 "Dicyemida"                   
[109] "Rhombozoa"                    "Monoblastozoa"               
[111] "Lophophorata"                 "Entoprocta"                  
[113] "Kamptozoa"                    "Ectoprocta"                  
[115] "Brachiozoa"                   "Brachiopoda"                 
[117] "Phoronida"                    "Chromadorea"                 
[119] "Enoplea"                      "Secernentea"                 
[121] "Turbellaria"                  "Trematoda"                   
[123] "Monogenea"                    "Cestoda"                     
[125] "Phylactolaemata"              "Stenolaemata"                
[127] "Gymnolaemata"                 "Clitellata"                  
[129] "Sipuncula"                    "Archaeplastida"              
[131] "Glaucophyta"                  "Rhodophyta"                  
[133] "Picozoa"                      "Plantae"                     
[135] "Chlorophyta"                  "Streptophyta"                
[137] "Chlorokybophyceae"            "Mesostigmatophyceae"         
[139] "Spirotaenia"                  "Cryptista"                   
[141] "Endohelea"                    "Katablepharidophyta"         
[143] "Cryptophyta"                  "Haptista"                    
[145] "Centroheliozoa"               "Haptophyta"                  
[147] "Halvaria"                     "Colponemidia"                
[149] "Chromerida"                   "Colpodellida"                
[151] "Sporozoa"                     "Perkinsozoa"                 
[153] "Dinophyta"                    "Stramenopiles"               
[155] "Developea"                    "Hyphochytrea"                
[157] "Ochrophyta"                   "Opalinata"                   
[159] "Sagenista"                    "Filosa"                      
[161] "Retaria"                      "Endomyxa"                    
[163] "Telonemia"                    "Discoba"                     
[165] "Eolouka"                      "Jakobea"                     
[167] "Tsukubea"                     "Discicristata"               
[169] "Euglenozoa"                   "Percolozoa"                  
[171] "Loukozoa"                     "Malawimonadea"               
[173] "Metamonada"                   "Anaeromonada"                
[175] "Trichozoa"                    "Conosa"                      
[177] "Archamoebae"                  "Lobosa"                      
[179] "Cutosea"                      "Discosea"                    
[181] "Tubulinea"                    "Apusomonadida"               
[183] "Breviatea"                    "Holomycota"                  
[185] "Cristidiscoidea"              "Ichthyosporea"               
[187] "Syssomonas"                   "Corallochytrea"              
[189] "Filasterea"                   "Collodictyonidae"            
[191] "Mantamonadidae"               "Rigifilida"                  
[193] "Hemimastigophora"             "Spironemidae"                
[195] "Incertae sedis"               "Eukaryota flora"             
[197] "Prokaryotes archaea bacteria" "Acidobacteria"               
[199] "Actinobacteria"               "Aquificae"                   
[201] "Bacteroidetes"                "Chlamydiae"                  
[203] "Chlorobi"                     "Chloroflexi"                 
[205] "Chrysiogenetes"               "Cyanobacteria"               
[207] "Deferribacteres"              "Dictyoglomi"                 
[209] "Fibrobacteres"                "Firmicutes"                  
[211] "Fusobacteria"                 "Gemmatimonadetes"            
[213] "Nitrospirae"                  "Planctomycetes"              
[215] "Proteobacteria"               "Spirochaetes"                
[217] "Thermodesulfobacteria"        "Thermomicrobia"              
[219] "Thermotogae"                  "Verrucomicrobia"             
[221] "Crenarchaeota"                "Euryarchaeota"               
[223] "Korarchaeota"                 "Nanoarchaeota"               
[225] "Chromista"                    "Heterokontophyta"            
[227] "Ciliophora"                   "Apicomplexa"                 
[229] "Dinoflagellata"               "Foraminifera"                
[231] "Cercozoa"                     "Chytridiomycota"             
[233] "Blastocladiomycota"           "Neocallimastigomycota"       
[235] "Glomeromycota"                "Zygomycota"                  
[237] "Ascomycota"                   "Basidiomycota"               
[239] "Charophyta"                   "Marchantiophyta"             
[241] "Anthocerotophyta"             "Lycopodiophyta"              
[243] "Pteridophyta"                 "Cycadophyta"                 
[245] "Ginkgophyta"                  "Pinophyta"                   
[247] "Gnetophyta"                   "Nanobacterium"               
[249] "Riboviria"                    "Unassigned"                  
[251] "Alphasatellitidae"            "Ampullaviridae"              
[253] "Anelloviridae"                "Avsunviroidae"               
[255] "Bicaudaviridae"               "Clavaviridae"                
[257] "Fuselloviridae"               "Globuloviridae"              
[259] "Guttaviridae"                 "Ovaliviridae"                
[261] "Plasmaviridae"                "Polydnaviridae"              
[263] "Portogloboviridae"            "Pospiviroidae"               
[265] "Spiraviridae"                 "Tolecusatellitidae"          
[267] "Hela"                         "Europaea"                    
[269] "Basa"                         "Franca"                      
[271] "Runa"                         "Sunda"

How can we use this API for BHL text content?

I did a test of using scrapenames directly on a page at BHL (and this page has only text on it!), but it failed.

taxa <- scrapenames('https://www.biodiversitylibrary.org/pagetext/1694302')

#FAIL!

The error says, “Error: Invalid SSL Certificate (Cloudflare) (HTTP 526)” - BOO!

This wasn’t all that unexpected, and it means that we’ll have to combine the querying functionality of the rbhl package with the scraping of the scrapenames function in taxize. Recall in my post “Using the BHL API” that we can pull text directly from BHL using the functions in the rbhl package. Let’s try to pull the OCR text from the same page we tried to query above.

# Get a list of items from the result of the title search, and put them in a List called `anspitems`

anspitems <- bhl_gettitlemetadata(6885, items = TRUE, as='list')$Result[[1]]$Items

Item-level data is in nested lists. That is, anspitems is a list where each item within anspitems is itself a list with metadata regarding each volume of the Proceedings. Let’s pull that metadata out into a tibble.

len <- length(anspitems) # how many items for the loop
dat <- as_tibble(anspitems[[1]],) # create a tibble for the metadata to go into
for (i in 2:len){
  newrow <- as_tibble(anspitems[[i]]) # create the new row of data
  dat <- dat %>% bind_rows(newrow)
}

rm(newrow, i) # clean up your work space
unique_vols = dat[!duplicated(dat$Volume),] #remove duplicate items in ANSP metadata

# Need info from page 189 volume 7. let's find it! 
volmeta <- as_tibble(bhl_getitemmetadata(unique_vols$ItemID[7], TRUE, ocr = TRUE)) # Volume 7
colnames(volmeta) # OCR info is in a table called "Pages"

 [1] "ItemID"             "TitleID"            "ThumbnailPageID"   
 [4] "Source"             "SourceIdentifier"   "IsVirtual"         
 [7] "Volume"             "Year"               "EndYear"           
[10] "HoldingInstitution" "RightsHolder"       "Sponsor"           
[13] "Language"           "CopyrightStatus"    "ItemUrl"           
[16] "TitleUrl"           "ItemThumbUrl"       "ItemTextUrl"       
[19] "ItemPDFUrl"         "ItemImagesUrl"      "CreationDate"      
[22] "Pages"

ocr <- as_tibble(volmeta["Pages"][[1]][[1]][["OcrText"]])
ocr[211,] # just look at the 189th page of text

# A tibble: 1 x 1
  value                                                               
  <chr>                                                               
1 "1854.] 189 \n\n\n\nCATALOGUE OF AMERICAN TESTUDINATA. \n\n\n\nChel…

Feed the text from the page we are interested in (211 = page 189 of Vol. 7) into scrapenames, put the data into a tibble, show the tibble.

snames <- scrapenames(text=ocr[211,])
df.snames <- as_tibble(snames[["data"]])
paged_table(df.snames)

WINNER!!

The “verbatim” column is the OCR text we fed into the scraper, and “scientificname” is the matched scientific name (GNRD is able to do “fuzzy” matching to account for misspellings and such).

Next steps

We need to verify this list of taxa against … something. I couldn’t get the “with_verification” option to work in the API context (it works on the web version). The Taxize package has options for this, so I’ll get to that tomorrow.

Comment on this article Share:

Using the GNRD API to find taxonomic names

A notebook wherein I learn how to use the GNRD API to find taxonomic names in text

Background

Workspace setup

Testing the `scrapenames` function

How can we use this API for BHL text content?

WINNER!!

Next steps

Corrections

Citation

Using the GNRD API to find taxonomic names

A notebook wherein I learn how to use the GNRD API to find taxonomic names in text

Background

Workspace setup

Testing the scrapenames function

How can we use this API for BHL text content?

WINNER!!

Next steps

Corrections

Citation

Testing the `scrapenames` function