A Global Names Architecture adventure supported by R package Taxize
Our use case is text-mining the Proceedings of the Academy of Natural Sciences of Philadelphia (ANSP) to identify species names. This is a first-step toward then identifying species occurrences (species observation + date + location). We know that Global Names can find taxonomic names in ANSP because every page of items in the Biodiversity Heritage Library has taxonomic names listed that were identified using Global Names. See, for example, this page, at the bottom left.
What is GNRD? It is described via Global Names Architecture as: “Global Names Recognition and Discovery: a name discovery service that accepts Microsoft Office documents, PDFs, images, and other files, performs OCR as required, then uses TaxonFinder and NetiNeti name discovery algorithms. It has several configuration options including passing found names to Global Names Resolver.”
This tool is important because rather than try to assemble a list of all possible taxonomic names so that we can search for them in a text (via named entity recognition or some other method), we want to be able to use community-accepted tools and taxonomies. So, the exercise here is to learn how to use the GNRD API, get it working on our corpus (ANSP), and get comfortable working with the inputs (text) and outputs (taxa).
install.packages("taxize")
Load the Libraries we need.
scrapenames
functionFrom Taxize
documentation we see how the scrapenames
function works. The input can be a URL OR a file OR text, with various options for what you’re looking for (eg., unique names only (no repeats), or all names found). Here is how it looks:
scrapenames(
url = NULL,
file = NULL,
text = NULL,
engine = NULL,
unique = NULL,
verbatim = NULL,
detect_language = NULL,
all_data_sources = NULL,
data_source_ids = NULL,
return_content = FALSE,
other options not described... )
And here it is in use on a Wikipedia page:
taxa <- scrapenames(
url = 'https://en.wikipedia.org/wiki/Animalia', # a web page with taxa mentioned
engine = 0, # uses both name finding engines (IDK what the differences are, tbh)
unique = TRUE, # just show the unique names as a start
verbatim = FALSE, # default
detect_language = TRUE, # default
with_verification = TRUE, # can't seem to get this to work
all_data_sources = TRUE,
return_content = FALSE
)
This returns a list of length 2, with a list of metadata (search query settings) and results (data). Let’s look! Here is what the metadata looks like:
taxa[["meta"]]
$token_url
[1] "http://gnrd.globalnames.org/name_finder.json?token=65gr6gcpd7"
$input_url
[1] "https://en.wikipedia.org/wiki/Animalia"
$file
NULL
$status
[1] 200
$engine
[1] "gnfinder"
$unique
[1] TRUE
$verbatim
[1] FALSE
$parameters
$parameters$return_content
[1] FALSE
$parameters$with_verification
[1] FALSE
$parameters$preferred_data_sources
list()
$parameters$detect_language
[1] FALSE
$parameters$engine
[1] 0
$parameters$no_bayes
[1] FALSE
$language_used
[1] "eng"
$execution_time
$execution_time$text_preparation_duration
[1] 2.482378
$execution_time$find_names_duration
[1] 0.05572677
$execution_time$total_duration
[1] 2.541807
$total
[1] 272
And here is a list of the taxonomic names that GNRD found in the Wikipedia page.
# max_len <- max(lengths(taxa[["data"]]))
# df <- do.call(cbind.data.frame, c(lapply(taxa[["data"]], function(x)
# c(x, rep('', max_len - length(x)))), stringsAsFactors = FALSE))
# knitr::kable(df)
taxa[["data"]][["scientificname"]]
[1] "Animalia" "Eukaryota"
[3] "Unikonta" "Opisthokonta"
[5] "Holozoa" "Metazoa"
[7] "Bilateria" "Protozoa"
[9] "Ecdysozoa" "Spiralia"
[11] "Latin animalis" "Ctenophora"
[13] "Ficedula superciliaris" "Tiktaalik"
[15] "Balaenoptera musculus" "Loxodonta africana"
[17] "Argentinosaurus" "Myxozoa"
[19] "Myxobolus" "Bryozoa"
[21] "Chordates" "Nematodes"
[23] "Platyhelminthes" "Dickinsonia costata"
[25] "Dickinsonia" "Anomalocaris canadensis"
[27] "Anomalocaris" "Gromia sphaerica"
[29] "Choanoflagellata" "Choanozoa"
[31] "Porifera" "Placozoa"
[33] "Eumetazoa" "Xenacoelomorpha"
[35] "Nephrozoa" "Deuterostomia"
[37] "Chordata" "Ambulacraria"
[39] "Protostomia" "Scalidophora"
[41] "Panarthropoda" "Nematoida"
[43] "Gnathifera" "Rotifera"
[45] "Chaetognatha" "Lophotrochozoa"
[47] "Mollusca" "Annelida"
[49] "Echinodermata" "Hemichordata"
[51] "Arthropoda" "Onychophora"
[53] "Tardigrada" "Nematoda"
[55] "Nematomorpha" "Kinorhyncha"
[57] "Priapulida" "Loricifera"
[59] "Vermes" "Insecta"
[61] "Amphibia" "Mammalia"
[63] "Protista" "Drosophila melanogaster"
[65] "Cecidomyiidae" "Angela"
[67] "Mossman" "Marisa"
[69] "Rita" "Alison"
[71] "Myxosporea" "Bivalvulida"
[73] "Melissa" "Sina"
[75] "Georgina" "Crustacea"
[77] "Malacostraca" "Trichomycteridae"
[79] "Vandelliinae" "Stegophilinae"
[81] "Tricladida" "Terricola"
[83] "Ediacara" "Verena"
[85] "Tamara" "Luca"
[87] "Bilaterian-specific genes" "Agatha"
[89] "Safra" "Xenoturbella"
[91] "Xenoturbella bocki" "Sabrina"
[93] "Gnathostomulida" "Cycliophora"
[95] "Dendrobatidae" "Archaea"
[97] "Hacrobia" "Alveolata"
[99] "Rhizaria" "Excavata"
[101] "Amoebozoa" "Micrognathozoa"
[103] "Limnognathia" "Syndermata"
[105] "Acanthocephala" "Mesozoa"
[107] "Orthonectida" "Dicyemida"
[109] "Rhombozoa" "Monoblastozoa"
[111] "Lophophorata" "Entoprocta"
[113] "Kamptozoa" "Ectoprocta"
[115] "Brachiozoa" "Brachiopoda"
[117] "Phoronida" "Chromadorea"
[119] "Enoplea" "Secernentea"
[121] "Turbellaria" "Trematoda"
[123] "Monogenea" "Cestoda"
[125] "Phylactolaemata" "Stenolaemata"
[127] "Gymnolaemata" "Clitellata"
[129] "Sipuncula" "Archaeplastida"
[131] "Glaucophyta" "Rhodophyta"
[133] "Picozoa" "Plantae"
[135] "Chlorophyta" "Streptophyta"
[137] "Chlorokybophyceae" "Mesostigmatophyceae"
[139] "Spirotaenia" "Cryptista"
[141] "Endohelea" "Katablepharidophyta"
[143] "Cryptophyta" "Haptista"
[145] "Centroheliozoa" "Haptophyta"
[147] "Halvaria" "Colponemidia"
[149] "Chromerida" "Colpodellida"
[151] "Sporozoa" "Perkinsozoa"
[153] "Dinophyta" "Stramenopiles"
[155] "Developea" "Hyphochytrea"
[157] "Ochrophyta" "Opalinata"
[159] "Sagenista" "Filosa"
[161] "Retaria" "Endomyxa"
[163] "Telonemia" "Discoba"
[165] "Eolouka" "Jakobea"
[167] "Tsukubea" "Discicristata"
[169] "Euglenozoa" "Percolozoa"
[171] "Loukozoa" "Malawimonadea"
[173] "Metamonada" "Anaeromonada"
[175] "Trichozoa" "Conosa"
[177] "Archamoebae" "Lobosa"
[179] "Cutosea" "Discosea"
[181] "Tubulinea" "Apusomonadida"
[183] "Breviatea" "Holomycota"
[185] "Cristidiscoidea" "Ichthyosporea"
[187] "Syssomonas" "Corallochytrea"
[189] "Filasterea" "Collodictyonidae"
[191] "Mantamonadidae" "Rigifilida"
[193] "Hemimastigophora" "Spironemidae"
[195] "Incertae sedis" "Eukaryota flora"
[197] "Prokaryotes archaea bacteria" "Acidobacteria"
[199] "Actinobacteria" "Aquificae"
[201] "Bacteroidetes" "Chlamydiae"
[203] "Chlorobi" "Chloroflexi"
[205] "Chrysiogenetes" "Cyanobacteria"
[207] "Deferribacteres" "Dictyoglomi"
[209] "Fibrobacteres" "Firmicutes"
[211] "Fusobacteria" "Gemmatimonadetes"
[213] "Nitrospirae" "Planctomycetes"
[215] "Proteobacteria" "Spirochaetes"
[217] "Thermodesulfobacteria" "Thermomicrobia"
[219] "Thermotogae" "Verrucomicrobia"
[221] "Crenarchaeota" "Euryarchaeota"
[223] "Korarchaeota" "Nanoarchaeota"
[225] "Chromista" "Heterokontophyta"
[227] "Ciliophora" "Apicomplexa"
[229] "Dinoflagellata" "Foraminifera"
[231] "Cercozoa" "Chytridiomycota"
[233] "Blastocladiomycota" "Neocallimastigomycota"
[235] "Glomeromycota" "Zygomycota"
[237] "Ascomycota" "Basidiomycota"
[239] "Charophyta" "Marchantiophyta"
[241] "Anthocerotophyta" "Lycopodiophyta"
[243] "Pteridophyta" "Cycadophyta"
[245] "Ginkgophyta" "Pinophyta"
[247] "Gnetophyta" "Nanobacterium"
[249] "Riboviria" "Unassigned"
[251] "Alphasatellitidae" "Ampullaviridae"
[253] "Anelloviridae" "Avsunviroidae"
[255] "Bicaudaviridae" "Clavaviridae"
[257] "Fuselloviridae" "Globuloviridae"
[259] "Guttaviridae" "Ovaliviridae"
[261] "Plasmaviridae" "Polydnaviridae"
[263] "Portogloboviridae" "Pospiviroidae"
[265] "Spiraviridae" "Tolecusatellitidae"
[267] "Hela" "Europaea"
[269] "Basa" "Franca"
[271] "Runa" "Sunda"
I did a test of using scrapenames
directly on a page at BHL (and this page has only text on it!), but it failed.
taxa <- scrapenames('https://www.biodiversitylibrary.org/pagetext/1694302')
#FAIL!
The error says, “Error: Invalid SSL Certificate (Cloudflare) (HTTP 526)” - BOO!
This wasn’t all that unexpected, and it means that we’ll have to combine the querying functionality of the rbhl
package with the scraping of the scrapenames
function in taxize
. Recall in my post “Using the BHL API” that we can pull text directly from BHL using the functions in the rbhl
package. Let’s try to pull the OCR text from the same page we tried to query above.
# Get a list of items from the result of the title search, and put them in a List called `anspitems`
anspitems <- bhl_gettitlemetadata(6885, items = TRUE, as='list')$Result[[1]]$Items
Item-level data is in nested lists. That is, anspitems
is a list where each item within anspitems
is itself a list with metadata regarding each volume of the Proceedings. Let’s pull that metadata out into a tibble.
len <- length(anspitems) # how many items for the loop
dat <- as_tibble(anspitems[[1]],) # create a tibble for the metadata to go into
for (i in 2:len){
newrow <- as_tibble(anspitems[[i]]) # create the new row of data
dat <- dat %>% bind_rows(newrow)
}
rm(newrow, i) # clean up your work space
unique_vols = dat[!duplicated(dat$Volume),] #remove duplicate items in ANSP metadata
# Need info from page 189 volume 7. let's find it!
volmeta <- as_tibble(bhl_getitemmetadata(unique_vols$ItemID[7], TRUE, ocr = TRUE)) # Volume 7
colnames(volmeta) # OCR info is in a table called "Pages"
[1] "ItemID" "TitleID" "ThumbnailPageID"
[4] "Source" "SourceIdentifier" "IsVirtual"
[7] "Volume" "Year" "EndYear"
[10] "HoldingInstitution" "RightsHolder" "Sponsor"
[13] "Language" "CopyrightStatus" "ItemUrl"
[16] "TitleUrl" "ItemThumbUrl" "ItemTextUrl"
[19] "ItemPDFUrl" "ItemImagesUrl" "CreationDate"
[22] "Pages"
ocr <- as_tibble(volmeta["Pages"][[1]][[1]][["OcrText"]])
ocr[211,] # just look at the 189th page of text
# A tibble: 1 x 1
value
<chr>
1 "1854.] 189 \n\n\n\nCATALOGUE OF AMERICAN TESTUDINATA. \n\n\n\nChel…
Feed the text from the page we are interested in (211 = page 189 of Vol. 7) into scrapenames
, put the data into a tibble, show the tibble.
snames <- scrapenames(text=ocr[211,])
df.snames <- as_tibble(snames[["data"]])
paged_table(df.snames)
The “verbatim” column is the OCR text we fed into the scraper, and “scientificname” is the matched scientific name (GNRD is able to do “fuzzy” matching to account for misspellings and such).
We need to verify this list of taxa against … something. I couldn’t get the “with_verification” option to work in the API context (it works on the web version). The Taxize
package has options for this, so I’ll get to that tomorrow.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Whitmire (2021, Aug. 3). Seaside Librarian: Using the GNRD API to find taxonomic names. Retrieved from https://amandawhitmire.github.io/blog/posts/2021-08-03-using-the-gnrd-api-to-find-taxonomic-names/
BibTeX citation
@misc{whitmire2021using, author = {Whitmire, Amanda}, title = {Seaside Librarian: Using the GNRD API to find taxonomic names}, url = {https://amandawhitmire.github.io/blog/posts/2021-08-03-using-the-gnrd-api-to-find-taxonomic-names/}, year = {2021} }