Using the BHL API

A notebook to access the Biodiversity Heritage Library API available through the rOpenSci package rbhl.

Amanda Whitmire https://amandawhitmire.github.io/ (Stanford Libraries & Hopkins Marine Station)
06-23-2021

v.0 2021-06-16, Amanda L. Whitmire, , ORCID: https://orcid.org/0000-0003-2429-8879

Purpose: I want to explore the corpus of the Proceedings of the Academy of Natural Sciences of Philadelphia (1841 - 1922), available at the Biodiversity Heritage Library (BHL) at https://www.biodiversitylibrary.org/bibliography/6885. I am interested in how this corpus may be leveraged for historical species occurrence data. Step one is accessing the plain text for each volume. Species names have already been identified in the text via the Global Names Parser, but which of these could be an occurrence record ("I saw this thing, at this place, at this time.")? Let’s find out!

Note: I already received an API token from BHL and have it in my .Renviron file.


Getting the metadata out of BHL

Going to use the rbhl package to interface with the BHL API. See https://docs.ropensci.org/rbhl/

If you don’t have the package installed, un-comment the start of this line to install the package

# install.packages("rbhl") 

Load the Libraries we need. Good ol’ Tidyverse!!

Let’s test out a rbhl function by doing and author search.

bhl_authorsearch(name='Doubleday, Nellie')
# A tibble: 1 x 4
  AuthorID Name                  Dates   CreatorUrl                   
  <chr>    <chr>                 <chr>   <chr>                        
1 11872    Doubleday, Nellie Bl… 1865-1… https://www.biodiversitylibr…

It works!

Okay, now search the BHL catalog for ‘Proceedings of the Academy of Natural Sciences of Philadelphia.’ Note: of course I’ve been to BHL and I know the URL for this collection. But for the sake of learning how to do this programmatically, we proceed thusly…

#creates a tibble named `anspsearch`
anspsearch <- bhl_publicationsearch('Proceedings of the Academy of Natural Sciences of Philadelphia')

#display the tibble (uses `rmarkdown` package for rendering)
paged_table(anspsearch)

What we want is in the first row. The TitleID is ‘6885’, so let’s see where we can go from there.

Get a list of items from the result of the title search, and put them in a List called anspitems

anspitems <- bhl_gettitlemetadata(6885, items = TRUE, as='list')$Result[[1]]$Items 

Look at the first item in the list:

anspitems[[1]]
$ItemID
[1] 17886

$Volume
[1] "v.1 (1841-1843)"

$Year
[1] "1841"

$EndYear
[1] "1843"

$ItemUrl
[1] "https://www.biodiversitylibrary.org/item/17886"

Okay, so we can see now that the item-level data is in nested lists. That is, anspitems is a list where each item within anspitems is itself a list with metadata regarding each volume of the Proceedings. Let’s pull that metadata out into a tibble.

len <- length(anspitems) # how many items for the loop
dat <- as_tibble(anspitems[[1]],) # create a tibble for the metadata to go into
for (i in 2:len){
  newrow <- as_tibble(anspitems[[i]]) #create the new row of data
  dat <- dat %>% bind_rows(newrow) # not sure why I couldn't use 'add_row' function, but this works I guess
}
rm(newrow, i) # clean up your workspace
kable(head(dat)) # look at your data
ItemID Volume Year EndYear ItemUrl
17886 v.1 (1841-1843) 1841 1843 https://www.biodiversitylibrary.org/item/17886
84736 v.1 (1841-1843) 1841 1843 https://www.biodiversitylibrary.org/item/84736
30491 v.2 (1844-1845) 1844 1845 https://www.biodiversitylibrary.org/item/30491
84725 v.2 (1844-1845) 1844 1845 https://www.biodiversitylibrary.org/item/84725
17669 v.3 (1846-1847) 1846 1847 https://www.biodiversitylibrary.org/item/17669
84755 v.3 (1846-1847) 1846 1847 https://www.biodiversitylibrary.org/item/84755

OH CRIPES - there are duplicate volumes. DANG. We’re going to have to de-duplicate the tibble based on … maybe the Volume column? ::shrugs:: This will not be a specific de-dup - like, I don’t know how to pick which of the two volumes to keep, so I guess we’ll keep whatever the function decides for us. If I had to guess, it’ll keep the first instance of each volume.

unique_vols = dat[!duplicated(dat$Volume),]
kable(head(unique_vols))
ItemID Volume Year EndYear ItemUrl
17886 v.1 (1841-1843) 1841 1843 https://www.biodiversitylibrary.org/item/17886
30491 v.2 (1844-1845) 1844 1845 https://www.biodiversitylibrary.org/item/30491
17669 v.3 (1846-1847) 1846 1847 https://www.biodiversitylibrary.org/item/17669
28006 v.4 (1848-1849) 1848 1849 https://www.biodiversitylibrary.org/item/28006
17633 v.5 (1850-1851) 1850 1851 https://www.biodiversitylibrary.org/item/17633
17888 v.6 (1852-1853) 1852 1853 https://www.biodiversitylibrary.org/item/17888

BOOM!! We have a list of each volume of the Proceedings with the year and the URL for each. This is the first step - DONE. Looking at this table, yes, the first instance of each volume is kept.


Downloading each volume of the Proceedings

Now let’s get into the business of trying to access the OCR text of the proceedings. Since I have a list of URLs, we could probably also do this in Python, but this ‘rbhl’ R-package makes it easy to query and pull the text we need. After I build out our corpus, we can shift over to Python …

# USAGE 
# bhl_getitemmetadata(
#   itemid = NULL,
#   pages = TRUE,
#   ocr = FALSE,
#   parts = FALSE,
#   as = "table",
#   key = NULL,
#   ...
# )

# volmeta <- bhl_getitemmetadata(as_tibble(unique_vols$ItemID), TRUE, ocr = TRUE)
volmeta <- as_tibble(bhl_getitemmetadata(unique_vols$ItemID[1], TRUE, ocr = TRUE))
colnames(volmeta) # OCR info is in a table tucked into the column #21 called "Pages"
 [1] "ItemID"             "TitleID"            "ThumbnailPageID"   
 [4] "Source"             "SourceIdentifier"   "IsVirtual"         
 [7] "Volume"             "Year"               "EndYear"           
[10] "HoldingInstitution" "Sponsor"            "Language"          
[13] "CopyrightStatus"    "ItemUrl"            "TitleUrl"          
[16] "ItemThumbUrl"       "ItemTextUrl"        "ItemPDFUrl"        
[19] "ItemImagesUrl"      "CreationDate"       "Pages"             
ocr <- as_tibble(volmeta[[21]][[1]]["OcrText"])
print(ocr[10,]) # just look at the 10th page of text
# A tibble: 1 x 1
  OcrText                                                             
  <chr>                                                               
1 "IV. \n\n\n\nINDEX. \n\n\n\nCalcutta Journal of Nat. History, don. …

What we have now is a tibble where each row is a page of OCR text from the volume of the proceedings that we’ve pulled from the API. Let’s concatenate the rows into a single block of text, effectively rendering a complete volume of the Proceedings into one block.

vol <- paste0(ocr) # concatenates the rows all together
write(vol, file = "test.txt") #write it out as a text file

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Whitmire (2021, June 23). Seaside Librarian: Using the BHL API. Retrieved from https://amandawhitmire.github.io/blog/posts/2021-06-23-using-the-bhl-api/

BibTeX citation

@misc{whitmire2021using,
  author = {Whitmire, Amanda},
  title = {Seaside Librarian: Using the BHL API},
  url = {https://amandawhitmire.github.io/blog/posts/2021-06-23-using-the-bhl-api/},
  year = {2021}
}