This vignette discusses some advanced functions to analyze PubMed records using easyPubMed, as well as new functions that were introduced in the latest version of the library. If you are looking for a tutorial about how to get started with easyPubMed, please start by reading the Retrieving and Processing PubMed Records using easyPubMed vignette. More information are available at the following URL: https://www.data-pulse.com/dev_site/easypubmed/.

In this vignette, we are making use of some pre-processed PubMed records that are included in the easyPubMed package. You can access them using the utils::data() function.

library(easyPubMed)
library(dplyr)
library(kableExtra)

Before starting - prepare some data

The following code is aimed at downloading ad pre-processing a short-list of PubMed records in small batches. In real-world applications, you may want to download large number of records in series of about 1000 records or more, depending on the specific needs.

# Query pubmed and fetch many results
my_query <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2016"[PDAT]:"2018"[PDAT]' 
my_query <- get_pubmed_ids(my_query)

# Fetch data
my_abstracts_xml <- fetch_pubmed_data(my_query)  

# Store Pubmed Records as elements of a list
all_xml <- articles_to_list(my_abstracts_xml)

Demo 1: fast extraction of PMID, Title, and Abstract

The following code illustrates the use of article_to_df(, getAuthors = FALSE), for fast extraction of PubMed record titles and abstracts. This function can process PubMed records quickly, and will return all record data, without information about authors. Here, ~100 records were processed in less than 2 sec.

# Starting time: record
t.start <- Sys.time()

# Perform operation (use lapply here, no further parameters)
final_df <- do.call(rbind, lapply(all_xml, article_to_df, 
                                  max_chars = -1, getAuthors = FALSE))

# Final time: record
t.stop <- Sys.time()

# How long did it take?
print(t.stop - t.start)

## Time difference of 1.251275 secs

# Show an excerpt of the results
final_df[,c("pmid", "year", "abstract")]  %>%
  head() %>% kable() %>% kable_styling(bootstrap_options = 'striped')

pmid	year	abstract
31200845	2019	While urothelial carcinom…
31100250	2018	Perioperative and long-te…
30652586	2018	Bladder cancer is initial…
30580006	2019	"To describe managem…
30510971	2018	The partial bladder outle…
30479364	2018	Chronic prostatitis/Chron…

# If interested in specific information,
# you can subset the dataframe and save the
# desired columns/features
id_abst_df <- final_df[,c("pmid", "abstract")]
id_abst_df %>%
    head(n=4) %>% kable() %>% kable_styling(bootstrap_options = 'striped')

pmid	abstract
31200845	While urothelial carcinom…
31100250	Perioperative and long-te…
30652586	Bladder cancer is initial…
30580006	"To describe managem…

Demo 2: full info extraction, including keywords

The following code illustrates the use of article_to_df(, getKeywords = TRUE), for recursive extraction of PubMed record info, including keywords. Author info extraction is a time-consuming process, but easyPubMed can handle this task in an efficient fashion. Here, we are extracting info from ~1000 PubMed records included in the attached IL_PubMed_data. The processing time was less ~30 sec.

# Starting time: record
t.start <- Sys.time()

# Perform operation (use lapply here, no further parameters)
NUBL_records <- easyPubMed::EPMsamples$NUBL_1618$rec_lst
keyword_df <- do.call(rbind, lapply(NUBL_records, 
                                    article_to_df, autofill = T, 
                                    max_chars = 100, getKeywords = T))

# Final time: record
t.stop <- Sys.time()

# How long did it take?
print(t.stop - t.start)

## Time difference of 19.96102 secs

# Visualize Keywords extracted from PubMed records
# Keyword and MeSH Concepts are separated by semicolons
data.frame(keywords = keyword_df$keywords[seq(1, 100, by = 10)]) %>%
  kable() %>% kable_styling(bootstrap_options = 'striped')

keywords
NA
NA
Bladder; VSOP; bladder obstruc…
NA
Biomarker; Bladder cancer; Cla…
Nutrition; albumin; bladder ca…
NA
NA
NA
cluster analysis; diagnosis-re…

# Show an excerpt of the results
keyword_df[seq(1, 100, by = 10), c("lastname", "firstname", "keywords")] %>%
    kable() %>% kable_styling(bootstrap_options = 'striped')

	lastname	firstname	keywords
1	Glaser	Alexander P	NA
11	Nadler	Robert B	NA
21	Kochan	Kirsten S	Bladder; VSOP; bladder obstruc…
31	Wetterlin	Jessica	NA
41	Meeks	Joshua J	Biomarker; Bladder cancer; Cla…
51	Tollefson	Matthew K	Nutrition; albumin; bladder ca…
61	Marra	Angelo	NA
71	Vargas	Carlos E	NA
81	Larson	Gary L	NA
91	Weinfurt	Kevin P	cluster analysis; diagnosis-re…

Demo 3: full info extraction using parallelization

The following code illustrates the use of article_to_df() in conjunction with parallelization. If multiple cores are available, splitting the job in multiple tasks can support faster info extraction from a large number of records. Here, ~100 records (as before) were processed in ~11 sec using 3 cores.

# Load required packages (available from CRAN).
# This will work on UNIX/LINUX systems. 
# Windows systems may not support the following code.
library(parallel)
library(foreach)
library(doParallel)

# Starting time: record
t.start <- Sys.time()

# Start a cluster with 3 cores
cl <- makeCluster(3)
registerDoParallel(cl)

# Perform operation (use foreach)
# The .combine argument guides result aggregation
fullDF <- tryCatch(
  {foreach(x=NUBL_records, 
           .packages = 'easyPubMed',
           .combine = rbind) %dopar% article_to_df(pubmedArticle = x, 
                                                   autofill = T, 
                                                   max_chars = 500, 
                                                   getKeywords = T, 
                                                   getAuthors = T)}, 
  error = function(e) {NULL},
  finally = {stopCluster(cl)})

# Final time: record
t.stop <- Sys.time()

# How long did it take?
print(t.stop - t.start)

## Time difference of 11.19819 secs

# Show an excerpt of the results
fullDF[seq(1, 100, by = 10), c("lastname", "keywords", "abstract")] %>%
    kable() %>% kable_styling(bootstrap_options = 'striped')

	lastname	keywords	abstract
1	Glaser	NA	Bladder cancer is in…
11	Nadler	NA	"To describe ma…
21	Kochan	Bladder; VSOP; …	The partial bladder …
31	Wetterlin	NA	Cystectomy is the re…
41	Meeks	Biomarker; Blad…	Bladder cancer is th…
51	Tollefson	Nutrition; albu…	There are conflictin…
61	Marra	NA	Central cord syndrom…
71	Vargas	NA	Randomized evidence …
81	Larson	NA	Randomized evidence …
91	Weinfurt	cluster analysi…	Women with lower uri…

Demo 4: Faster queries using API key

The following code illustrates the use of the argument api_key, which was introduced in version 2.11. E-utils users are now limited to 3 requests/second if an API key is not provided. However, users can obtain an NCBI/Entrez API key to increase the e-utils limit to 10 requests/second. For more information, visit: (https://www.ncbi.nlm.nih.gov/account/settings/)[https://www.ncbi.nlm.nih.gov/account/settings/]. Two easyPubMed functions can accept an api_key argument: get_pubmed_ids(), and batch_pubmed_download(). Requests submitted by the latter function are automatically paced, therefore the use of a key may speed the queries if records are retrieved in small batches. Please, use your own API key, as the one provided in the vignette has been replaced and is no longer valid.

# define a PubMed Query: this should return 40 results
my_query <- '"immune checkpoint" AND 2010[DP]:2012[DP]'

# Monitor time, and proceed with record download -- USING API_KEY!
t_key1 <- Sys.time()
set_01 <- batch_pubmed_download(my_query, 
                                api_key = "NNNNNNNNNNe9108aee96ace507af23a4eb09", 
                                batch_size = 2, dest_file_prefix = "TMP_api_")
t_key2 <- Sys.time()

# Monitor time, and proceed with record download -- DO NOT USE API_KEY!
t_nok1 <- Sys.time()
set_02 <- batch_pubmed_download(my_query, 
                                batch_size = 2, dest_file_prefix = "TMP_no_")
t_nok2 <- Sys.time()

# Compute time differences
# The use of a key makes the process faster
print(paste("With key:", t_key2 - t_key1))

## [1] "With key: 19.9697108268738"

print(paste("W/o key:", t_nok2 - t_nok1))

## [1] "W/o key: 26.1454644203186"

Demo 5: Searching for Exact Matches in PubMed using Full-length Publication Titles

Here, we demo get_pubmed_ids_by_fulltitle(), a new function included in version 2.11 of easyPubMed, and we compare its results with get_pubmed_ids(). Querying PubMed using full-length titles may be troublesome due to stopwords included in the title. To circumvent this problem, the get_pubmed_ids_by_fulltitle() function attempts a PubMed query after stopword removal if no results were returned by the original query.

# Define the query string and the query filter to apply
my_query <- "Body mass index and cancer risk among Chinese patients with type 2 diabetes mellitus"
my_field <- "[Title]"

# Standard query
res_01 <- get_pubmed_ids(paste("\"", my_query, "\"", my_field, sep = ""))

# Improved query (designed to query titles)
res_02 <- get_pubmed_ids_by_fulltitle(my_query, field = my_field)

## Display and compare the results
# Num results standard query
print(as.numeric(res_01$Count))

## [1] 0

# Num results title-specific query
print(as.numeric(res_02$Count))

## [1] 1

# Pubmed Record ID returned
print(as.numeric(res_02$IdList$Id[1]))

## [1] 30081866

Feedback and Citation

Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is <damiano.fantini(at)gmail(dot)com>. More info about easyPubMed are available at the following URL: www.data-pulse.com.

easyPubMed Copyright (C) 2017-2019 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

!!Note!! If you are using easyPubMed for a scientific publication, please name the package in the Materials and Methods section of the paper. Thanks! Also, I am always open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this Vignette, feel free to contact me via email. Thank you.

SessionInfo

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] doParallel_1.0.15 iterators_1.0.12  foreach_1.4.7     kableExtra_1.1.0 
## [5] dplyr_0.8.0.1     easyPubMed_2.17  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3        pillar_1.3.1      compiler_3.5.2   
##  [4] highr_0.8         tools_3.5.2       digest_0.6.18    
##  [7] evaluate_0.13     tibble_2.1.1      viridisLite_0.3.0
## [10] pkgconfig_2.0.2   rlang_0.3.2       rstudioapi_0.10  
## [13] yaml_2.2.0        xfun_0.5          stringr_1.4.0    
## [16] httr_1.4.0        knitr_1.22        xml2_1.2.0       
## [19] hms_0.4.2         webshot_0.5.1     tidyselect_0.2.5 
## [22] glue_1.3.1        R6_2.4.0          rmarkdown_1.12   
## [25] readr_1.3.1       purrr_0.3.2       magrittr_1.5     
## [28] codetools_0.2-16  scales_1.0.0      htmltools_0.3.6  
## [31] assertthat_0.2.1  rvest_0.3.2       colorspace_1.4-1 
## [34] stringi_1.4.3     munsell_0.5.0     crayon_1.3.4

Advanced Features for Analysis of PubMed Records using easyPubMed

Damiano Fantini, Ph.D.

January 17, 2019