This vignette discusses some advanced functions to analyze PubMed records using easyPubMed, as well as new functions that were introduced in the latest version of the library. If you are looking for a tutorial about how to get started with easyPubMed, please start by reading the Retrieving and Processing PubMed Records using easyPubMed vignette. More information are available at the following URL: https://www.data-pulse.com/dev_site/easypubmed/.
In this vignette, we are making use of some pre-processed PubMed records that are included in the easyPubMed package. You can access them using the utils::data()
function.
library(easyPubMed)
library(dplyr)
library(kableExtra)
The following code is aimed at downloading ad pre-processing a short-list of PubMed records in small batches. In real-world applications, you may want to download large number of records in series of about 1000 records or more, depending on the specific needs.
# Query pubmed and fetch many results
my_query <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2016"[PDAT]:"2018"[PDAT]'
my_query <- get_pubmed_ids(my_query)
# Fetch data
my_abstracts_xml <- fetch_pubmed_data(my_query)
# Store Pubmed Records as elements of a list
all_xml <- articles_to_list(my_abstracts_xml)
The following code illustrates the use of article_to_df(, getAuthors = FALSE)
, for fast extraction of PubMed record titles and abstracts. This function can process PubMed records quickly, and will return all record data, without information about authors. Here, ~100 records were processed in less than 2 sec.
# Starting time: record
t.start <- Sys.time()
# Perform operation (use lapply here, no further parameters)
final_df <- do.call(rbind, lapply(all_xml, article_to_df,
max_chars = -1, getAuthors = FALSE))
# Final time: record
t.stop <- Sys.time()
# How long did it take?
print(t.stop - t.start)
## Time difference of 1.251275 secs
# Show an excerpt of the results
final_df[,c("pmid", "year", "abstract")] %>%
head() %>% kable() %>% kable_styling(bootstrap_options = 'striped')
pmid | year | abstract |
---|---|---|
31200845 | 2019 | While urothelial carcinom… |
31100250 | 2018 | Perioperative and long-te… |
30652586 | 2018 | Bladder cancer is initial… |
30580006 | 2019 | "To describe managem… |
30510971 | 2018 | The partial bladder outle… |
30479364 | 2018 | Chronic prostatitis/Chron… |
# If interested in specific information,
# you can subset the dataframe and save the
# desired columns/features
id_abst_df <- final_df[,c("pmid", "abstract")]
id_abst_df %>%
head(n=4) %>% kable() %>% kable_styling(bootstrap_options = 'striped')
pmid | abstract |
---|---|
31200845 | While urothelial carcinom… |
31100250 | Perioperative and long-te… |
30652586 | Bladder cancer is initial… |
30580006 | "To describe managem… |
The following code illustrates the use of article_to_df(, getKeywords = TRUE)
, for recursive extraction of PubMed record info, including keywords. Author info extraction is a time-consuming process, but easyPubMed can handle this task in an efficient fashion. Here, we are extracting info from ~1000 PubMed records included in the attached IL_PubMed_data
. The processing time was less ~30 sec.
# Starting time: record
t.start <- Sys.time()
# Perform operation (use lapply here, no further parameters)
NUBL_records <- easyPubMed::EPMsamples$NUBL_1618$rec_lst
keyword_df <- do.call(rbind, lapply(NUBL_records,
article_to_df, autofill = T,
max_chars = 100, getKeywords = T))
# Final time: record
t.stop <- Sys.time()
# How long did it take?
print(t.stop - t.start)
## Time difference of 19.96102 secs
# Visualize Keywords extracted from PubMed records
# Keyword and MeSH Concepts are separated by semicolons
data.frame(keywords = keyword_df$keywords[seq(1, 100, by = 10)]) %>%
kable() %>% kable_styling(bootstrap_options = 'striped')
keywords |
---|
NA |
NA |
Bladder; VSOP; bladder obstruc… |
NA |
Biomarker; Bladder cancer; Cla… |
Nutrition; albumin; bladder ca… |
NA |
NA |
NA |
cluster analysis; diagnosis-re… |
# Show an excerpt of the results
keyword_df[seq(1, 100, by = 10), c("lastname", "firstname", "keywords")] %>%
kable() %>% kable_styling(bootstrap_options = 'striped')
lastname | firstname | keywords | |
---|---|---|---|
1 | Glaser | Alexander P | NA |
11 | Nadler | Robert B | NA |
21 | Kochan | Kirsten S | Bladder; VSOP; bladder obstruc… |
31 | Wetterlin | Jessica | NA |
41 | Meeks | Joshua J | Biomarker; Bladder cancer; Cla… |
51 | Tollefson | Matthew K | Nutrition; albumin; bladder ca… |
61 | Marra | Angelo | NA |
71 | Vargas | Carlos E | NA |
81 | Larson | Gary L | NA |
91 | Weinfurt | Kevin P | cluster analysis; diagnosis-re… |
The following code illustrates the use of article_to_df()
in conjunction with parallelization. If multiple cores are available, splitting the job in multiple tasks can support faster info extraction from a large number of records. Here, ~100 records (as before) were processed in ~11 sec using 3 cores.
# Load required packages (available from CRAN).
# This will work on UNIX/LINUX systems.
# Windows systems may not support the following code.
library(parallel)
library(foreach)
library(doParallel)
# Starting time: record
t.start <- Sys.time()
# Start a cluster with 3 cores
cl <- makeCluster(3)
registerDoParallel(cl)
# Perform operation (use foreach)
# The .combine argument guides result aggregation
fullDF <- tryCatch(
{foreach(x=NUBL_records,
.packages = 'easyPubMed',
.combine = rbind) %dopar% article_to_df(pubmedArticle = x,
autofill = T,
max_chars = 500,
getKeywords = T,
getAuthors = T)},
error = function(e) {NULL},
finally = {stopCluster(cl)})
# Final time: record
t.stop <- Sys.time()
# How long did it take?
print(t.stop - t.start)
## Time difference of 11.19819 secs
# Show an excerpt of the results
fullDF[seq(1, 100, by = 10), c("lastname", "keywords", "abstract")] %>%
kable() %>% kable_styling(bootstrap_options = 'striped')
lastname | keywords | abstract | |
---|---|---|---|
1 | Glaser | NA | Bladder cancer is in… |
11 | Nadler | NA | "To describe ma… |
21 | Kochan | Bladder; VSOP; … | The partial bladder … |
31 | Wetterlin | NA | Cystectomy is the re… |
41 | Meeks | Biomarker; Blad… | Bladder cancer is th… |
51 | Tollefson | Nutrition; albu… | There are conflictin… |
61 | Marra | NA | Central cord syndrom… |
71 | Vargas | NA | Randomized evidence … |
81 | Larson | NA | Randomized evidence … |
91 | Weinfurt | cluster analysi… | Women with lower uri… |
The following code illustrates the use of the argument api_key
, which was introduced in version 2.11. E-utils users are now limited to 3 requests/second if an API key is not provided. However, users can obtain an NCBI/Entrez API key to increase the e-utils limit to 10 requests/second. For more information, visit: (https://www.ncbi.nlm.nih.gov/account/settings/)[https://www.ncbi.nlm.nih.gov/account/settings/]. Two easyPubMed functions can accept an api_key
argument: get_pubmed_ids()
, and batch_pubmed_download()
. Requests submitted by the latter function are automatically paced, therefore the use of a key may speed the queries if records are retrieved in small batches. Please, use your own API key, as the one provided in the vignette has been replaced and is no longer valid.
# define a PubMed Query: this should return 40 results
my_query <- '"immune checkpoint" AND 2010[DP]:2012[DP]'
# Monitor time, and proceed with record download -- USING API_KEY!
t_key1 <- Sys.time()
set_01 <- batch_pubmed_download(my_query,
api_key = "NNNNNNNNNNe9108aee96ace507af23a4eb09",
batch_size = 2, dest_file_prefix = "TMP_api_")
t_key2 <- Sys.time()
# Monitor time, and proceed with record download -- DO NOT USE API_KEY!
t_nok1 <- Sys.time()
set_02 <- batch_pubmed_download(my_query,
batch_size = 2, dest_file_prefix = "TMP_no_")
t_nok2 <- Sys.time()
# Compute time differences
# The use of a key makes the process faster
print(paste("With key:", t_key2 - t_key1))
## [1] "With key: 19.9697108268738"
print(paste("W/o key:", t_nok2 - t_nok1))
## [1] "W/o key: 26.1454644203186"
Here, we demo get_pubmed_ids_by_fulltitle()
, a new function included in version 2.11 of easyPubMed
, and we compare its results with get_pubmed_ids()
. Querying PubMed using full-length titles may be troublesome due to stopwords included in the title. To circumvent this problem, the get_pubmed_ids_by_fulltitle()
function attempts a PubMed query after stopword removal if no results were returned by the original query.
# Define the query string and the query filter to apply
my_query <- "Body mass index and cancer risk among Chinese patients with type 2 diabetes mellitus"
my_field <- "[Title]"
# Standard query
res_01 <- get_pubmed_ids(paste("\"", my_query, "\"", my_field, sep = ""))
# Improved query (designed to query titles)
res_02 <- get_pubmed_ids_by_fulltitle(my_query, field = my_field)
## Display and compare the results
# Num results standard query
print(as.numeric(res_01$Count))
## [1] 0
# Num results title-specific query
print(as.numeric(res_02$Count))
## [1] 1
# Pubmed Record ID returned
print(as.numeric(res_02$IdList$Id[1]))
## [1] 30081866
Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is <damiano.fantini(at)gmail(dot)com>. More info about easyPubMed are available at the following URL: www.data-pulse.com.
easyPubMed Copyright (C) 2017-2019 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
!!Note!! If you are using easyPubMed for a scientific publication, please name the package in the Materials and Methods section of the paper. Thanks! Also, I am always open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this Vignette, feel free to contact me via email. Thank you.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] doParallel_1.0.15 iterators_1.0.12 foreach_1.4.7 kableExtra_1.1.0
## [5] dplyr_0.8.0.1 easyPubMed_2.17
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 pillar_1.3.1 compiler_3.5.2
## [4] highr_0.8 tools_3.5.2 digest_0.6.18
## [7] evaluate_0.13 tibble_2.1.1 viridisLite_0.3.0
## [10] pkgconfig_2.0.2 rlang_0.3.2 rstudioapi_0.10
## [13] yaml_2.2.0 xfun_0.5 stringr_1.4.0
## [16] httr_1.4.0 knitr_1.22 xml2_1.2.0
## [19] hms_0.4.2 webshot_0.5.1 tidyselect_0.2.5
## [22] glue_1.3.1 R6_2.4.0 rmarkdown_1.12
## [25] readr_1.3.1 purrr_0.3.2 magrittr_1.5
## [28] codetools_0.2-16 scales_1.0.0 htmltools_0.3.6
## [31] assertthat_0.2.1 rvest_0.3.2 colorspace_1.4-1
## [34] stringi_1.4.3 munsell_0.5.0 crayon_1.3.4