The EPMsamples dataset is a list of four examples showing how to download and analyze records from PubMed by using easyPubMed. Each element in the EPMsamples
list corresponds to a different query and/or analysis. Also, each element of EPMsamples
is a list including intermediates and notes about the analysis. These elements are:
qry_st: string used for querying PubMed
pm_ids: results returned by a PubMed query that used $qry_st as input, and was performed via get_pubmed_ids()
pm_res: results returned by an Entrez query that used $pm_ids as input, and was performed via fetch_pubmed_data()
or batch_pubmed_download()
rec_lst: list of records returned by articles_to_list()
notes: notes about the arguments used to query Entrez and process the output
The code to re-build the full dataset is shown below.
library(easyPubMed)
library(dplyr)
library(kableExtra)
# Initialize the list and four sub-lists
EPMsamples <- list()
EPMsamples$NUBL_dw18 <- list()
EPMsamples$NUBL_1618 <- list()
# Query #1
qry_st <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
pm_ids <- get_pubmed_ids(qry_st)
pm_res <- fetch_pubmed_data(pm_ids, format = "abstract")
notes <- "fetch_pubmed_data__arguments: format=\"abstract\""
EPMsamples$DF_papers_abs <- list()
EPMsamples$DF_papers_abs$qry_st <- qry_st
EPMsamples$DF_papers_abs$pm_ids <- pm_ids
EPMsamples$DF_papers_abs$pm_res <- pm_res
EPMsamples$DF_papers_abs$notes <- notes
# Query #2
qry_st <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
pm_ids <- get_pubmed_ids(qry_st)
pm_res <- fetch_pubmed_data(pm_ids)
notes <- ""
EPMsamples$DF_papers_std <- list()
EPMsamples$DF_papers_std$qry_st <- qry_st
EPMsamples$DF_papers_std$pm_ids <- pm_ids
EPMsamples$DF_papers_std$pm_res <- pm_res
EPMsamples[[2]]$notes <- notes
# Query #3
qry_st <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2018"[PDAT]'
pm_res <- batch_pubmed_download(pubmed_query_string = qry_st,
format = "xml",
batch_size = 20,
dest_file_prefix = "easyPM_example",
encoding = "ASCII")
notes <- "batch_pubmed_download__arguments: format = \"xml\", batch_size = 20, dest_file_prefix = \"easyPM_example\", encoding = \"ASCII\""
EPMsamples$NUBL_dw18 <- list()
EPMsamples$NUBL_dw18$qry_st <- qry_st
EPMsamples$NUBL_dw18$pm_res <- pm_res
EPMsamples[[3]]$notes <- notes
# Query #4
qry_st <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2016"[PDAT]:"2018"[PDAT]'
pm_ids <- get_pubmed_ids(qry_st)
pm_res <- fetch_pubmed_data(pm_ids)
rec_lst <- articles_to_list(pm_res, simplify = FALSE)
notes <- "articles_to_list__arguments: simplify = FALSE"
EPMsamples$NUBL_1618 <- list()
EPMsamples$NUBL_1618$qry_st <- qry_st
EPMsamples$NUBL_1618$pm_ids <- pm_ids
EPMsamples$NUBL_1618$pm_res <- pm_res
EPMsamples$NUBL_1618$rec_lst <- rec_lst
EPMsamples$NUBL_1618$notes <- notes
Note that you can load the dataset using the following lines of code.
library("easyPubMed")
data("EPMsamples")
The following code shows how to further process the data included in the EPMsamples
dataset. A single record can be processed using the article_to_df()
function. This will extract useful information and populate a data.frame.
# Process record num 6 - NUBL_1618
xx <- article_to_df(EPMsamples$NUBL_1618$rec_lst[[6]])
KEEP <- c("lastname", "firstname", "year", "month", "day", "pmid")
xx[, KEEP] %>% kable() %>% kable_styling(bootstrap_options = 'striped')
lastname | firstname | year | month | day | pmid |
---|---|---|---|---|---|
Fantini | Damiano | 2018 | 11 | 17 | 30446446 |
Seiler | Roland | 2018 | 11 | 17 | 30446446 |
Meeks | Joshua J | 2018 | 11 | 17 | 30446446 |
It is possible to process the records recusively using a loop, or using an apply-family function. Here, we are using lapply()
. This returns a list of data.frames, which we can aggregate by using do.call()
. For more info about these functions, try ?lapply
and ?do.call
.
# Recursive extraction of info from the records (~15 sec)
xx <- lapply(NUBL_1618$records, article_to_df)
xx <- do.call(rbind, xx)
# Total number of rows in the resulting data.frame
print(nrow(xx))
## [1] 708
# Show an excerpt of the results
KEEP <- c("lastname", "firstname", "year", "month", "pmid")
xx[seq(1, 100, by = 10), KEEP] %>% kable() %>%
kable_styling(bootstrap_options = 'striped')
lastname | firstname | year | month | pmid | |
---|---|---|---|---|---|
1 | Glaser | Alexander P | 2019 | 01 | 30652586 |
11 | Nadler | Robert B | 2019 | 01 | 30580006 |
21 | Kochan | Kirsten S | 2018 | 12 | 30510971 |
31 | Wetterlin | Jessica | 2018 | 12 | 30475828 |
41 | Meeks | Joshua J | 2018 | 11 | 30421072 |
51 | Tollefson | Matthew K | 2018 | 12 | 30417048 |
61 | Marra | Angelo | 2018 | 11 | 30346310 |
71 | Vargas | Carlos E | 2018 | 11 | 30202801 |
81 | Larson | Gary L | 2018 | 11 | 30202801 |
91 | Wiseman | Jonathan B | 2018 | 12 | 29990467 |
Now, I can query the xx
data.frame using a perfect-match search, or regular expression. For example, we can identify and return records of interest as shown below.
# Here, we search for PubMed records published by the *easyPubMed* maintainer, Damiano Fantini.
idx <- which(xx$lastname == "Fantini" & grepl("^D", xx$firstname))
# Show results
KEEP <- c("lastname", "firstname", "year", "month", "day", "pmid")
xx[idx, KEEP] %>% kable() %>%
kable_styling(bootstrap_options = 'striped')
lastname | firstname | year | month | day | pmid | |
---|---|---|---|---|---|---|
37 | Fantini | Damiano | 2018 | 11 | 17 | 30446446 |
40 | Fantini | Damiano | 2018 | 11 | 13 | 30421072 |
84 | Fantini | Damiano | 2018 | 11 | 14 | 30035181 |
218 | Fantini | Damiano | 2018 | 11 | 13 | 29435122 |
245 | Fantini | Damiano | 2018 | 11 | 13 | 29367767 |
429 | Fantini | Damiano | 2017 | 06 | 03 | 28169993 |
easyPubMed official website including news, vignettes, and further information https://www.data-pulse.com/dev_site/easypubmed/
Sayers, E. A General Introduction to the E-utilities (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK25497/
PubMed Help (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK3827/
Howto: basic usage of easyPubMed - an example Tutorial/Blog Post
Howto: using easyPubMed for a targeting campaign Tutorial/Blog Post
Dev version of easyPubMed on GitHub Website
Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is <damiano.fantini(at)gmail(dot)com>. More info about easyPubMed are available at the following URL: www.data-pulse.com.
easyPubMed Copyright (C) 2017-2019 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
!!Note!! If you are using easyPubMed for a scientific publication, please name the package in the Materials and Methods section of the paper. Thanks! Also, I am always open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this Vignette, feel free to contact me via email. Thank you. SessionInfo
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.1.0 dplyr_0.8.0.1 easyPubMed_2.12
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 rstudioapi_0.10 xml2_1.2.0
## [4] knitr_1.22 magrittr_1.5 hms_0.4.2
## [7] munsell_0.5.0 rvest_0.3.2 tidyselect_0.2.5
## [10] viridisLite_0.3.0 colorspace_1.4-1 R6_2.4.0
## [13] rlang_0.3.2 highr_0.8 httr_1.4.0
## [16] stringr_1.4.0 tools_3.5.2 webshot_0.5.1
## [19] xfun_0.5 htmltools_0.3.6 yaml_2.2.0
## [22] assertthat_0.2.1 digest_0.6.18 tibble_2.1.1
## [25] crayon_1.3.4 purrr_0.3.2 readr_1.3.1
## [28] glue_1.3.1 evaluate_0.13 rmarkdown_1.12
## [31] stringi_1.4.3 compiler_3.5.2 pillar_1.3.1
## [34] scales_1.0.0 pkgconfig_2.0.2