Building the EPMsamples dataset

The EPMsamples dataset is a list of four examples showing how to download and analyze records from PubMed by using easyPubMed. Each element in the EPMsamples list corresponds to a different query and/or analysis. Also, each element of EPMsamples is a list including intermediates and notes about the analysis. These elements are:

qry_st: string used for querying PubMed
pm_ids: results returned by a PubMed query that used $qry_st as input, and was performed via get_pubmed_ids()
pm_res: results returned by an Entrez query that used $pm_ids as input, and was performed via fetch_pubmed_data() or batch_pubmed_download()
rec_lst: list of records returned by articles_to_list()
notes: notes about the arguments used to query Entrez and process the output

The code to re-build the full dataset is shown below.

Building the dataset

library(easyPubMed)
library(dplyr)
library(kableExtra)

# Initialize the list and four sub-lists
EPMsamples <- list()

EPMsamples$NUBL_dw18 <- list()
EPMsamples$NUBL_1618 <- list()

# Query #1
qry_st <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
pm_ids <- get_pubmed_ids(qry_st)
pm_res <- fetch_pubmed_data(pm_ids, format = "abstract")
notes <- "fetch_pubmed_data__arguments: format=\"abstract\""

EPMsamples$DF_papers_abs <- list()
EPMsamples$DF_papers_abs$qry_st <- qry_st
EPMsamples$DF_papers_abs$pm_ids <- pm_ids
EPMsamples$DF_papers_abs$pm_res <- pm_res
EPMsamples$DF_papers_abs$notes <- notes

# Query #2
qry_st <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
pm_ids <- get_pubmed_ids(qry_st)
pm_res <- fetch_pubmed_data(pm_ids)
notes <- ""

EPMsamples$DF_papers_std <- list()
EPMsamples$DF_papers_std$qry_st <- qry_st
EPMsamples$DF_papers_std$pm_ids <- pm_ids
EPMsamples$DF_papers_std$pm_res <- pm_res
EPMsamples[[2]]$notes <- notes

# Query #3
qry_st <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2018"[PDAT]' 
pm_res <- batch_pubmed_download(pubmed_query_string = qry_st, 
                                format = "xml", 
                                batch_size = 20,
                                dest_file_prefix = "easyPM_example",
                                encoding = "ASCII")
notes <- "batch_pubmed_download__arguments: format = \"xml\", batch_size = 20, dest_file_prefix = \"easyPM_example\", encoding = \"ASCII\""

EPMsamples$NUBL_dw18 <- list()
EPMsamples$NUBL_dw18$qry_st <- qry_st
EPMsamples$NUBL_dw18$pm_res <- pm_res
EPMsamples[[3]]$notes <- notes

# Query #4
qry_st <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2016"[PDAT]:"2018"[PDAT]' 
pm_ids <- get_pubmed_ids(qry_st) 
pm_res <- fetch_pubmed_data(pm_ids)
rec_lst <- articles_to_list(pm_res, simplify = FALSE) 
notes <- "articles_to_list__arguments: simplify = FALSE"

EPMsamples$NUBL_1618 <- list()
EPMsamples$NUBL_1618$qry_st <- qry_st
EPMsamples$NUBL_1618$pm_ids <- pm_ids
EPMsamples$NUBL_1618$pm_res <- pm_res
EPMsamples$NUBL_1618$rec_lst <- rec_lst
EPMsamples$NUBL_1618$notes <- notes

Loading and using the data

Note that you can load the dataset using the following lines of code.

library("easyPubMed")
data("EPMsamples")

Further analyses

The following code shows how to further process the data included in the EPMsamples dataset. A single record can be processed using the article_to_df() function. This will extract useful information and populate a data.frame.

# Process record num 6 - NUBL_1618
xx <- article_to_df(EPMsamples$NUBL_1618$rec_lst[[6]])
KEEP <- c("lastname", "firstname", "year", "month", "day", "pmid")
xx[, KEEP] %>% kable() %>% kable_styling(bootstrap_options = 'striped')

lastname	firstname	year	month	day	pmid
Fantini	Damiano	2018	11	17	30446446
Seiler	Roland	2018	11	17	30446446
Meeks	Joshua J	2018	11	17	30446446

It is possible to process the records recusively using a loop, or using an apply-family function. Here, we are using lapply(). This returns a list of data.frames, which we can aggregate by using do.call(). For more info about these functions, try ?lapply and ?do.call.

# Recursive extraction of info from the records (~15 sec)
xx <- lapply(NUBL_1618$records, article_to_df)
xx <- do.call(rbind, xx)

# Total number of rows in the resulting data.frame
print(nrow(xx))

## [1] 708

# Show an excerpt of the results
KEEP <- c("lastname", "firstname", "year", "month", "pmid")
xx[seq(1, 100, by = 10), KEEP] %>% kable() %>% 
  kable_styling(bootstrap_options = 'striped')

	lastname	firstname	year	month	pmid
1	Glaser	Alexander P	2019	01	30652586
11	Nadler	Robert B	2019	01	30580006
21	Kochan	Kirsten S	2018	12	30510971
31	Wetterlin	Jessica	2018	12	30475828
41	Meeks	Joshua J	2018	11	30421072
51	Tollefson	Matthew K	2018	12	30417048
61	Marra	Angelo	2018	11	30346310
71	Vargas	Carlos E	2018	11	30202801
81	Larson	Gary L	2018	11	30202801
91	Wiseman	Jonathan B	2018	12	29990467

Now, I can query the xx data.frame using a perfect-match search, or regular expression. For example, we can identify and return records of interest as shown below.

# Here, we search for PubMed records published by the *easyPubMed* maintainer, Damiano Fantini.
idx <- which(xx$lastname == "Fantini" & grepl("^D", xx$firstname)) 

# Show results
KEEP <- c("lastname", "firstname", "year", "month", "day", "pmid")
xx[idx, KEEP] %>% kable() %>% 
  kable_styling(bootstrap_options = 'striped')

	lastname	firstname	year	month	day	pmid
37	Fantini	Damiano	2018	11	17	30446446
40	Fantini	Damiano	2018	11	13	30421072
84	Fantini	Damiano	2018	11	14	30035181
218	Fantini	Damiano	2018	11	13	29435122
245	Fantini	Damiano	2018	11	13	29367767
429	Fantini	Damiano	2017	06	03	28169993

References

easyPubMed official website including news, vignettes, and further information https://www.data-pulse.com/dev_site/easypubmed/
Sayers, E. A General Introduction to the E-utilities (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK25497/
PubMed Help (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK3827/
Howto: basic usage of easyPubMed - an example Tutorial/Blog Post
Howto: using easyPubMed for a targeting campaign Tutorial/Blog Post
Dev version of easyPubMed on GitHub Website

Feedback and Citation

Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is <damiano.fantini(at)gmail(dot)com>. More info about easyPubMed are available at the following URL: www.data-pulse.com.

easyPubMed Copyright (C) 2017-2019 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

!!Note!! If you are using easyPubMed for a scientific publication, please name the package in the Materials and Methods section of the paper. Thanks! Also, I am always open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this Vignette, feel free to contact me via email. Thank you. SessionInfo

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.1.0 dplyr_0.8.0.1    easyPubMed_2.12 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1        rstudioapi_0.10   xml2_1.2.0       
##  [4] knitr_1.22        magrittr_1.5      hms_0.4.2        
##  [7] munsell_0.5.0     rvest_0.3.2       tidyselect_0.2.5 
## [10] viridisLite_0.3.0 colorspace_1.4-1  R6_2.4.0         
## [13] rlang_0.3.2       highr_0.8         httr_1.4.0       
## [16] stringr_1.4.0     tools_3.5.2       webshot_0.5.1    
## [19] xfun_0.5          htmltools_0.3.6   yaml_2.2.0       
## [22] assertthat_0.2.1  digest_0.6.18     tibble_2.1.1     
## [25] crayon_1.3.4      purrr_0.3.2       readr_1.3.1      
## [28] glue_1.3.1        evaluate_0.13     rmarkdown_1.12   
## [31] stringi_1.4.3     compiler_3.5.2    pillar_1.3.1     
## [34] scales_1.0.0      pkgconfig_2.0.2