The EPMsamples dataset is a list of four examples showing how to download and analyze records from PubMed by using easyPubMed. Each element in the EPMsamples list corresponds to a different query and/or analysis. Also, each element of EPMsamples is a list including intermediates and notes about the analysis. These elements are:

The code to re-build the full dataset is shown below.

Building the dataset

library(easyPubMed)
library(dplyr)
library(kableExtra)

# Initialize the list and four sub-lists
EPMsamples <- list()

EPMsamples$NUBL_dw18 <- list()
EPMsamples$NUBL_1618 <- list()

# Query #1
qry_st <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
pm_ids <- get_pubmed_ids(qry_st)
pm_res <- fetch_pubmed_data(pm_ids, format = "abstract")
notes <- "fetch_pubmed_data__arguments: format=\"abstract\""

EPMsamples$DF_papers_abs <- list()
EPMsamples$DF_papers_abs$qry_st <- qry_st
EPMsamples$DF_papers_abs$pm_ids <- pm_ids
EPMsamples$DF_papers_abs$pm_res <- pm_res
EPMsamples$DF_papers_abs$notes <- notes

# Query #2
qry_st <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
pm_ids <- get_pubmed_ids(qry_st)
pm_res <- fetch_pubmed_data(pm_ids)
notes <- ""

EPMsamples$DF_papers_std <- list()
EPMsamples$DF_papers_std$qry_st <- qry_st
EPMsamples$DF_papers_std$pm_ids <- pm_ids
EPMsamples$DF_papers_std$pm_res <- pm_res
EPMsamples[[2]]$notes <- notes

# Query #3
qry_st <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2018"[PDAT]' 
pm_res <- batch_pubmed_download(pubmed_query_string = qry_st, 
                                format = "xml", 
                                batch_size = 20,
                                dest_file_prefix = "easyPM_example",
                                encoding = "ASCII")
notes <- "batch_pubmed_download__arguments: format = \"xml\", batch_size = 20, dest_file_prefix = \"easyPM_example\", encoding = \"ASCII\""

EPMsamples$NUBL_dw18 <- list()
EPMsamples$NUBL_dw18$qry_st <- qry_st
EPMsamples$NUBL_dw18$pm_res <- pm_res
EPMsamples[[3]]$notes <- notes

# Query #4
qry_st <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2016"[PDAT]:"2018"[PDAT]' 
pm_ids <- get_pubmed_ids(qry_st) 
pm_res <- fetch_pubmed_data(pm_ids)
rec_lst <- articles_to_list(pm_res, simplify = FALSE) 
notes <- "articles_to_list__arguments: simplify = FALSE"

EPMsamples$NUBL_1618 <- list()
EPMsamples$NUBL_1618$qry_st <- qry_st
EPMsamples$NUBL_1618$pm_ids <- pm_ids
EPMsamples$NUBL_1618$pm_res <- pm_res
EPMsamples$NUBL_1618$rec_lst <- rec_lst
EPMsamples$NUBL_1618$notes <- notes

Loading and using the data

Note that you can load the dataset using the following lines of code.

library("easyPubMed")
data("EPMsamples")

Further analyses

The following code shows how to further process the data included in the EPMsamples dataset. A single record can be processed using the article_to_df() function. This will extract useful information and populate a data.frame.

# Process record num 6 - NUBL_1618
xx <- article_to_df(EPMsamples$NUBL_1618$rec_lst[[6]])
KEEP <- c("lastname", "firstname", "year", "month", "day", "pmid")
xx[, KEEP] %>% kable() %>% kable_styling(bootstrap_options = 'striped')
lastname firstname year month day pmid
Fantini Damiano 2018 11 17 30446446
Seiler Roland 2018 11 17 30446446
Meeks Joshua J 2018 11 17 30446446

It is possible to process the records recusively using a loop, or using an apply-family function. Here, we are using lapply(). This returns a list of data.frames, which we can aggregate by using do.call(). For more info about these functions, try ?lapply and ?do.call.

# Recursive extraction of info from the records (~15 sec)
xx <- lapply(NUBL_1618$records, article_to_df)
xx <- do.call(rbind, xx)

# Total number of rows in the resulting data.frame
print(nrow(xx))
## [1] 708
# Show an excerpt of the results
KEEP <- c("lastname", "firstname", "year", "month", "pmid")
xx[seq(1, 100, by = 10), KEEP] %>% kable() %>% 
  kable_styling(bootstrap_options = 'striped')
lastname firstname year month pmid
1 Glaser Alexander P 2019 01 30652586
11 Nadler Robert B 2019 01 30580006
21 Kochan Kirsten S 2018 12 30510971
31 Wetterlin Jessica 2018 12 30475828
41 Meeks Joshua J 2018 11 30421072
51 Tollefson Matthew K 2018 12 30417048
61 Marra Angelo 2018 11 30346310
71 Vargas Carlos E 2018 11 30202801
81 Larson Gary L 2018 11 30202801
91 Wiseman Jonathan B 2018 12 29990467

Now, I can query the xx data.frame using a perfect-match search, or regular expression. For example, we can identify and return records of interest as shown below.

# Here, we search for PubMed records published by the *easyPubMed* maintainer, Damiano Fantini.
idx <- which(xx$lastname == "Fantini" & grepl("^D", xx$firstname)) 

# Show results
KEEP <- c("lastname", "firstname", "year", "month", "day", "pmid")
xx[idx, KEEP] %>% kable() %>% 
  kable_styling(bootstrap_options = 'striped')
lastname firstname year month day pmid
37 Fantini Damiano 2018 11 17 30446446
40 Fantini Damiano 2018 11 13 30421072
84 Fantini Damiano 2018 11 14 30035181
218 Fantini Damiano 2018 11 13 29435122
245 Fantini Damiano 2018 11 13 29367767
429 Fantini Damiano 2017 06 03 28169993

References

Feedback and Citation

Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is <damiano.fantini(at)gmail(dot)com>. More info about easyPubMed are available at the following URL: www.data-pulse.com.

easyPubMed Copyright (C) 2017-2019 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

!!Note!! If you are using easyPubMed for a scientific publication, please name the package in the Materials and Methods section of the paper. Thanks! Also, I am always open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this Vignette, feel free to contact me via email. Thank you. SessionInfo

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.1.0 dplyr_0.8.0.1    easyPubMed_2.12 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1        rstudioapi_0.10   xml2_1.2.0       
##  [4] knitr_1.22        magrittr_1.5      hms_0.4.2        
##  [7] munsell_0.5.0     rvest_0.3.2       tidyselect_0.2.5 
## [10] viridisLite_0.3.0 colorspace_1.4-1  R6_2.4.0         
## [13] rlang_0.3.2       highr_0.8         httr_1.4.0       
## [16] stringr_1.4.0     tools_3.5.2       webshot_0.5.1    
## [19] xfun_0.5          htmltools_0.3.6   yaml_2.2.0       
## [22] assertthat_0.2.1  digest_0.6.18     tibble_2.1.1     
## [25] crayon_1.3.4      purrr_0.3.2       readr_1.3.1      
## [28] glue_1.3.1        evaluate_0.13     rmarkdown_1.12   
## [31] stringi_1.4.3     compiler_3.5.2    pillar_1.3.1     
## [34] scales_1.0.0      pkgconfig_2.0.2