Data Analytics Practicum  ( Jeremy Gerdes )

Created Problem Statement

  • If an enterprise needs to support a specific number of packages to run our own miniCRAN like collection of approved packages, what packages should we require?
  • We could minimize the package selection to:
    • reduce approval cost
    • minimize risk (exposure)
  • maximize package selection to:
    • most popular
      • likely to use
      • could measure reverse dependencies (other author popularity)
      • Can we predict the popularity of new packages baised on popularity of existing data
    • OR – could restrict to minimum needed to perform specific tasks
      • CRAN Task Views provide a curated topics grouping specific tasks.
      • Need to know the dependency graph for each required package.
  • Additionally we could also identify the most likely to be needed and pre-install them in the default NMCI install.
    • define acceptable in terms of targets as < N GB in size, and < T minutes install time.

My Process

install.packages("devtools")
library(devtools)
install_github("seakintruth/evalCranMeta")
  • This package can build reportsd for any list of CRAN packages
cran.meta.dependancies.graph()
  • Generates  reports  of all currently installed packages
  • The repo can also gather file data from the installed libraries (14,631 libraries on windows desktop:CP1)
cran.meta.gather.file.data()
  • File statistics can provide some insight on how others build packages.
    • Installing all packages is only ~20GB, the entire mirror is ~220GB.
    • Have a file list of every package, and can perfrom code review with several packages designed for this, test for unit tests, function count, and other attributes.
cran.meta.generate.reports()
  • Reports and graphics to explain, and provide insight towards answering the problem statement

Results

  1. Sum of Reverse Dependencies, and Reverse Imports (RDRI) could be an indicator of popularity of packages for other authors to use, rely on. (74% have 0 RDRI’s)
    • Still need to identify those packages that use ‘packrat’ methods to install static versions of other packages and include those as RDRI for the purpose of measuring author popularity.
      • packrat dependencies are in {PACKAGE}/packrat/src/
      • or read {PACKAGE}/packrat/packrat.lock
    • Installing all packages that have >=2 RDRI consumes (_???_ GB)
  2. Identify those packages that are licensed with ‘Restricts Use’, and their reverse imports and dependencies. These license files need to be reviewed prior to inclusion.
Package License Title Reverse.depends Reverse.imports
DATforDCEMRI CC BY-NC-SA 3.0 Deconvolution Analysis Tool for Dynamic Contrast Enhanced MRI NA NA
molaR ACM Dental Surface Complexity Measurement Tools NA NA
spikeSlabGAM CC BY-NC-SA 4.0 Bayesian Variable Selection and Model Choice for Generalized
Additive Mixed Models
NA NA
rangeBuilder ACM Occurrence Filtering, Geographic and Taxonomic Standardization
and Generation of Species Range Polygons
NA NA
alphahull file LICENSE Generalization of the Convex Hull of a Sample of Points in the
Plane
molaR rangeBuilder, SLICER
akima ACM | file LICENSE Interpolation of Irregularly and Regularly Spaced Data DATforDCEMRI spikeSlabGAM
asypow ACM | file LICENSE Calculate Power Utilizing Asymptotic Likelihood Ratio Methods NA NA
gpclib file LICENSE General Polygon Clipping Library for R NA NA
momr Artistic-2.0 Mining Metaomics Data (MetaOMineR) NA NA
PredictiveRegression file LICENSE Prediction Intervals for Three Basic Statistical Models NA NA
regtest file LICENSE Regression testing NA NA
rngwell19937 file LICENSE Random number generator WELL19937a with 53 or 32 bit output NA NA
sfdct CC BY-NC-SA 4.0 Constrained Triangulation for Simple Features NA NA
sigQC file LICENSE Quality Control Metrics for Gene Signatures NA NA
tripack ACM | file LICENSE Triangulation of Irregularly Spaced Data NA alphahull

TODO

  • Identify those no longer maintained: ORPHANED (TRUE=1, FALSE=0)
  • Sum of Reverse Dependencies, and Reverse Imports could be an indicator of popularity of packages for other authors to use, rely on. (74% have 0 RDRI’s)
  • Continue munging data
  • machine learning

Course Resources (v3):

Data Management

A good Data Management plan will account for:

  • repeatability
  • audibility
  • sharing

https://science.energy.gov/funding-opportunities/digital-data-management/suggested-elements-for-a-dmp/

Resources:

  • DAMA-DMBOK Model and Basic Data Management in R
  • Read the PDF of V2 here on dama.org:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.