R-bloggers

Subscribe to R-bloggers feed R-bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 8 hours 39 min ago

Repeated measures can improve estimation when we only care about a single endpoint

Tue, 10/12/2019 - 01:00

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, involving multiple clusters, three interventions, a sequential and adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here.

There was one issue that came up that should be fairly generalizable to other studies. In this case, individual measures will be collected repeatedly over time but the primary outcome of interest will be the measure collected during the last follow-up period. I wanted to explore what, if anything, can be gained by analyzing all of the available data rather than focusing only the final end point.

Data generation

In this simulation scenario, there will be 200 subjects randomized at the individual level to one of two treatment arms, intervention (\(rx = 1\)) and control (\(rx = 0\)). Each person will be followed for 5 months, with a binary outcome measure collected at the end of each month. In the data, period 0 is the first month, and period 4 is the final month.

library(simstudy) set.seed(281726) dx <- genData(200) dx <- trtAssign(dx, grpName = "rx") dx <- addPeriods(dx, nPeriods = 5)

Here are the data for a single individual:

dx[id == 142] ## id period rx timeID ## 1: 142 0 1 706 ## 2: 142 1 1 707 ## 3: 142 2 1 708 ## 4: 142 3 1 709 ## 5: 142 4 1 710

The probabilities of the five binary outcomes for each individual are a function of time and intervention status.

defP <- defDataAdd(varname = "p", formula = "-2 + 0.2*period + 0.5*rx", dist = "nonrandom", link = "logit") dx <- addColumns(defP, dx)

The outcomes for a particular individual are correlated, with outcomes in two adjacent periods are more highly correlated than outcomes collected further apart. (I use an auto-regressive correlation structure to generate these data.)

dx <- addCorGen(dtOld = dx, idvar = "id", nvars = 5, rho = 0.6, corstr = "ar1", dist = "binary", param1 = "p", method = "ep", formSpec = "-2 + 0.2*period + 0.5*rx", cnames = "y") dx[id == 142] ## id period rx timeID p y ## 1: 142 0 1 706 0.18 0 ## 2: 142 1 1 707 0.21 0 ## 3: 142 2 1 708 0.25 1 ## 4: 142 3 1 709 0.29 0 ## 5: 142 4 1 710 0.33 0

In the real world, there will be loss to follow up – not everyone will be observed until the end. In the first case, I will be assuming the data are missing completely at random (MCAR), where missingness is independent of all observed and unobserved variables. (I have mused on missingess before.)

MCAR <- defMiss(varname = "y", formula = "-2.6", logit.link = TRUE, monotonic = TRUE ) dm <- genMiss(dx, MCAR, "id", repeated = TRUE, periodvar = "period") dObs <- genObs(dx, dm, idvars = "id") dObs[id == 142] ## id period rx timeID p y ## 1: 142 0 1 706 0.18 0 ## 2: 142 1 1 707 0.21 0 ## 3: 142 2 1 708 0.25 1 ## 4: 142 3 1 709 0.29 NA ## 5: 142 4 1 710 0.33 NA

In this data set only about 70% of the total sample is observed – though by chance there is different dropout for each of the treatment arms:

dObs[period == 4, .(prop.missing = mean(is.na(y))), keyby = rx] ## rx prop.missing ## 1: 0 0.28 ## 2: 1 0.38 Estimating the intervention effect

If we are really only interested in the probability of a successful outcome in the final period, we could go ahead and estimate the treatment effect using a simple logistic regression using individuals who were available at the end of the study. The true value is 0.5 (on the logistic scale), and the estimate here is close to 1.0 with a standard error just under 0.4:

fit.l <- glm(y ~ rx, data = dObs[period == 4], family = binomial) coef(summary(fit.l)) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.25 0.28 -4.4 9.9e-06 ## rx 0.99 0.38 2.6 9.3e-03

But, can we do better? Fitting a longitudinal model might provide a more stable and possibly less biased estimate, particularly if the specified model is the correct one. In this case, I suspect it will be an improvement, since the data was generated using a process that is amenable to a GEE (generalized estimating equation) model.

library(geepack) fit.m <- geeglm(y ~ period + rx, id = id, family = binomial, data = dObs, corstr = "ar1") coef(summary(fit.m)) ## Estimate Std.err Wald Pr(>|W|) ## (Intercept) -2.33 0.259 81 0.00000 ## period 0.30 0.072 17 0.00003 ## rx 0.83 0.263 10 0.00152

And finally, it is reasonable to expect that a model that is based on a data set without any missing values would provide the most efficient estimate. And that does seem to be case if we look at the standard error of the effect estimate.

fit.f <- geeglm(y ~ period + rx, id = id, family = binomial, data = dx, corstr = "ar1") coef(summary(fit.f)) ## Estimate Std.err Wald Pr(>|W|) ## (Intercept) -2.15 0.227 89.2 0.0e+00 ## period 0.30 0.062 23.1 1.5e-06 ## rx 0.54 0.233 5.4 2.1e-02

Of course, we can’t really learn much of anything from a single simulated data set. Below is a plot of the mean estimate under each modeling scenario (along with the blue line that represents \(\pm 2\) sd) based on 2500 simulated data sets with missingness completely at random. (The code for these replications is included in the addendum.)

It is readily apparent that under an assumption of MCAR, all estimation models yield unbiased estimates (the true effect size is 0.5), though using the last period only is inherently more variable (given that there are fewer observations to work with).

Missing at random

When the data are MAR (missing at random), using the last period only no longer provides an unbiased estimate of the effect size. In this case, the probability of missingness is a function of time, intervention status, and the outcome from the prior period, all of which are observed. This is how I’ve defined the MAR process:

MAR <- defMiss(varname = "y", formula = "-2.9 + 0.2*period - 2*rx*LAG(y)", logit.link = TRUE, monotonic = TRUE )

The mean plots based on 2500 iterations reveal the bias of the last period only. It is interesting to see that the GEE model is not biased, because we have captured all of the relevant covariates in the model. (It is well known that a likelihood method can yield unbiased estimates in the case of MAR, and while GEE is not technically a likelihood, it is a quasi-likelihood method.)

Missing not at random

When missingness depends on unobserved data, such as the outcome itself, then GEE estimates are also biased. For the last set of simulations, I defined missingness of \(y\) in any particular time period to be a function of itself. Specifically, if the outcome was successful and the subject was in the intervention, the subject would be more likely to be observed:

NMAR <- defMiss(varname = "y", formula = "-2.9 + 0.2*period - 2*rx*y", logit.link = TRUE, monotonic = TRUE )

Under the assumption of missingness not at random (NMAR), both estimation approaches based on the observed data set with missing values yields an biased estimate, though using all of the data appears to reduce the bias somewhat:

Addendum: generating replications iter <- function(n, np, defM) { dx <- genData(n) dx <- trtAssign(dx, grpName = "rx") dx <- addPeriods(dx, nPeriods = np) defP <- defDataAdd(varname = "p", formula = "-2 + 0.2*period + .5*rx", dist = "nonrandom", link = "logit") dx <- addColumns(defP, dx) dx <- addCorGen(dtOld = dx, idvar = "id", nvars = np, rho = .6, corstr = "ar1", dist = "binary", param1 = "p", method = "ep", formSpec = "-2 + 0.2*period + .5*rx", cnames = "y") dm <- genMiss(dx, defM, "id", repeated = TRUE, periodvar = "period") dObs <- genObs(dx, dm, idvars = "id") fit.f <- geeglm(y ~ period + rx, id = id, family = binomial, data = dx, corstr = "ar1") fit.m <- geeglm(y ~ period + rx, id = id, family = binomial, data = dObs, corstr = "ar1") fit.l <- glm(y ~ rx, data = dObs[period == (np - 1)], family = binomial) return(data.table(full = coef(fit.f)["rx"], miss = coef(fit.m)["rx"], last = coef(fit.l)["rx"]) ) } ## defM MCAR <- defMiss(varname = "y", formula = "-2.6", logit.link = TRUE, monotonic = TRUE ) MAR <- defMiss(varname = "y", formula = "-2.9 + 0.2*period - 2*rx*LAG(y)", logit.link = TRUE, monotonic = TRUE ) NMAR <- defMiss(varname = "y", formula = "-2.9 + 0.2*period - 2*rx*y", logit.link = TRUE, monotonic = TRUE ) ## library(parallel) niter <- 2500 resMCAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MCAR))) resMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MAR))) resNMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, NMAR))) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reproduce economic indicators from ‘The Economist’

Tue, 10/12/2019 - 00:00

[This article was first published on Macroeconomic Observatory - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Economic data (% change on year ago) Gross domestic product Industrial production Consumer prices Unemployment rate, % latest quarter* 2019† 2020† latest latest 2019† latest United States +2.1 Q3 +2.1 +2.4 +2.1 -1.1 Oct +1.8 Oct +1.8 +3.6 Oct China +0.1 Q3 +0.2 +6.1 +5.8 +5.4 Nov‡ +3.8 Oct +2.3 +3.6 Q3 Japan +1.4 Q3 +0.2 +0.9 +0.5 -6.0 Oct +0.2 Oct +1.0 +2.4 Oct Britain +1.0 Q3 +1.2 +1.2 +1.4 -1.4 Sep +1.5 Oct +1.8 +3.8 Aug Canada +1.7 Q3 +1.3 +1.5 +1.8 -1.8 Sep +1.9 Oct +2.0 +5.5 Oct Euro area +1.2 Q3 +0.9 +1.2 +1.4 -1.9 Sep +0.7 Oct +1.2 +7.5 Oct Austria +1.5 Q3 +0.5 +1.6 +1.7 +0.3 Aug +1.1 Oct +1.5 +4.6 Oct Belgium +1.6 Q3 +1.7 +1.2 +1.3 +6.0 Sep +0.4 Nov +1.5 +5.6 Oct France +1.4 Q3 +1.1 +1.2 +1.3 +0.1 Sep +0.8 Oct +1.2 +8.5 Oct Germany +0.5 Q3 +0.3 +0.5 +1.2 -5.1 Sep +1.1 Oct +1.5 +3.1 Oct Greece +1.9 Q2 +3.4 +2.0 +2.2 +0.8 Sep -0.7 Oct +0.6 +16.7 Aug Italy +0.3 Q3 +0.2 +0.0 +0.5 -2.1 Sep +0.2 Oct +0.7 +9.7 Oct Netherlands +1.8 Q3 +1.8 +1.8 +1.6 +0.3 Sep +2.7 Oct +2.5 +3.5 Oct Spain +2.0 Q3 +1.7 +2.2 +1.8 +0.8 Sep +0.1 Oct +0.7 +14.2 Oct Czech Republic +2.5 Q3 +1.5 +2.5 +2.6 +0.1 Aug +2.7 Oct +2.6 +2.2 Oct Denmark +2.2 Q3 +1.3 +1.7 +1.9 +4.2 Sep +0.6 Oct +1.3 +5.3 Oct Norway +0.6 Q3 +0.1 +1.9 +2.4 -8.0 Sep +1.8 Oct +2.3 +3.9 Sep Poland +4.0 Q3 +5.3 +4.0 +3.1 +3.5 Oct +2.5 Oct +2.4 +3.2 Oct Russia +0.8 Q2 +0.6 +1.1 +1.9 +2.6 Sep +3.8 Oct +4.7 +4.6 Q3 Sweden +1.7 Q3 +1.1 +0.9 +1.5 +1.6 Sep +1.6 Oct +1.7 +6.6 Oct Switzerland +1.0 Q3 +1.6 +0.8 +1.3 +5.4 Q4‡ -0.3 Oct +0.6 +2.4 May Turkey +0.5 Q3 +1.7 +0.2 +3.0 +2.8 Sep +8.6 Oct +15.7 +14.2 Aug Australia +1.7 Q3 +1.8 +1.7 +2.3 +2.7 Q3 +1.7 Q3 +1.6 +5.3 Oct Hong Kong +0.5 Q2 -1.7 +0.3 +1.5 +0.4 Q2 +3.3 Sep +3.0 +2.8 Q1 India +4.7 Q3 +4.3 +6.1 +7.0 +2.6 Dec‡ +7.6 Oct +3.4 +2.6 Year‡ Indonesia +5.1 Q3 +5.0 +5.0 +5.1 -3.7 Apr +3.0 Nov +3.2 +4.6 Q3‡ Malaysia +4.7 Q4‡ +14.7 +4.5 +4.4 +3.1 Mar +1.5 Aug +1.0 +3.3 Q1 Pakistan +5.5 Year‡ NA +3.3 +2.4 -7.0 Aug +8.2 Feb +7.3 +4.4 Q2‡ Philippines +6.1 Q4‡ +6.4 +5.7 +6.2 -8.2 Jul +3.3 Mar +2.5 +2.2 Q4‡ Singapore +3.9 Q2‡ +7.8 +0.5 +1.0 +0.1 Sep +0.4 Oct +0.7 +3.0 Q1 South Korea +2.0 Q3 +1.6 +2.0 +2.2 -2.6 Oct -0.4 Sep +0.5 +3.5 Oct Taiwan +2.9 Q3 NA +2.0 +1.9 NA +0.4 Q3 +0.8 +3.7 Q2 Thailand +2.3 Q2 +2.4 +2.9 +3.0 -1.2 Q1 +0.1 Oct +0.9 +0.7 Q4‡ Argentina -0.0 Q2 -0.0 -3.1 -1.3 +4.4 Q3§ +50.5 Oct +54.4 +9.8 Q1 Brazil +1.2 Q3 +2.5 +0.9 +2.0 +0.6 Sep +2.5 Oct +3.8 +8.0 Nov Chile +2.8 Q3 +3.0 +2.5 +3.0 -3.7 Oct +2.7 Oct +2.2 +6.9 Aug Colombia +3.3 Q3 +2.3 +3.4 +3.6 -1.1 Dec‡ +3.9 Oct +3.6 +10.7 Sep Mexico -0.2 Q3 +0.1 +0.4 +1.3 -2.9 Jun +3.0 Oct +3.8 +3.6 Oct Peru +2.1 Q1§ -16.9 +2.6 +3.6 +20.3 Apr‡ +2.2 Mar +2.2 +6.2 Q2 Egypt +5.3 Year‡ NA +5.5 +5.9 +6.2 Mar‡ +15.7 Nov‡ +13.9 +11.8 Q4§ Israel +3.3 Q3 +4.1 +3.1 +3.1 +4.4 Sep +0.4 Oct +1.0 +3.7 Sep Saudi Arabia +0.5 Q2 -10.4 +0.2 +2.2 +1.6 Q3§ -0.3 Oct -1.1 +6.0 Year‡ South Africa +1.0 Q2 +3.1 +0.7 +1.1 +1.3 Aug‡ +3.7 Oct +4.4 +28.8 Q3 Source: DBnomics (Eurostat, ILO, IMF, OECD and national sources). Click on the figures in the `latest` columns to see the full time series. * % change on previous quarter, annual rate † IMF estimation/forecast ‡ 2018 § 2017

The aim of this blog post is to reproduce part of the economic indicators table from ‘The Economist’ using only free tools. We take data directly from DBnomics. The DBnomics API can be accessed through R with the rdbnomics package. All the following code is written in R, thanks to the RCoreTeam (2016) and the RStudioTeam (2016). To update the table, just download the code here and re-run it.

if (!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman", repos='http://cran.r-project.org') pacman::p_load(tidyverse,rdbnomics,magrittr,zoo,lubridate,knitr,kableExtra,formattable) opts_chunk$set(fig.align="center", message=FALSE, warning=FALSE) currentyear <- year(Sys.Date()) lastyear <- currentyear-1 beforelastyear <- currentyear-2 CountryList <- c("United States","China","Japan","Britain","Canada", "Euro area","Austria","Belgium","France","Germany","Greece","Italy","Netherlands","Spain", "Czech Republic","Denmark","Norway","Poland","Russia","Sweden","Switzerland","Turkey", "Australia","Hong Kong","India","Indonesia","Malaysia","Pakistan","Philippines","Singapore","South Korea","Taiwan","Thailand", "Argentina","Brazil","Chile","Colombia","Mexico","Peru", "Egypt","Israel","Saudi Arabia","South Africa") Download gdp <- rdb("OECD","MEI",ids=".NAEXKP01.GPSA+GYSA.Q") hongkong_philippines_thailand_gdp <- rdb("IMF","IFS",mask="Q.HK+PH+TH.NGDP_R_PC_CP_A_SA_PT+NGDP_R_PC_PP_SA_PT") %>% rename(Country=`Reference Area`) %>% mutate(Country=case_when(Country=="Hong Kong, China" ~ "Hong Kong", TRUE ~ Country), MEASURE=case_when(INDICATOR=="NGDP_R_PC_CP_A_SA_PT" ~ "GYSA", INDICATOR=="NGDP_R_PC_PP_SA_PT" ~ "GPSA")) malaysia_peru_saudi_singapore_gdp <- rdb("IMF","IFS",mask="Q.MY+PE+SA+SG.NGDP_R_PC_CP_A_PT+NGDP_R_PC_PP_PT") %>% rename(Country=`Reference Area`) %>% mutate(MEASURE=case_when(INDICATOR=="NGDP_R_PC_CP_A_PT" ~ "GYSA", INDICATOR=="NGDP_R_PC_PP_PT" ~ "GPSA")) taiwan_gdp <- rdb("BI/TABEL9_1/17.Q") %>% mutate(Country="Taiwan", MEASURE="GYSA") egypt_pakistan_gdp <- rdb("IMF","WEO",mask="EGY+PAK.NGDP_RPCH") %>% rename(Country=`WEO Country`) %>% mutate(MEASURE="GYSA") %>% filter(year(period)<currentyear) china_gdp_level <- rdb(ids="OECD/MEI/CHN.NAEXCP01.STSA.Q") gdp_qoq_china <- china_gdp_level %>% arrange(period) %>% mutate(value=value/lag(value)-1, MEASURE="GPSA") gdp_yoy_china <- china_gdp_level %>% arrange(period) %>% mutate(quarter=quarter(period)) %>% group_by(quarter) %>% mutate(value=value/lag(value)-1, MEASURE="GYSA") argentina_gdp_level <- rdb(ids="Eurostat/naidq_10_gdp/Q.SCA.KP_I10.B1GQ.AR") %>% rename(Country=`Geopolitical entity (reporting)`) gdp_qoq_argentina <- argentina_gdp_level %>% arrange(period) %>% mutate(value=value/lag(value)-1, MEASURE="GPSA") gdp_yoy_argentina <- argentina_gdp_level %>% arrange(period) %>% mutate(quarter=quarter(period)) %>% group_by(quarter) %>% mutate(value=value/lag(value)-1, MEASURE="GYSA") gdp <- bind_rows(gdp,hongkong_philippines_thailand_gdp,malaysia_peru_saudi_singapore_gdp,taiwan_gdp,egypt_pakistan_gdp,gdp_yoy_china,gdp_qoq_china,gdp_yoy_argentina,gdp_qoq_argentina) indprod <- rdb("OECD","MEI",ids=".PRINTO01.GYSA.M") australia_swiss_indprod <- rdb("OECD","MEI","AUS+CHE.PRINTO01.GYSA.Q") china_egypt_mexico_malaysia_indprod <- rdb("IMF","IFS",mask="M.CN+EG+MX+MY.AIP_PC_CP_A_PT") %>% rename(Country=`Reference Area`) indonesia_pakistan_peru_philippines_singapore_southafrica_indprod <- rdb("IMF","IFS",mask="M.ID+PK+PE+PH+SG+ZA.AIPMA_PC_CP_A_PT") %>% rename(Country=`Reference Area`) argentina_hongkong_saudiarabia_thailand_indprod <- rdb("IMF","IFS",mask="Q.AR+HK+SA+TH.AIPMA_PC_CP_A_PT") %>% rename(Country=`Reference Area`) %>% mutate(Country=case_when(Country=="Hong Kong, China" ~ "Hong Kong", TRUE ~ Country)) indprod <- bind_rows(indprod,australia_swiss_indprod,china_egypt_mexico_malaysia_indprod,indonesia_pakistan_peru_philippines_singapore_southafrica_indprod,argentina_hongkong_saudiarabia_thailand_indprod) cpi <- rdb("OECD","MEI",ids=".CPALTT01.GY.M") australia_cpi <- rdb("OECD","MEI",ids="AUS.CPALTT01.GY.Q") taiwan_cpi <- rdb("BI/TABEL9_2/17.Q") %>% mutate(Country="Taiwan") other_cpi <- rdb("IMF","IFS",mask="M.EG+HK+MY+PE+PH+PK+SG+TH.PCPI_PC_CP_A_PT") %>% rename(Country=`Reference Area`) %>% mutate(Country=case_when(Country=="Hong Kong, China" ~ "Hong Kong", TRUE ~ Country)) cpi <- bind_rows(cpi,australia_cpi,taiwan_cpi,other_cpi) unemp <- rdb("OECD","MEI",ids=".LRHUTTTT.STSA.M") swiss_unemp <- rdb("OECD","MEI",mask="CHE.LMUNRRTT.STSA.M") brazil_unemp <- rdb("OECD","MEI",mask="BRA.LRUNTTTT.STSA.M") southafrica_russia_unemp <- rdb("OECD","MEI",mask="ZAF+RUS.LRUNTTTT.STSA.Q") china_unemp <- rdb(ids="BUBA/BBXL3/Q.CN.N.UNEH.TOTAL0.NAT.URAR.RAT.I00") %>% mutate(Country="China") saudiarabia_unemp <- rdb(ids="ILO/UNE_DEAP_SEX_AGE_RT/SAU.BA_627.AGE_AGGREGATE_TOTAL.SEX_T.A") %>% rename(Country=`Reference area`) %>% filter(year(period)<currentyear) india_unemp <- rdb(ids="ILO/UNE_2EAP_NOC_RT/IND.XA_1976.A") %>% rename(Country=`Reference area`) %>% filter(year(period)<currentyear) indonesia_pakistan_unemp <- rdb("ILO","UNE_DEAP_SEX_AGE_EDU_RT",mask="IDN+PAK..AGE_AGGREGATE_TOTAL.EDU_AGGREGATE_TOTAL.SEX_T.Q") %>% rename(Country=`Reference area`) other_unemp <- rdb("ILO","UNE_DEA1_SEX_AGE_RT",mask="ARG+EGY+HKG+MYS+PER+PHL+SGP+THA+TWN..AGE_YTHADULT_YGE15.SEX_T.Q") %>% rename(Country=`Reference area`) %>% mutate(Country=case_when(Country=="Hong Kong, China" ~ "Hong Kong", Country=="Taiwan, China" ~ "Taiwan", TRUE ~ Country)) unemp <- bind_rows(unemp,brazil_unemp,southafrica_russia_unemp,swiss_unemp,china_unemp,saudiarabia_unemp,india_unemp,indonesia_pakistan_unemp,other_unemp) forecast_gdp_cpi_ea <- rdb("IMF","WEOAGG",mask="163.NGDP_RPCH+PCPIPCH") %>% rename(`WEO Country`=`WEO Countries group`) forecast_gdp_cpi <- rdb("IMF","WEO",mask=".NGDP_RPCH+PCPIPCH") %>% bind_rows(forecast_gdp_cpi_ea) %>% transmute(Country=`WEO Country`, var=`WEO Subject`, value, period) %>% mutate(Country=str_trim(Country), var=str_trim(var)) %>% mutate(Country=case_when(Country=="United Kingdom" ~ "Britain", Country=="Hong Kong SAR" ~ "Hong Kong", Country=="Korea" ~ "South Korea", Country=="Taiwan Province of China" ~ "Taiwan", TRUE ~ Country), var=case_when(var=="Gross domestic product, constant prices - Percent change" ~ "GDP", var=="Inflation, average consumer prices - Percent change" ~ "CPI", TRUE ~ var)) forecast_gdp_cpi <- left_join(data.frame(Country=CountryList),forecast_gdp_cpi,by="Country") Transform gdp_yoy_latest_period <- gdp %>% filter(MEASURE=="GYSA") %>% filter(!is.na(value)) %>% group_by(Country) %>% summarise(period=max(period)) gdp_yoy_latest <- gdp %>% filter(MEASURE=="GYSA") %>% inner_join(gdp_yoy_latest_period) %>% mutate(var="GDP",measure="latest") gdp_qoq_latest_period <- gdp %>% filter(MEASURE=="GPSA") %>% filter(!is.na(value)) %>% group_by(Country) %>% summarise(period=max(period)) gdp_qoq_latest <- gdp %>% filter(MEASURE=="GPSA") %>% inner_join(gdp_qoq_latest_period) %>% mutate(value=((1+value/100)^4-1)*100, var="GDP", measure="quarter") gdp_2019_2020 <- forecast_gdp_cpi %>% filter(var=="GDP" & (period=="2019-01-01" | period=="2020-01-01")) %>% mutate(measure=as.character(year(period))) indprod_latest_period <- indprod %>% filter(!is.na(value)) %>% group_by(Country) %>% summarise(period=max(period)) indprod_latest <- indprod %>% inner_join(indprod_latest_period) %>% mutate(var="indprod",measure="latest") cpi_latest_period <- cpi %>% filter(!is.na(value)) %>% group_by(Country) %>% summarise(period=max(period)) cpi_latest <- cpi %>% inner_join(cpi_latest_period) %>% mutate(var="CPI",measure="latest") cpi_2019 <- forecast_gdp_cpi %>% filter(var=="CPI" & period=="2019-01-01") %>% mutate(measure="2019") unemp_latest_period <- unemp %>% filter(!is.na(value)) %>% group_by(Country) %>% summarise(period=max(period)) unemp_latest <- unemp %>% inner_join(unemp_latest_period) %>% mutate(var="unemp",measure="latest") Merge df_all <- bind_rows(gdp_yoy_latest,gdp_qoq_latest,gdp_2019_2020,indprod_latest,cpi_latest,cpi_2019,unemp_latest) %>% mutate(value=ifelse(value>=0, paste0("+",sprintf("%.1f",round(value,1))), sprintf("%.1f",round(value,1)))) %>% unite(measure,c(var,measure)) df_latest <- df_all %>% filter(measure %in% c("GDP_latest","indprod_latest","CPI_latest","unemp_latest")) %>% mutate(value=case_when(`@frequency`=="quarterly" ~ paste(value," Q",quarter(period),sep=""), `@frequency`=="monthly" ~ paste(value," ",month(period,label = TRUE, abbr = TRUE, locale = "en_US.utf8"),sep=""), `@frequency`=="annual" ~ paste(value," Year",sep=""), TRUE ~ value)) %>% mutate(value=text_spec(ifelse(year(period)==lastyear,paste0(value,footnote_marker_symbol(3)), ifelse(year(period)==beforelastyear,paste0(value,footnote_marker_symbol(4)),value)), link = paste("https://db.nomics.world",provider_code,dataset_code,series_code,sep = "/"), color = "#333333",escape = F,extra_css="text-decoration:none")) df_final <- df_all %>% filter(measure %in% c("GDP_quarter","GDP_2019","GDP_2020","CPI_2019")) %>% bind_rows(df_latest) %>% mutate(Country=case_when(Country=="United Kingdom" ~ "Britain", Country=="Euro area (19 countries)" ~ "Euro area", Country=="China (People's Republic of)" ~ "China", Country=="Korea" ~ "South Korea", TRUE ~ Country)) %>% select(Country,value,measure) %>% spread(measure,value) %>% select(Country,GDP_latest,GDP_quarter,GDP_2019,GDP_2020,indprod_latest,CPI_latest,CPI_2019,unemp_latest) df_final <- left_join(data.frame(Country=CountryList),df_final,by="Country") Display names(df_final)[1] <- "" names(df_final)[2] <- "latest" names(df_final)[3] <- paste0("quarter",footnote_marker_symbol(1)) names(df_final)[4] <- paste0("2019",footnote_marker_symbol(2)) names(df_final)[5] <- paste0("2020",footnote_marker_symbol(2)) names(df_final)[6] <- "latest" names(df_final)[7] <- "latest" names(df_final)[8] <- paste0("2019",footnote_marker_symbol(2)) names(df_final)[9] <- "latest" df_final %>% kable(row.names = F,escape = F,align = c("l",rep("c",8)),caption = "Economic data (% change on year ago)") %>% kable_styling(bootstrap_options = c("striped", "hover","responsive"), fixed_thead = T, font_size = 13) %>% add_header_above(c(" " = 1, "Gross domestic product" = 4, "Industrial production " = 1, "Consumer prices"= 2, "Unemployment rate, %"=1)) %>% column_spec(1, bold = T) %>% row_spec(seq(1,nrow(df_final),by=2), background = "#D5E4EB") %>% row_spec(c(5,14,22,33,39),extra_css = "border-bottom: 1.2px solid") %>% footnote(general = "DBnomics (Eurostat, ILO, IMF, OECD and national sources). Click on the figures in the `latest` columns to see the full time series.", general_title = "Source: ", footnote_as_chunk = T, symbol = c("% change on previous quarter, annual rate ", "IMF estimation/forecast", paste0(lastyear),paste0(beforelastyear))) Bibliography

R Core Team.
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, 2016.
URL: https://www.R-project.org.

RStudio Team.
RStudio: Integrated Development Environment for R.
RStudio, Inc., Boston, MA, 2016.
URL: http://www.rstudio.com/.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Macroeconomic Observatory - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

5 Data Science Technologies for 2020 (and Beyond)

Mon, 09/12/2019 - 07:30

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Moving into 2020, three things are clear – Organizations want Data Science, Cloud, and Apps. Here are the Top 5 essential skills for Data Scientists that need to build and deploy applications in 2020 and beyond.

This is part of a series of articles on Data Science key skills for 2020:

Top 20 Tech Skills 2014-2019

Indeed, the popular employment-related search engine, released an article showing changing trends from 2015 to 2019 in “Technology-Related Job Postings” examining the 5-Year Change of the most requested technology skills.

Top 20 Tech Skills 2014-2019
Source: Indeed Hiring Lab.

I’m generally not a big fan of these reports because the technology landscape changes so quickly. But, I was pleasantly surprised at the length of time from the analysis – Indeed looked at changes over a 5-year period, which gives a much better sense of the long term trends.

Why No R, Shiny, Tableau PowerBI, Alteryx?

The skills reported are not “Data Science”-specific (which is why you don’t see R, Tableau, PowerBI, Alteryx, on the list).

However, we can glean insights based on the technologies present…

Cloud, Machine Learning, Apps Driving Growth

From the technology growth, it’s clear that Businesses need Cloud + ML + Apps.

Technologies Driving Tech Skill Growth

My Takeaway

This assessment has led me to my key technologies for Data Scientists heading into 2020. I focus on key technologies related to Cloud + ML + Apps.

Top 5 Data Science Technologies for Cloud + ML + Apps

That Data Scientists should learn for 2020 and beyond – these are geared towards the Business Demands: Cloud + ML + Apps. In other words, businesses need data-science and machine learning-powered web applications deployed into production via the Cloud.

Here’s what you need to learn to build ML-Powered Web Applications and deploy in the Cloud.

*Note that R and Python are skills that you should be learning before you jump into these.

5 Key Data Science Technologies for Cloud + Machine Learning + Applications

1. AWS Cloud Services

The most popular cloud service provider. EC2 is a staple for apps, running jupyter/rstudio in the cloud, and leveraging cloud resources rather than investing in expensive computers & servers.

AWS Resource: AWS for Data Science Apps – 14% Share, 400% Growth

2. Shiny Web Apps

A comprehensive web framework designed for data scientists with a rich ecosystem of extension libraries (dubbed the “shinyverse”).

Shiny Resource (Coming Soon): Shiny Data Science Web Applications

3. H2O Machine Learning

Automated machine learning library available in Python and R. Works well on structured data (format for 95% of business problems). Automation drastically increases productivity in machine learning.

H2O Resource (Coming Soon): H2O Automated Machine Learning (AutoML)

4. Docker for Web Apps

Creating docker environments drastically reduces the risk of software incompatibility in production. DockerHub makes it easy to share your environment with other Data Scientists or DevOps. Further, Docker and DockerHub make it easy to deploy applications into production.

Docker Resource: Docker for Data Science Apps – 4000% Growth

5. Git Version Control

Git and GitHub are staples for reproducible research and web application development. Git tracks past versions and enables software upgrades to be performed on branches. GitHub makes it easy to share your research and/or web applications with other Data Scientists, DevOps, or Data Engineering. Further, Git and GitHub make it easy to deploy changes to apps in production.

Git Resource (Coming Soon): Git Version Control for Data Science Apps

Other Technologies Worth Mentioning
  1. dbplyr for SQL – For data scientists that need to create complex SQL queries, but don’t have time to deal with messy SQL. dbplyr is a massive productivity booster. It converts R (dplyr) to SQL. Can use it for 95% of SQL queries.

  2. Bootstrap – For data scientists that build apps, Bootstrap is a Front-End web framework that Shiny is built on top of and it powers much of the web (e.g. Twitter’s app). Bootstrap makes it easy to control the User Interface (UI) of your application.

  3. MongoDB – For data scientists that build apps, MongoDB is a NoSQL database that is useful for storing complex user information of your application in one table. Much easier than creating a multi-table SQL database.

Real Shiny App + AWS + Docker Case Example

In my Shiny Developer with AWS Course (NEW), you use the following application architecture that uses AWS EC2 to create an Ubuntu Linux Server that hosts a Shiny App in the cloud called the Stock Analyzer.

Data Science Web Application Architecture

From Shiny Developer with AWS Course

You use AWS EC2 to build a server to run your Stock Analyzer application along with several other web apps.

AWS EC2 Instance used for Cloud Deployment

From Shiny Developer with AWS Course

Next, you use a DockerFile to containerize the application’s software environment.

DockerFile for Stock Analyzer App

From Shiny Developer with AWS Course

You then deploy your “Stock Analyzer” application so it’s accessible anywhere via the AWS Cloud.

DockerFile for Stock Analyzer App

From Shiny Developer with AWS Course

If you are ready to learn how to build and deploy Shiny Applications in the cloud using AWS, then I recommend my NEW 4-Course R-Track System.

I look forward to providing you the best data science for business education.

Matt Dancho

Founder, Business Science

Lead Data Science Instructor, Business Science University

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Training courses in San Francisco

Sun, 08/12/2019 - 14:13

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Jumping Rivers are coming to San Francisco in January 2020! We’ll be running a number of R training courses with Paradigm Data. You can find the booking links and more details over at our courses page. Don’t be afraid to get in contact if you have any questions!

22nd January – Intro to R

This is a one-day intensive course on R and assumes no prior knowledge. By the end of the course, participants will be able to import, summarise and plot their data. At each step, we avoid using "magic code", and stress the importance of understanding what R is doing.

23rd January – Getting to Grips with the Tidyverse

The tidyverse is essential for any data scientist who deals with data on a day-to-day basis. By focusing on small key tasks, the tidyverse suite of packages removes the pain of data manipulation. This training course covers key aspects of the tidyverse, including dplyr, lubridate, tidyr and tibbles.

24th January – Advanced Graphics with R

The ggplot2 package can create advanced and informative graphics. This training course stresses understanding – not just one off R scripts. By the end of the session, participants will be familiar with themes, scales and facets, as well as the wider ggplot2 world of packages.

The post Training courses in San Francisco appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Advent of Code 2019 challenge with R

Sun, 08/12/2019 - 10:38

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have decided to tackle this year’s Advent Of Code using R (more or less). I know there are more preferred languages, such as Python, C#, Java, JavaScript, Go, Kotlin, C++, Elixir, but it was worth trying.

Into the 8th day of the competition, in the time of writing this blog post, I have had little problems using R. There was a competition on second day, that initiated the array of arrays with 0-based first position, and knowing that R is 1-based first position, I had some fun (and aggravation) finding out the problem. And at the end, I rewrote everything in Python The same set of instructions continued on the 5th and 6th day, but R should also be just as good as any other language.

On the day 3, the Manhattan distance (similar to Manhattan wiring) from starting point to closest intersection of two different wires needed to be calculated. Which of course, it was fun, but an extra little tickle was to visualize both wires in 2-D space.

The following dataset is, the dataset I have received, your’s will be different. And the code was just a product of 10 minutes writing (hence, not really optimized).

My dataset is presented as:

library(ggplot2) # read instructions for both wires d <- c("R1007","D949","R640","D225","R390","D41","R257","D180","L372","U62","L454","U594","L427","U561","R844","D435","L730","U964","L164","U342","R293","D490","L246","U323","L14","D366","L549","U312","L851","U959","L255","U947","L179","U109","R850","D703","L310","U175","L665","U515","R23","D633","L212","U650","R477","U131","L838","D445","R999","D229","L772","U716","L137","U355","R51","D565","L410","D493","L312","U623","L846","D283","R980","U804","L791","U918","L641","U258","R301","U727","L307","U970","L748","U229","L225","U997","L134","D707","L655","D168","L931","D6","R36","D617","L211","D453","L969","U577","R299","D804","R910","D898","R553","U298","L309","D912","R757","U581","R228","U586","L331","D865","R606","D163","R425","U670","R156","U814","L168","D777","R674","D970","L64","U840","L688","U144","L101","U281","L615","D393","R277","U990","L9","U619","L904","D967","L166","U839","L132","U216","R988","U834","R342","U197","L717","U167","L524","U747","L222","U736","L149","D156","L265","U657","L72","D728","L966","U896","R45","D985","R297","U38","R6","D390","L65","D367","R806","U999","L840","D583","R646","U43","L731","D929","L941","D165","R663","U645","L753","U619","R60","D14","L811","D622","L835","U127","L475","D494","R466","U695","R809","U446","R523","D403","R843","U715","L486","D661","L584","U818","L377","D857","L220","U309","R192","U601","R253","D13","L95","U32","L646","D983","R13","U821","L1","U309","L425","U993","L785","U804","L663","U699","L286","U280","R237","U388","L170","D222","L900","U204","R68","D453","R721","U326","L629","D44","R925","D347","R264","D767","L785","U249","R989","D469","L446","D384","L914","U444","R741","U90","R424","U107","R98","U20","R302","U464","L808","D615","R837","U405","L191","D26","R661","D758","L866","D640","L675","U135","R288","D357","R316","D127","R599","U411","R664","D584","L979","D432","R887","D104","R275","D825","L338","D739","R568","D625","L829","D393","L997","D291","L448","D947","L728","U181","L137","D572","L16","U358","R331","D966","R887","D122","L334","D560","R938","D159","R178","D29","L832","D58","R37") d2 <- c("L993","U121","L882","U500","L740","D222","R574","U947","L541","U949","L219","D492","R108","D621","L875","D715","R274","D858","R510","U668","R677","U327","L284","U537","L371","U810","L360","U333","L926","D144","R162","U750","L741","D360","R792","D256","L44","D893","R969","D996","L905","D524","R538","U141","R70","U347","L383","U74","R893","D560","L39","U447","L205","D783","L244","D40","R374","U507","L946","D934","R962","D138","L584","U562","L624","U69","L77","D137","L441","U671","L849","D283","L742","D459","R105","D265","R312","D734","R47","D369","R676","D429","R160","D814","L881","D830","R395","U598","L413","U817","R855","D377","L338","D413","L294","U321","L714","D217","L15","U341","R342","D480","R660","D11","L192","U518","L654","U13","L984","D866","R877","U801","R413","U66","R269","D750","R294","D143","R929","D786","R606","U816","L562","U938","R484","U32","R136","U30","L393","U209","L838","U451","L387","U413","R518","D9","L847","D605","L8","D805","R348","D174","R865","U962","R926","U401","R445","U720","L843","U785","R287","D656","L489","D465","L192","U68","L738","U962","R384","U288","L517","U396","L955","U556","R707","U329","L589","U604","L583","U457","R545","D504","L521","U711","L232","D329","L110","U167","R311","D234","R284","D984","L778","D295","R603","U349","R942","U81","R972","D505","L301","U422","R840","U689","R225","D780","R379","D200","R57","D781","R166","U245","L865","U790","R654","D127","R125","D363","L989","D976","R993","U702","L461","U165","L747","U656","R617","D115","L783","U187","L462","U838","R854","D516","L978","U846","R203","D46","R833","U393","L322","D17","L160","D278","R919","U611","L59","U709","L472","U871","L377","U111","L612","D177","R712","U628","R858","D54","L612","D303","R205","U430","R494","D306","L474","U848","R816","D104","L967","U886","L866","D366","L120","D735","R694","D335","R399","D198","R132","D787","L749","D612","R525","U163","R660","U316","R482","D412","L376","U170","R891","D202","R408","D333","R842","U965","R955","U440","L26","U747","R447","D8","R319","D188","L532","D39","L863","D599","R307","U253","R22")

Creating two empty dataframes (each for different group) and a helper function to decode the instructions: R1007 -> should be undestood as Right for 1007 steps and in accordance, the coordinates change.

gCoord2 <- function(pos, prevX, prevY){ direction <- substring(pos,1,1) val <- as.integer(substring(pos,2,nchar(pos))) #a <- c(0,0) if (direction == "R") { a <- c(prevX,prevY+val) return(a)} if (direction == "L") { a <- c(prevX,prevY-val) return(a)} if (direction == "U") { a <- c(prevX+val,prevY) return(a)} if (direction == "D") { a <- c(prevX-val, prevY) return(a)} }

And not to iterate through both datasets and binding them into single one:

df <- data.frame(x = c(0), y = c(0), group = 1) df2 <- data.frame(x = c(0), y = c(0), group = 2) for (i in 1:length(d)){ ii <- d[i] print(ii) #get last value from df X <- tail(df$x,1) Y <- tail(df$y,1) x1<- gCoord2(d[i],X,Y)[1] y1<- gCoord2(d[i],X,Y)[2] df <-rbind(df, c(x=x1, y=y1, group=1)) } for (i in 1:length(d2)){ ii <- d2[i] print(ii) #get last value from df X <- tail(df2$x,1) Y <- tail(df2$y,1) x1<- gCoord2(d2[i],X,Y)[1] y1<- gCoord2(d2[i],X,Y)[2] df2 <-rbind(df2, c(x=x1, y=y1, group=2)) } df3 <- rbind(df,df2)

Finally, touch of ggplot2:

ggplot(df3, aes(x = x, y = y, group = group, colour=group)) + geom_path(size = 0.75, show.legend = FALSE) + # theme_void() + ggtitle('Advent Of Code 2019; Day 3 - Graph; starting point(0,0)')

And the graph with both wires:

Happy R-coding!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Advent of Code 2019-08 with R & JavaScript

Sun, 08/12/2019 - 01:00

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Solving Advent of Code 2019-08 with R and JavaScript.

[Disclaimer] Obviously, this post contains a big spoiler about Advent
of Code, as it gives solutions for solving day 8.

About the JavaScript code

The JavaScript code has been written in the same RMarkdown as the R
code. It runs thanks to the {bubble} package:
https://github.com/ColinFay/bubble

Instructions

Find the instructions at: https://adventofcode.com/2019/day/8

R solution Part one library(magrittr) library(purrr) ipt <- read.delim("input8.txt", header = FALSE, colClasses = "character")$V1 ipt <- strsplit(ipt, "")[[1]] %>% as.numeric() layers_size <- 6 * 25 l <- list() for (i in 1: (length(ipt)/layers_size)){ l[[i]] <- ipt[1:150] ipt <- ipt[151:length(ipt)] } mn <- l %>% lapply(table) %>% map_dbl("0") %>% which.min() l[[mn]] %>% table() ## . ## 0 1 2 ## 7 14 129 14 * 129 ## [1] 1806 Part two v <- c() for (i in seq_len(layers_size)){ idx <- map_dbl(l, i) v[i] <- idx[idx %in% c(0,1)][1] } library(dplyr) library(tidyr) library(ggplot2) library(tibble) matrix(v, ncol = 6) %>% as.data.frame() %>% rowid_to_column() %>% gather(key = key, value = value, V1:V6) %>% mutate(key = gsub("V(.)", "\\1", key) %>% as.numeric()) %>% ggplot(aes(rowid, key, fill = as.factor(value))) + geom_tile() + coord_fixed() + scale_fill_viridis_d() + scale_y_reverse()

JS solution var ipt = fs.readFileSync("input8.txt", 'utf8').split("").filter(x => x.length != 0 & x != '\n').map(x => parseInt(x)); var layers_size = 6 * 25; var layer_n = ipt.length / layers_size; var res = []; function table(vec){ var tbl = {}; vec.map(function(x){ if (tbl[x]){ tbl[x] = tbl[x] + 1; } else { tbl[x] = 1; } }) return tbl; } for (var i = 0; i < layer_n; i ++){ res[i] = ipt.splice(0, layers_size); } var res_b = res.map(x => table(x)); var minim = Math.min.apply(Math, res_b.map(x => x['0'])); var smallest = res_b.filter(x => x['0'] == minim); smallest[0]["1"] * smallest[0]["2"]; ## 1806 var v = []; for (var i = 0; i < layers_size; i ++){ var idx = res.map(x => x[i]); v[i] = idx.find(z => z== 0 | z == 1); } var nn = []; for (var i = 0; i < 6; i ++){ nn[i] = v.splice(0, 25).join(" ").replace(/0/g, " "); } nn ## [ ' 1 1 1 1 1 1 1 1 1 1 1 1 1 ', ## ' 1 1 1 1 1 1 1 1 ', ## ' 1 1 1 1 1 1 1 1 1 1 ', ## ' 1 1 1 1 1 1 1 1 1 1 1 1 1 ', ## '1 1 1 1 1 1 1 1 1 ', ## ' 1 1 1 1 1 1 1 1 1 ' ] var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Tidy Evaluation

Sun, 08/12/2019 - 01:00

[This article was first published on jessecambon-R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tidy evaluation is a framework for controlling how expressions and
variables in your code are evaluated by
tidyverse functions. This framework,
housed in the rlang package, is a powerful
tool for writing more efficient and elegant code. In particular, you’ll
find it useful for passing variable names as inputs to functions that
use tidyverse packages like dplyr and
ggplot2.

The goal of this post is to offer accessible examples and intuition for
putting tidy evaluation to work in your own code. Because of this I will
keep conceptual explanations brief, but for more comprehensive
documentation you can refer to dplyr’s
website
, rlang’s
website
, the ‘Tidy Evaluation’
book
by Lionel Henry and Hadley
Wickham, and the Metaprogramming Section of the ‘Advanced R’
book
by Hadley Wickham.

Motivating Example

To begin, let’s consider a simple example of calculating summary
statistics with the mtcars
dataset
.
Below we calculate maximum and minimum horsepower (hp) by the number of
cylinders (cyl) using the
group_by and
summarize
functions from dplyr.

library(dplyr) hp_by_cyl <- mtcars %>% group_by(cyl) %>% summarize(min_hp=min(hp), max_hp=max(hp)) cyl min_hp max_hp 4 52 113 6 105 175 8 150 335

Now let’s say we wanted to repeat this calculation multiple times while
changing which variable we group by
. A brute force method to accomplish
this would be to copy and paste our code as many times as necessary and
modify the group by variable in each iteration. However, this is
inefficient especially if our code gets more complicated, requires many
iterations, or requires further development.

To avoid this inelegant solution you might think to store the name of a
variable inside of another variable like this groupby_var <- "vs".
Then you could attempt to use your newly created “groupby_var” variable
in your code: group_by(groupby_var). However, if you try this you will
find it doesn’t work. The “group_by” function expects the name of the
variable you want to group by as an input, not the name of a variable
that contains the name of the variable you want to group by.

This is the kind of headache that tidy evaluation can help you solve. In
the example below we use the
quo function and the
“bang-bang” !!
operator to set “vs” (engine type, 0 = automatic, 1 = manual) as our
group by variable. The “quo” function allows us to store the variable
name in our “groupby_var” variable and “!!” extracts the stored
variable name.

groupby_var <- quo(vs) hp_by_vs <- mtcars %>% group_by(!!groupby_var) %>% summarize(min_hp=min(hp), max_hp=max(hp)) vs min_hp max_hp 0 91 335 1 52 123

The code above provides a method for setting the group by variable by
modifying the input to the “quo” function when we define “groupby_var”.
This can be useful, particularly if we intend to reference the group by
variable multiple times. However, if we want to use code like this
repeatedly in a script then we should consider packaging it into a
function. This is what we will do next.

Making Functions with Tidy Evaluation

To use tidy evaluation in a function, we will still use the “!!”
operator as we did above, but instead of “quo” we will use the
enquo function. Our
new function below takes the group by variable and the measurement
variable as inputs so that we can now calculate maximum and minimum
values of any variable we want. Also note two new features I have
introduced in this function:

  • The as_label
    function extracts the string value of the “measure_var” variable
    (“hp” in this case). We use this to set the value of the
    “measure_var” column.
  • The “walrus operator”
    :=
    is used to create a column named after the variable name stored in
    the “measure_var” argument (“hp” in the example). The walrus
    operator allows you to use strings and evaluated variables (such as
    “measure_var” in our example) on the left hand side of an
    assignment operation (where there would normally be a “=” operator)
    in functions such as “mutate” and “summarize”.

Below we define our function and use it to group by “am” (transmission
type, 0 = automatic, 1 = manual) and calculate summary statistics with
the “hp” (horsepower) variable.

car_stats <- function(groupby_var,measure_var) { groupby_var <- enquo(groupby_var) measure_var <- enquo(measure_var) return(mtcars %>% group_by(!!groupby_var) %>% summarize(min=min(!!measure_var), max=max(!!measure_var)) %>% mutate(measure_var = as_label(measure_var), !!measure_var := NA) ) } hp_by_am <- car_stats(am,hp) am min max measure_var hp 0 62 245 hp NA 1 52 335 hp NA

We now have a flexible function that contains a dplyr workflow. You can
experiment with modifying this function for your own purposes.
Additionally, as you might suspect, you could use the same tidy
evaluation functions we just used with tidyverse packages other than
dplyr.

As an example, below I’ve defined a function that builds a scatter plot
with ggplot2. The function takes a
dataset and two variable names as inputs. You will notice that the
dataset argument “df” needs no tidy evaluation. The
as_label function is
used to extract our variable names as strings to create a plot title
with the “ggtitle” function.

library(ggplot2) scatter_plot <- function(df,x_var,y_var) { x_var <- enquo(x_var) y_var <- enquo(y_var) return(ggplot(data=df,aes(x=!!x_var,y=!!y_var)) + geom_point() + theme_bw() + theme(plot.title = element_text(lineheight=1, face="bold",hjust = 0.5)) + geom_smooth() + ggtitle(str_c(as_label(y_var), " vs. ",as_label(x_var))) ) } scatter_plot(mtcars,disp,hp)

As you can see, we’ve plotted the “hp” (horsepower) variable against
“disp” (displacement) and added a regression line. Now, instead of
copying and pasting ggplot code to create the same plot with different
datasets and variables, we can just call our function.

The “Curly-Curly” Shortcut and Passing Multiple Variables

To wrap things up, I’ll cover a few additional tricks and shortcuts for
your tidy evaluation toolbox.

  • The “curly-curly” {{
    }}

    operator directly extracts a stored variable name from
    “measure_var” in the example below. In the prior example we
    needed both “enquo” and “!!” to evaluate a variable like this so
    the “curly-curly” operator is a convenient shortcut. However, note
    that if you want to extract the string variable name with the
    “as_label” function, you will still need to use “enquo” and
    “!!” as we have done below with “measure_name”.
  • The syms function and
    the “!!!” operator are used for passing a list of variables as a
    function argument. In prior examples “!!” was used to evaluate a
    single group by variable; we now use “!!!” to evaluate a list of
    group by variables. One quirk is that to use the “syms” function we
    will need to pass the variable names in quotes.
  • The walrus operator “:=” is again used to create new columns, but
    now the column names are defined with a combination of a variable
    name stored in a function argument and another string (“_min” and
    “_max” below). We use the “enquo” and “as_label” functions to
    extract the string variable name from “measure_var” and store it in
    “measure_name” and then use the “str_c” function from
    stringr to combine strings. You
    can use similar code to build your own column names from variable
    name inputs and strings.

Our new function is defined below and is first called to group by the
“cyl” variable and then called to group by the “am” and “vs”
variables. Note that the “!!!” operator and “syms” function can be
used with either a list of strings or a single string.

get_stats <- function(data,groupby_vars,measure_var) { groupby_vars <- syms(groupby_vars) measure_name <- as_label(enquo(measure_var)) return( data %>% group_by(!!!groupby_vars) %>% summarize( !!str_c(measure_name,"_min") := min({{measure_var}}), !!str_c(measure_name,"_max") := max({{measure_var}})) )} cyl_hp_stats <- mtcars %>% get_stats("cyl",mpg) gear_stats <- mtcars %>% get_stats(c("am","vs"),gear) cyl mpg_min mpg_max 4 21.4 33.9 6 17.8 21.4 8 10.4 19.2 am vs gear_min gear_max 0 0 3 3 0 1 3 4 1 0 4 5 1 1 4 5

This concludes my introduction to tidy evaluation. Hopefully this serves
as a useful starting point for using these concepts in your own code.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: jessecambon-R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to precompute package vignettes or pkgdown articles

Sun, 08/12/2019 - 01:00

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As of earlier this year, we are now automatically building binaries and pkgdown documentation for all rOpenSci packages. One issue we encountered is that some packages include vignettes that require some special tools/data/credentials, which are unavailable on generic build servers.

This post explains how to include such vignettes and articles in your package.

On package vignettes

By default, R automatically recreates vignettes during R CMD check or when generating pkgdown sites by running all R code. This is useful because it provides some extra testing of your code and ensures that documentation is reproducible. However, sometimes it is not a good idea to run the code on every build server, every time. For example:

  • The vignette examples require some special local software or private data.
  • The code connects to a web service that requires authentication or has limits.
  • You don’t want to hammer web services for every CMD check.
  • The vignette code takes very long to execute.

In such cases it is better to execute the rmarkdown code locally, and ship a vignette in the package which already contains the rendered R output.

The solution: locally knitting rmarkdown

Suppose you have a vignette called longexample.Rmd. To pre-compute the vignette, rename the input file to something that is not recognized by R as rmarkdown such as: longexample.Rmd.orig. Then run knitr in the package directory to execute and replace R code in the rmarkdown:

# Execute the code from the vignette knitr::knit("vignettes/longexample.Rmd", output = "vignettes/longexample.Rmd")

The new output file longexample.Rmd now contains markdown with the already executed R output. So it can be treated as a regular vignette, but R can convert it to html instantaneously without having to re-run R code from the rmarkdown.

The jsonlite package shows a real world example. In this case I pre-computed vignettes that access web APIs to prevent services from getting hammered (and potentially banning the check servers).

Saving vignette figures

One gotcha with this trick is that if the vignette output includes figures, you need to store the images in the vignettes folder. It is also a good idea to explicitly name your rmarkdown knitr chunks, so that the images have sensible filenames.

Our recently onboarded package eia by Matt Leonawicz is a good example. This package provides an R client for US Energy Information Administration Open Data API. The eia documentation gets automatically generated for each commit on the rOpenSci docs server, even though the code in the vignettes actually requires an API key (which the docs server does not have).

The eia vignettes directory contains the Rmd.orig input files and the .Rmd files as pre-computed by the package author. Also note the vignettes directory contains a handy script precompile.R that makes it easy for the package author to refresh the output vignettes locally.

Don’t forget to update

The drawback of this approach is that documents no longer automatically update when the package changes. Therefore you should only pre-compute the vignettes and articles that are problematic, and make a note for yourself to re-knit the vignette occasionally, e.g. before a package release. Adding a script to your vignette folders that does so can be a helpful reminder.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Inset maps with ggplot2

Sun, 08/12/2019 - 01:00

[This article was first published on the Geocomputation with R website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Inset maps enable multiple places to be shown in the same geographic data visualisation, as described in the Inset maps section (8.2.7) of our open source book Geocomputation with R.
The topic of inset maps has gained attention and recently Enrico Spinielli asked inset maps could be created for data in unusual coordinate systems:

Speaking of insets, do you know of any ggplot2 examples with an inset for showing where the (bbox of the) map is in an orthographic/satellite proj? It is usually done to provide geographic context… NYT/WP have fantastic maps like that

— espinielli (@espinielli) November 4, 2019

R’s flexibility allows inset maps to be created in various ways, using different approaches and packages.
However, the main idea stays the same: we need to create at least two maps: a larger one, called the main map, that shows the central story and a smaller one, called the inset map, that puts the main map in context.

This blog post shows how to create inset maps with ggplot2 for visualization.
The approach also uses the sf package for spatial data reading and handling, cowplot to arrange inset maps, and rcartocolor for additional color palettes.
To reproduce the results on your own computer, after installing them, these packages can be attached as follows:

library(sf) library(ggplot2) library(cowplot) library(rcartocolor) Basic inset map

Let’s start by creating a basic inset map.

Data preparation

The first step is to read and prepare the data we want to visualize.
We use the us_states data from the spData package as the source of the inset map, and north_carolina from the sf package as the source of the main map.

library(spData) data("us_states", package = "spData") north_carolina = read_sf(system.file("shape/nc.shp", package = "sf"))

Both objects should have the same coordinate reference system (crs).
Here, we use crs = 2163, which represents the US National Atlas Equal Area projection.

us_states_2163 = st_transform(us_states, crs = 2163) north_carolina_2163 = st_transform(north_carolina, crs = 2163)

We also need to have the borders of the area we want to highlight (use in the main map).
This can be done by extracting the bounding box of our north_carolina_2163 object.

north_carolina_2163_bb = st_as_sfc(st_bbox(north_carolina_2163)) Maps creation

The second step is to create both inset and main maps independently.
The inset map should show the context (larger area) and highlight the area of interest.

ggm1 = ggplot() + geom_sf(data = us_states_2163, fill = "white") + geom_sf(data = north_carolina_2163_bb, fill = NA, color = "red", size = 1.2) + theme_void() ggm1

The main map’s role is to tell the story.
Here we show the number of births between 1974 and 1978 in the North Carolina counties (the BIR74 variable) using the Mint color palette from the rcartocolor palette.
We also customize the legend position and size – this way, the legend is a part of the map, instead of being somewhere outside the map frame.

ggm2 = ggplot() + geom_sf(data = north_carolina_2163, aes(fill = BIR74)) + scale_fill_carto_c(palette = "Mint") + theme_void() + theme(legend.position = c(0.4, 0.05), legend.direction = "horizontal", legend.key.width = unit(10, "mm")) ggm2

Maps joining

The final step is to join two maps.
This can be done using functions from the cowplot package.
We create an empty ggplot layer using ggdraw(), fill it with out main map (draw_plot(ggm2)), and add an inset map by specifing its position and size:

gg_inset_map1 = ggdraw() + draw_plot(ggm2) + draw_plot(ggm1, x = 0.05, y = 0.65, width = 0.3, height = 0.3) gg_inset_map1

The final map can be saved using the ggsave() function.

ggsave(filename = "01_gg_inset_map.png", plot = gg_inset_map1, width = 8, height = 4, dpi = 150) Advanced inset map

Let’s expand the idea of the inset map in ggplot2 based on the previous example.

Data preparation

This map will use the US states borders (states()) as the source of the inset map and the Kentucky Senate legislative districts (state_legislative_districts()) as the main map.

library(tigris) us_states = states(cb = FALSE, class = "sf") ky_districts = state_legislative_districts("KY", house = "upper", cb = FALSE, class = "sf")

The states() function, in addition to the 50 states, also returns the District of Columbia, Puerto Rico, American Samoa, the Commonwealth of the Northern Mariana Islands, Guam, and the US Virgin Islands.
For our purpose, we are interested in the continental 48 states and the District of Columbia only; therefore, we remove the rest of the divisions using subset().

us_states = subset(us_states, !NAME %in% c( "United States Virgin Islands", "Commonwealth of the Northern Mariana Islands", "Guam", "American Samoa", "Puerto Rico", "Alaska", "Hawaii" ))

The same as in the example above, we transform both objects to have the same projection.

ky_districts_2163 = st_transform(ky_districts, crs = 2163) us_states_2163 = st_transform(us_states, crs = 2163)

We also extract the bounding box of the main object here.
However, instead of using it directly, we add a buffer of 10,000 meters around it.
This output will be handy in both inset and main maps.

ky_districts_2163_bb = st_as_sfc(st_bbox(ky_districts_2163)) ky_districts_2163_bb = st_buffer(ky_districts_2163_bb, dist = 10000)

The ky_districts_2163 object does not have any interesting variables to visualize, so we create some random values here.
However, we could also join the districts’ data with another dataset in this step.

ky_districts_2163$values = runif(nrow(ky_districts_2163)) Map creation

The inset map should be as clear and simple as possible.

ggm3 = ggplot() + geom_sf(data = us_states_2163, fill = "white", size = 0.2) + geom_sf(data = ky_districts_2163_bb, fill = NA, color = "blue", size = 1.2) + theme_void() ggm3

On the other hand, the main map looks better when we provide some additional context to our data.
One of the ways to achieve it is to add the borders of the neighboring states.

Importantly, we also need to limit the extent of our main map to the range of the frame in the inset map.
This can be done with the coord_sf() function.

ggm4 = ggplot() + geom_sf(data = us_states_2163, fill = "#F5F5DC") + geom_sf(data = ky_districts_2163, aes(fill = values)) + scale_fill_carto_c(palette = "Sunset") + theme_void() + theme(legend.position = c(0.5, 0.07), legend.direction = "horizontal", legend.key.width = unit(10, "mm"), plot.background = element_rect(fill = "#BFD5E3")) + coord_sf(xlim = st_bbox(ky_districts_2163_bb)[c(1, 3)], ylim = st_bbox(ky_districts_2163_bb)[c(2, 4)]) ggm4

Finally, we draw two maps together, trying to find the best location and size for the inset map.

gg_inset_map2 = ggdraw() + draw_plot(ggm4) + draw_plot(ggm3, x = 0.02, y = 0.65, width = 0.35, height = 0.35) gg_inset_map2

The final map can be saved using the ggsave() function.

ggsave(filename = "02_gg_inset_map.png", plot = gg_inset_map2, width = 7.05, height = 4, dpi = 150) Summary

The above examples can be adjusted to any spatial data and location.
It is also possible to put more context on the map, including adding main cities’ names, neighboring states’ names, and annotations (using geom_text(), geom_label()).
The main map can also be enhanced with the north arrow and scale bar using the ggsn package.

As always with R, there are many possible options to create inset maps.
You can find two examples of inset maps created using the tmap package in the Geocomputation with R book.
The second example is a classic map of the United States, which consists of the contiguous United States, Hawaii, and Alaska.
However, Hawaii and Alaska are displayed at different geographic scales than the main map there.
This problem can also be solved in R, which you can see in the Making maps of the USA with R: alternative layout blogpost and the Alternative layout for maps of the United States repository.

The presented approaches also apply to other areas.
For example, you can find three ways on how to create an inset map of Spain in the Alternative layout for maps of Spain repository.
Other examples of inset maps with ggplot2 can be found in the Inset Maps vignette by Ryan Peek and the blog post Drawing beautiful maps programmatically with R, sf and ggplot2 by Mel Moreno and Mathieu Basille.

The decision which option to use depends on the expected map type preferred R packages, etc.
Try different approaches on your own data and decide what works best for you!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: the Geocomputation with R website. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Building a statistical model for field goal kicker accuracy

Sun, 08/12/2019 - 01:00

[This article was first published on R on Jacob Long, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the methodological companion to my post on
my proposed method for evaluating NFL kickers. To get all the results and
some extra info about the data, check it out.

When you have a hammer, everything looks like a nail, right? Well I’m a big
fan of multilevel models and especially the ability of MCMC estimation to
fit these models with many, often sparse, groups. Great software implementations
like Stan and my favorite R interface to it,
brms make doing applied
work pretty straightforward and even fun.

As I spent time lamenting the disappointing season my Chicago Bears have been
having, I tried to think about how seriously I should take the shakiness of
their new kicker, Eddy Pineiro. Just watching casually, it can be hard to really
know whether a kicker has had a very difficult set of kicks or not and what
an acceptable FG% would be. This got me thinking about how I could use what
I know to satisfy my curiosity.

Previous efforts

Of course, I’m not exactly the first person to want to do something to account
for those differences. Chase Stuart over
at Football Perspectives
used field goal distance to adjust FG% for difficulty (as well as comparing
kickers across eras by factoring in generational differences in kicking success).
Football Outsiders
does something similar — adjusting for distance — when grading kickers.
Generally speaking, the evidence suggests that just dealing with kick distance
gets you very far along the path to identifying the best kickers.

Chris Clement at the
Passes and Patterns blog
provides a nice review of the statistical and theoretical approaches to the
issue. There are several things that are clear from previous efforts, besides
the centrality of kick distance. Statistically, methods based
on the logistic regression model are appropriate and the most popular —
logistic regression is a statistical method designed to predict binary events
(e.g., making/missing a field goal) using multiple sources of information. And
besides kick distance, there are certainly elements of the environment that
matter — wind, temperature, elevation among them — although just how much
and how easily they can be measured is a matter of debate.

There has also been a lot of interest in game situations, especially clutch
kicking. Do kickers perform worse under pressure, like when their kicks will
tie the game or give their team the lead late in the game? Does “icing” the
kicker, by calling a timeout before the kick, make the kicker less likely to
be successful? Do they perform worse in playoff games?

On icing, Moskowitz and Wertheim (2011),
Stats, Inc.,
Clark, Johnson, and Stimpson (2013),
and LeDoux (2016)
do not find compelling evidence that icing the kicker is effective. On the other
hand,
Berry and Wood (2004)
Goldschmied, Nankin, and Cafri (2010),
and Carney (2016)
do find some evidence that icing hurts the kicker. All these have some limitations,
including which situations qualify as “icing” and whether we can identify
those situations in archival data. In general, to the extent there may be an
effect, it looks quite small.

Most important in this prior work is the establishment of a few approaches to
quantification. A useful way to think about comparing kickers is to know what
their expected FG% (eFG%) is. That is, given the difficulty of their kicks,
how
would some hypothetical average kicker have fared? Once we have an expected FG%,
we can more sensibly look at the kicker’s actual FG%. If we have two kickers
with an actual FG% of 80%, and kicker A had an eFG% of 75% while
kicker B had an eFG% of 80%, we can say kicker A is doing better because he
clearly had a more difficult set of kicks and still made them at the same rate.

Likewise, once we have eFG%, we can compute points above average (PAA).
This is fairly straightforward since we’re basically just going to take the
eFG% and FG% and weight them by the number of kicks. This allows us to
appreciate the kickers who accumulate the most impressive (or unimpressive)
kicks over the long haul. And since coaches generally won’t try kicks they
expect to be missed, it rewards kickers who win the trust of their coaches and
get more opportunities to kick.

Extensions of these include replacement FG% and points above
replacement
, which use replacement level as a reference point rather than
average. This is useful because if you want to know whether a kicker is playing
badly enough to be fired, you need some idea of who the competition is.
PAA and eFG% are more useful when you’re talking about greatness and who
deserves a pay raise.

Statistical innovations

The most important — in my view — entries into the “evaluating kickers with
statistical models” literature are papers by
Pasteur and Cunningham-Rhoads (2014) as well as
Osborne and Levine (2017).

Pasteur and Cunningham-Rhoads — I’ll refer to them as PC-R for short —
gathered more data than most predecessors, particularly in terms of auxiliary
environmental info. They have wind, temperature, and presence/absence of
precipitation. They show fairly convincingly that while modeling kick distance
is the most important thing, these other factors are important as well. PC-R
also find the cardinal direction of every NFL stadium (i.e., does it run
north-south, east-west, etc.) and use this information along with wind direction
data to assess the presence of cross-winds, which are perhaps the trickiest for
kickers to deal with. They can’t know about headwinds/tailwinds because as far
as they (and I) can tell, nobody bothers to record which end zone teams defend
at the game’s coin toss, so we don’t know without looking at video which
direction the kick is going. They ultimately combine the total wind and the
cross wind, suggesting they have some meaningful measurement error that makes
them not accurately capture all the cross-winds. Using
their logistic regressions that factor for these several factors, they calculate
an eFG% and use it and its derivatives to rank the kickers.

PC-R
include some predictors that, while empirically justifiable based on their
results, I don’t care to include. These are especially indicators of defense
quality, because I don’t think this should logically effect the success of a
kick and is probably related to the selection bias inherent to the coach’s
decision to try a kick or not. They also include a “kicker fatigue” variable
that appears to show that kickers attempting 5+ kicks in a game are less
successful than expected. I don’t think this makes sense and so I’m not
going to include it for my purposes.

They put some effort into defining a
“replacement-level” kicker which I think is sensible in spite of some limitations
they acknowledge. In my own efforts, I decided to do something fairly similar
by using circumstantial evidence to classify a given kicker in a given
situation as a replacement or not.

PC-R note that their model seems to overestimate the probability of very long
kicks, which is not surprising from a statistical standpoint given that there
are rather few such kicks, they are especially likely to only be taken by those
with an above-average likelihood of making them, and the statistical assumption
of linearity is most likely to break down on the fringes like this. They also
mention it would be nice to be able to account for kickers having different
leg strengths and not just differing in their accuracy.

Osborne and Levine (I’ll call them OL) take an important step in trying to
improve upon some of these limitations. Although they don’t use this phrasing,
they are basically proposing to use multilevel models, which treat each kicker
as his own group and thereby accounting for the possibility — I’d say it’s a
certainty — that kickers differ from one another in skill.

A multilevel
model has several positive attributes, especially that it not only adjusts for
the apparent differences in kickers but also that it looks skeptically upon
small sample sizes. A guy who makes a 55-yard kick in his first career attempt
won’t be dubbed the presumptive best kicker of all time because the model will
by design account for the fact that a single kick isn’t very informative. This
means we can simultaneously improve the prediction accuracy on kicks, but also
use the model’s estimates of kicker ability without over-interpreting
small sample sizes. They also attempt to use a quadratic term for kick
distance, which could better capture the extent to which the marginal
difference of a few extra yards of distance is a lot different when you’re at
30 vs. 40 vs. 50 yards. OL are unsure about whether the model justifies
including the quadratic term but I think on theoretical grounds it makes a lot
of sense.

OL also discuss using a clog-log link rather than the logistic link, showing
that it has better predictive accuracy under some conditions. I am going to
ignore that advice for a few reasons, most importantly because the advantage is
small and also because the clog-log link is computationally intractable with
the software I’m using.

Model Code and data for reproducing these analyses can be found
on Github

My tool is a multilevel logistic regression fit via
MCMC using the wonderful
brms R package.
I actually considered several models for model selection.

In all cases, I have random intercepts for kicker and stadium. I also use
random slopes for both kick distance and wind at the stadium level. Using
random wind slopes at the stadium level will hopefully capture the prevailing
winds at that stadium. If they tend to be helpful, it’ll have a larger absolute
slope. Some stadiums may have swirling winds and this helps capture that as
well. The random slope for distance hopefully captures some other things, like
elevation. I also include interaction terms for wind and kick distance as well
as temperature and kick distance, since the elements may only affect longer
kicks.

There are indicators for whether the kick was “clutch” — game-tying or
go-ahead in the 4th quarter — whether the kicker was “iced,” and whether
the kick occurred in the playoffs. There is an interaction term between
clutch kicks and icing to capture the ideal icing situation as well.

I have a binary variable indicating whether the kicker was, at the time, a
replacement. In the main post, I describe the
decision rules involved in that. I have interaction terms for replacement
kickers and kick distance as well as replacement kickers and clutch kicks.

I have two random slopes at the kicker level:

  • Kick distance (allowing some kickers to have stronger legs)
  • Season (allowing kickers to have a career trajectory)

Season is modeled with a quadratic term so that kickers can decline over
time — it also helps with the over-time ebb and flow of NFL FG%. It would
probably be better to use a GAM for this to be more flexible, but they are a
pain.

All I’ve disclosed so far is enough to have one model. But I also explore
the form of kick distance using polynomials. OL used a quadratic term, but I’m
not sure even that is enough. I compare 2nd, 3rd, and 4th degree polynomials
for kick distance to try to improve the prediction of long kicks in particular.
Of course, going down the road of polynomials can put you on a glide path
towards the land of overfitting.

I fit several models, with combinations of the following:

  • 2nd, 3rd, or 4th degree polynomial
  • brms default improper priors on the fixed and random effects or
    weakly informative normal priors on the fixed and random effects
  • Interactions with kick distance with either all polynomial terms or just
    the first and second degree terms

That last category is one that I suspected — and later confirmed — could
cause some weird results. Allowing all these things to interact with a 3rd
and 4th degree polynomial term made for some odd predictions on the fringes,
like replacement-level kickers having a predicted FG% of 70% at 70 yards out.

Model selection

I looked at several criteria to compare models.

A major one was
approximate leave-one-out cross-validation.
I will show the LOOIC, which is interpreted like AIC/BIC/DIC/WAIC in terms of
lower numbers being better. This produces the same ordering as the ELPD, which
has the opposite interpretation in that higher numbers are better.
Another thing I looked at was generating prediction weights for the models via
Bayesian model stacking.
I also calculated Brier scores,
which are a standard tool for looking at prediction accuracy for binary
outcomes and are simply the mean squared prediction error. Last among the
quantitative measures is the
AUC
(area under the curve), which is another standard tool in the evaluation
of binary prediction models.

Beyond these, I also plotted predictions in areas of interest where I’d like
the model to perform well (like on long kicks) and checked certain cases
where external information not shown directly to the model gives me a relatively
strong prior. Chief among these was whether it separated the strong-legged
kickers well.

Below I’ve summarized the model comparison results. I shade the metrics
darker wherever the number is better — sometimes lower numbers are better,
sometimes higher numbers are. The bolded, red-colored row is the model I
used.

Model specification Model fit metrics Polynomial degree Interaction degree Priors LOOIC Model weight Brier score AUC 3 2 Proper 8030.265 0.5083 0.1165 0.7786 4 2 Proper 8032.332 0.1533 0.1164 0.7792 3 2 Improper 8032.726 0.0763 0.1160 0.7811 3 3 Proper 8033.879 0.0001 0.1165 0.7787 3 3 Improper 8035.616 0.2042 0.1158 0.7815 4 2 Improper 8036.614 0.0001 0.1158 0.7816 4 4 Proper 8038.266 0.0000 0.1163 0.7793 2 2 Proper 8043.221 0.0000 0.1169 0.7778 2 2 Improper 8043.450 0.0577 0.1165 0.7798 4 4 Improper 8043.741 0.0000 0.1156 0.7825

So why did I choose that model? The approximate LOO-CV procedure picks it as
the third model, although there’s enough uncertainty around those estimates
that it could easily be the best — or not as good as some ranked below it.
It’s not clear that the 4th degree polynomial does a lot of good in the
models that have it and it increases the risk of overfitting. It seems to
reduce the predicted probability of very long kicks, but as I’ve thought about
it more I’m not sure it’s a virtue.

Compared to the top two models, which distinguish themselves from the chosen
model by their use of proper priors, the model I chose does better on the
in-sample prediction accuracy metrics without looking much different on the
approximate out-of-sample ones. It doesn’t get much weight because it doesn’t
have much unique information compared to the slightly-preferred version with
proper priors. But as I looked at the models’ predictions, it appeared to me
that the regularization with the normal priors was a little too aggressive
and wasn’t picking up on the differences among kickers in leg strength.

That being said, the choices among these top few models are not very important
at all when it comes to the basics of who are the top and bottom ranked kickers.

Notes on predicted success

I initially resisted including things like replacement status and anything else
that is a fixed characteristic of a kicker (at least within a season) or
kicker-specific slopes in the
model because I planned to extract the random intercepts and use that as my
metric. Adding those things would make the random intercepts less interpretable;
if a kicker is bad and there’s no “replacement” variable, then the intercept
will be negative, but with the “replacement” variable the kicker may not have
a negative intercept after the adjustment for replacement status.

Instead, I decided to focus on model predictions. Generating the expected
FG% and replacement FG% was pretty straightforward. For eFG%, take all kicks
attempted and set replacement = 0. For rFG%, take all kicks and set
replacement = 1.

To generate kicker-specific probabilities, though, I had to decide how to
incorporate this information. I’d clearly overrate new, replacement-level
kickers. My solution to this was to, before generating predictions on
hypothetical data, set each kicker’s replacement variable to his career
average.

For season, on the other hand, I could eliminate the kicker-specific
aspect of this by intentionally zeroing these effects out in the
predictions. If I wanted to predict success in a specific season, of course,
I could include this.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Jacob Long. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Le Monde puzzle [#1119]

Sun, 08/12/2019 - 00:19

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A digit puzzle as Le weekly Monde current mathematical puzzle that sounds close to some earlier versions:

Perfect squares are pairs (a²,b²) with the same number of digits such that a²b² is itself a square. What is the pair providing a²b² less than 10⁶? Is there a solution with both integers enjoying ten digits?

The run of a brute force R code like

cek<-function(a,b){ u<-trunc if ((n<-u(log(a^2,ba=10)))==u(log(b^2,ba=10))& (u(sqrt(a^2*10^(n+1)+b^2))^2==(a^2*10^(n+1)+b^2))) print(c(a,b))}

provides solutions to the first question.

[1] 2 3 [1] 4 9 [1] 12 20 [1] 15 25 [1] 18 30 [1] 49 99 [1] 126 155 [1] 154 300 [1] 159 281 [1] 177 277 [1] 228 100 [1] 252 310 [1] 285 125

with the (demonstrable) conclusion that the only pairs with an even number of digits are of the form (49…9²,9…9²), as for instance (49999²,99999²) with ten digits each.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is new for rquery December 2019

Sat, 07/12/2019 - 20:25

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our goal has been to make rquery the best query generation system for R (and to make data_algebra the best query generator for Python).

Lets see what rquery is good at, and what new features are making rquery better.

The idea is: the query is a first class citizen that we can use to design and optimize queries prior to translating them into a data transform action via data.table, SQL, Pandas, or other realizations.

For quite a while rquery has had query-narrowing. Columns that are not used in the calculation are eliminated early. Here is an example.

library(rquery) ops <- mk_td( "d", c("col1", "col2", "col3")) %.>% extend(., sum23 := col2 + col3) %.>% select_columns(., 'sum23') cat(format(ops)) ## mk_td("d", c( ## "col1", ## "col2", ## "col3")) %.>% ## extend(., ## sum23 := col2 + col3) %.>% ## select_columns(., ## c('sum23'))

The above query (or operator DAG) represents working with a table that has columns col1, col2, col3. The example is specifying adding a new derived column named sum23 and then limiting down to only this new column. We’ve tried to use operator names that evoke operator names used by Codd.

An important point is: the query is bound to a description of a data frame (or a schema), not bound to any one data frame. Thus we can re-use the query on new data.

The record-keeping in the query knows that only columns col2 and col2 are used.

columns_used(ops) ## $d ## [1] "col2" "col3"

This allows “query narrowing” where the unused columns are not specified in intermediate queries. This is easiest to see if we convert the query to SQL.

ops %.>% to_sql( ., rquery::rquery_default_db_info()) %.>% cat(.) ## SELECT ## "sum23" ## FROM ( ## SELECT ## "col2" + "col3" AS "sum23" ## FROM ( ## SELECT ## "col2", ## "col3" ## FROM ## "d" ## ) tsql_76973382323412881950_0000000000 ## ) tsql_76973382323412881950_0000000001

Notice col1 is never referred to. This can be handy when working with tables with hundreds of columns.

And, using rqdatatable we can use data.table as another data action implementation.

library(rqdatatable) data.frame(col1 = 1, col2 = 2, col3 = 3) %.>% ops %.>% knitr::kable(.) sum23 5

rquery now also has query-shortening. Some dead-values can be eliminated during query construction, before any calculations are attempted.

ops <- mk_td( "example_table", c("col1", "col2", "col3")) %.>% extend(., sum23 := col2 + col3) %.>% extend(., x := 1) %.>% extend(., x := 2) %.>% extend(., x := 3) %.>% extend(., x := 4) %.>% extend(., x := 5) %.>% select_columns(., c('x', 'sum23')) cat(format(ops)) ## mk_td("example_table", c( ## "col1", ## "col2", ## "col3")) %.>% ## extend(., ## sum23 := col2 + col3, ## x := 5) %.>% ## select_columns(., ## c('x', 'sum23'))

Obviously nobody would construct such a bad query, but it is nice that some of the “ick” is optimized automatically.

Both of the above optimizations are deliberately conservative. They are implemented to be correct (not give incorrect results), but are not necessarily super aggressive in eliminating all redundancy.

It is a bit long and technical. But both of these optimizations are easy due to the use of category theoretic ideas in the design of the rquery and data_algebra packages (I am working on some notes on this here).

The short form is: the rquery/data_algebra operators have an interpretation in a nice category over table schemas. The schema objects give us pre-condition and post-condition record keeping which enforces correct query composition and query narrowing. The generality of arrow composition gives us the freedom to place optimizations in the composition step. This gives us more options then systems that are restricted to list-concatenation or function composition/abstraction as their notion of composition. It also lets us enforce and check conditions early.

rquery performs most of its checking during query construction. This can catch errors early and save a lot of development time.

ops_bad <- mk_td( "example_table", c("col1", "col2", "col3")) %.>% extend(., sum23 := col2_MISSPELLED + col3) ## Error in check_have_cols(src_columns, required_cols, "rquery::extend"): rquery::extend unknown columns col2_MISSPELLED

Notice an error was raised during query construction. We didn’t have to wait to supply data or translate to SQL.

Let’s take a look at the SQL translation of our final example query.

ops %.>% to_sql( ., rquery::rquery_default_db_info()) %.>% cat(.) ## SELECT ## "x", ## "sum23" ## FROM ( ## SELECT ## "col2" + "col3" AS "sum23", ## 5 AS "x" ## FROM ( ## SELECT ## "col2", ## "col3" ## FROM ## "example_table" ## ) tsql_28722584463189084716_0000000000 ## ) tsql_28722584463189084716_0000000001

There are some more things we would wish optimized away, such as both the inner and outer select. But the SQL is reasonably short, due to the intermediate stages that were optimized out of the original query. Later versions of the system will pick these up, and likely these are also easy for downstream SQL optimizers to eliminate.

An important point: optimizations performed during query construction are shared among all back-ends: data.table, SQL, and Pandas.

Please consider giving rquery a try.

Appendix

We often get asked “why bother with rquery, given dplyr was first.” I’d say: if you are happy with dplyr don’t worry about rquery. Though I would add: you really owe it to yourself to check out data.table, it is by far the best data manipulation system in R.

However, let’s take a look how dbplyr generates a similar SQL query.

library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union packageVersion("dplyr") ## [1] '0.8.3' library(dbplyr) ## ## Attaching package: 'dbplyr' ## The following objects are masked from 'package:dplyr': ## ## ident, sql packageVersion("dbplyr") ## [1] '1.4.2' con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") copy_to( con, data.frame(col1 = 1, col2 = 2, col3 = 3), name = 'd') tbl(con, 'd') %>% mutate(sum23 := col2 + col3) %>% mutate(x := 1) %>% mutate(x := 2) %>% mutate(x := 3) %>% mutate(x := 4) %>% mutate(x := 5) %>% select(x, sum23) %>% show_query() ## ## SELECT 5.0 AS `x`, `sum23` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 4.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 3.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 2.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 1.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `col2` + `col3` AS `sum23` ## FROM `d`)))))

The dplyr SQL query appears to have neither query narrowing nor query shortening. Again, a downstream SQL optimizer may be able to eliminate these steps (or it may not). However, it also would be desirable to have these sort of eliminations available when using data.table through dtplyr.

Also, dbplyr does not seem to catch errors until compute() or print() are called.

tbl(con, 'd') %>% mutate(sum23 := col2_MISSPELLED + col3) %>% show_query() ## ## SELECT `col1`, `col2`, `col3`, `col2_MISSPELLED` + `col3` AS `sum23` ## FROM `d`

The above SQL refers to a non-existent column col2_MISSPELLED. The query construction and SQL generation steps did not signal any error. Depending on how many queries and steps are before this, this could delay finding this mistake by quite a while (especially when using a high latency SQL engine such as Apache Spark).

DBI::dbDisconnect(con) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppArmadillo 0.9.800.3.0

Sat, 07/12/2019 - 17:38

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A small Armadillo bugfix upstream update 9.800.3 came out a few days ago. The changes, summarised by Conrad in email to me (and for once not yet on the arma site are fixes for matrix row iterators, better detection of non-hermitian matrices by eig_sym(), inv_sympd(), chol(), expmat_sym() and miscellaneous minor fixes. It also contains a bug fix by Christian Gunning to his sample() implementation.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 679 other packages on CRAN.

Changes in RcppArmadillo version 0.9.800.3.0 (2019-12-04)
  • Upgraded to Armadillo release 9.800.3 (Horizon Scraper)

  • The sample function passes the prob vector as const allowing subsequent calls (Christian Gunning in #276 fixing #275)

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RDieHarder 0.2.1

Sat, 07/12/2019 - 17:26

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version, now at 0.2.1, of the random-number generator tester RDieHarder (based on the DieHarder suite developed / maintained by Robert Brown with contributions by David Bauer and myself) is now on CRAN.

This version has only internal changes. Brian Ripley, tireless as always, is testing the impact of gcc 10 on CRAN code and found that the ‘to-be-default’ option -fno-common throws off a few (older) C code bases, this one (which is indeed old) included. So in a nutshell, we declared all global variables extern and defined them once and only once in new file globals.c. Needless to say, this affects the buildability options. In the past we used to rely on an external library libdieharder (which e.g. I had put together for Debian) but we now just build everything internally in the package.

Which builds on the changes in RDieHarder 0.2.0 which I apparently had not blogged about when it came out on December 21 last year. I had refactored the package to use either the until-then-required-but-now-optional external library, or the included library code. Doing so meant more builds on more systems including Windows.

This (very old) package has no NEWS.Rd file to take a summary from, but the ChangeLog file has all the details.

Thanks to CRANberries, you can also look at a diff from 0.2.1 to 0.2.0. or the older diff from 0.2.0 to 0.1.4.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

From one Regression to Hundreds Within Seconds: A Shiny Specification Curve

Sat, 07/12/2019 - 01:00

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Online appendices detailing the robustness of empirical analyses are
paramount but they never let readers explore all reasonable researcher
degrees of freedom. Simonsohn, Simmons and Nelson suggest
a ‘specification curve’ that allows readers to eyeball how a main coefficient
of interest varies across a wide arrange of specifications. I build on this idea
by making it interactive: A shiny-based web app enables readers to explore the
robustness of findings in detail along the whole curve.

Following up on two blog articles
that introduced the in-development
‘rdfanalysis’ package, the app is new extension of this package.
In essence, it let’s you change the research design choices that you want
to display and then redraws the curve on the fly.

In its simple version, it just needs a data frame with each row containing
an estimate and its choices. In most cases, you also want to include lower and
upper bounds of the estimate as well so that the specification curve can
display a nice confidence interval ribbon. As an example, the first few rows
of data used in the example below look as follows:

# devtools::install_github("joachim-gassen/rdfanalysis") library(rdfanalysis) load(url("https://joachim-gassen.github.io/data/rdf_ests.RData")) kable(head(ests), format.args = list(digits = 3)) na.omit idvs outlier_tment_style outlier_cutoff model_type feffect cluster est lb ub yes gdp_only win 0 level-level none none 0.362 0.3466 0.3783 no gdp_only win 0 level-level none none 0.386 0.3737 0.3983 yes gdp_school win 0 level-level none none 0.131 0.1159 0.1452 no gdp_school win 0 level-level none none 0.075 0.0644 0.0855 yes gdp_ue win 0 level-level none none 0.370 0.3543 0.3860 no gdp_ue win 0 level-level none none 0.325 0.3116 0.3378

This is the type of data that exhaust_design() from the rdfanalysis package
will generate. If you create your own data, you need to inform
plot_rdf_spec_curve() which columns in your data frame contain choices. You
do this by setting attribute in the data frame. In the case above, choices are
included in columns 1 to 7. So that you would set the attribute as follows:

attr(ests, "choices") <- 1:7

Once you have such a data frame, you can plot your specification curve:

attr(ests, "choices") <- 1:7 plot_rdf_spec_curve(ests, "est", "lb", "ub")

Nice. But how does one create the interactive display? Easy. Just call
shiny_rdf_spec_curve(), giving your data and the additional parameters that
you would hand over to plot_rdf_spec_curve() as a list:

shiny_rdf_spec_curve(ests, list("est", "lb", "ub"))

You will see that the app will take a while to display the initial specification
curve. This is because it is based on 11,264 specifications. Once you start to
drill down, the app will become more responsive.

When you focus on only a few specifications you might think “Hey this is nice but
I would rather like to see the actual regression results for these cases”.
This can be done! You can use the workflow of the rdfanalysis package so that
the app will present the actual model results as soon as you zoomed in on a handful
of specifications. While you at it you can also specify your preferred
specification (e.g., the one that you presented in your paper).

design <- define_design(steps = c("read_data", "select_idvs", "treat_extreme_obs", "specify_model", "est_model"), rel_dir = "vignettes/case_study_code") shiny_rdf_spec_curve( ests, list("est", "lb", "ub"), design, "vignettes/case_study_code", "https://joachim-gassen.github.io/data/wb_new.csv", default_choices = list(na.omit = "no", idvs = "full", outlier_tment_style ="win", model_type = "level-log", outlier_cutoff = 0, feffect = "ctryyear", cluster = "ctryyear"))

Please note that the code above will only run when you have forked the
rdfanalysis repo and set the working directory to its root.

Finally, you can add a title and a short info text by setting the title and
abstract parameters and, voilà: Your interactive and exhaustive robustness
section
.

Kudos to Nate Breznau for bringing up the idea to use shiny to visualize the
specification curve. See my former post for more detail on the case and on how to drill deeper into the findings. Feel free to use the in-development ‘rdfanalysis’ package to exhaust the researcher degrees of freedoms in your own projects. If you have remarks about this project, I would love to hear from you. Use the comment section below or reach out via email or twitter.

Enjoy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Choropleth Map in ggplot2

Fri, 06/12/2019 - 18:33

[This article was first published on R – data technik, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Creating a map in ggplot2 can be surprisingly easy! This tutorial will show the US by state. The dataset is from 1970 and will show some US statistics including population, income, life expectancy, and illiteracy.

I love making maps, while predictive statistics provide such great insight, map making was one thing that really made my interested in data science. I’m also glad R provides a great way to make them.

I’d also recommend plotly package where you can make it interactive as you scroll over. All within R!

Here is the first map we will make:

This is population by state in 1970 US.

library(ggplot2) library(dplyr) states<-as.data.frame(state.x77) states$region <- tolower(rownames(states)) states_map <- map_data("state") fact_join <- left_join(states_map, states, by = "region") ggplot(fact_join, aes(long, lat, group = group))+ geom_polygon(aes(fill = Population), color = "white")+ scale_fill_viridis_c(option = "C")+ theme_classic()

For the next graph the code will be mostly similar but I will change the fill = option.

Let’s try income:

This is great. The income has much more normal distribution. Seems varying among the states.

ggplot(fact_join, aes(long, lat, group = group))+ geom_polygon(aes(fill = Income), color = "white")+ scale_fill_viridis_c(option = "C")+ theme_classic()

Last one we’ll make is life expectancy:

Great info here! Life expectancy, in the 1970s by state! That particular variable needed a little extra coding, see below:

fact_join$`Life Exp` <- as.numeric(fact_join$`Life Exp`) ggplot(fact_join, aes(long, lat, group = group))+ geom_polygon(aes(fill = `Life Exp`), color = "white")+ scale_fill_viridis_c(option = "C")+ theme_classic()

Enjoy your maps! Also this dataset is publicly available so feel free to recreate.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – data technik. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Package for @racently

Fri, 06/12/2019 - 03:00

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently wrote about an API for @racently. The next logical step was to build a package which wraps the API so that the data can easily be pulled into R.

The package is available here. It is still very much a work in progress: the API only exposes two endpoints, but both of them are wrapped in the package.

Installation

Install using {devtools}.

devtools::install_github("datawookie/racently") library(racently) Searching for Athletes

The search_athlete() function makes it possible to search for an athlete by name using regular expressions for pattern matching.

Suppose I wanted to find athletes with the surname “Mann”.

search_athlete("Mann$") ## id slug name gender nationality twitter parkrun strava ## 1 016d0a83-1fc4-4613-81cb-ad3f09abac5e Garry Mann M NA NA ## 2 311aeaa2-155f-4566-a9fd-cf30afd67b16 Alec Mann M ZA NA NA ## 3 3207bed9-3dd4-42e7-aee5-c0a0347295c8 Michael Mann M NA NA ## 4 3c22a177-d03e-41c7-be31-a2bc77b2c61c Kevin Mann M ZA NA NA ## 5 52d681ea-8229-47ce-99ab-a4425b4dbcf0 Thomas Mann M NA NA ## 6 7ebe5ce2-6067-48d5-9c68-5f8167d6ab6e Julie Mann F NA NA ## 7 8437fcc7-3a38-4ecb-8708-82d148ea3368 Angie Mann F NA NA ## 8 94d8f1e6-4850-4708-8f70-3210b45d3bde Helen Mann F NA NA ## 9 bdd0716a-cbda-4c2e-8ef1-d5baab63ecfd Patrick Mann M NA NA ## 10 bebc042f-ad26-40f3-99d2-61a65e0296a2 Zeldi Mann F NA NA ## 11 d4cf93a6-69b4-40c4-949e-8444da403e17 Peter Mann M ZA NA NA ## 12 d676cb99-f272-4e9a-be04-fb6ec0aa7245 Steve Mann M ZA NA NA ## 13 d9897020-6e62-47c4-8481-ecbd19ddad12 Nicola Mann F NA NA ## 14 d9cc374d-521a-48bb-86dd-3ec6f05ef59f stuart-mann Stuart Mann M runningmann100 NA NA ## 15 df571960-0451-4cba-8483-17ccb4704680 Kim Mann F NA NA ## 16 e3c1dcc1-c783-4629-891b-f2e5a21daa87 Joanne Mann F NA NA ## 17 e4417318-66ef-42e3-b9f7-6a241b550e74 Jack Mann M NA NA ## 18 f2e9074c-ca75-4a99-b89d-07d377836c21 Roy Mann M ZA NA NA ## 19 f753bf50-4637-4cfb-89f6-efde8563ceff Luke Mann M ZA NA NA

We see that the somewhat legendary @runningmann100, Stuart Mann, is among the list of results.

Athlete Details

The athlete() function allows you to find the results for a specific athlete, specified by id (from the results above).

athlete("d9cc374d-521a-48bb-86dd-3ec6f05ef59f") ## $name ## [1] "Stuart Mann" ## ## $gender ## [1] "M" ## ## $results ## date race distance time club license ## 1 2019-06-09 Comrades 86.8 km 09:53:23 Fourways Road Runners NA ## 2 2019-02-09 Klerksdorp 42.2 km 03:54:09 NA NA ## 3 2018-11-24 Josiah Gumede 42.2 km 04:20:21 Fourways Road Runners 3388 ## 4 2018-10-28 Sapphire Coast 42.2 km 04:27:23 Fourways Road Runners 3388 ## 5 2018-09-02 Vaal River City 42.2 km 04:21:54 Fourways Road Runners 3388 ## 6 2018-06-10 Comrades 90.2 km 10:15:52 Fourways Road Runners 3388 ## 7 2018-05-20 RAC 10.0 km 01:14:21 Fourways Road Runners 3388 ## 8 2018-03-25 Umgeni Marathon 42.2 km 04:20:04 Fourways Road Runners 3388 ## 9 2018-03-10 Kosmos 3-in-1 42.2 km 04:18:51 NA NA ## 10 2018-02-25 Maritzburg 42.2 km 04:31:38 Fourways Road Runners 3388 ## 11 2017-12-03 Heroes Marathon 42.2 km 04:28:06 Fourways Road Runners 3388 ## 12 2017-11-05 Soweto Marathon 42.2 km 04:41:24 Fourways Road Runners 3388 ## 13 2017-09-17 Cape Town Marathon 42.2 km 04:25:06 Fourways Road Runners 3388 ## 14 2017-06-04 Comrades 86.7 km 10:40:59 Fourways Road Runners 3388 ## 15 2017-04-01 Arthur Cresswell Memorial 52.0 km 05:27:19 Fourways Road Runners 3388 ## 16 2017-03-26 Gaterite Challenge 42.2 km 04:14:54 Fourways Road Runners 3388 ## 17 2016-05-29 Comrades 89.2 km 10:19:06 Fourways Road Runners 9120 ## 18 2016-05-01 Deloitte Challenge 42.2 km 04:15:53 Team Vitality NA ## 19 2016-04-24 Chatsworth Freedom 52.0 km 05:36:06 Fourways Road Runners 9120 ## 20 2015-08-30 Mandela Day 42.2 km 04:48:16 Fourways Road Runners NA ## 21 2015-01-25 Johnson Crane Hire 42.2 km 04:28:35 Fourways Road Runners NA ## 22 2009-08-09 Mtunzini Bush 16.0 km 01:31:07 Fourways Road Runners 9120 ## 23 2009-05-24 Comrades 89.2 km 08:45:43 Fourways Road Runners 9120 ## 24 2008-11-28 Sani Stagger 42.2 km 05:25:19 Fourways Road Runners NA

The results include name, gender and the details of all of the relevant results in the system. As Stuart has pointed out, there are many results missing from his profile (he ticks off a marathon pretty much every weekend), but hey, like I said, this is a work in progress.

Future

We’ll be working hard to add more race results over the coming months (and cleaning up some of the existing data). There are also plans to expose more functionality on the API, which in turn will filter through to the R package.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R 2nd Edition now in-stock at Amazon.com!

Thu, 05/12/2019 - 19:35

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practical Data Science with R 2nd Edition is now in-stock at Amazon.com!

Buy it for your favorite data scientist in time for the holidays!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

lintools 0.1.3 is on CRAN

Thu, 05/12/2019 - 14:28

[This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Version 0.1.3 of the lintools package was accepted on CRAN today.

This version brings a few internal improvements and switches the testing suite to the tinytest test infrastructure.

lintools is provides basic manipulations of linear systems of equalities and inequalities including: variable elimination (Gaussian elimination, Fourier-Motzkin elimination), Moore-Penrose pseudoinverse, reduction to reduced row echelon form, value substitution, projecting a vector on the convex polytope described by a system of (in)equations, simplifing systems by removing spurious columns and rows and collapsing implied equalities, testing whether a matrix is totally unimodular and computing variable ranges implied by linear (in)equalities.

Markdown with by wp-gfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sponsorship: SatRdays and useR Groups

Thu, 05/12/2019 - 11:49

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

SatRdays

SatRdays are great. Low cost R events, held around the world. What’s not to love!

For the last year, we have been offering automatic sponsorship for all SatRday events. All the organisers have to do is complete a quick questionnaire and the money is sent on it’s way. So far we have sponsored seven events!

So if you are organising a SatRday, feel free to tap us for sponsorship! We’re particularly proud of how frictionless we’ve made the process.

useR / R-Ladies Groups

Now that we’ve tested the water with SatRday sponsorship, we thought we would do the same with useR groups. Due the hugh numbers of useR groups, there are over 400, we can’t open this up to the world (sorry). To start with we plan on offering sponsorship to twenty groups from Europe. Hopefully, we can increase this number.

So if you want sponsorship for your group, just complete this quick form

Jumping Rivers are RStudio Certified Full Service Partners. If you need help installing, or running RStudio Pro products, give us a shout. We might even be able to do it for free!

The post Sponsorship: SatRdays and useR Groups appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pages