R-bloggers

Subscribe to R-bloggers feed R-bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 10 min 9 sec ago

Machine Learning in R: Start with an End-to-End Test

Wed, 13/11/2019 - 10:45

[This article was first published on R – David's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a data scientist, you will likely be asked one day to automate your analysis and port your models to production environments. When that happens you cross the blurry line between data science and software engineering, and become a machine learning engineer. I’d like to share a few tips on how to make that transition as successful as possible.

Let’s first discuss testing, and let’s assume without loss of generality that you develop your machine learning application in R. Just like any other software system, your application needs to be thoroughly tested before being deployed. But how do you ensure that your application will perform as expected when dealing with real data? If you’ve structured your application as an R package you already know that you can write unit tests that will run automatically when you check your package; but how do you test the application as a whole?

Testing an application as whole locally, known as integration testing, should normally not be part of your suite of unit tests: integration tests are frequently much slower than unit tests, and would break your regular build-test workflow. (Besides, if you forgive some pedantry, an integration test is not a unit test by definition since it tests the whole system.)

Nevertheless, integration tests serve a useful purpose in your workflow. They’re often the last test you’ll run locally before releasing your application to more ‘serious’ testing. But that doesn’t mean you should write them last; in this article I’m going to argue that writing them first, before even writing a line of code, tends to produce a better, clearer, and more testable design.

Let’s illustrate with an example. You work for a travel agency whose CEO would like a system that predicts the survival probability of its customers, should their cruise ship hit an iceberg. You are told that we have extensive data on shipwreck survivors since 1912, when the Titanic sank. You are told that your application should apply machine learning on a dataset that has the same structure as the famous Titanic survivor data set, but you cannot access the entire set directly.

You are relatively free to choose the output format of your application, because it will be picked up downstream by other teams. You decide to start with a CSV file for now.

Well, where to begin? From an earlier analysis (ideally documented in a vignette, a notebook, or another report) you know that a random forest is the best model for this problem. So it would make sense to verify whether indeed your application trains a random forest from the training data. How would we know that? The simplest way to save the trained model along with the other output, and check that it is indeed a random forest, trained with the right number of training samples. And this is definitely something which we know how to test.

I’m going to assume you have the basic structure of an R package in place, such as the one created by default by RStudio when you start a new project. We’ll call it app:

app ├── DESCRIPTION ├── NAMESPACE ├── R │ └── hello.R ├── app.Rproj └── man └── hello.Rd

Besides the testthat package, which has become the standard library for writing unit tests in R, I recommend using testhis, which facilitates creating and running integration tests:

usethis::use_testthat() testthis::use_integration_tests()

This sets up the basic structure for running unit tests and other integration tests:

app ├── DESCRIPTION ├── NAMESPACE ├── R │ └── hello.R ├── app.Rproj ├── man │ └── hello.Rd └── tests ├── testthat │ └── integration_tests └── testthat.R

Any test you create under tests/testhat/integration_tests will be run when you call testthis::test_integration(), but not when running your regular unit tests.

I like to begin with a failing test to make sure that everything is setup correctly. Let’s go ahead and write our end-to-end test, calling it tests/testthat/integrations_test-end_to_end.R:

# integrations_test-end_to_end.R test_that("this test fails", { expect_true(FALSE) })

As expected, we can now run the “integration” test and see it fail:

> testthis::test_integration() Loading app | OK F W S | Context | 0 1 | test_end-to-end ─────────────────────────────────────────────────── test-test_end-to-end.R:2: failure: this test fails FALSE isn't true. ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ OK: 0 Failed: 1 Warnings: 0 Skipped: 0

Good. Let’s now write test code as if a model had been serialized, and check its number of training samples.

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { app::main() })

Hold it right there. This test will, of course, not even run because the main() function doesn’t exist. Let’s create R/main.R with the following contents before we continue:

# main.R main <- function() {}

The integration test now runs, so we can complete it. But before we can continue, we will need some training data, which will be the Titanic survival dataset. We don’t want main() to read the data from file, but from an interface that is as similar as possible to the actual environment.

We’ve been told that the application will read from a PostgreSQL relational database. We could certainly have our test code set up a local PostgreSQL instance, and load it with the training dataset; this will sometimes be the right strategy, but in this case I think it is overkill. Instead, we will exploit the fact that most database systems can be abstracted away through ODBC drivers. R code that talks to PostgreSQL through its ODBC driver should work just as well against any other database system. In particular, we’ll use the in-memory SQLite database for integration testing.

So our integration test code will look like this, in pseudo-code:

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { app::main() })

Well, let’s do it. Create the tests/testthat/testdata-raw folder (or just call testthis::use_testdata_raw()), and copy there the train.csv dataset downloaded from https://www.kaggle.com/c/titanic/data, renamed to titanic.csv.

We’ll proceed with turning our pseudo-code into real code, refactoring as we go along:

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { # Load Titanic training set titanic <- readr::read_csv("../testdata-raw/titanic.csv") # Copy it to SQLite con <- odbc::dbConnect( odbc::odbc(), driver = "SQLite", database = "file::memory:?cache=shared" ) dplyr::copy_to(con, titanic, temporary = FALSE) # Call main() app::main() # Read serialized model model <- readRDS("output/model.rds") })

Notice the file::memory:?cache=shared URI used to instantiate the in-memory SQLite database. If you simply pass the usual :memory: filename you won’t be able to connect to it from more than one place in your program; but we want the test code to populate the same database that will be accessed by the main application, so we must pass this special file::memory:?chache=shared URI.

Notice also the temporary = FALSE argument to dplyr::copy_to(). This is required for the table to be accessible from another connection than the current one.

Calling testthis::test_integration() should now fail:

> testthis::test_integration() Loading app | OK F W S | Context | 0 1 1 | test_end-to-end ─────────────────────────────────────────────────── test-test_end-to-end.R:14: warning: a model is output with the right number of training samples cannot open compressed file 'output/model.rds', probable reason 'No such file or directory' test-test_end-to-end.R:14: error: a model is output with the right number of training samples cannot open the connection 1: readRDS("output/model.rds") at /Users/dlindelof/Work/app/tests/testthat/integration_tests/test-test_end-to-end.R:14 2: gzfile(file, "rb") ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ OK: 0 Failed: 1 Warnings: 1 Skipped: 0

It fails indeed, so we stop writing test code at this point and fix it. It fails because no model is serialized yet; we don’t want to overengineer at this point and fix that by instantiating a fake model and serializing it:

# main.R main <- function() { model <- lm(~ 0) if (!dir.exists("output")) dir.create("output") saveRDS(model, "output/model.rds") }

The tests pass now, so we keep on going:

# integrations_test-end_to_end.R test_that("a model is output with the right number of training samples", { # Load Titanic training set titanic <- readr::read_csv("../testdata-raw/titanic.csv") # Copy it to SQLite con <- odbc::dbConnect( odbc::odbc(), driver = "SQLite", database = "file::memory:?cache=shared" ) dplyr::copy_to(con, titanic, temporary = FALSE) # Call main() main() # Read serialized model model <- readRDS("output/model.rds") # Verify number of training samples expect_equal(length(model$fitted.values), nrow(titanic)) # (1) }) # (1) check if the model has been trained with the right number of samples

There are several issues with that test, but it is complete (for now) and, as expected, fails:

> testthis::test_integration() Loading app | OK F W S | Context | 0 1 | test_end-to-end ─────────────────────────────────────────────────── test-test_end-to-end.R:16: failure: a model is output with the right number of training samples length(model$fitted.values) not equal to nrow(titanic). 1/1 mismatches [1] 0 - 891 == -891 ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ OK: 0 Failed: 1 Warnings: 0 Skipped: 0

To pass the test, the main() function needs to open an ODBC connection to SQLite, read the training dataset, train (some) model with it, and serialize the model. But we run into a problem: how does our application know where to find the database it should connect to? And how does it know which driver to load? A reasonable solution is to use the config package with a configuration file. When we run integration tests we’ll use the test settings, and when we run in production we’ll use the default settings.

So here’s a first attempt at config.yml:

# config.yml test: driver: "SQLite" database: "file::memory:?cache=shared" default: driver: "PostgreSQL" database: "TBD"

We should now be able to read the titanic dataset from main():

# main.R main <- function(env = "default") { connection_params <- config::get(config = env) con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params)) train <- dplyr::tbl(con, "titanic") %>% dplyr::collect() model <- lm(~ 0, data = train) if (!dir.exists("output")) dir.create("output") saveRDS(model, "output/model.rds") }

Running the tests, we see that the dataset gets read without errors by main(), but the number of training samples in the model remains 0:

> testthis::test_integration() Loading app | OK F W S | Context | 0 1 | test_end-to-end [0.1 s] ─────────────────────────────────────────────────── test-test_end-to-end.R:16: failure: a model is output with the right number of training samples length(model$fitted.values) not equal to nrow(titanic). 1/1 mismatches [1] 0 - 891 == -891 ─────────────────────────────────────────────────── ══ Results ════════════════════════════════════════ Duration: 0.1 s OK: 0 Failed: 1 Warnings: 0 Skipped: 0

That is simply because we have not specified any target variable in our call to lm(). Let’s fix that:

# main.R main <- function(env = "default") { connection_params <- config::get(config = env) con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params)) train <- dplyr::tbl(con, "titanic") %>% dplyr::collect() model <- lm(Survived ~ 0, data = train) if (!dir.exists("output")) dir.create("output") saveRDS(model, "output/model.rds") }

And the tests now pass:

> testthis::test_integration() Loading app | OK F W S | Context | 1 | test_end-to-end ══ Results ════════════════════════════════════════ OK: 1 Failed: 0 Warnings: 0 Skipped: 0

Let’s now recap what we have achieved. We have an integration test case that verifies that a linear model (or any model, really, provided it provides a fitted.values attribute) gets trained by our application, using a dataset read using an ODBC connection we provide. But more importantly, we have written an automated test that exercises the whole application, and will keep on doing so as we implement the missing pieces.

It doesn’t feel like we have accomplished much; after all, our model is certainly not the best one for the classification task we have at hand, and we don’t even train it in a sensible manner. But consider this: before you began the project, chances are the customer (or the boss) didn’t really know what they wanted. Now you are in a position to hold the following conversation:

you: We have decided that the first user story should prove that the application, when run, trains a model on the Titanic dataset. This is what this first integration test demonstrates.
boss: How good is that model?
you: Its accuracy is 62%, if that’s what you mean.
boss: Really? That’s pretty impressive for a first shot. How did you achieve that?
you: It predicts that everybody dies. And 62% of passengers in the training dataset did indeed die.
boss: (shifts uncomfortably) hm I see well I guess we should think of alternative metrics. We’d rather avoid false positives, even at the cost of some accuracy. Can you do something about it?

So by getting a woefully incomplete, but working, application out so soon you’ve been able to help your stakeholders understand what really matters to them. Better, yet, you are even in a position to write the next item on your backlog as a precise user story:

As a customer, I want to know if I'm going to die on this cruise but a death prediction had better be right 90% of the time

Time to write a new integration test and our first unit tests.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – David's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Durban EDGE DataQuest

Wed, 13/11/2019 - 03:00

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Durban EDGE (Economic Development and Growth in eThekwini) DataQuest was held at UKZN (Westville Campus) on 13 November 2019. Participants were tasked with creating something interesting and useful with the civic data on the new Durban EDGE Open Data Portal developed by Open Data Durban.

These datasets were available:

  • EThekwini Water and Sanitation
  • Durban Skills Audit 2016
  • EThekwini Financial Statistics Survey
  • EThekwini Rate Collection and Valuation Roll
  • EThekwini Business Licensing
  • EThekwini DMOSS -DURBAN Metropolitan Open Space System
  • Rentable Office Data
  • EThekwini Labour Force
  • EThekwini Building Plans
  • Durban Film Sector Data
  • KZN Formal Education – Current
  • EThekwini Electricity Usage and Access and
  • EThekwini Ward Maps.

None of the participants had prior experience with R, but most had used Excel. I’h hoped to get at least a few of them to try using R. To make this more accessible I introduced them to RStudio Cloud, which is such a phenomenal tool for this sort of gig since it requires zero setup on the participants’ machines. I also put together a couple of starter scripts:

Let’s take a quick look at them.

Electricity Usage

This script loads the electricity consumption data, does some simple wrangling (mostly just fixing the year column) and then creates a few plots.

The first plot shows how the number of (formal) electricity consumers has increased over time.

We see that there is a systematic increase in the number of consumers, which makes sense in terms of population growth and urbanisation.

How much energy is being consumed?

Again there is a systematic growth in energy consumption. But something clearly happens in 2007: the introduction of load shedding.

With these two pieces of information we can also assess the average power consumed per customer.

Distribution of Drivers’ Licenses

This script merges data from two sources:

  • a KML file giving ward boundaries and
  • a skills survey.

Although there’s a wealth of informative data in the survey, to keep things simple I used a simple Boolean column: whether or not the respondent had a drivers’ license.

Mashing these two datasets together created the map below: the proportion of people with drivers’ licenses broken down by ward.

Both of these scripts provide potentially interesting starting points for a deeper analysis. The main motivation for them though was simply to show how such an analysis can be done in R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Colour of Everything

Wed, 13/11/2019 - 01:00

[This article was first published on Data Imaginist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m happy to announce that farver 2.0 has landed on CRAN. This is a big release
comprising of a rewrite of much of the internals along with a range of new
functions and improvements. Read on to find out what this is all about.

The case for farver

The first version of farver really came out of necessity as I identified a major
performance bottleneck in gganimate related to converting colours into Lab
colour space and back when tweening them. This was a result of
grDevices::convertColor() not being meant for use with millions of colour
values. I build farver in order to address this very specific need, which in
turn made Brodie Gaslam look into speeding up the grDevices function. The
bottom line is that, while farver is still the fastest at converting between
colour spaces, grDevices is now so fast that I probably wouldn’t have bothered
to build farver in the first place had it been like this all along. I find this
a prime example of fruitful open source competition and couldn’t be happier that
Brodie took it upon him.

So why a new shiny version? As part of removing compiled code from scales, we
decided to adopt farver for colour interpolation, and the code could use a
brush-up. I’ve become much more trained in writing compiled code, and further
there were some shortcomings in the original implementation that needed to be
addressed if scales (and thus ggplot2) should depend on it. Further, I usually
write on larger frameworks and there is a certain joy in picking a niche area
that you care about and go ridiculously overboard in tooling without worrying
about if it benefits any other than yourself (ambient is another example of such
indulgence).

The new old

The former version of farver was quite limited in functionality. It had two
functions: convert_colour() and compare_colour() that did colour space
conversion and colour distance calculations respectively. No outward changes
has been made to these functions, but internally a lot has happened. The old
versions had no input validation, so passing in colours with NA, NaN, Inf,
and -Inf would give you some truly weird results back. Further, the input and
output was not capped to the range of the given colour space, so you could in
theory end up with negative RGB values if you converted from a colour space with
a larger gamut than sRGB. Both of these issues has been rectified in the new
version. Any non-finite value in any channel will result in NA in all channels
in the output (for conversion) or an NA distance (for comparison).

library(farver) colours <- cbind(r = c(0, NA, 255), g = c(55, 165, 20), b = c(-Inf, 120, 200)) colours ## r g b ## [1,] 0 55 -Inf ## [2,] NA 165 120 ## [3,] 255 20 200 convert_colour(colours, from = 'rgb', to = 'yxy') ## y1 x y2 ## [1,] NA NA NA ## [2,] NA NA NA ## [3,] 25.93626 0.385264 0.1924651

Further, input is now capped to the channel range (if any) before conversion,
and output is capped again before returning the result. The later means that
convert_colour() is only symmetric (ignoring rounding errors) if the colours
are within gamut in both colour spaces.

# Equal output because values are capped between 0 and 255 colours <- cbind(r = c(1000, 255), g = 55, b = 120) convert_colour(colours, 'rgb', 'lab') ## l a b ## [1,] 57.41976 76.10097 12.44826 ## [2,] 57.41976 76.10097 12.44826

Lastly, a new colour space has been added: CIELch(uv) (in farver hcl) has been
added as a cousin of CIELch(ab) (lch). Both are polar transformations, but the
former is based on luv values and the latter on lab. Both colour spaces are
used interchangeably (though not equivalent), and as the grDevices::hcl()
function is based on the luv space it made sense to provide an equivalent in
farver.

The new new

The new functionality mainly revolves around the encoding of colour in text
strings. In many programming languages colour can be encoded into strings as
#RRGGBB where each channel is given in hexadecimal digits. This is also how
colours are passed around in R mostly (R also has a list of recognized colour
names that can be given as aliases instead of the hex string – see
grDevices::colour() for a list). The encoding is convenient as it allows
colours to be encoded into vectors, and thus into data frame columns or arrays,
but means that if you need to perform operations on it you’d have to first
decode the string into channels, potentially convert it into the required colour
space, do the manipulation, convert back to sRGB, and encode it into strings.
Encoding and decoding has been supported in grDevices with rgb() and
col2rgb() respectively, both of which are pretty fast. col2rgb() has a quirk
in that the output has the channels in the rows instead of the columns, contrary
to how decoded colours are presented everywhere else:

grDevices::col2rgb(c('#56fec2', 'red')) ## [,1] [,2] ## red 86 255 ## green 254 0 ## blue 194 0

farver sports two new functions that, besides providing consistency in the
output format also eliminates some steps in the workflow described above:

# Decode strings with decode_colour colours <- decode_colour(c('#56fec2', 'red')) colours ## r g b ## [1,] 86 254 194 ## [2,] 255 0 0 # Encode with encode_colour encode_colour(colours) ## [1] "#56FEC2" "#FF0000"

Besides the basic use shown above, both function allows input/output from other
colour spaces than sRGB. That means that if you need to manipulate some colour in
Lab space, you can simply decode directly into that, do the manipulation and
encode directly back. The functionality is baked into the compiled code, meaning
that a lot of memory allocation is spared, making this substantially faster than
a grDevices-based workflow:

library(ggplot2) # Create some random colour strings colour_strings <- sample(grDevices::colours(), 5000, replace = TRUE) # Get Lab values from a string timing <- bench::mark( farver = decode_colour(colour_strings, to = 'lab'), grDevices = convertColor(t(col2rgb(colour_strings)), 'sRGB', 'Lab', scale.in = 255), check = FALSE, min_iterations = 100 ) plot(timing, type = 'ridge') + theme_minimal() + labs(x = NULL, y = NULL)

Can we do better than this? If the purpose is simply to manipulate a single
channel in a colour encoded as a string, we may forego the encoding and decoding
completely and do it all in compiled code. farver provides a family of functions
for doing channel manipulation in string encoded colours. The channels can be
any channel in any colour space supported by farver, and the decoding,
manipulation and encoding is done in one pass. If you have a lot of colours and
need to increase e.g. darkness, this can save a lot of memory allocation:

# a lot of colours colour_strings <- sample(grDevices::colours(), 500000, replace = TRUE) darken <- function(colour, by) { colour <- t(col2rgb(colour)) colour <- convertColor(colour, from = 'sRGB', 'Lab', scale.in = 255) colour[, 'L'] <- colour[, 'L'] * by colour <- convertColor(colour, from = 'Lab', to = 'sRGB') rgb(colour) } timing <- bench::mark( farver = multiply_channel(colour_strings, channel = 'l', value = 1.2, space = 'lab'), grDevices = darken(colour_strings, 1.2), check = FALSE, min_iterations = 100 ) plot(timing, type = 'ridge') + theme_minimal() + labs(x = NULL, y = NULL)

The bottom line

The new release of farver provides invisible improvements to the existing
functions and a range of new functionality for working efficiently with string
encoded colours. You will be using it indirectly following the next release of
scales if you are plotting with ggplot2, but you shouldn’t be able to tell. If
you somehow ends up having to manipulate millions of colours, then farver is
still the king of the hill by a large margin when it comes to performance, but
I personally believe that it also provides a much cleaner API than any of the
alternatives.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Automating update of a fiscal database for the Euro Area

Wed, 13/11/2019 - 00:00

[This article was first published on Macroeconomic Observatory - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our purpose is to write a program to automatically update a quarterly fiscal database for the Euro Area. The main difficulty of this exercise is to build long series that go as far as the 1980’s.

We use two sources to build the database: the historical database developed in Paredes et al. (2014), which stops in 2013, and the latest Eurostat data. Throughout this article, we explain how we chained each series of PPP with the Eurostat data.

Both databases are taken without any seasonal adjustment. At the end of the post, chained data series are seasonally adjusted using the seasonal package developed by Sax (2016) using the X13 methodology.

To be automated, the recent points of the database are taken from DBnomics using the rdbnomics package. All the code is written in R, thanks to the RCoreTeam (2016) and RStudioTeam (2016).

The database will contain the following series:

  • Direct taxes
  • Indirect taxes
  • Social security contributions by employees
  • Social security contributions by employers
  • Government consumption
  • Government investment
  • Government transfers
  • Government subsidies
  • Government compensation of employees
  • Unemployment benefits
  • Government debt
  • Interest payments
  • Total revenues
  • Total expenditures
Historical data

First we get the historical series built by Paredes et al. (2014). Problem is, the series are not all directly usable: the series of social contribution by contributors do not exist before 1991.

url <- "PPP_raw.xls" ppp <- read_excel(url, skip = 1) ppp %<>% transmute(period = as.Date(as.yearqtr(`MILL. EURO, RAW DATA, NON-SEAS. ADJUSTED, SMOOTHED ESTIMATES`, format="%YQ%q")), totexp = TOE, # Total expenditures pubcons = GCN, # General government consumption expenditures pubinves = GIN, # General government investment tfs = THN, # Social payments unemp = `of which UNB`, # Unemployment benefits (among social payments) salaries = COE, # Compensation of employees subs = SIN, # Subsidies intpay = INP, # General government interest payments totrev = TOR, # Total revenue indirtax = TIN, # Total indirect taxes dirtax = DTX, # Total direct taxes scr = as.numeric(SCR), # Social contribution by employers sce = as.numeric(SCE), # Social contribution by employees and self-employed sct = SCT, # Total social security contributions debt = MAL) %>% # Euro area general government debt filter(!is.na(period))

Assuming that the ratio of social contributions remains stable between employers and households before 1991, we can reconstruct the contribution of employees and employers using the series of total contribution. Using this technique we infer the series of social contribution by contributors before 1991.

# We calculate the ratio of social contribution by employers for the first point in our series prcent <- transmute(ppp, scr_sct=scr/sct) %>% na.omit() %>% first() %>% as.numeric() # Using the ratio, we reconstruct earlier social contribution by contributor scr_sce_before91 <- filter(ppp, is.na(scr)) %>% select(period, sct, scr, sce) %>% transmute(period, scr=prcent*sct, sce=sct-scr) %>% gather(var, value, -period) # We reinject the constructed series in the ppp database ppp %<>% select(-sct) %>% gather(var, value, -period, na.rm = TRUE) %>% bind_rows(scr_sce_before91) %>% arrange(var, period) maxDate <- ppp %>% group_by(var) %>% summarize(maxdate=max(period)) %>% arrange(maxdate) kable(maxDate) var maxdate debt 2013-10-01 dirtax 2013-10-01 indirtax 2013-10-01 intpay 2013-10-01 pubcons 2013-10-01 pubinves 2013-10-01 salaries 2013-10-01 sce 2013-10-01 scr 2013-10-01 subs 2013-10-01 tfs 2013-10-01 totexp 2013-10-01 totrev 2013-10-01 unemp 2013-10-01 Recent data

Historical data series stop in 2013. For latest points, we use DBnomics to get Eurostat’s data. Eurostat’s series on social contributions and on unemployment benefits present difficulties as well. We thus download the series from DBnomics in three steps:

  1. We take the series on social contributions and we treat them in order to build quarterly series by contributors.
  2. We take the series on unemployment benefits and we treat them in order to build quarterly series.
  3. We take the series that do not present any problems
Special case: social contributions Download annual data var_taken <- c('D613', # Annual Households' actual social contributions (D613) for general govt only (S13) 'D612', # Annual Employers' imputed social contributions 'D611') # Annual Employers' actual social contributions (D611) for general govt only (S13) url_variables <- paste0(var_taken, collapse="+") filter <- paste0('A.MIO_EUR.S13.',url_variables,'.EA19') df <- rdb("Eurostat","gov_10a_taxag",mask = filter) data_1 <- df %>% select(period,var=na_item,value) %>% spread(var,value) %>% mutate(sce=D613+D612, scr=D611) %>% select(-D611,-D612,-D613) %>% gather(var,value,-period) %>% mutate(year=year(period))

The series of actual social contributions present 2 problems: (i) they are not quarterly; (ii) they are available only up to 2018. We fix this problem by using the two series of quarterly net total social contributions and quarterly employers contribution for the total economy.

From annual to quarterly data # Quarterly Net social contributions, receivable (D61REC) for general govt only (S13) df <- rdb("Eurostat","gov_10q_ggnfa",mask = "Q.MIO_EUR.NSA.S13.D61REC.EA19") qsct <- df %>% transmute(period, var = 'sct', value) %>% mutate(year=year(period)) maxtime <- summarize(qsct,maxdate=max(period))

To turn the two annual series of social contributions into quarterly series, we use the series of quarterly total net social contributions to calculate the share of each contributor for each year. Using this share and the quarterly value of the total net social contributions, we can deduce the quarterly value of the net social contributions of each contributor.

# Calculate total amount of sct by year qsct_a <- qsct %>% group_by(year) %>% summarise(value_a=sum(value)) qsct %<>% left_join(qsct_a, by="year") # Convert data from annual to quarterly qsce_uncomplete <- filter(data_1, var=="sce") %>% full_join(qsct, by="year") %>% transmute(period=period.y, var=var.x, value=value.y*value.x/value_a) %>% filter(!is.na(value)) # Convert data from annual to quarterly qscr_uncomplete <- filter(data_1, var=="scr") %>% full_join(qsct, by="year") %>% transmute(period=period.y, var=var.x, value=value.y*value.x/value_a) %>% filter(!is.na(value))

We plot series to compare built quarterly series with annual series.

plot_treatment <- bind_rows(qscr_uncomplete, qsce_uncomplete) %>% mutate(Origin="Deduced quarterly series", value=4*value) %>% # We multiply by 4 because on the plot we compare quarterly level with annual levels bind_rows(mutate(data_1,Origin="Original annual series")) %>% mutate(var=ifelse(var=="sce","Social contribution of households","Social contribution of employers")) %>% select(-year) ggplot(plot_treatment,aes(period,value,colour=Origin)) + geom_line() + facet_wrap(~var,ncol=2,scales = "free_y") + scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(strip.text=element_text(size=12), axis.text=element_text(size=8)) + theme(legend.title=element_blank())

Most recent data

Now that we have the quarterly values, we use the series of total employers contribution for total economy along with the share of each contributors in total contributions to deduce latest points of contributions by households and employers.

# Quarterly Employers SSC for total economy df <- rdb("Eurostat","namq_10_gdp",mask="Q.CP_MEUR.NSA.D12.EA19") qscr_toteco <- df %>% transmute(period,var = 'scr',value) %>% mutate(year=year(period)) # Using recent data on employers total contribution we chain forward the social contribution of employers qscr <- chain(to_rebase = qscr_toteco, basis = qscr_uncomplete, date_chain = max(qscr_uncomplete$period), is_basis_the_recent_data=FALSE) %>% arrange(period) # Assuming the ratio of social contribution by contributors remains constant over time, we deduce social contribution of households qsce <- bind_rows(qsce_uncomplete, select(qsct, period, value, var), qscr) %>% filter(period<=maxtime$maxdate) %>% spread(var,value, fill = 0) %>% transmute(period, sce=ifelse(period<=max(qsce_uncomplete$period),sce,sct-scr)) %>% gather(var, value, -period) %>% arrange(period)

Series of employers contribution are different in levels. Indeed, we are interested in social contributions of employers for general government only, and not for total economy. But the pattern of both series are very similar. So, by chaining them we take the variations from social contributions of employers for total economy and we apply them to the level of actual social contributions for general government only.

plot_treatment <- bind_rows(qscr_uncomplete, qsce_uncomplete) %>% mutate(Origin="Quarterly series") %>% bind_rows(mutate(qscr_toteco ,Origin="Original quarterly series (for total economy)"), mutate(bind_rows(qsce,qscr), Origin="Chained series")) %>% mutate(var=ifelse(var=="sce","Contribution of households","Contribution of employers")) %>% select(-year) ggplot(plot_treatment,aes(period,value,colour=Origin)) + geom_line() + facet_wrap(~var,ncol=2,scales = "free_y") + scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(strip.text=element_text(size=12), axis.text=element_text(size=8)) + theme(legend.title=element_blank()) + ggtitle("Social contribution forward chaining")

Special case: unemployment benefits

We retrieve government social expenditures and compute their quaterly share for each year.

socialexp <- rdb("Eurostat","gov_10q_ggnfa",mask = "Q.MIO_EUR.NSA.S13.D62PAY.EA19") %>% mutate(year=year(period)) %>% select(period,value,year) %>% group_by(year) %>% mutate(sum=sum(value), ratio=value/sum) %>% ungroup() %>% select(-value,-year,-sum)

Then we retrieve the latest annual data on unemployment benefits, put them in a quarterly table and use the previous ratio of quarterly social expenditures to compute quarterly unemployment benefits.

df <- rdb("Eurostat","gov_10a_exp",mask = "A.MIO_EUR.S13.GF1005.TE.EA19") recent_unemp <- df %>% mutate(year=year(period)) %>% select(period,value,year) recent_unemp_q <- tibble(period=seq(min(recent_unemp$period), length.out=nrow(recent_unemp)*4, by = "quarter"), year=year(period)) %>% left_join(recent_unemp,by="year") %>% select(-period.y,-year) %>% rename(period=period.x) unemp_q <- recent_unemp_q %>% inner_join(socialexp,by="period") %>% mutate(value=value*ratio, var="unemp") %>% select(-ratio)

We compare historical data with new quarterly series.

data_plot <- ppp %>% filter(var=="unemp") %>% mutate(var="From PPP") %>% bind_rows(mutate(unemp_q,var="New measure")) %>% filter(year(period)>=2007) ggplot(data_plot,aes(period,value,colour=var))+ geom_line()+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(strip.text=element_text(size=12), axis.text=element_text(size=8)) + theme(legend.title=element_blank()) + ggtitle("Unemployment benefits series comparison")

Chaining recent data with historical

We now fetch the remaining series from DBnomics, none of the remaining series has to be treated before it can be used. We then include in the resulting dataframe the series of social contributions by contributors.

# List of var that can be taken on the first dataset var_taken <- c('P3', # Public consumption 'P51G', # Public gross fixed capital formation 'D62PAY', # Social transfers 'D1PAY', # Compensation of employees 'D3PAY', # Subsidies 'D2REC', # Indirect taxes on production 'D5REC', # Direct taxation on income and wealth 'D41PAY', # Interest payments 'TE', # Total expenditures 'TR') # Total revenue # We build the URL, fetch the data and convert it to a data frame url_variables <- paste0(var_taken, collapse="+") filter <- paste0('Q.MIO_EUR.NSA.S13.', url_variables,'.EA19') data_1 <- rdb("Eurostat","gov_10q_ggnfa",mask=filter) # Government consolidated gross debt is in a different dataset so we make a second call to DBnomics data_2 <- rdb("Eurostat","gov_10q_ggdebt",mask="Q.GD.S13.MIO_EUR.EA19") # We then bind the two data frame fetched on DBnomics recent_data <- bind_rows(data_1, data_2) %>% transmute(value, period, var= as.factor(na_item)) # Used to harmonize DBnomics series var names with PPP var_names <- c('pubcons', 'pubinves', 'tfs', 'salaries', 'subs', 'indirtax', 'dirtax','intpay','totexp','totrev','debt') recent_data$var <- plyr::mapvalues(recent_data$var,c(var_taken,'GD'),var_names) # We include the series of social contributions var_names <- c(var_names, "sce", "scr","unemp") recent_data %<>% bind_rows(qsce,qscr,unemp_q)

We can check the last available date for each series.

All that is left to do is to chain the dataframe of recent series with the historical database of Paredes et al. (2014). Once the data is chained we use the seasonal package to remove the seasonal component of each series. Hereafter, we will present the treatment on each variable to check graphically that what we obtain is consistent.

chained_data <- recent_data %>% chain(basis=., to_rebase = filter(ppp, var %in% var_names), date_chain = "2007-01-01") %>% arrange(var, period) %>% select(period, var, value) %>% mutate(Origin = "Chained") to_deseason <- chained_data %>% select(-Origin) %>% spread(var, value) deseasoned <- bind_rows(lapply(unique(chained_data$var), function(var) deseason(var_arrange = var, source_df = to_deseason))) %>% mutate(Origin = "Final series") ppp %<>% mutate(Origin = "Historical data") to_plot <- bind_rows(chained_data, deseasoned, ppp) Total revenue and expenditures plot_totrev <- to_plot %>% filter(var == "totrev", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="totrev",Origin !="Final series"), ind2="1 - Chain series")) plot_totexp <- to_plot %>% filter(var == "totexp", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="totexp",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_totrev,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Total revenue") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) p2 <- ggplot(plot_totexp,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Total expenditures") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) grid.arrange(arrangeGrob(p1,p2,ncol=2))

Public direct spending

The chained series of public consumption resembles strongly the historical series. Here our manipulation of the series allows us to create a long series without much loss.

There is on the chained series of investment a (visually) significant difference in level with the historical one. The method of chaining allows us to build a reasonable proxy for the series of public investment but at a certain loss.

plot_cons <- to_plot %>% filter(var == "pubcons", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="pubcons",Origin !="Final series"), ind2="1 - Chain series")) plot_inves <- to_plot %>% filter(var == "pubinves", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="pubinves",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_cons,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Public consumption") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) p2 <- ggplot(plot_inves,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Public Investment") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) grid.arrange(arrangeGrob(p1,p2,ncol=2))

Specific spending

Both chaining seem to be consistent with the historical series. Here our manipulation does not entail much loss.

plot_salaries <- to_plot %>% filter(var == "salaries", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="salaries",Origin !="Final series"), ind2="1 - Chain series")) plot_subs <- to_plot %>% filter(var == "subs", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="subs",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_salaries,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Compensation of employees") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) p2 <- ggplot(plot_subs,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + ggtitle("Subsidies") + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) grid.arrange(arrangeGrob(p1,p2,ncol=2))

Taxes

Both chaining seem to be consistent with the historical series. Here our manipulation does not entail much loss.

plot_indir <- to_plot %>% filter(var == "indirtax", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="indirtax",Origin !="Final series"), ind2="1 - Chain series")) plot_dir <- to_plot %>% filter(var == "dirtax", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="dirtax",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_indir,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Indirect taxation") p2 <- ggplot(plot_dir,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Direct taxation") grid.arrange(arrangeGrob(p1,p2,ncol=2))

Debt and interest payments

The chained series of general government debt deviates slightly from the historical one, but the deviation is very thin and both chaining seem consistent. Here the seasonality of both series is weaker and the final series resemble strongly the chained ones.

plot_debt <- to_plot %>% filter(var == "debt", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="debt",Origin !="Final series"), ind2="1 - Chain series")) plot_intpay <- to_plot %>% filter(var == "intpay", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="intpay",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_intpay,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Interest payments") p2 <- ggplot(plot_debt,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("General government debt") grid.arrange(arrangeGrob(p1,p2,ncol=2))

Total social transfers and unemployment benefits plot_unemp <- to_plot %>% filter(var == "unemp", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="unemp",Origin !="Final series"), ind2="1 - Chain series")) plot_transf <- to_plot %>% filter(var == "tfs", Origin != "Historical data") %>% mutate(ind2 = "2 - Remove seasonal component") %>% bind_rows(data.frame(filter(to_plot,var=="tfs",Origin !="Final series"), ind2="1 - Chain series")) p1 <- ggplot(plot_transf,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Social transfers") p2 <- ggplot(plot_unemp,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~ind2,scales = "free_y",ncol = 1)+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL) + theme(legend.title=element_blank()) + theme(strip.text = element_text(size=12)) + theme(plot.title = element_text(size=16)) + ggtitle("Unemployment benefits") grid.arrange(arrangeGrob(p1,p2,ncol=2))

Building the final database Comparing the obtained series with PPP

We want to check that the final database we created resembles the seasonally adjusted one of Paredes et al. (2014).

url <- "PPP_seasonal.xls" pppSA <- read_excel(url, skip = 1) pppSA %<>% transmute(period =as.Date(as.yearqtr(`MILL. EURO, RAW DATA, SEASONALLY ADJUSTED, SMOOTHED ESTIMATES`, format="%YQ%q")), totexp =TOE, # Total expenditures pubcons =GCN, # General government consumption expenditure pubinves =GIN, # General government investment tfs =THN, # Social payments salaries =COE, # Compensation of employees subs =SIN, # Subsidies unemp =`of whichUNB`, # Unemployment benefits (among social payments) intpay =INP, # General government interest payments totrev =TOR, # Total revenue indirtax =TIN, # Total indirect taxes dirtax =DTX, # Total direct taxes scr =as.numeric(SCR), # Social contribution by employers sce =as.numeric(SCE), # Social contribution by employees and self-employed debt =MAL) %>% # Euro area general government debt filter(!is.na(period)) plot_compare <- gather(pppSA, var, value, -period, convert= TRUE) %>% na.omit() %>% mutate(Origin="ppp") %>% bind_rows(deseasoned) %>% mutate(var=as.factor(var)) xlab_plot <- c('Euro Area government debt', 'Direct taxes', 'Indirect taxes', 'Interest payments', 'Public consumption', 'Public investment', 'Compensation of employees', "Households' social contribution", "Employers' social contribution", 'Subsidies', 'Social transfers', 'Total expenditures', 'Total revenues', 'Unemployment benefits') plot_compare$var <- plyr::mapvalues(plot_compare$var,levels(plot_compare$var),xlab_plot) ggplot(plot_compare,aes(period,value,colour=Origin))+ geom_line()+ facet_wrap(~var,ncol=3,scales = "free_y")+ scale_x_date(expand = c(0.01,0.01)) + theme + xlab(NULL) + ylab(NULL)+ theme(legend.title=element_blank()) + theme(strip.text=element_text(size=10), axis.text=element_text(size=9))+ ggtitle("Fiscal data for the Euro Area")

Final fiscal database for the Euro area

We eventually want to build a database close to Paredes et al. (2014). You can download all the raw series here.

EA_Fipu_rawdata <- deseasoned %>% select(-Origin) %>% spread(var,value) EA_Fipu_rawdata %>% write.csv(file = "EA_Fipu_rawdata.csv",row.names = FALSE)

Then data are normalized by capita and price if needed, using data built to reproduce the Smets and Wouters (2003).

sw03 <- read.csv("http://shiny.nomics.world/data/EA_SW_rawdata.csv") %>% mutate(period=ymd(period)) %>% filter(period >="1980-01-01") EA_Fipu_data <- EA_Fipu_rawdata %>% inner_join(sw03,by="period") %>% transmute(period=period, pubcons_rpc = 100*1e+6*pubcons/(defgdp*pop*1000), pubinves_rpc = 100*1e+6*pubinves/(defgdp*pop*1000), salaries_rpc = 100*1e+6*salaries/(defgdp*pop*1000), subs_rpc = 100*1e+6*subs/(defgdp*pop*1000), dirtax_rpc = 100*1e+6*dirtax/(defgdp*pop*1000), indirtax_rpc = 100*1e+6*indirtax/(defgdp*pop*1000), tfs_rpc = 100*1e+6*tfs/(defgdp*pop*1000), sce_rpc = 100*1e+6*sce/(defgdp*pop*1000), scr_rpc = 100*1e+6*scr/(defgdp*pop*1000), debt_rpc = 100*1e+6*debt/(defgdp*pop*1000)) EA_Fipu_data %>% write.csv(file = "EA_Fipu_data.csv",row.names = FALSE)

You can download ready-to-use (normalized) data for the estimation here.

Appendix Chaining function

To chain two datasets, we build a chain function whose input must be two dataframes with three standard columns (period, var, value). It returns a dataframe composed of chained values, ie the dataframe “to rebase” will be chained on the “basis” dataframe.

More specifically, the function :

  • computes the growth rates from value in the dataframe of the 1st argument
  • multiplies it with the value of reference chosen in value in the dataframe of the 2nd argument
  • at the date specified in the 3rd argument.
chain <- function(to_rebase, basis, date_chain="2000-01-01", is_basis_the_recent_data=TRUE) { date_chain <- as.Date(date_chain, "%Y-%m-%d") valref <- basis %>% filter(period == date_chain) %>% transmute(var, value_ref = value) # If chain is to update old values to match recent values if (is_basis_the_recent_data) { res <- to_rebase %>% filter(period <= date_chain) %>% arrange(desc(period)) %>% group_by(var) %>% mutate(growth_rate = c(1, value[-1]/lag(x = value)[-1])) %>% full_join(valref, by=c("var")) %>% group_by(var) %>% transmute(period, value=cumprod(growth_rate)*value_ref) %>% ungroup() %>% bind_rows(filter(basis, period>date_chain)) } else { # If chain is to update recent values to match old values res <- to_rebase %>% filter(period >= date_chain) %>% arrange(period) %>% group_by(var) %>% mutate(growth_rate = c(1, value[-1]/lag(x = value, n = 1)[-1])) %>% full_join(valref, by=c("var")) %>% group_by(var) %>% transmute(period, value=cumprod(growth_rate)*value_ref) %>% ungroup() %>% bind_rows(filter(basis, period<date_chain)) } return(res) } Seasonal adjustment

For the seasonal adjustment we just used a function to mutate the series into a time series, apply the function from the Sax (2016) package, mutate back into a dataframe.

deseason <- function(source_df=data_large, var_arrange="pubcons", ...){ local_data <- source_df detrend <- local_data[var_arrange] %>% ts(start=c(1980,1), frequency = 4) %>% seas(na.action=na.exclude) res <- as.numeric(final(detrend)) res <- data_frame(value=res, period=local_data$period, var=var_arrange) %>% arrange(period) %>% transmute(period, var, value) } Bibliography

Joan Paredes, Diego J. Pedregal, and Javier J. Pérez.
Fiscal policy analysis in the euro area: expanding the toolkit.
Journal of Policy Modeling, 36(5):800 – 823, 2014.
URL: http://www.sciencedirect.com/science/article/pii/S0161893814000702, doi:https://doi.org/10.1016/j.jpolmod.2014.07.003. 1 2 3 4 5

Christoph Sax.
seasonal: R Interface to X-13-ARIMA-SEATS.
2016.
R package version 1.2.1.
URL: https://CRAN.R-project.org/package=seasonal. 1 2

F. Smets and R. Wouters.
An estimated dynamic stochastic general equilibrium model of the euro area.
Journal of the European Economic Association, 2003.

R Core Team.
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, 2016.
URL: https://www.R-project.org.

RStudio Team.
RStudio: Integrated Development Environment for R.
RStudio, Inc., Boston, MA, 2016.
URL: http://www.rstudio.com/.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Macroeconomic Observatory - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When Cross-Validation is More Powerful than Regularization

Tue, 12/11/2019 - 20:45

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit.

Cross-validation is a way to safely reuse training data in nested model situations. This includes both the case of setting hyperparameters before fitting a model, and the case of fitting models (let’s call them base learners) that are then used as variables in downstream models, as shown in Figure 1. In either situation, using the same data twice can lead to models that are overtuned to idiosyncracies in the training data, and more likely to overfit.

Figure 1 Properly nesting models with cross-validation

In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a y-aware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias.

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation?

The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.”

A simple example

Suppose you have a system with two categorical variables. The variable x_s has 10 levels, and the variable x_n has 100 levels. The outcome y is a function of x_s, but not of x_n (but you, the analyst building the model, don’t know this). Here’s the head of the data.

## x_s x_n y ## 2 s_10 n_72 0.34228110 ## 3 s_01 n_09 -0.03805102 ## 4 s_03 n_18 -0.92145960 ## 9 s_08 n_43 1.77069352 ## 10 s_08 n_17 0.51992928 ## 11 s_01 n_78 1.04714355

With most modeling techniques, a categorical variable with K levels is equivalent to K or K-1 numerical (indicator or dummy) variables, so this system actually has around 110 variables. In real life situations where a data scientist is working with high-cardinality categorical variables, or with a lot of categorical variables, the number of actual variables can begin to swamp the size of training data, and/or bog down the machine learning algorithm.

One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models.

Figure 2 Impact coding as an example of nested model

We refer to this procedure as “impact coding,” and it is one of the data treatments available in the vtreat package, specifically for dealing with high-cardinality categorical variables. But for now, let’s go back to the original problem.

The naive way

For this simple example, you might try representing each variable as the expected value of y - mean(y) in the training data, conditioned on the variable’s level. So the ith “coefficient” of the one-variable model would be given by:

vi = E[y|x = si] − E[y]

Where si is the ith level. Let’s show this with the variable x_s (the code for all the examples in this article is here):

## x_s meany coeff ## 1 s_01 0.7998263 0.8503282 ## 2 s_02 -1.3815640 -1.3310621 ## 3 s_03 -0.7928449 -0.7423430 ## 4 s_04 -0.8245088 -0.7740069 ## 5 s_05 0.7547054 0.8052073 ## 6 s_06 0.1564710 0.2069728 ## 7 s_07 -1.1747557 -1.1242539 ## 8 s_08 1.3520153 1.4025171 ## 9 s_09 1.5789785 1.6294804 ## 10 s_10 -0.7313895 -0.6808876

In other words, whenever the value of x_s is s_01, the one variable model vs returns the value 0.8503282, and so on. If you do this for both variables, you get a training set that looks like this:

## x_s x_n y vs vn ## 2 s_10 n_72 0.34228110 -0.6808876 0.64754957 ## 3 s_01 n_09 -0.03805102 0.8503282 0.54991135 ## 4 s_03 n_18 -0.92145960 -0.7423430 0.01923877 ## 9 s_08 n_43 1.77069352 1.4025171 1.90394159 ## 10 s_08 n_17 0.51992928 1.4025171 0.26448341 ## 11 s_01 n_78 1.04714355 0.8503282 0.70342961

Now fit a linear model for y as a function of vs and vn.

model_raw = lm(y ~ vs + vn, data=dtrain_treated) summary(model_raw) ## ## Call: ## lm(formula = y ~ vs + vn, data = dtrain_treated) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.33068 -0.57106 0.00342 0.52488 2.25472 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.05050 0.05597 -0.902 0.368 ## vs 0.77259 0.05940 13.006 <2e-16 *** ## vn 0.61201 0.06906 8.862 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8761 on 242 degrees of freedom ## Multiple R-squared: 0.6382, Adjusted R-squared: 0.6352 ## F-statistic: 213.5 on 2 and 242 DF, p-value: < 2.2e-16

Note that this model gives significant coefficients to both vs and vn, even though y is not a function of x_n (or vn). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit.

The right way: cross-validation

The correct way to impact code (or to nest models in general) is to use cross-validation techniques. Impact coding with cross-validation is already implemented in vtreat; note the similarity between this diagram and Figure 1 above.

Figure 3 Cross-validated data preparation with vtreat

The training data is used both to fit the base learners (as we did above) and to also to create a data frame of cross-validated base learner predictions (called a cross-frame in vtreat). This cross-frame is used to train the overall model. Let’s fit the correct nested model, using vtreat.

library(vtreat) library(wrapr) xframeResults = mkCrossFrameNExperiment(dtrain, qc(x_s, x_n), "y", codeRestriction = qc(catN), verbose = FALSE) # the plan uses the one-variable models to treat data treatmentPlan = xframeResults$treatments # the cross-frame dtrain_treated = xframeResults$crossFrame head(dtrain_treated) ## x_s_catN x_n_catN y ## 1 -0.6337889 0.91241547 0.34228110 ## 2 0.8342227 0.82874089 -0.03805102 ## 3 -0.7020597 0.18198634 -0.92145960 ## 4 1.3983175 1.99197404 1.77069352 ## 5 1.3983175 0.11679580 0.51992928 ## 6 0.8342227 0.06421659 1.04714355 variables = setdiff(colnames(dtrain_treated), "y") model_X = lm(mk_formula("y", variables), data=dtrain_treated) summary(model_X) ## ## Call: ## lm(formula = mk_formula("y", variables), data = dtrain_treated) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2157 -0.7343 0.0225 0.7483 2.9639 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.04169 0.06745 -0.618 0.537 ## x_s_catN 0.92968 0.06344 14.656 <2e-16 *** ## x_n_catN 0.10204 0.06654 1.533 0.126 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.055 on 242 degrees of freedom ## Multiple R-squared: 0.4753, Adjusted R-squared: 0.471 ## F-statistic: 109.6 on 2 and 242 DF, p-value: < 2.2e-16

This model correctly determines that x_n (and its one-variable model x_n_catN) do not affect the outcome. We can compare the performance of this model to the naive model on holdout data.

rmse rsquared ypred_naive 1.303778 0.2311538 ypred_crossval 1.093955 0.4587089

The correct model has a much smaller root-mean-squared error and a much larger R-squared than the naive model when applied to new data.

An attempted alternative: regularized models.

But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean.

vi = ∑xj = si yi/(counti + λ) − E[yi]

Where counti is the frequency of si in the training data, and λ is the smoothing parameter (usually 1). If λ = 1 then the first term on the right is just adding one to the frequency of the level and then taking the “adjusted conditional mean” of y.

Again, let’s show this for the variable x_s.

## x_s sum_y count_y grandmean vs ## 1 s_01 20.795484 26 -0.05050187 0.8207050 ## 2 s_02 -37.302227 27 -0.05050187 -1.2817205 ## 3 s_03 -22.199656 28 -0.05050187 -0.7150035 ## 4 s_04 -14.016649 17 -0.05050187 -0.7282009 ## 5 s_05 19.622340 26 -0.05050187 0.7772552 ## 6 s_06 3.129419 20 -0.05050187 0.1995218 ## 7 s_07 -35.242672 30 -0.05050187 -1.0863585 ## 8 s_08 36.504412 27 -0.05050187 1.3542309 ## 9 s_09 33.158549 21 -0.05050187 1.5577086 ## 10 s_10 -16.821957 23 -0.05050187 -0.6504130

After applying the one variable models for x_s and x_n to the data, the head of the resulting treated data looks like this:

## x_s x_n y vs vn ## 2 s_10 n_72 0.34228110 -0.6504130 0.44853367 ## 3 s_01 n_09 -0.03805102 0.8207050 0.42505898 ## 4 s_03 n_18 -0.92145960 -0.7150035 0.02370493 ## 9 s_08 n_43 1.77069352 1.3542309 1.28612835 ## 10 s_08 n_17 0.51992928 1.3542309 0.21098803 ## 11 s_01 n_78 1.04714355 0.8207050 0.61015422

Now fit the overall model:

## ## Call: ## lm(formula = y ~ vs + vn, data = dtrain_treated) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.30354 -0.57688 -0.02224 0.56799 2.25723 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.06665 0.05637 -1.182 0.238 ## vs 0.81142 0.06203 13.082 < 2e-16 *** ## vn 0.85393 0.09905 8.621 8.8e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8819 on 242 degrees of freedom ## Multiple R-squared: 0.6334, Adjusted R-squared: 0.6304 ## F-statistic: 209.1 on 2 and 242 DF, p-value: < 2.2e-16

Again, both variables look significant. Even with regularization, the model is still overfit. Comparing the performance of the models on holdout data, you see that the regularized model does a little better than the naive model, but not as well as the correctly cross-validated model.

rmse rsquared ypred_naive 1.303778 0.2311538 ypred_crossval 1.093955 0.4587089 ypred_reg 1.267648 0.2731756 The Moral of the Story

Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.

Appendix: Derivation of Laplace Smoothing as L2-Regularization

Without regularization, the optimal one-variable model for y in terms of a categorical variable with K levels {sj} is a set of K coefficients v such that

is minimized (N is the number of data points). L2-regularization adds a penalty to the magnitude of v, so that the goal is to minimize

where λ is a known smoothing hyperparameter, usually set (in this case) to 1.

To minimize the above expression for a single coefficient vj, take the deriviative with respect to vj and set it to zero:

Where countj is the number of times the level sj appears in the training data. Now solve for vj:

This is Laplace smoothing. Note that it is also the one-variable equivalent of ridge regression.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Logistic Regression in R: A Classification Technique to Predict Credit Card Default

Tue, 12/11/2019 - 20:00

[This article was first published on R Programming - Data Science Blog | AI, ML, big data analytics , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Logistic regression is one of the statistical techniques in machine learning used to form prediction models. It is one of the most popular classification algorithms mostly used for binary classification problems (problems with two class values, however, some variants may deal with multiple classes as well). It’s used for various research and industrial problems. Therefore, it is essential to have a good grasp on logistic regression algorithm. This tutorial is a sneak peek from many of Data Science Dojo’s hands-on exercises from their 5-day data science bootcamp, you will learn how logistic regression fits a dataset to make predictions, as well as when and why to use it.

In short, Logistic Regression is used when the dependent variable(target) is categorical. For example:

  • To predict whether an email is spam (1) or not spam (0)
  • Whether the tumor is malignant (1) or not (0)

It is named as ‘Logistic Regression’, because it’s underlying technique is quite the same as Linear Regression. There are structural differences in how linear and logistic regression operate. Therefore, linear regression isn’t suitable to be used for classification problems. This link answers in details that why linear regression isn’t the right approach for classification.

Its name is derived from one of the core function behind its implementation called the logistic function or the sigmoid function. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

The hypothesis function of logistic regression can be seen below where the function g(z) is also shown.

The hypothesis for logistic regression now becomes:

Here θ (theta) is a vector of parameters that our model will calculate to fit our classifier.

After calculations from the above equations, the cost function is now as follows:

Here m is the number of training examples. Like Linear Regression, we will use gradient descent to minimize our cost function and calculate the vector θ (theta).

This tutorial will follow the format below to provide you hands-on practice with Logistic Regression:

  1. Importing Libraries
  2. Importing Datasets
  3. Exploratory Data Analysis
  4. Feature Engineering
  5. Pre-processing
  6. Model Development
  7. Prediction
  8. Evaluation
THE SCENARIO

In this tutorial, we will be working with Default of Credit Card Clients Data Set. This data set has 30000 rows and 24 columns. The data set could be used to estimate the probability of default payment by credit card client using the data provided. These attributes are related to various details about a customer, his past payment information and bill statements. It is hosted in Data Science Dojo’s repository.

Think of yourself as a lead data scientist employed at a large bank. You have been assigned to predict whether a particular customer will default payment next month or not. The result is a an extremely valuable piece of information for the bank to take decisions regarding offering credit to its customer and could massively affect the bank’s revenue. Therefore, your task is very critical. You will learn to use logistic regression to solve this problem.

The dataset is a tricky one as it has a mix of categorical and continuous variables. Moreover, You will also get a chance to practice these concepts through short assignments given at the end of a few sub-module. Feel free to change the parameters in the given methods once you have been through the entire notebook.

Download Exercise Files

1) Importing Libraries

We’ll begin by importing our dependencies that we require. The following dependencies are popularly used for data wrangling operations and visualizations. We would encourage you to have a look at their documentations.

library(knitr) library(tidyverse) library(ggplot2) library(mice) library(lattice) library(reshape2) #install.packages("DataExplorer") if the following package is not available library(DataExplorer) 2) Importing Datasets

The dataset is available at Data Science Dojo’s repository in the following link. We’ll use head method to view the first few rows.

## Need to fetch the excel file path <- "https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Default%20of%20Credit%20Card%20Clients/default%20of%20credit%20card%20clients.csv" data <- read.csv(file = path, header = TRUE) head(data)

Since the header names are in the first row of the dataset, we’ll use the code below to first assign the headers to be the one from the first row and then delete the first row from the dataset. This way we will get our desired form.

colnames(data) <- as.character(unlist(data[1,])) data = data[-1, ] head(data)

To avoid any complications ahead, we’ll rename our target variable “default payment next month” to a name without spaces using the code below.

colnames(data)[colnames(data)=="default payment next month"] <- "default_payment" head(data) 3) Exploratory Data Analysis

Data Exploration is one of the most significant portions of the machine learning process. Clean data can ensures a notable increase in accuracy of our model. No matter how powerful our model is, it cannot function well unless the data we provide it has been thoroughly processed. This step will briefly take you through this step and assist you to visualize your data, find relation between variables, deal with missing values and outliers and assist in getting some fundamental understanding of each variable we’ll use. Moreover, this step will also enable us to figure out the most important attibutes to feed our model and discard those that have no relevance.

We will start with using the dim function to print out the dimensionality of our dataframe.

dim(data)

30000 25

The str method will allows us to know the data type of each variable. We’ll transform it to numeric data type since it’ll be more handy to use for our functions ahead.

str(data) 'data.frame': 30000 obs. of 25 variables: $ ID : Factor w/ 30001 levels "1","10","100",..: 1 11112 22223 23335 24446 25557 26668 27779 28890 2 ... $ LIMIT_BAL : Factor w/ 82 levels "10000","100000",..: 14 5 81 48 48 48 49 2 7 14 ... $ SEX : Factor w/ 3 levels "1","2","SEX": 2 2 2 2 1 1 1 2 2 1 ... $ EDUCATION : Factor w/ 8 levels "0","1","2","3",..: 3 3 3 3 3 2 2 3 4 4 ... $ MARRIAGE : Factor w/ 5 levels "0","1","2","3",..: 2 3 3 2 2 3 3 3 2 3 ... $ AGE : Factor w/ 57 levels "21","22","23",..: 4 6 14 17 37 17 9 3 8 15 ... $ PAY_0 : Factor w/ 12 levels "-1","-2","0",..: 5 1 3 3 1 3 3 3 3 2 ... $ PAY_2 : Factor w/ 12 levels "-1","-2","0",..: 5 5 3 3 3 3 3 1 3 2 ... $ PAY_3 : Factor w/ 12 levels "-1","-2","0",..: 1 3 3 3 1 3 3 1 5 2 ... $ PAY_4 : Factor w/ 12 levels "-1","-2","0",..: 1 3 3 3 3 3 3 3 3 2 ... $ PAY_5 : Factor w/ 11 levels "-1","-2","0",..: 2 3 3 3 3 3 3 3 3 1 ... $ PAY_6 : Factor w/ 11 levels "-1","-2","0",..: 2 4 3 3 3 3 3 1 3 1 ... $ BILL_AMT1 : Factor w/ 22724 levels "-1","-10","-100",..: 13345 10030 10924 15026 21268 18423 12835 1993 1518 307 ... $ BILL_AMT2 : Factor w/ 22347 levels "-1","-10","-100",..: 11404 5552 3482 15171 16961 17010 13627 12949 3530 348 ... $ BILL_AMT3 : Factor w/ 22027 levels "-1","-10","-100",..: 18440 9759 3105 15397 12421 16866 14184 17258 2072 365 ... $ BILL_AMT4 : Factor w/ 21549 levels "-1","-10","-100",..: 378 11833 3620 10318 7717 6809 16081 8147 2129 378 ... $ BILL_AMT5 : Factor w/ 21011 levels "-1","-10","-100",..: 385 11971 3950 10407 6477 6841 14580 76 1796 2638 ... $ BILL_AMT6 : Factor w/ 20605 levels "-1","-10","-100",..: 415 11339 4234 10458 6345 7002 14057 15748 12215 3230 ... $ PAY_AMT1 : Factor w/ 7944 levels "0","1","10","100",..: 1 1 1495 2416 2416 3160 5871 4578 4128 1 ... $ PAY_AMT2 : Factor w/ 7900 levels "0","1","10","100",..: 6671 5 1477 2536 4508 2142 4778 6189 1 1 ... $ PAY_AMT3 : Factor w/ 7519 levels "0","1","10","100",..: 1 5 5 646 6 6163 4292 1 4731 1 ... $ PAY_AMT4 : Factor w/ 6938 levels "0","1","10","100",..: 1 5 5 337 6620 5 2077 5286 5 813 ... $ PAY_AMT5 : Factor w/ 6898 levels "0","1","10","100",..: 1 1 5 263 5777 5 950 1502 5 408 ... $ PAY_AMT6 : Factor w/ 6940 levels "0","1","10","100",..: 1 2003 4751 5 5796 6293 963 1267 5 1 ... $ default_payment: Factor w/ 3 levels "0","1","default payment next month": 2 2 1 1 1 1 1 1 1 1 ... data[, 1:25] <- sapply(data[, 1:25], as.character)

We have involved an intermediate step by converting our data to character first. We need to use as.character before as.numeric. This is because factors are stored internally as integers with a table to give the factor level labels. Just using as.numeric will only give the internal integer codes.

data[, 1:25] <- sapply(data[, 1:25], as.numeric) str(data) 'data.frame': 30000 obs. of 25 variables: $ ID : num 1 2 3 4 5 6 7 8 9 10 ... $ LIMIT_BAL : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ... $ SEX : num 2 2 2 2 1 1 1 2 2 1 ... $ EDUCATION : num 2 2 2 2 2 1 1 2 3 3 ... $ MARRIAGE : num 1 2 2 1 1 2 2 2 1 2 ... $ AGE : num 24 26 34 37 57 37 29 23 28 35 ... $ PAY_0 : num 2 -1 0 0 -1 0 0 0 0 -2 ... $ PAY_2 : num 2 2 0 0 0 0 0 -1 0 -2 ... $ PAY_3 : num -1 0 0 0 -1 0 0 -1 2 -2 ... $ PAY_4 : num -1 0 0 0 0 0 0 0 0 -2 ... $ PAY_5 : num -2 0 0 0 0 0 0 0 0 -1 ... $ PAY_6 : num -2 2 0 0 0 0 0 -1 0 -1 ... $ BILL_AMT1 : num 3913 2682 29239 46990 8617 ... $ BILL_AMT2 : num 3102 1725 14027 48233 5670 ... $ BILL_AMT3 : num 689 2682 13559 49291 35835 ... $ BILL_AMT4 : num 0 3272 14331 28314 20940 ... $ BILL_AMT5 : num 0 3455 14948 28959 19146 ... $ BILL_AMT6 : num 0 3261 15549 29547 19131 ... $ PAY_AMT1 : num 0 0 1518 2000 2000 ... $ PAY_AMT2 : num 689 1000 1500 2019 36681 ... $ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ... $ PAY_AMT4 : num 0 1000 1000 1100 9000 ... $ PAY_AMT5 : num 0 0 1000 1069 689 ... $ PAY_AMT6 : num 0 2000 5000 1000 679 ... $ default_payment: num 1 1 0 0 0 0 0 0 0 0 ...

When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together. For a continuous (numeric) variable like “age”, it returns the 5-number summary showing 5 descriptive statistic as these are numeric values.

summary(data) ID LIMIT_BAL SEX EDUCATION Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000 1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000 Median :15000 Median : 140000 Median :2.000 Median :2.000 Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853 3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000 Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000 MARRIAGE AGE PAY_0 PAY_2 Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000 1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000 Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000 Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338 3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000 PAY_3 PAY_4 PAY_5 PAY_6 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000 1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327 Median : 22382 Median : 21200 Median : 20089 Median : 19052 Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263 3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506 Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 Min. :-81334 Min. :-339603 Min. : 0 Min. : 0 1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833 Median : 18105 Median : 17071 Median : 2100 Median : 2009 Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921 3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000 Max. :927171 Max. : 961664 Max. :873552 Max. :1684259 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0 1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8 Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0 Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5 3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0 Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0 default_payment Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.2212 3rd Qu.:0.0000 Max. :1.0000

Using the introduce method, we can get to know the basc information about the dataframe, including the number of missing values in each variable.

introduce(data)

As we can observe, there are no missing values in the dataframe.

The information in summary above gives a sense of the continuous and categorical features in our dataset. However, evaluating these details against the data description shows that categorical values such as EDUCATION and MARRIAGE have categories beyond those given in the data dictionary. We’ll find out these extra categories using the value_counts method.

count(data, vars = EDUCATION) vars n 0 14 1 10585 2 14030 3 4917 4 123 5 280 6 51

The data dictionary defines the following categories for EDUCATION: “Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)”. However, we can also observe 0 along with numbers greater than 4, i.e. 5 and 6. Since we don’t have any further details about it, we can assume 0 to be someone with no education experience and 0 along with 5 & 6 can be placed in others along with 4.

count(data, vars = MARRIAGE) vars n 0 54 1 13659 2 15964 3 323

The data dictionary defines the following categories for MARRIAGE: “Marital status (1 = married; 2 = single; 3 = others)”. Since the category 0 hasn’t been defined anywhere in the data dictionary, we can incude it in the ‘others’ category marked as 3.

#replace 0's with NAN, replace others too data$EDUCATION[data$EDUCATION == 0] <- 4 data$EDUCATION[data$EDUCATION == 5] <- 4 data$EDUCATION[data$EDUCATION == 6] <- 4 data$MARRIAGE[data$MARRIAGE == 0] <- 3 count(data, vars = MARRIAGE) count(data, vars = EDUCATION) vars n 1 13659 2 15964 3 377 vars n 1 10585 2 14030 3 4917 4 468

We’ll now move on to multi-variate analysis of our variables and draw a correlation heat map from DataExplorer library. The heatmap will enable us to find out the correlation between each variable. We are more interested in to find out the correlation between our predictor attributes with the target attribute default payment next month. The color scheme depicts the strength of correlation between 2 variables.

This will be a simple way to quickly find out how much an impact a variable has on our final outcome. There are other ways as well to figure this out.

plot_correlation(na.omit(data), maxcat = 5L)

We can observe the week correlation of AGE, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6 with our target variable.

Now let’s have a univariate analysis of our variables. We’ll start with the categorical variables and have a quick check on the frequency of distribution of categories. The code below will allow us to observe the required graphs. We’ll first draw distribution for all PAY variables.

plot_histogram(data)

We can make a few observations from the above histogram. The distribution above shows that all nearly all PAY attributes are rightly skewed.

4) Feature Engineering

This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial.

Analyzing our data above, we’ve been able to note the extremely week correlation of some variables with the final target variable. The following are the ones which have significantly low correlation values: AGE, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6.

#deleting columns data_new <- select(data, -one_of('ID','AGE', 'BILL_AMT2', 'BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6')) head(data_new) 5) Pre-processing

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

In the code below, we’ll use the scale method transform our dataset using it.

data_new[, 1:17] <- scale(data_new[, 1:17]) head(data_new)

The next task we’ll do is to split the data for training and testing as we’ll use our test data to evaluate our model. We will now split our dataset into train and test. We’ll change it to 0.3. Therefore, 30% of the dataset is reserved for testing while the remaining for training. By default, the dataset will also be shuffled before splitting.

#create a list of random number ranging from 1 to number of rows from actual data #and 70% of the data into training data data2 = sort(sample(nrow(data_new), nrow(data_new)*.7)) #creating training data set by selecting the output row values train <- data_new[data2,] #creating test data set by not selecting the output row values test <- data_new[-data2,]

Let us print the dimensions of all these variables using the dim method. You can notice the 70-30% split.

dim(train) dim(test)

21000 18

9000 18

6) Model Development

We will now move on to our most important step of developing our logistic regression model. We have already fetched our machine learning model in the beginning. Now with a few lines of code we’ll first create a logistic regression model which as has been imported from scikit learn’s linear model package to our variable named model.

Followed by this, we’ll train our model using the fit method with X_train and y_train that contain 70% of our dataset. This will be a binary classification model.

## fit a logistic regression model with the training dataset log.model <- glm(default_payment ~., data = train, family = binomial(link = "logit")) summary(log.model) Call: glm(formula = default_payment ~ ., family = binomial(link = "logit"), data = train) Deviance Residuals: Min 1Q Median 3Q Max -3.1171 -0.6998 -0.5473 -0.2946 3.4915 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.465097 0.019825 -73.900 < 2e-16 *** LIMIT_BAL -0.083475 0.023905 -3.492 0.000480 *** SEX -0.082986 0.017717 -4.684 2.81e-06 *** EDUCATION -0.059851 0.019178 -3.121 0.001803 ** MARRIAGE -0.107322 0.018350 -5.849 4.95e-09 *** PAY_0 0.661918 0.023605 28.041 < 2e-16 *** PAY_2 0.069704 0.028842 2.417 0.015660 * PAY_3 0.090691 0.031982 2.836 0.004573 ** PAY_4 0.074336 0.034612 2.148 0.031738 * PAY_5 0.018469 0.036430 0.507 0.612178 PAY_6 0.006314 0.030235 0.209 0.834584 BILL_AMT1 -0.123582 0.023558 -5.246 1.56e-07 *** PAY_AMT1 -0.136745 0.037549 -3.642 0.000271 *** PAY_AMT2 -0.246634 0.056432 -4.370 1.24e-05 *** PAY_AMT3 -0.014662 0.028012 -0.523 0.600677 PAY_AMT4 -0.087782 0.031484 -2.788 0.005300 ** PAY_AMT5 -0.084533 0.030917 -2.734 0.006254 ** PAY_AMT6 -0.027355 0.025707 -1.064 0.287277 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 22176 on 20999 degrees of freedom Residual deviance: 19535 on 20982 degrees of freedom AIC: 19571 Number of Fisher Scoring iterations: 6 7) Prediction

Below we’ll use the predict method to find out the predictions made by our Logistic Regression method. We will first store the predicted results in our y_pred variable and print our the first 10 rows of our test data set. Following this we will print the predicted values of the corresponding rows and the original labels that were stored in y_test for comparison.

test[1:10,] ## to predict using logistic regression model, probablilities obtained log.predictions <- predict(log.model, test, type="response") ## Look at probability output head(log.predictions, 10)
2
0.539623162720197
7
0.232835137994762
10
0.25988780274953
11
0.0556716133560243
15
0.422481223473459
22
0.165384552048511
25
0.0494775267027534
26
0.238225423596718
31
0.248366972046479
37
0.111907725985513

Below we are going to assign our labels with decision rule that if the prediction is greater than 0.5, assign it 1 else 0.

log.prediction.rd <- ifelse(log.predictions > 0.5, 1, 0) head(log.prediction.rd, 10)
2
1
7
0
10
0
11
0
15
0
22
0
25
0
26
0
31
0
37
0
Evaluation

We’ll now discuss a few evaluation metrics to measure the performance of our machine learning model here. This part has significant relevance since it will allow us to understand the most important characteristics that led to our model development.

We will output the confusion matrix. It is a handy presentation of the accuracy of a model with two or more classes.

The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

According to an article the entries in the confusion matrix have the following meaning in the context of our study:

[[a b]
]

  • a is the number of correct predictions that an instance is negative,
  • b is the number of incorrect predictions that an instance is positive,
  • c is the number of incorrect of predictions that an instance is negative, and
  • d is the number of correct predictions that an instance is positive.
table(log.prediction.rd, test[,18]) log.prediction.rd 0 1 0 6832 1517 1 170 481

We’ll write a simple function to print the accuracy below

accuracy <- table(log.prediction.rd, test[,18]) sum(diag(accuracy))/sum(accuracy)

0.812555555555556

Conclusion

This tutorial has given you a brief and concise overview of Logistic Regression algorithm and all the steps involved in acheiving better results from our model. This notebook has also highlighted a few methods related to Exploratory Data Analysis, Pre-processing and Evaluation, however, there are several other methods that we would encourage to explore on our blog or video tutorials.

If you want to take a deeper dive into the several data science techniques. Join our 5-day hands-on data science bootcamp preferred by working professionals, we cover the following topics:

  • Fundamentals of Data Mining
  • Machine Learning Fundamentals
  • Introduction to R
  • Introduction to Azure Machine Learning Studio
  • Data Exploration, Visualization, and Feature Engineering
  • Decision Tree Learning
  • Ensemble Methods: Bagging, Boosting, and Random Forest
  • Regression: Cost Functions, Gradient Descent, Regularization
  • Unsupervised Learning
  • Recommendation Systems
  • Metrics and Methods for Evaluating Predictive Models
  • Introduction to Online Experimentation and A/B Testing
  • Fundamentals of Big Data Engineering
  • Hadoop and Hive
  • Message Queues and Real-time Analytics
  • NoSQL Databases and HBase
  • Hack Project: Creating a Real-time IoT Pipeline
  • Naive Bayes
  • Logistic Regression
  • Times Series Forecasting

This post was originally sponsored on What’s The Big Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming - Data Science Blog | AI, ML, big data analytics . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

AzureR updates: AzureStor, AzureVM, AzureGraph, AzureContainers

Tue, 12/11/2019 - 19:30

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some major updates to AzureR packages this week! As well as last week's AzureRMR update, there are changes to AzureStor, AzureVM, AzureGraph and AzureContainers. All of these are live on CRAN.

AzureStor 3.0.0

There are substantial enhancements to multiple-file transfers (up and down). You can supply a vector of pathnames to storage_upload/download as the source and destination arguments. Alternatively if you specify a wildcard source, there is now the ability to recurse through subdirectories; the directory structure in the source will be reproduced at the destination.

There are revamped methods for getting storage properties and (user-defined) metadata for storage objects.

You can now use Azurite with AzureStor, by creating a service_specific endpoint (file_endpoint, blob_endpoint, adls_endpoint) with the Azurite URL. AzureStor will print a warning, but create the endpoint anyway.

For other changes, see the NEWS.md file.

AzureVM 2.1.0

You can now create VM scalesets with attached data disks. In addition, you can specify the disk type (Standard_LRS, StandardSSD_LRS, or Premium_LRS) for the OS disk and, for a Linux Data Science Virtual Machine, the supplied data disk. This enables using VM sizes that don't support Premium storage.

AzureGraph 1.1.0 and AzureContainers 1.1.2

These packages have been updated to use the new Microsoft Graph operations, introduced last week, for managing app passwords. As a security measure, app passwords can no longer be manually specified; instead they are auto-generated on the server using a cryptographically secure PRNG.

In AzureGraph, the az_app$update_password() method is defunct; in its place are add_password() and remove_password(). Similarly, in AzureContainers the aks$update_service_password() and aks$update_aad_password() methods no longer accept a manual password as an input.

If you use Graph, and in particular if you use AzureContainers to deploy Azure Kubernetes Service clusters, you should update as soon as possible, as the old versions are incompatible with the new Graph API.

If you run into any problems, please open an issue at the GitHub repo in question.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Azure AI and Machine Learning talk series

Tue, 12/11/2019 - 18:04

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At last week's Microsoft Ignite conference in Orlando, our team delivered a series of 6 talks about AI and machine learning applications with Azure. The videos from each talk are linked below, and you can watch every talk from the conference online (no registration necessary). Each of our talks also comes with a companion Github repository, where you can find all of the code and scripts behind the demonstrations, so you can deploy and run them yourself.

If you'd like to see these talks live, they will also be presented in 31 cities around the world over the next six months, starting with Paris this week. Check the website for Microsoft Ignite the Tour for event dates and further information.

AIML10 Making sense of your unstructured data with AI

Tailwind Traders has a lot of legacy data that they’d like their developers to leverage in their apps – from various sources, both structured and unstructured, and including images, forms, and several others. In this session, learn how the team used Azure Cognitive Search to make sense of this data in a short amount of time and with amazing success. We discuss tons of AI concepts, like the ingest-enrich-explore pattern, search skillsets, cognitive skills, natural language processing, computer vision, and beyond.

Github repository here. Video here.

AIML20 Using pre-built AI to solve business challenges

As a data-driven company, Tailwind Traders understands the importance of using artificial intelligence to improve business processes and delight customers. Before investing in an AI team, their existing developers were able to demonstrate some quick wins using pre-built AI technologies.

In this session, we show how you can use Azure Cognitive Services to extract insights from retail data and go into the neural networks behind computer vision. Learn how it works and how to augment the pre-built AI with your own images for custom image recognition applications.

Github repository here. Video here.

AIML21 Developers guide to AI: A data story

In this theater session, we show the data science process and how to apply it. From exploration of datasets to deployment of services – all applied to an interesting data story. We also take you on a very brief tour of the Azure AI platform.

Github repository here. Video here.

AIML30: Start building machine learning models faster than you think

Tailwind Traders uses custom machine learning models to fix their inventory issues – without changing their software development life cycle! How? Azure Machine Learning Visual Interface.

In this session, learn the data science process that Tailwind Traders’ uses and get an introduction to Azure Machine Learning Visual Interface. See how to find, import, and prepare data, select a machine learning algorithm, train and test the model, and deploy a complete model to an API. Get the tips, best practices, and resources you and your development team need to continue your machine learning journey, build your first model, and more.

Github repository here. Video here.

AIML40 Taking models to the next level with Azure Machine Learning best practices

Tailwind Traders’ data science team uses natural language processing (NLP), and recently discovered how to fine tune and build a baseline models with Automated ML.

In this session, learn what Automated ML is and why it’s so powerful, then dive into how to improve upon baseline models using examples from the NLP best practices repository. We highlight Azure Machine Learning key features and how you can apply them to your organization, including: low priority compute instances, distributed training with auto scale, hyperparameter optimization, collaboration, logging, and deployment.

Github repository here. Video here.

AIML50 Machine learning operations: Applying DevOps to data science

Many companies have adopted DevOps practices to improve their software delivery, but these same techniques are rarely applied to machine learning projects. Collaboration between developers and data scientists can be limited and deploying models to production in a consistent, trustworthy way is often a pipe dream.

In this session, learn how Tailwind Traders applied DevOps practices to their machine learning projects using Azure DevOps and Azure Machine Learning Service. We show automated training, scoring, and storage of versioned models, wrap the models in Docker containers, and deploy them to Azure Container Instances or Azure Kubernetes Service. We even collect continuous feedback on model behavior so we know when to retrain.

Github repository here. Video here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My AP Statistics Class First R Programming Assignment Using RStudio

Tue, 12/11/2019 - 15:49

[This article was first published on R – Saturn Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My AP Stats class has started their first R programming assignment this week. I gave them the code for them to type in and play with. This will give them some experience with RStudio and basic function commands.

I have a total of six assignments for them to complete over the next few months. All my students have a laptop to use so there should be fewer issues getting up and running. The first thing we did was to download R and RStudio. Everyone was able to get RStudio up and running. I then went over some rules for coding and how the assignments will be submitted.

Some of the rules I went over were:

  • Comment out your code with the # sign. Every function should have a comment.
  • Create white space. It’s easier for me to read and debug.
  • If you are stuck: a) ask a classmate b) copy/paste error code into Google, c) ask Mr. Smith
  • Make sure your name and all relevant information are at the top of your code.
  • Expect to get stuck sometimes. Google and Stackoverflow are your friends. You will learn a lot.

After we all had RStudio working, I showed them how to install the tideverse package. This is the greatest package ever and allows me to teach the students on all kinds of larger data. I’ll go into more detail the next lesson on using dplyr to filter and select from a data frame.

For this first assignment, I’m using the data from our book on page 35.

Here is the code for the first assignment and the output.

# My name is _________________________________ # This is my first programming assignment for AP Stats and I will copy and see if everything runs properly # November 11, 2019 # I need to comment out everything using the "#" # This lesson is from my website at saturnscience.com # web link here to see details # http://www.saturnscience.com/category/r/ #################################### Assignment 1---students type in the data ############ # Everything here works for the latest version of R and RStudio ## The general form of a command will look like this: ## note to myself ## myGraph <- ggplot(myData, aes(variable for x axis, variable for y axis)) + geom() ## You can also use =, its the same as -< ## NOTE: DO NOT make variables names with a space, use one word or two connected with a period "." ## Here I enter the data from page 35 ## The "c" function combines the data into a vector ##### Please load dplyr and ggplot2 now. #### foreigh.born=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8, 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1, 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3, 3.9,10.1,12.4,1.2,4.4,2.7) summary(foreigh.born) # Gives the five number summary. str(foreigh.born) # the str function shows me the type of structure of the data. fivenum(foreigh.born) # gives the five number summary mean(foreigh.born) # just shows the mean head(foreigh.born, n=12) # shows the first 12, pick n. Used with large data files. tail(foreigh.born) # shows the end of the data. You can pick n or leave it alone. plot(foreigh.born) # this is R's generic scatter plot function and only shows basic information. # we will use this more later. hist(foreigh.born) # This is base R basic histogram function. # Below is ggplot's better graphing abilities ggplot() + aes(foreigh.born)+ geom_histogram(binwidth = 2.5) # I change the variable name so I don't confuse with the prior graphs foreign.born3=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8, 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1, 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3, 3.9,10.1,12.4,1.2,4.4,2.7) # This is a histogram with base R hist(foreign.born3, breaks = 10, main = "Histogram with Base Graphics", ylim = c(0,15)) # check the structure str(foreign.born3) # make sure it's a data frame by changing to a data.frame. fb3=as.data.frame(foreign.born3) # I check to see the structur of fb3 str(fb3) # I use ggplot to make a histogram similar to the book's histogram ggplot(fb3,aes(x=foreign.born3))+ geom_histogram(color="black",fill="orange",binwidth = 3)+ labs(x="Percent of foreign born residents",y="Number of States")+ geom_density() # I can add a density curve to the histogtam ggplot(fb3,aes(x=foreign.born3))+ geom_histogram(aes(y=..density..),color="black",fill="orange",binwidth = 3)+ labs(x="Percent of foreign born residents",y="Density of States")+ geom_density(alpha=0.2,fill="#FF6666") # Same histogram but I just change the colors a bit. ggplot(fb3, aes(x=foreign.born3)) + geom_histogram(aes(y=..density..), binwidth=3, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") # use control-l to clear the console

Some of the output:

> ##### Please load dplyr and ggplot2 now. #### > > foreigh.born=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8, + 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1, + 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3, + 3.9,10.1,12.4,1.2,4.4,2.7) > > summary(foreigh.born) # Gives the five number summary. Min. 1st Qu. Median Mean 3rd Qu. Max. 1.200 3.800 6.100 8.316 12.350 27.200 > > str(foreigh.born) # the str function shows me the type of structure of the data. num [1:50] 2.8 7 15.1 3.8 27.2 10.3 12.9 8.1 18.9 9.2 ... > > fivenum(foreigh.born) # gives the five number summary [1] 1.2 3.8 6.1 12.4 27.2 > > mean(foreigh.born) # just shows the mean [1] 8.316 > > head(foreigh.born, n=12) # shows the first 12, pick n. Used with large data files. [1] 2.8 7.0 15.1 3.8 27.2 10.3 12.9 8.1 18.9 9.2 16.3 5.6 > > tail(foreigh.born) # shows the end of the data. You can pick n or leave it alone. [1] 3.9 10.1 12.4 1.2 4.4 2.7

Here are some of the plots using ggplot2



We have completed Unit 4 and will start Unit 5 next week. We are where we need to be at this time of the year. At this rate, we’ll finish the class on time and have a few weeks to review for the exam in May 2020.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Saturn Science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppAnnoy 0.0.14

Tue, 12/11/2019 - 13:13

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new minor release of RcppAnnoy is now on CRAN, following the previous 0.0.13 release in September.

RcppAnnoy is the Rcpp-based R integration of the nifty Annoy library by Erik Bernhardsson. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours—originally developed to drive the famous Spotify music discovery algorithm.

This release once again allows compilation on older compilers. The 0.0.13 release in September brought very efficient 512-bit AVX instruction to accelerate computations. However, this could not be compiled on older machines so we caught up once more with upstream to update to conditional code which will fall back to either 128-bit AVX or no AVX, ensuring buildability “everywhere”.

Detailed changes follow below.

Changes in version 0.0.14 (2019-11-11)
  • RcppAnnoy again synchronized with upstream to ensure builds with older compilers without AVX512 instructions (Dirk #53).

  • The cleanup script only uses /bin/sh.

Courtesy of CRANberries, there is also a diffstat report for this release.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

dplyr and Oracle database with odbc on windows

Tue, 12/11/2019 - 13:05

[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RStudio makes Oracle accessibility from R easier via odbc and connections Pane1.
Personally, I find it’s not so easy.
As it finally works for me, I will detail some snippets here.

After tens of try it seems good to share some tricks2. This blog post is also a notepad for me.

Oracle and R configuration is a step where we potentially waste a lot of time.

Many things can cause oracle and R not to work at all:

  • it depends on which client is installed (32b, 64b ?)
  • wether odbc driver is correctly installed or not
  • you have to dissect tnsnames.ora
  • investigate on many ORA error’s
  • maybe try to clean install Oracle client

Often ROracle is used and it works well, sometimes it doesn’t (some oci.dll not found3, etc.). But it doesn’t work with dplyr/dbplyr at the moment.

After several years with ROracle, I’m happy to have both possibilities for query writing and collecting (SQL, and now dplyr)

Here we are:

RStudio connection Pane

From connection Pane we take Oracle odbc driver name, we have two here for two Oracle client versions:

And then:

We now have a big component of the connection string.

32b or 64b

If your Oracle client is 32bit, you have to switch to R 32bits, otherwhise it doesn’t work (at least for me).

Connection string

Then stackoverflow history helped me4 to structure the entire string:

library(odbc) library(dplyr) library(dbplyr) library(lubridate) my_oracle <- dbConnect(odbc::odbc(), .connection_string = "Driver={Oracle dans OraClient10g_home1};DBQ=host:port/db_name;UID=woo;PWD=hoo", timeout = 10)

You will find all these informations in tnsnames.ora. Port is probably 1521.

Some dplyr/dbplyr statements Simple one dplyr::tbl(my_oracle, dbplyr::in_schema('SCHEMA_ONE', 'TB_ONE')) <SQL> SELECT * FROM SCHEMA_ONE.TB_ONE dplyr and dblink

If you have another oracle database with dblinks it may also works like this:

dplyr::tbl(my_oracle, dbplyr::in_schema('SCHEMA_B', 'TC_TWO@MYDBTWOLINK')) <SQL> SELECT * FROM SCHEMA_B.TC_TWO@MYDBTWOLINK

List dblinks

DBI::dbGetQuery(my_oracle, "SELECT * FROM ALL_DB_LINKS") <SQL> SELECT * FROM ALL_DB_LINKS

Catalog of all columns5

<SQL> SELECT * FROM ALL_TAB_COLUMNS Decomposing the connection string

In order to ask for password, we split the connection parts:

library(odbc) library(dplyr) library(dbplyr) library(lubridate) my_oracle <- dbConnect(odbc::odbc(), Driver = "Oracle dans OraClient10g_home1", DBQ = "host:port/db_name", SVC = "DB_SCHEMA", # schema when connection opens UID = "woo", PWD = "hoo")

And then:

library(odbc) library(dplyr) library(dbplyr) library(lubridate) my_oracle <- dbConnect(odbc::odbc(), Driver = "Oracle dans OraClient10g_home1", DBQ = "host:port/db_name", SVC = "DB_SCHEMA", UID = rstudioapi::askForPassword('woo (username)'), PWD = rstudioapi::askForPassword('Open, Sesame (password)'))
  1. RStudio documentation for Oracle connections: https://db.rstudio.com/databases/oracle/ 

  2. see here for a readme in a repo on github: https://github.com/GuillaumePressiat/oracle_odbc_connection_template_for_R 

  3. see here for ROracle difficulties: https://technology.amis.nl/2017/08/23/r-and-the-oracle-database-using-dplyr-dbplyr-with-roracle-on-windows-10/ 

  4. how to make a connection string for oracle that includes hostname, instance name, user id, password using system.data.oracleclient? stackoverflow 

  5. for Oracle catalogs, see here: https://docs.oracle.com/pls/db92/db92.catalog_views?remark=homepage 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Teach R to see by Borrowing a Brain

Tue, 12/11/2019 - 09:00

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


It has been an old dream to teach a computer to see, i.e. to hold something in front of a camera and let the computer tell you what it sees. For decades it has been exactly that: a dream – because we as human beings are able to see, we just don’t know how we do it, let alone be precise enough to put it into algorithmic form.

Enter machine learning!

As we have seen in Understanding the Magic of Neural Networks we can use neural networks for that. We have to show the network thousands of readily tagged pics (= supervised learning) and after many cycles, the network will have internalized all the important features of all the pictures shown to it. The problem is that it often takes a lot a computing power and time to train a neural network from scratch.

The solution: a pre-trained neural network which you can just use out of the box! In the following we will build a system where you can point your webcam in any direction or hold items in front of it and R will tell you what it sees: a banana, some toilet paper, a sliding door, a bottle of water and so on. Sounds impressive, right!

For the following code to work you first have to go through the following steps:

  1. Install Python through the Anaconda distribution: https://www.anaconda.com
  2. Install the R interface to Keras (a high-level neural networks API): https://keras.rstudio.com
  3. Load the keras package and the pre-trained ResNet-50 neural network (based on https://keras.rstudio.com/reference/application_resnet50.html):
  4. library(keras) # instantiate the model resnet50 <- application_resnet50(weights = 'imagenet')
  5. Build a function which takes a picture as input and makes a prediction on what can be seen in it:
  6. predict_resnet50 <- function(img_path) { # load the image img <- image_load(img_path, target_size = c(224, 224)) x <- image_to_array(img) # ensure we have a 4d tensor with single element in the batch dimension, # the preprocess the input for prediction using resnet50 x <- array_reshape(x, c(1, dim(x))) x <- imagenet_preprocess_input(x) # make predictions then decode and print them preds <- predict(resnet50, x) imagenet_decode_predictions(preds, top = 3)[[1]] }
  7. Start the webcam and set the timer to 2 seconds (depends on the technical specs on how to do that!), start taking pics.
  8. Let the following code run and put different items in front of the camera… Have fun!
  9. img_path <- "C:/Users/.../Pictures/Camera Roll" # change path appropriately while (TRUE) { files <- list.files(path = img_path, full.names = TRUE) img <- files[which.max(file.mtime(files))] # grab latest pic cat("\014") # clear console print(predict_resnet50(img)) Sys.sleep(1) }
  10. When done click the Stop button in RStudio and stop taking pics.
  11. Optional: delete saved pics – you can also do this with the following command:
  12. unlink(paste0(img_path, "/*")) # delete all pics in folder

Here are a few examples of my experiments with my own crappy webcam:

class_name class_description score 1 n07753592 banana 9.999869e-01 2 n01945685 slug 5.599981e-06 3 n01924916 flatworm 3.798145e-06

class_name class_description score 1 n07749582 lemon 0.9924537539 2 n07860988 dough 0.0062746629 3 n07747607 orange 0.0003545524

class_name class_description score 1 n07753275 pineapple 0.9992571473 2 n07760859 custard_apple 0.0002387811 3 n04423845 thimble 0.0001032234

class_name class_description score 1 n04548362 wallet 0.51329690 2 n04026417 purse 0.33063501 3 n02840245 binder 0.02906101

class_name class_description score 1 n04355933 sunglass 5.837566e-01 2 n04356056 sunglasses 4.157162e-01 3 n02883205 bow_tie 9.142305e-05

So far, all of the pics were on a white background, what happens in a more chaotic setting?

class_name class_description score 1 n03691459 loudspeaker 0.62559783 2 n03180011 desktop_computer 0.17671309 3 n03782006 monitor 0.04467739

class_name class_description score 1 n03899768 patio 0.65015656 2 n03930313 picket_fence 0.04702349 3 n03495258 harp 0.04476695

class_name class_description score 1 n02870880 bookcase 0.5205195 2 n03661043 library 0.3582534 3 n02871525 bookshop 0.1167464

Quite impressive for such a small amount of work, isn’t it!

Another way to make use of pre-trained models is to take them as a basis for building new nets that can e.g. recognize things the original net was not able to. You don’t have to start from scratch but use e.g. only the lower layers which hold the needed building block while retraining the higher layers (another possibility would be to add additional layers on top of the pre-trained model).

This method is called Transfer Learning and an example would be to reuse a net that is able to differentiate between male and female persons for recognizing their age or their mood. The main advantage obviously is that you get results much faster this way, one disadvantage may be that a net that is trained from scratch might yield better results. As so often in the area of machine learning there is always a trade-off…

Hope this post gave you an even deeper insight into the fascinating area of neural networks which is still one of the hottest areas of machine learning research.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What can we really expect to learn from a pilot study?

Tue, 12/11/2019 - 01:00

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am involved with a very interesting project – the NIA IMPACT Collaboratory – where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention.

There is a rich literature on this topic. For example, these papers by Lancaster et al and Leon et al provide nice discussions about how pilot studies should fit into the context of larger randomized trials. The key point made by both groups of authors is that pilot studies are important sources of information about the feasibility of conducting a larger, more informative study: Can the intervention actually be implemented well enough to study it? Will it be possible to recruit and retain patients? How difficult will it be to measure the primary outcome? Indeed, what is the most appropriate outcome to be measuring?

Another thing the authors agree on is that the pilot study is not generally well-equipped to provide an estimate of the treatment effect. Because pilot studies are limited in resources (both time and money), sample sizes tend to be quite small. As a result, any estimate of the treatment effect is going to be quite noisy. If we accept the notion that there is some true underlying treatment effect for a particular intervention and population of interest, the pilot study estimate may very well fall relatively far from that true value. As a result, if we use that effect size estimate (rather than the true value) to estimate sample size requirements for the larger randomized trial, we run a substantial risk of designing an RCT that is too small, which may lead us to miss identifying a true effect. (Likewise, we may end up with a study that is too large, using up precious resources.)

My goal here is to use simulations to see how a small pilot study could potentially lead to poor design decisions with respect to sample size.

A small, two-arm pilot study

In these simulations, I will assume a two-arm study (intervention and control) with a true intervention effect \(\Delta = 50\). The outcome is a continuous measure with a within-arm standard deviation \(\sigma = 100\). In some fields of research, the effect size would be standardized as \(d = \Delta / \sigma\). (This is also known as Cohen’s \(d\).) So, in this case the true standardized effect size \(d=0.5\).

If we knew the true effect size and variance, we could skip the pilot study and proceed directly to estimate the sample size required for 80% power and Type I error rate \(\alpha = 0.05\). Using the pwr.t.test function in the pwr library, we specify the treatment effect (as \(d\)), significance level \(\alpha\), and power to get the number of subjects needed for each study arm. In this case, it would be 64 (for a total of 128):

library(pwr) pwr.t.test(n = NULL, d = 50/100, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 64 ## d = 0.5 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group

If we do not have an estimate of \(d\) or even of the individual components \(\Delta\) and \(\sigma\), we may decide to do a small pilot study. I simulate a single study with 30 subjects in each arm (for a total study sample size of 60). First, I generate the data set (representing this one version of the hypothetical study) with a treatment indicator \(rx\) and an outcome \(y\):

library(simstudy) defd <- defDataAdd(varname = "y", formula = "rx * 50", variance = 100^2) ss <- 30 set.seed(22821) dd <- genData(n = ss*2) dd <- trtAssign(dd, grpName = "rx") dd <- addColumns(defd, dd) head(dd) ## id rx y ## 1: 1 0 -150 ## 2: 2 1 48 ## 3: 3 0 -230 ## 4: 4 1 116 ## 5: 5 1 91 ## 6: 6 1 105

Once we have collected the data from the pilot study, we probably would try to get sample size requirements for the larger RCT. The question is, what information can we use to inform \(d\)? We have a couple of options. In the first case, we can estimate both \(\Delta\) and \(\sigma\) from the data and use those results directly in power calculations:

lmfit <- lm(y ~ rx, data = dd) Delta <- coef(lmfit)["rx"] Delta ## rx ## 78 sd.rx <- dd[rx==1, sd(y)] sd.ctl <- dd[rx==0, sd(y)] pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 ) pool.sd ## [1] 94

The estimated standard deviation (94) is less than the true value, and the effect size is inflated (78), so that the estimated \(\hat{d}\) is also too large, close to 0.83. This is going to lead us to recruit fewer participants (24 in each group) than the number we actually require (64 in each group):

pwr.t.test(n = NULL, d = Delta/pool.sd, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 24 ## d = 0.83 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group

Alternatively, if we had external information that provided some insight into the true effect size, or, absent that, we use a minimally clinically significant effect size, we might get a better result. In this case, we are quite fortunate to use an effect size of 50. However, we will continue to use the variance estimate from the pilot study. Using this approach, the resulting sample size (56) happens to be much closer to the required value (64):

pwr.t.test(n = NULL, d = 50/pool.sd, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 56 ## d = 0.53 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group Speak truth to power

Now the question becomes, what is the true expected power of the RCT based on the sample size estimated in the pilot study. To estimate this true power, we use the true effect size and the true variance (i.e. the true \(d\))?

In the first case, where we actually used the true \(d\) to get the sample size estimate, we just recover the 80% power estimate. No surprise there:

pwr.t.test(n = 64, d = 0.50, sig.level = 0.05, type = "two.sample")$power ## [1] 0.8

In the second case, where we used \(\hat{d} = \hat{\Delta} / \hat{\sigma}\) to get the sample size \(n=24\), the true power of the larger RCT would be 40%:

pwr.t.test(n = 24, d = 0.50, sig.level = 0.05, type = "two.sample")$power ## [1] 0.4

And if we had used \(\hat{d} = 50 / \hat{\sigma}\) to get the sample size estimate \(n=56\), the true power would have been 75%:

pwr.t.test(n = 56, d = 0.50, sig.level = 0.05, type = "two.sample")$power ## [1] 0.75 Conservative estimate of standard deviation

While the two papers I cited earlier suggest that it is not appropriate to use effect sizes estimated from a pilot study (and more on that in the next and last section), this 1995 paper by R.H. Browne presents the idea that we can use the estimated standard deviation from the pilot study. Or rather, to be conservative, we can use the upper limit of a one-sided confidence interval for the standard deviation estimated from the pilot study.

The confidence interval for the standard deviation is not routinely provided in R. Another paper analyzes one-sided confidence intervals quite generally under different conditions, and provides a formula in the most straightforward case under assumptions of normality to estimate the \(\gamma*100\%\) one-sided confidence interval for \(\sigma^2\):

\[
\left( 0,\frac{(N-2)s_{pooled}^2}{\chi^2_{N-2;\gamma}} \right)
\]

where \(\chi^2_{N-2;\gamma}\) is determined by \(P(\chi^2_{N-2} > \chi^2_{N-2;\gamma}) = \gamma\). So, if \(\gamma = 0.95\) then we can get a one-sided 95% confidence interval for the standard deviation using that formulation:

gamma <- 0.95 qchi <- qchisq(gamma, df = 2*ss - 2, lower.tail = FALSE) ucl <- sqrt( ( (2*ss - 2) * pool.sd^2 ) / qchi ) ucl ## [1] 111

The point estimate \(\hat{\sigma}\) is 94, and the one-sided 95% confidence interval is \((0, 111)\). (I’m happy to provide a simulation to demonstrate that this is in fact the case, but won’t do it here in the interest of space.)

If we use \(\hat{\sigma}_{ucl} = 111\) to estimate the sample size, we get a more conservative sample size requirement (78) than if we used the point estimate \(\hat{\sigma} = 94\) (where the sample size requirement was 56):

pwr.t.test(n = NULL, d = 50/ucl, sig.level = 0.05, power = 0.80, type = "two.sample") ## ## Two-sample t test power calculation ## ## n = 78 ## d = 0.45 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group

Ultimately, using \(\gamma = 0.95\) might be too conservative in that it might lead to an excessively large sample size requirement. Browne’s paper uses simulation to to evaluate a range of \(\gamma\)’s, from 0.5 to 0.9, which I also do in the next section.

Simulation of different approaches

At this point, we need to generate multiple iterations to see how the various approaches perform over repeated pilot studies based on the same data generating process, rather than looking at a single instance as I did in the simulations above.

As Browne does in his paper, I would like to evaluate the distribution of power estimates that arise from the various approaches. I compare using an external source or minimally clinically meaningful effect size to estimate \(\Delta\) (in the figures below, this would be the columns labeled ‘truth’) with using the effect size point estimate from the pilot (labeled pilot). I also compare using a point estimate of \(\sigma\) from the pilot (where \(\gamma=0\)), with using the upper limit of a one-sided confidence interval defined by \(\gamma\). In these simulations I compare three levels of \(\gamma\): \(\gamma \in (0.5, 0.7, 0.9)\).

In each of the simulations, I assume 30 subjects per arm, and evaluate true effect sizes of 30 and 75. In all cases, the true standard error \(\sigma = 100\) so that true \(d\) is 0.30 or 0.75.

The box plots in the figure represent the distribution of power estimates for the larger RCT under different scenarios. Each scenario was simulated 5000 times each. Ideally, the power estimates should cluster close to 80%, the targeted level of power. In the figure, the percentage next to each box plot reports the percent of simulations with power estimates at or above the target of 80%.

Two things jump out at me. First, using the true effect size in the power calculation gives us a much better chance of designing an RCT with close to 80% power, even when a point estimate is used for \(\hat{\sigma}\). In Browne’s paper, the focus is on the fact that even when using the true effect size, there is a high probability of power falling below 80%. This may be the case, but it may be more important to note that when power is lower than the target, it is actually likely to fall relatively close to the 80% target. If the researcher is very concerned about falling below that threshold, perhaps using \(\gamma\) higher than 0.6 or 0.7 might provide an adequate cushion.

Second, it appears using the effect size estimate from the pilot as the basis for an RCT power analysis is risky. The box plots labeled as pilot exhibit much more variation than the ‘true’ box plots. As a result, there is a high probability that the true power will fall considerably below 80%. And in many other cases, the true power will be unnecessarily large, due to the fact that they have been designed to be larger than they need to be.

The situation improves somewhat with larger pilot studies, as shown below with 60 patients per arm, where variation seems to be reduced. Still, an argument can be made that using effect sizes from pilot studies is too risky, leading to an under-powered or overpowered study, neither of which is ideal.

A question remains about how best to determine what effect size to use for the power calculation if using the estimate from the pilot is risky. I think a principled approach, such as drawing effect size estimates from the existing literature or using clinically meaningful effect sizes, is a much better way to go. And the pilot study should focus on other important feasibility issues that can help improve the design of the RCT.


References:

Lancaster, G.A., Dodd, S. and Williamson, P.R., 2004. Design and analysis of pilot studies: recommendations for good practice. Journal of evaluation in clinical practice, 10(2), pp.307-312.

Leon, A.C., Davis, L.L. and Kraemer, H.C., 2011. The role and interpretation of pilot studies in clinical research. Journal of psychiatric research, 45(5), pp.626-629.

Browne, R.H., 1995. On the use of a pilot sample for sample size determination. Statistics in medicine, 14(17), pp.1933-1940.

Cojbasic, V. and Loncar, D., 2011. One-sided confidence intervals for population variances of skewed distributions. Journal of Statistical Planning and Inference, 141(5), pp.1667-1672.

 

Support:

This research is supported by the National Institutes of Health National Institute on Aging U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

 

Addendum

Below is the code I used to run the simulations and generate the plots

getPower <- function(ssize, esize, gamma = 0, use.est = FALSE) { estring <- paste0("rx * ", esize) defd <- defDataAdd(varname = "y", formula = estring, variance = 100^2) N <- ssize * 2 dd <- genData(n = N) dd <- trtAssign(dd, grpName = "rx") dd <- addColumns(defd, dd) lmfit <- lm(y~rx, data = dd) sd.rx <- dd[rx==1, sd(y)] sd.ctl <- dd[rx==0, sd(y)] pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 ) qchi <- qchisq(gamma, df = N - 2, lower.tail = FALSE) ucl <- sqrt( ( (N-2) * pool.sd^2 ) / qchi ) p.sd <- estsd * (gamma == 0) + ucl * (gamma > 0) p.eff <- esize * (use.est == FALSE) + coef(lmfit)["rx"] * (use.est == TRUE) if (abs(p.eff/p.sd) < 0.0002) p.eff <- sign(p.eff) * .0002 * p.sd nstar <- round(pwr.t.test(n = NULL, d = p.eff/p.sd, sig.level = 0.05, power = 0.80, type = "two.sample")$n,0) power <- pwr.t.test(n=nstar, d = esize/100, sig.level = 0.05, type = "two.sample") return(data.table(ssize, esize, gamma, use.est, estsd = estsd, ucl = ucl, nstar, power = power$power, est = coef(lmfit)["rx"], lcl.est = confint(lmfit)["rx",1] , ucl.est = confint(lmfit)["rx",2]) ) } dres <- data.table() for (i in c(30, 60)) { for (j in c(30, 75)) { for (k in c(0, .5, .7)) { for (l in c(FALSE, TRUE)) { dd <- rbindlist(lapply(1:5000, function(x) getPower(ssize = i, esize = j, gamma = k, use.est = l)) ) dres <- rbind(dres, dd) }}}} above80 <- dres[, .(x80 = mean(power >= 0.80)), keyby = .(ssize, esize, gamma, use.est)] above80[, l80 := scales::percent(x80, accuracy = 1)] g_labeller <- function(value) { paste("\U03B3", "=", value) # unicode for gamma } e_labeller <- function(value) { paste("\U0394", "=", value) # unicdoe for Delta } ggplot(data = dres[ssize == 30], aes(x=factor(use.est, labels=c("'truth'", "pilot")), y=power)) + geom_hline(yintercept = 0.8, color = "white") + geom_boxplot(outlier.shape = NA, fill = "#9ba1cf", width = .4) + theme(panel.grid = element_blank(), panel.background = element_rect(fill = "grey92"), axis.ticks = element_blank(), plot.title = element_text(size = 9, face = "bold")) + facet_grid(esize ~ gamma, labeller = labeller(gamma = g_labeller, esize = e_labeller)) + scale_x_discrete( name = "\n source of effect size used for power calculation") + scale_y_continuous(limits = c(0,1), breaks = c(0, .8), name = "distribution of power estimates \n") + ggtitle("Distribution of power estimates (n = 30 per treatment arm)") + geom_text(data = above80[ssize == 30], aes(label = l80), x=rep(c(0.63, 1.59), 6), y = 0.95, size = 2.5) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using R and H2O Isolation Forest For Data Quality

Mon, 11/11/2019 - 19:56

[This article was first published on R-Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction:
We will identify anomalous patterns in data, this process is useful, not only to find inconsistencies and errors but also to find abnormal data behavior, being useful even to find cyber attacks on organizations.
On this article there is more information as reference: Data Quality and Anomaly Detection Thoughts For Web Analytics

Before starting we need the next software installed and working:
– R language installed. – H2O open source framework. – Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases – R studio. About the data used in this article. # I am using https://www.kaggle.com/bradklassen/pga-tour-20102018-data
# The version I have is not the most updated version but anyway, a new version
# may be used.


# The file I am using is a csv 950 mb file with 9,720,530 records, including header.
#
# One very important thing is that we are going to see that instead to be lost in more
than 9 million records, we will just be looking at 158 records with anomalies for the
analysed variable, so, it is easier to inspect data in this way.
Let’s start coding: # Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) )
# For interactive plotting
suppressWarnings( suppressMessages( library( dygraphs ) ) )
suppressWarnings( suppressMessages( library( dplyr ) ) )
suppressWarnings( suppressMessages( library( DT ) ) )

# Start a single-node instance of H2O using all available processor cores and reserve 5GB of memory
h2oServer = h2o.init( ip = "localhost", port = 54321, max_mem_size = "5g", nthreads = -1 ) ##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.out
## /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 395 milliseconds
## H2O cluster timezone: America/Mexico_City
## H2O data parsing timezone: UTC
## H2O cluster version: 3.26.0.6
## H2O cluster version age: 1 month and 8 days
## H2O cluster name: H2O_started_from_R_ckassab_aat507
## H2O cluster total nodes: 1
## H2O cluster total memory: 4.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.1 (2019-07-05) h2o.removeAll() # Removes all data from h2o cluster, ensuring it is clean.
h2o.no_progress() # Turn off progress bars for notebook readability

# Setting H2O timezone for proper date data type handling
#h2o.getTimezone() ===>>> UTC
#h2o.listTimezones() # We can see all H2O timezones
h2o.setTimezone("US/Central") ## [1] "US/Central" # Note. I am using Ubuntu 19.10, using /tmp directory
# Every time I boot my computer, I need to copy the data file again to /tmp
# directory.

# Importing data file and setting data types accordingly.
allData = read.csv( "/tmp/PGA_Tour_Golf_Data_2019_Kaggle.csv", sep = ",", header = T )

# When using as.Posixct H2O is not importing data, so we are using as.Date.
allData$Date = as.Date( allData$Date )
allData$Value = as.numeric(allData$Value)

# Convert dataset to H2O format.
allData_hex = as.h2o( allData )

# Build an Isolation forest model
startTime <- Sys.time()
startTime ## [1] "2019-11-10 20:10:30 CST" trainingModel = h2o.isolationForest( training_frame = allData_hex
, sample_rate = 0.1
, max_depth = 32
, ntrees = 100
) ## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Stopping tolerance is ignored for _stopping_rounds=0.. Sys.time() ## [1] "2019-11-10 20:20:15 CST" Sys.time() - startTime ## Time difference of 9.756691 mins # According to H2O doc:
# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
#
# Isolation Forest is similar in principle to Random Forest and is built on
# the basis of decision trees.

# Isolation Forest creates multiple decision trees to isolate observations.
#
# Trees are split randomly, The assumption is that:
#
# IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,
# IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.
#
# The less splits needed, the unit is more likely to be anomalous.
#
# The average number of splits is then used as a score.

# Calculate score for all data.
startTime <- Sys.time()
startTime ## [1] "2019-11-10 20:20:15 CST" score = h2o.predict( trainingModel, allData_hex )
result_pred = as.vector( score$predict )
Sys.time() ## [1] "2019-11-10 20:23:18 CST" Sys.time() - startTime ## Time difference of 3.056829 mins ################################################################################
# Setting threshold value for anomaly detection.
################################################################################

# Setting desired threshold percentage.
threshold = .999 # Let's say we want the .001% data different than the rest.

# Using this threshold to get score limit to filter data anomalies.
scoreLimit = round( quantile( result_pred, threshold ), 4 )

# Add row score at the beginning of dataset
allData = cbind( RowScore = round( result_pred, 4 ), allData )

# Get data anomalies by filtering all data.
anomalies = allData[ allData$RowScore > scoreLimit, ]

# As we can see in the summary:
summary(anomalies) ## RowScore Player.Name Date
## Min. :0.9540 Jonas Blixt : 231 Min. :2019-07-07
## 1st Qu.:0.9565 Jordan Spieth : 231 1st Qu.:2019-08-25
## Median :0.9614 Julian Etulain: 221 Median :2019-08-25
## Mean :0.9640 Johnson Wagner: 213 Mean :2019-08-24
## 3rd Qu.:0.9701 John Chin : 209 3rd Qu.:2019-08-25
## Max. :1.0000 Keegan Bradley: 209 Max. :2019-08-25
## (Other) :8325
## Statistic
## Club Head Speed : 234
## Driving Pct. 300-320 (Measured): 193
## Carry Efficiency : 163
## First Tee Early Lowest Round : 161
## First Tee Late Lowest Round : 160
## GIR Percentage - 100+ yards : 158
## (Other) :8570
## Variable
## First Tee Early Lowest Round - (LOW RND) : 103
## First Tee Late Lowest Round - (LOW RND) : 96
## First Tee Late Lowest Round - (ROUNDS) : 64
## Driving Pct. 300-320 (Measured) - (TOTAL DRVS - OVERALL): 61
## GIR Percentage - 175-200 yards - (%) : 61
## First Tee Early Lowest Round - (ROUNDS) : 58
## (Other) :9196
## Value
## Min. : 1268
## 1st Qu.: 53058
## Median : 87088
## Mean :111716
## 3rd Qu.:184278
## Max. :220583
## # The Statistic: GIR Percentage - 100+ yards is one of the most important values
# Filtering all anomalies within this Statistic value
statisticFilter = "GIR Percentage - 100+ yards"

specificVar = anomalies %>%
filter(Statistic==statisticFilter)

cat( statisticFilter,": ", dim(specificVar)[1] ) ## GIR Percentage - 100+ yards : 158 if( dim(specificVar)[1] > 0 ) {

# We want to know the relation between Players and "Approaches from 200-225 yards"
# So, in order to get a chart, we assign a code to each player
# Since factors in R are really integer values, we do this to get the codes:
specificVar$PlayerCode = as.integer(specificVar$Player.Name)

# To sort our dataset we convert the date to numeric
specificVar$DateAsNum = as.numeric( paste0( substr(specificVar$Date,1,4)
, substr(specificVar$Date,6,7)
, substr(specificVar$Date,9,10) ) )
# And sort the data frame.
specificVar = specificVar[order(specificVar$DateAsNum),]
# Set records num using a sequence.
rownames(specificVar) = seq(1:dim(specificVar)[1])

colNamesFinalTable = c( "PlayerCode", "Player.Name", "Date", "Variable", "Value" )
specificVar = specificVar[, colNamesFinalTable]
specificVar$PlayerCode = as.factor(specificVar$PlayerCode)

# Creating our final dataframe for our chart.
specificVarChartData = data.frame( SeqNum = as.integer( rownames(specificVar) )
, PlayerCode = specificVar$PlayerCode
, Value = specificVar$Value
)



AnomaliesGraph = dygraph( specificVarChartData, main = ''
, xlab = paste(statisticFilter,"Anomaly Number."), ylab = "Player Code." ) %>%
dyAxis("y", label = "Player Code.") %>%
dyAxis("y2", label = "Value.", independentTicks = TRUE) %>%
dySeries( name = "PlayerCode", label = "Player Code.", drawPoints = TRUE, pointShape = "dot"
, color = "blue", pointSize = 2 ) %>%
dySeries( name = "Value", label = "Value.", drawPoints = TRUE, pointShape = "dot"
, color = "green", pointSize = 2, axis = 'y2' ) %>%
dyRangeSelector()
dyOptions( AnomaliesGraph, digitsAfterDecimal = 0 )
} ## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo

Sample chart with the anomalies found:

Sample data table with the anomalies found:

Show 102550100 entries Search: Player Code Player Name Date Variable Value 1 686 Josh Teater 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 198471 2 655 Johnson Wagner 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 186658 3 618 Jim Furyk 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 198471 4 723 Keegan Bradley 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 211362 5 213 Cameron Tringale 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 198471 6 712 Justin Thomas 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 199671 7 520 Hunter Mahan 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 178096 8 587 Jason Day 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 189755 9 539 J.J. Henry 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 177431 10 657 Jon Rahm 2019-08-25 GIR Percentage – 100+ yards – (ROUNDS) 199671 Showing 1 to 10 of 158 entries 1234516 Here is the code on github, including the total html functional demo:
https://github.com/LaranIkal/DataQuality Enjoy it!!!…

Carlos Kassab https://www.linkedin.com/in/carlos-kassab-48b40743/
More information about R: https://www.r-bloggers.com
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Free Training: Mastering Data Structures in R

Mon, 11/11/2019 - 18:00

[This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Next week I will be delivering a free online R training. This is a new course I’ve created called Mastering Data Structures in R. This course is for you if:

  • You are new to R, and want a rigorous introduction to R as a programming language
  • You know how to analyze data in R, but want to take the next step and learn to program in it too
  • You already know R, but find yourself frequently confused by its “idiosyncracies”

Register for Mastering Data Structures in R
R as a Tool for Data Analysis

When clients reach out to me for training, they normally want help learning R as a tool for Data Analysis. Most of my students are already experts at Excel. They want to learn R because they have heard that it is more powerful than Excel. For these clients I normally propose a two-day course that focuses on the Tidyverse and R Markdown:

  • Day 1: learn to use ggplot2 and dplyr on datasets that have already been cleaned.
  • Day 2: in the morning, learn to import and clean data using the Tidyverse. In the afternoon, learn to use R Markdown to do reproducible research.

In general, this sort of course helps people learn to use R to do much of what they were already doing in Excel.

R as a Programming Language

The biggest problem with my 2-day workshop is this: while I intend it to be a starting point for learning R, many of my students think that it’s all that they need to know! In fact, though, I only consider it to be a starting point.

For example, when I was working at an online Real Estate company, I needed to analyze our website’s sales lead data. I started by using R as a data analysis tool. I used packages like ggplot2 to explore the ebb and flow of our sales leads over time for each metropolitan area. But I eventually hit a wall. What I really wanted to do was map our data at the ZIP code level, and mash it up with data from the Census Bureau. Of course, no package existed to do this: it’s too specific. But since I knew R as a programming language, I was able to create functions (and eventually a package) to answer the exact question I had.

And this is the level that I want all my students to get to. Every analyst has a unique problem. Learning to use R as a programming language allows you to answer the exact question you have.


Register for Mastering Data Structures in R
Why Data Structures?

Most of my students struggle to learn the “programming language” aspect of R  because they never formally studied Computer Science. I decided to address this by creating a course that resembles an introductory Computer Science course but uses R as the language of instruction.

My undergraduate Computer Science curriculum focused on Data Structures and Algorithms. This is why Mastering Data Structures in R provides a rigorous introduction to the basic data structures in R. The course contains dozens of exercises, and will increase your fluency at the console. 

I recently gave a pilot version of this course that was well received. To celebrate, I will be running the course online, for free, next week. Here is the syllabus:

  • Monday 11/18: Data Types
  • Tuesday 11/19: Vectors
  • Wednesday 11/20: Factors and Lists
  • Friday 11/21: Data Frames

All sessions will start at 10am PT and last approximately 90 minutes.

If you are interested in the course but cannot attend live, then you should still register: I will be sending recordings to everyone who registers.


Register for Mastering Data Structures in R

The post Free Training: Mastering Data Structures in R appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Scraping Machinery Parts

Mon, 11/11/2019 - 03:00

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been exploring the feasibility of aggregating data on prices of replacement parts for heavy machinery. There are a number of websites which list this sort of data. I’m focusing on the static sites for the moment.

I’m using are R with {rvest} (and a few other Tidyverse packages thrown in for good measure).

library(glue) library(dplyr) library(purrr) library(stringr) library(rvest)

The data are paginated. Fortunately the URL includes the page number as a GET parameter, so stepping through the pages is simple. I defined a global variable, URL, with a {glue} placeholder for the page number.

This is how I’m looping over the pages. The page number, page, is set to one initially and incremented after each page of results is scraped. When it gets to a page without at results, the loop is stopped.

page = 1 while (TRUE) { items <- read_html(glue(URL)) %>% html_nodes(SELECTOR) # if (length(items) == 0) break # Check if gone past last page. # Extract data for each item here... page <- page + 1 # Advance to next page. }

That’s the mechanics. Within the loop I then used map_dfr() from {purrr} to iterate over items, delving into each item to extract its name and price.

map_dfr(items, function(item) { tibble( part = item %>% html_node("p") %>% html_text(), price = item %>% html_node(".product-item--price small") %>% html_text() ) })

The results from each page are appended to a list and finally concatenated using bind_rows().

> dim(parts) [1] 987 2

Scraping a single category yields 987 parts. Let’s take a look at the first few.

> head(parts) part price <chr> <chr> 1 R986110000 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 240-7732 $1,534.00 2 R986110001 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 213-5268 $1,854.00 3 R986110002 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 266-8034 $1,374.00 4 R986110003 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 296-6728 $1,754.00 5 R986110004 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 136-8869 $1,494.00 6 R986110005 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 255-6805 $1,534.00

That’s looking pretty good already. There’s one final niggle: the data in the price column are strings. Ideally we’d want those to be numeric. But to do that we have to strip off some punctuation. Not a problem thanks to functions from {stringr}.

parts <- parts %>% mutate( price = str_replace(price, "^\\$", ""), # Strip off leading "$" price = str_replace_all(price, ",", ""), # Strip out comma separators price = as.numeric(price) )

Success!

> head(parts) part price <chr> <dbl> 1 R986110000 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 240-7732 1534 2 R986110001 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 213-5268 1854 3 R986110002 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 266-8034 1374 4 R986110003 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 296-6728 1754 5 R986110004 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 136-8869 1494 6 R986110005 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 255-6805 1534

The great thing about scripting a scraper is that, provided that the website has not been dramatically restructured, the scraper can be run at any time to gather an updated set of results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Statistical uncertainty with R and pdqr

Mon, 11/11/2019 - 01:00

[This article was first published on QuestionFlow , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

CRAN has accepted my ‘pdqr’ package. Here are important examples of how it can be used to describe and evaluate statistical uncertainty.

Prologue

I am glad to announce that my latest, long written R package ‘pdqr’ is accepted to CRAN. It provides tools for creating, transforming and summarizing custom random variables with distribution functions (as base R ‘p*()’, ‘d*()’, ‘q*()’, and ‘r*()’ functions). You can read a brief overview in one of my previous posts.

We will need the following setup:

library(pdqr) library(magrittr) # For the sake of reproducibility set.seed(20191111) Statistical uncertainty General description

Statistical estimation usually has the following setup. There is a sample (observed, usually randomly chosen, set of values of measurable quantities) from some general population (whole set of values of the same measurable quantities). We need to make conclusions about the general population based on a sample. This is done by computing summary values (called statistics) of a sample, and making reasonable assumptions (with process usually called inference) about how these values are close to values that potentially can be computed based on whole general population. Thus, summary value based on a sample (sample statistic) is an estimation of potential summary value based on a general population (true value).

How can we make inference about quality of this estimation? This question itself describes statistical uncertainty and can be unfolded into a deep philosophical question about probability, nature, and life in general. Basically, the answer depends on assumptions about the relation between sample, general population, and statistic.

For me, the most beautiful inferential approach is bootstrap. It has the following key assumption: process of producing samples from general population can be simulated by doing random sampling with replacement from present sample. In other words, we agree (and practice often agrees with us) that random sampling with replacement from current sample (sometimes called bootstrap sampling) has a “close enough” behavior to the “true nature” of how initial sample was created. Numerical estimation of “how close” is also an interesting problem, but it is a more complicated topic.

Computation with pdqr

Natural way of computing bootstrap quantities is straightforward: produce \(B\) random bootstrap samples, for each one compute value of statistic under question, and summarize sample of statistic values with numerical quantity (usually with some center and spread values).

There are many ways of performing bootstrap in R, like boot::boot(), rsample::bootstraps(), and others. In turn, ‘pdqr’ offers its own way of describing and doing bootstrap inference for one-dimensional numeric sample(s):

  • Create a random variable (in the form of pdqr-function with new_*() family) based on initial sample. This random variable already describes a general population with “bootstrap assumption”: it will produce values based on initial sample. Type of this variable determines the type of bootstrap:
    • Type "discrete" describes ordinary bootstrap. Only values from initial sample can be produced.
    • Type "continuous" describes smooth bootstrap. Initial sample is smoothed by doing kernel density estimation with density() function and random variable produces values from distribution with that density.
  • Transform created random variable into one that produces statistic values obtained with bootstrap. Sometimes this can be done with basic mathematical operations like +, min, etc. But usually this is done with form_estimate() function: it creates many (10000 by default) bootstrap samples, for each computes statistic value, and creates its own random variable in the form of pdqr-function (class and type are preserved from supplied random variable, but this can be adjusted). It needs at least three arguments:
    • f: pdqr-function representing random variable. In described setup it is created as a result of “Create” step.
    • stat: statistic function that accepts numeric vector of size sample_size and returns single numeric or logical output.
    • sample_size: Size of a sample that each bootstrap draw should produce. In described setup it should be equal to number of elements in initial sample.
  • Summarize distribution of statistic. Usually this is point measure of center or spread, or interval.
Example 1: single numerical estimate

Mean value of ‘mpg’ variable in mtcars dataset is 20.090625. However, having in mind statistical uncertainty, we can ask how precise is this estimation? This can, and should, be reformulated in the following question: if we repeat sampling sets of 32 cars from general population of all cars, how close their ‘mpg’ sample means will be to each other? This can be answered by computing bootstrap distribution of sample means (pipe %>% function from ‘magrittr’ package is used to simplify notation):

# Using ordinary bootstrap d_mpg_dis_mean <- mtcars$mpg %>% new_d(type = "discrete") %>% form_estimate(stat = mean, sample_size = nrow(mtcars)) # Spread of this bootstrap distribution describes the precision of estimation: # bigger values indicate lower precision summ_sd(d_mpg_dis_mean) ## [1] 1.04067 # This discrete distribution has the following d-function plot( d_mpg_dis_mean, main = "Ordinary bootstrap distribution of 'mpg' sample mean" )

If modeling assumption about continuous nature of ‘mpg’ variable is reasonable (which it seems so), you can use “smooth bootstrap” by changing type of initial pdqr-function:

# Using smooth bootstrap with `type = "continuous"` d_mpg_con_mean <- mtcars$mpg %>% new_d(type = "continuous") %>% form_estimate(stat = mean, sample_size = nrow(mtcars)) # Spread is higher in this case because kernel density estimation with # `density()` function extends support during creation of pdqr-function on the # bootstrap step summ_sd(d_mpg_con_mean) ## [1] 1.153957 plot( d_mpg_con_mean, main = "Smooth bootstrap distribution of 'mpg' sample mean" )

One can also do ordinary bootstrap but represent bootstrap distribution of sample mean with continuous random variable:

# Using ordinary bootstrap, but treating sample mean as continuous d_mpg_con_mean_2 <- mtcars$mpg %>% new_d(type = "discrete") %>% form_estimate( stat = mean, sample_size = nrow(mtcars), # Create continuous pdqr-function from bootstrap sample means args_new = list(type = "continuous") ) summ_sd(d_mpg_con_mean_2) ## [1] 1.063524 plot( d_mpg_con_mean_2, main = "Ordinary bootstrap distribution of 'mpg' continuous sample mean" )

In this case, sample mean has standard deviation from 1.04067 to 1.1539572 (depends on assumptions about data generating process).

Example 2: single logical estimate

Share of 4-cylinder cars in mtcars is equal to 0.34375. However, it might happen that we don’t care about actual value, but only if it is bigger 0.3 or not. In present data it is bigger, but how sure we can be about that? In other words: if we repeat sampling sets of 32 cars from general population of all cars, which part of it will have share of 4-cylinder cars bigger than 0.3?. Here is the way of computing that with ‘pdqr’:

# If statistic returns logical value (indicating presence of some feature in # sample), output estimate pdqr-function is "boolean": "discrete" type function # with elements being exactly 0 (indicating `FALSE`) and 1 (indicating `TRUE`). d_cyl_lgl <- mtcars$cyl %>% new_d(type = "discrete") %>% form_estimate( stat = function(x) {mean(x == 4) > 0.3}, sample_size = nrow(mtcars) ) d_cyl_lgl ## Probability mass function of discrete type ## Support: [0, 1] (2 elements, probability of 1: 0.7113) # To extract certain probability from boolean pdqr-function, use # `summ_prob_*()` functions summ_prob_true(d_cyl_lgl) ## [1] 0.7113 summ_prob_false(d_cyl_lgl) ## [1] 0.2887

In this case, estimated probability that share of 4-cylinder cars in general population is more than 0.3 is 0.7113.

Example 3: comparison of estimates

In mtcars there are 19 cars with automatic transmission (‘am’ variable is 0) and 13 with manual (‘am’ variable is 1). We might be concerned with the following question: are cars with automatic transmission heavier than cars with manual transmission? This is an example of question where reformulating is very crucial, because it leads to completely different methodologies. Basically, it is all about dealing with statistical uncertainty and how to measure that one numerical set is bigger than the other.

First, rather verbose, way of expanding this question is this one: if we randomly choose a car with automatic transmission (uniformly on set of all cars with automatic transmission) and a car with manual (uniformly on set of all cars with manual transmission), what is the probability that weight of the first one is bigger than the second one?. With ‘pdqr’ this can be computed straightforwardly by comparing two random variables (which is implemented exactly like the question above; read more here):

# Seems reasonable to treat weight as continuous random variable. Note that this # means use of kernel density estimation, which can lead to random variable that # returns negative values. As weight can be only positive, it is a good idea to # ensure that. Package 'pdqr' has `form_resupport()` function for that. d_wt_am0 <- mtcars$wt[mtcars$am == 0] %>% new_d(type = "continuous") %>% # Ensure that returned values are only positive form_resupport(c(0, NA)) d_wt_am1 <- mtcars$wt[mtcars$am == 1] %>% new_d(type = "continuous") %>% form_resupport(c(0, NA)) # Comparing two pdqr-functions with `>=` results into boolean pdqr-function summ_prob_true(d_wt_am0 >= d_wt_am1) ## [1] 0.9209063

So in this case the answer is that probability of “automatic” cars being heavier than “manual” ones is around 0.921.

Second way of understanding question about comparing is the following: is average weight of “automatic” cars bigger than of “manual”?. This type of questions are more widespread in statistical practice. Having to deal with statistical uncertainty, this should be reformulated: if we repeat sampling (in parallel pairs) sets of 19 “automatic” cars and of 13 “manual” cars, which part of the set pairs will have mean weight of “automatic” cars bigger? This question implies creating bootstrap distribution of sample means for “automatic” and “manual” cars with the following comparing:

d_wt_am0_mean <- d_wt_am0 %>% form_estimate(stat = mean, sample_size = sum(mtcars$am == 0)) %>% # Ensure "positiveness" of random variable form_resupport(c(0, NA)) d_wt_am1_mean <- d_wt_am1 %>% form_estimate(stat = mean, sample_size = sum(mtcars$am == 1)) %>% form_resupport(c(0, NA)) # Comparing two random variables representing sample means summ_prob_true(d_wt_am0_mean >= d_wt_am1_mean) ## [1] 1

So in this case the answer is that probability of “automatic” cars being heavier than “manual” ones is 1.

Computed results can have decisively different outcomes. If researcher sets a standard 0.95 rule, first variant would imply that conclusion ‘“automatic” cars are heavier than “manual”’ isn’t significant, while the second would imply otherwise.

Epilogue
  • Basic knowledge about statistical uncertainty is crucial to understand the process of statistical inference.
  • One of the most popular methodologies for doing statistical inference is bootstrap. There are at least two kinds of it: ordinary and smooth.
  • Package ‘pdqr’ offers extensive functionality for describing and estimating statistical uncertainty. Core functions here are new_*() family, form_estimate(), and comparison operators.
sessionInfo() sessionInfo() ## R version 3.6.1 (2019-07-05) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 18.04.3 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so ## ## locale: ## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8 ## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8 ## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] magrittr_1.5 pdqr_0.2.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.2 bookdown_0.13 crayon_1.3.4 digest_0.6.21 ## [5] evaluate_0.14 blogdown_0.15 pillar_1.4.2 rlang_0.4.0 ## [9] stringi_1.4.3 rmarkdown_1.15 tools_3.6.1 stringr_1.4.0 ## [13] xfun_0.9 yaml_2.2.0 compiler_3.6.1 htmltools_0.3.6 ## [17] knitr_1.25

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A comparison of methods for predicting clothing classes using the Fashion MNIST dataset in RStudio and Python (Part 1)

Mon, 11/11/2019 - 01:00

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Florianne Verkroost is a PhD candidate at Nuffield College at the University of Oxford. With a passion for data science and a background in mathematics and econometrics. She applies her interdisciplinary knowledge to computationally address societal problems of inequality.

In this series of blog posts, I will compare different machine and deep learning methods to predict clothing categories from images using the Fashion MNIST data. In this first blog of the series, we will explore and prepare the data for analysis. I will also show you how to predict the clothing categories of the Fashion MNIST data using my go-to model: an artificial neural network. To show you how to use one of RStudio’s incredible features to run Python from RStudio, I build my neural network in Python using the code in this Python script or this Jupyter notebook on my Github. In the second blog post, we will experiment with tree-based methods (single tree, random forests and boosting) and support vector machines to see whether we can beat the neural network in terms of performance. As Python cannot be run in this blog post, I will walk you through the results from this script produced earlier, but if you would also like to see how to embed Python code and results in R Markdown files, check out this Markdown file on my Github! The R code used for this blog is also included on my Github.

To start, we first set our seed to make sure the results are reproducible.

set.seed(1234) Importing and exploring the data

The keras package contains the Fashion MNIST data, so we can easily import the data into RStudio from this package directly after installing it from Github and loading it.

library(devtools) install.packages("keras") #devtools::install_github("rstudio/keras") library(keras) install_keras() fashion_mnist <- keras::dataset_fashion_mnist()

The resulting object named fashion_mnist is a nested list, consisting of lists train and test. Each of these lists in turn consists of arrays x and y. To look at the dimensions of these elements, we recursively apply the dim() function to the fashion_mnist list.

rapply(fashion_mnist, dim) train.x1 train.x2 train.x3 train.y test.x1 test.x2 test.x3 test.y 60000 28 28 60000 10000 28 28 10000

From the result, we observe that the x array in the training data contains 28 matrices each of 60000 rows and 28 columns, or in other words 60000 images each of 28 by 28 pixels. The y array in the training data contains 60000 labels for each of the images in the x array of the training data. The test data has a similar structure but only contains 10000 images rather than 60000. For simplicity, we rename these lists elements to something more intuitive (where x now represents images and y represents labels):

c(train.images, train.labels) %<-% fashion_mnist$train c(test.images, test.labels) %<-% fashion_mnist$test

Every image is captured by a 28 by 28 matrix, where entry [i, j] represents the opacity of that pixel on an integer scale from 0 (white) to 255 (black). The labels consist of integers between zero and nine, each representing a unique clothing category. As the category names are not contained in the data itself, we have to store and add them manually. Note that the categories are evenly distributed in the data.

cloth_cats = data.frame(category = c('Top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boot'), label = seq(0, 9))

To get an idea of what the data entail and look like, we plot the first ten images of the test data. To do so, we first need to reshape the data slightly such that it becomes compatible with ggplot2. We select the first ten test images, convert them to data frames, rename the columns into digits 1 to 28, create a variable named y with digits 1 to 28 and then we melt by variable y. We need package reshape2 to access the melt() function. This results in a 28 times 28 equals 784 by 3 (y pixels (= y), x pixels (= variable) and the opacity (= value)) data frame. We bind these all together by rows using the rbind.fill() function from the plyr package and add a variable Image, which is a unique string repeated 784 times for each of the nine images containing the image number and corresponding test set label.

library(reshape2) library(plyr) subarray <- apply(test.images[1:10, , ], 1, as.data.frame) subarray <- lapply(subarray, function(df){ colnames(df) <- seq_len(ncol(df)) df['y'] <- seq_len(nrow(df)) df <- melt(df, id = 'y') return(df) }) plotdf <- rbind.fill(subarray) first_ten_labels <- cloth_cats$category[match(test.labels[1:10], cloth_cats$label)] first_ten_categories <- paste0('Image ', 1:10, ': ', first_ten_labels) plotdf['Image'] <- factor(rep(first_ten_categories, unlist(lapply(subarray, nrow))), levels = unique(first_ten_categories))

We then plot these first ten test images using package ggplot2. Note that we reverse the scale of the y-axis because the original dataset contains the images upside-down. We further remove the legend and axis labels and change the tick labels.

library(ggplot2) ggplot() + geom_raster(data = plotdf, aes(x = variable, y = y, fill = value)) + facet_wrap(~ Image, nrow = 2, ncol = 5) + scale_fill_gradient(low = "white", high = "black", na.value = NA) + theme(aspect.ratio = 1, legend.position = "none") + labs(x = NULL, y = NULL) + scale_x_discrete(breaks = seq(0, 28, 7), expand = c(0, 0)) + scale_y_reverse(breaks = seq(0, 28, 7), expand = c(0, 0))

Data Preparation

Next, it’s time to start the more technical work of predicting the labels from the image data. First, we need to reshape our data to convert it from a multidimensional array into a two-dimensional matrix. To do so, we vectorize each 28 by 28 matrix into a column of length 784, and then we bind the columns for all images on top of each other, finally taking the transpose of the resulting matrix. This way, we can convert a 28 by 28 by 60000 array into a 60000 by 784 matrix. We also normalize the data by dividing between the maximum opacity of 255.

train.images <- data.frame(t(apply(train.images, 1, c))) / max(fashion_mnist$train$x) test.images <- data.frame(t(apply(test.images, 1, c))) / max(fashion_mnist$train$x)

We also create two data frames that include all training and test data (images and labels), respectively.

pixs <- 1:ncol(fashion_mnist$train$x)^2 names(train.images) <- names(test.images) <- paste0('pixel', pixs) train.labels <- data.frame(label = factor(train.labels)) test.labels <- data.frame(label = factor(test.labels)) train.data <- cbind(train.labels, train.images) test.data <- cbind(test.labels, test.images) Artificial Neural Network

Now, let’s continue by building a simple neural network model to predict our clothing categories. Neural networks are artificial computing systems that were built with human neural networks in mind. Neural networks contain nodes, which transmit signals amongst one another. Usually the input at each node is a number, which is transformed according to a non-linear function of the input and weights, the latter being the parameters that are tuned while training the model. Sets of neurons are collected in different layers; neural networks are referred to as ‘deep’ when they contain at least two hidden layers. If you’re not familiar with artificial neural networks, then this free online book is a good source to start learning about them.

In this post, I will show you how artificial neural networks with different numbers of hidden layers compare, and I will also compare these networks to a convolutional network, which often performs better in the case of visual imagery. I will show you some basic models and how to code these, but will not spend too much time on tuning neural networks, for example when it comes to choosing the right amount of hidden layers or the number of nodes in each hidden layer. In essence, what it comes down to is that these parameters largely depend on your data structure, magnitude and complexity. The more hidden layers one adds, the more complex non-linear relationships can be modelled. Often, in my experience, adding hidden layers to a neural network increases their performance up to a certain number of layers, after which the increase becomes non-significant while the computational requirements and interpretation become more infeasible. It is up to you to play around a bit with your specific data and test how this trade-off works.

Although neural networks can easily built in RStudio using TensorFlow and Keras, I really want to show you one of the incredible features of RStudio where you can run Python in RStudio. This can be done in two ways: either we choose “Terminal” on the top of the output console in RStudio and run Python via Terminal, or we use the base system2() function to run Python in RStudio.

For the second option, to use the system2() command, it’s important to first check what version of Python should be used. You can check which versions of Python are installed on your machine by running python --version in Terminal. Note that with RStudio 1.1 (1.1.383 or higher), you can run in Terminal directly from RStudio on the “Terminal” tab. You can also run python3 --version to check if you have Python version 3 installed. On my machine, python --version and python3 --version return Python 2.7.16 and Python 3.7.0, respectively. You can then run which python (or which python3 if you have Python version 3 installed) in Terminal, which will return the path where Python is installed. In my case, these respective commands return /usr/bin/python and /Library/Frameworks/Python.framework/Versions/3.7/bin/python3. As I will make use of Python version 3, I specify the latter as the path to Python in the use_python() function from the reticulate package. We can check whether the desired version of Python is used by using the sys package from Python. Just make sure to change the path in the code below to what version of Python you desire using and where that version in installed.

library(reticulate) use_python(python = '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3') sys <- import("sys") sys$version

Now that we’ve specified the correct version of Python to be used, we can run our Python script from RStudio using the system2() function. This function also takes an argument for the version of Python used, which in my case is Python version 3. If you are using an older version of Python, make sure to change "python3" in the command below to "python2".

python_file <- "simple_neural_network_fashion_mnist.py" system2("python3", args = c(python_file), stdout = NULL, stderr = "")

The source code used to build and fit the neural networks from the above script can be found in this Python script or this Jupyter notebook on my Github.. In this post, I will walk you through the results from this script produced earlier, but if you would also like to see how to embed Python code and results in R Markdown files, check out this file on my Github!

I will now guide you step by step through the script called in the command above. First, we load the required packages in Python and set the session seed for replicability.

We then load the fashion MNIST data from keras and we normalize the data by dividing by maximum opacity of 255.

We start by building a simple neural network containing one hidden layer. Note that as here we use the untransformed but normalized data, we need to flatten the 28 by 28 pixels input first. We add one hidden densely-connected layer which performs output = relu(dot(input, kernel) + bias), where the rectified linear unit (relu) activation function has been proven to work well. We set the number of nodes equal to 128, because this seems to work well in our case. The number of nodes could essentially be any of the numbers 32, 64, 128, 256 and 512, as these are in a sequence of multiples between the number of nodes in the output (= 10) and input (= 784) layers. The softmax layer then assigns predicted probabilities to each of the ten clothing categories, which is also why there are ten nodes in this layer.

After building the neural network, we compile it. We specify sparse_categorical_crossentropy as the loss function, which is suitable for categorical multi-class responses. The optimizer controls the learning rate; adam (adaptive moment estimation) is similar to classical stochastic gradient descent and usually a safe choice for the optimizer. We set our metric of interest to be the accuracy, or the percentage of correctly classified images. Hereafter, we fit the model onto our training data set using ten iterations through the training data (“epochs”). Here, 70% is used for training and 30% is used for validation.

Next, we print the results of the model in terms of training and testing loss and accuracy.

We can see that the neural network with one hidden layer already performs relatively well with a test accuracy of 87.09%. However, it seems like we are slightly overfitting (i.e. the model is fitted too well to a particular data set and therefore does not well extend to other data sets), as the training set accuracy (88.15%) is slightly higher than the test set accuracy. There are several ways to avoid overfitting in neural networks, such as simplifying our model by reducing the number of hidden layers and neurons, adding dropout layers that randomly remove some of the connections between layers, and early stopping when validation loss starts to increase. Later on in this post, I will demonstrate some of these methods to you. For further reading, I personally like this and this post showing how to avoid overfitting when building neural networks using keras. Instead, to see whether a deep neural network performs better at predicting clothing categories, we build a neural network with three hidden layers in a similar way as before.



It seems like the model with two additional layers does not perform better than the previous one with only one hidden layer, given that both the training (87.42%) and test set (86.03%) accuracies are lower and the loss (38.49) is higher. Let’s try whether adding another five hidden layers improves model performance, or whether we can include that increasing model complexity does not improve performance.



The model with eight hidden layers performs best in terms of training (88.21%) and test (87.58%) accuracy as well as loss (36.12). Nevertheless, the difference in performance between the first model with one hidden layer and the current model with eight hidden layers is only quite small. Although it seems that with so many hidden layers, we can model additional complexity that improves the accuracy of the model, we must ask ourselves whether increasing model complexity at the cost of interpretability and computational feasibility is worth this slight improvement in accuracy and loss.

Now that we have seen how the number of hidden layers affects model performance, let’s try and see whether increasing the number of epochs (i.e. the number of times the model iterates through the training data) from ten to fifty improves the performance of our first neural network with one hidden layer.



The three-layer model trained with fifty epochs has the highest train (89.32%) and test (88.68%) accuracies we have seen so far. However, the loss (54.73) is also about a third larger than we have seen before. Additionally, the model is also less time-efficient, given that the increase in accuracy is not substantial but the model takes significantly longer to fit. To better understand the trade-off between minimizing loss and maximizing accuracy, we plot model loss and accuracy over the number of epochs for the training and cross-validation data.

We observe that for the training data, loss decreases to zero while accuracy increases to one, as a result of overfitting. This is why we also check how the model performs on the cross-validation data, for which we observe that loss increases with the number of epochs while accuracy remains relatively stable. Using this figure, we can select an “optimal” number of epochs such that accuracy is maximized while loss is minimized. Looking at the cross-validation data accuracy, we see that the accuracy peak lays at around 20 epochs, for which loss is approximately 0.4. However, similar accuracies but much lower losses and modelling time are achieved with around 6 and 12 epochs, and so we might rather choose to train our model with around 6 or 20 epochs.

Regarding the model output, the predictions returned are probabilities per class or clothing category. We can calculate the majority vote by taking class that has the maximum of predicted probabilities of all classes. We can print the first ten elements of the majority_vote dictionary, which we can obtain as follows:

All except the fifth (number 4) prediction are correct. In the fifth prediction, a shirt (category 6) is being misclassified as a top (category 0).

Convolutional Neural Network

I also wanted to show you how to build a convolutional neural network and compare its performance to the neural networks presented earlier, mostly because convolutional neural networks have generally been shown to perform better on visual image data. Essentially, what happens in a convolutional neural network is that a smaller matrix (the “filter matrix” or “kernel”) slides over the full image matrix, moving pixel by pixel, multiplies the filter matrix with the part of the full image matrix covered by the filter matrix on that moment, sums up these values and then repeats this until the full image matrix has been covered. For a more extensive explanation on how convolutional neural networks, I refer you to this page or this page.

As we need to prepare our data slightly differently for a convolutional neural network, we reload the data and reshape the images to “flatten” them. The last “1” in the reshape dimensions stand for a greyscale, as we have images on a black-to-white scale. If we would have RGB images, we would change the “1” into a “3”.

We make sure the the values of the pixels, ranging from zero to 255, are of the float type and then we normalize the values as before.


The convolutional neural network cannot deal with categorical labels. Therefore, we transform the labels to binary vectors, where all vectors have length ten (as there are ten categories), a “1” at the index of the category and zeros elsewhere. For example, category 3 and 8 would be coded as [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0, 0, 0, 1, 0], respectively. This transformation is referred to as “one hot encoding”.

Now, we can start building our convolutional neural network. The first layer Conv2D is a convolutional layer that takes a 2-dimensional matrix of 28 by 28 pixels in greyscale (1) as input. As before, we use 128 nodes in this layer, as the size of the data is not extremely large and we want to avoid making our model unnecessarily complex. The filter matrix is of size 3 by 3, which is quite standard. As before, we use the rectified linear (“relu”) activation function. The MaxPooling2D layer reduces the dimensionality (and thus required computational power) by outputting the maximum of the part of the input image that is captured by the filter matrix. The Flatten layer simply flattens the result from the previous layer into a vector. As we saw before, the softmax layer then assigns predicted probabilities to each of the ten clothing categories. Note that we use the same optimizer and metric as before, but that we now use “categorical_crossentropy” as the loss function instead of “sparse_categorical_crossentropy”. The reason for this is that the former works for one-hot encoded labels, whereas the other works for categorical labels.

We fit our model to the training data, where we set the batch_size argument equal to the number of neurons in the convolutional layers (= 128).

Although we are still overfitting, we observe that the convolutional neural network performs better than the neural networks we saw earlier, achieving a training set accuracy of 95.16% and a test set accuracy of 90.39%, and a lower loss of 28.70. This was to be expected, because convolutional neural networks have previously been shown to perform well on visual imagery data. Let’s see if we can reduce overfitting by reducing the number of neurons from 128 to 64, adding dropout layers and enabling early stopping. Note that the rate in the Dropout layer is the percentage of connections between layers that are being removed. the SpatialDropout2D is a special kind of dropout layer for convolutional neural networks, which drops certain multiplications of the filter matrix with parts of the original image before pooling across all movements over the original image.

When fitting our model, we also enable early stopping to reduce overfitting. Instead of going through all epochs specified, early stopping automatically stops the iterations through the epoch once it’s being noticed that the validation loss increases.

From the results, we observe that although the training and test accuracies have decreased, they are now much more similar than before. The test accuracy has not decreased substantially, but the training accuracy has, which means that overfitting is much less of a problem than before. Next, we can print the first ten predictions from the model and the first ten actual labels and compare them.

Comparing these predictions to the first ten labels in the data set, we observe that the first ten predictions are correct!

Next up…

Next up in this series of blog posts, I will experiment with tree-based methods and support vector machines to see if they perform as well as the neural network in predicting clothing categories in the Fashion MNIST data. Let’s go!

_____='https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-1/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

future 1.15.0 – Lazy Futures are Now Launched if Queried

Sat, 09/11/2019 - 20:00

[This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


No dogs were harmed while making this release

future 1.15.0 is now on CRAN, accompanied by a recent, related update of future.callr 0.5.0. The main update is a change to the Future API:


resolved() will now also launch lazy futures

Although this change does not look much to the world, I’d like to think of this as part of a young person slowly finding themselves. This change in behavior helps us in cases where we create lazy futures upfront;

fs <- lapply(X, future, lazy = TRUE)

Such futures remain dormant until we call value() on them, or, as of this release, when we call resolved() on them. Contrary to value(), resolved() is a non-blocking function that allows us to check in on one or more futures to see if they are resolved or not. So, we can now do:

while (!all(resolved(fs))) { do_something_else() }

to run that loop until all futures are resolved. Any lazy future that is still dormant will be launched when queried the first time. Previously, we would have had to write specialized code for the lazy=TRUE case to trigger lazy futures to launch. If not, the above loop would have run forever. This change means that the above design pattern works the same regardless of whether we use lazy=TRUE or lazy=FALSE (default). There is now one less thing to worry about when working with futures. Less mental friction should be good.

What else?

The Future API now guarantees that value() relays the “visibility” of a future’s value. For example,

> f <- future(invisible(42)) > value(f) > v <- value(f) > v [1] 42

Other than that, I have fixed several non-critical bugs and improved some documentation. See news(package="future") or NEWS for all updates.

What’s next?
  • I’ll be talking about futures at rstudio::conf 2020 (San Francisco, CA, USA) at the end of January 2020. Please come and say hi – I am keen to hear your R story.

  • I will wrap up the deliverables for the project Future Minimal API: Specification with Backend Conformance Test Suite sponsored by the R Consortium. This project helps to robustify the future ecosystem and validate that all backends fulfill the Future API specification. It also serves to refine the Future API specifications. For example, the above change to resolved() resulted from this project.

  • The maintainers of foreach plan to harmonize how foreach() identifies global variables with how the future framework identifies them. The idea is to migrate foreach to use the same approach as future, which relies on the globals package. If you’re curious, you can find out more about this over at the foreach issue tracker. Yeah, the foreach issue tracker is a fairly recent thing – it’s a great addition.

  • The progressr package (GitHub only) is a proof-of-concept and a working prototype showing how to signal progress updates when doing parallel processing. It works out of the box with the core Future API and higher-level Future APIs such as future.apply, foreach with doFuture, furrr, and plyr – regardless of what parallel backend is being used. It should also work with all known non-parallel map-reduce frameworks, including base lapply() and purrr. For parallel processing, the “granularity” of progress updates varies with the type of parallel worker used. Right now, you will get live updates for sequential processing, whereas for parallel processing the updates will come in chunks along with the value whenever it is collected for a particular future. I’m working on adding support for “live” progress updates also for some parallel backends including when running on local and remote workers.

Happy futuring!

Links var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: JottR on R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reading in Data

Sat, 09/11/2019 - 18:35

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here’s a common situation: you have a folder full of similarly-formatted CSV or otherwise structured text files that you want to get into R quickly and easily. Reading data into R is one of those tasks that can be a real source of frustration for beginners, so I like collecting real-life examples of the many ways it’s become much easier.

This week in class I was working with country-level historical mortality rate estimates. These are available from mortality.org, a fabulous resource. They have a variety of data available but I was interested in the 1×1 year estimates of mortality for all available countries. By “1×1” I mean that the tables show (for men, women, and in total) age-specific morality rate estimates for yearly ages from 0 to 110 and above, for every available historical year (e.g. from 1850 to 2016 or what have you). So you can have an estimate of the mortality rate for, say, 28 year olds in France in 1935.

Downloading this data gives me a folder of text files, one for each country. (Or rather, country-like unit: there are separate series for, e.g. East Germany, West Germany, and Germany as a whole, for example, along with some countries where sub-populations are broken out historically.) The names of the files are consistently formatted, as is the data inside them, and they have a .txt extension. What I wanted to do was get each one of these files into R, ideally putting them all into a single big table that could be the jumping-off point for subsetting and further analysis.

I know from the documentation provided by mortality.org that the files all have the same basic format, which of course makes things much easier. The data is already clean. It’s just a matter of loading it all in efficiently, or “ingesting” it, to use the charming image that seems to be preferred at present.

Here we go. First, some libraries.

library(tidyverse) library(janitor) library(here) ## here() starts at /Users/kjhealy/Source/demog

We get a list of the filenames in our raw data folder, along with their full paths. Then we take a look at them.

filenames <- dir(path = here("rawdata"), pattern = "*.txt", full.names = TRUE) filenames ## [1] "/Users/kjhealy/Source/demog/rawdata/AUS.Mx_1x1.txt" ## [2] "/Users/kjhealy/Source/demog/rawdata/AUT.Mx_1x1.txt" ## [3] "/Users/kjhealy/Source/demog/rawdata/BEL.Mx_1x1.txt" ## [4] "/Users/kjhealy/Source/demog/rawdata/BGR.Mx_1x1.txt" ## [5] "/Users/kjhealy/Source/demog/rawdata/BLR.Mx_1x1.txt" ## [6] "/Users/kjhealy/Source/demog/rawdata/CAN.Mx_1x1.txt" ## [7] "/Users/kjhealy/Source/demog/rawdata/CHE.Mx_1x1.txt" ## [8] "/Users/kjhealy/Source/demog/rawdata/CHL.Mx_1x1.txt" ## [9] "/Users/kjhealy/Source/demog/rawdata/CZE.Mx_1x1.txt" ## [10] "/Users/kjhealy/Source/demog/rawdata/DEUTE.Mx_1x1.txt" ## [11] "/Users/kjhealy/Source/demog/rawdata/DEUTNP.Mx_1x1.txt" ## [12] "/Users/kjhealy/Source/demog/rawdata/DEUTW.Mx_1x1.txt" ## [13] "/Users/kjhealy/Source/demog/rawdata/DNK.Mx_1x1.txt" ## [14] "/Users/kjhealy/Source/demog/rawdata/ESP.Mx_1x1.txt" ## [15] "/Users/kjhealy/Source/demog/rawdata/EST.Mx_1x1.txt" ## [16] "/Users/kjhealy/Source/demog/rawdata/FIN.Mx_1x1.txt" ## [17] "/Users/kjhealy/Source/demog/rawdata/FRACNP.Mx_1x1.txt" ## [18] "/Users/kjhealy/Source/demog/rawdata/FRATNP.Mx_1x1.txt" ## [19] "/Users/kjhealy/Source/demog/rawdata/GBR_NIR.Mx_1x1.txt" ## [20] "/Users/kjhealy/Source/demog/rawdata/GBR_NP.Mx_1x1.txt" ## [21] "/Users/kjhealy/Source/demog/rawdata/GBR_SCO.Mx_1x1.txt" ## [22] "/Users/kjhealy/Source/demog/rawdata/GBRCENW.Mx_1x1.txt" ## [23] "/Users/kjhealy/Source/demog/rawdata/GBRTENW.Mx_1x1.txt" ## [24] "/Users/kjhealy/Source/demog/rawdata/GRC.Mx_1x1.txt" ## [25] "/Users/kjhealy/Source/demog/rawdata/HRV.Mx_1x1.txt" ## [26] "/Users/kjhealy/Source/demog/rawdata/HUN.Mx_1x1.txt" ## [27] "/Users/kjhealy/Source/demog/rawdata/IRL.Mx_1x1.txt" ## [28] "/Users/kjhealy/Source/demog/rawdata/ISL.Mx_1x1.txt" ## [29] "/Users/kjhealy/Source/demog/rawdata/ISR.Mx_1x1.txt" ## [30] "/Users/kjhealy/Source/demog/rawdata/ITA.Mx_1x1.txt" ## [31] "/Users/kjhealy/Source/demog/rawdata/JPN.Mx_1x1.txt" ## [32] "/Users/kjhealy/Source/demog/rawdata/KOR.Mx_1x1.txt" ## [33] "/Users/kjhealy/Source/demog/rawdata/LTU.Mx_1x1.txt" ## [34] "/Users/kjhealy/Source/demog/rawdata/LUX.Mx_1x1.txt" ## [35] "/Users/kjhealy/Source/demog/rawdata/LVA.Mx_1x1.txt" ## [36] "/Users/kjhealy/Source/demog/rawdata/NLD.Mx_1x1.txt" ## [37] "/Users/kjhealy/Source/demog/rawdata/NOR.Mx_1x1.txt" ## [38] "/Users/kjhealy/Source/demog/rawdata/NZL_MA.Mx_1x1.txt" ## [39] "/Users/kjhealy/Source/demog/rawdata/NZL_NM.Mx_1x1.txt" ## [40] "/Users/kjhealy/Source/demog/rawdata/NZL_NP.Mx_1x1.txt" ## [41] "/Users/kjhealy/Source/demog/rawdata/POL.Mx_1x1.txt" ## [42] "/Users/kjhealy/Source/demog/rawdata/PRT.Mx_1x1.txt" ## [43] "/Users/kjhealy/Source/demog/rawdata/RUS.Mx_1x1.txt" ## [44] "/Users/kjhealy/Source/demog/rawdata/SVK.Mx_1x1.txt" ## [45] "/Users/kjhealy/Source/demog/rawdata/SVN.Mx_1x1.txt" ## [46] "/Users/kjhealy/Source/demog/rawdata/SWE.Mx_1x1.txt" ## [47] "/Users/kjhealy/Source/demog/rawdata/TWN.Mx_1x1.txt" ## [48] "/Users/kjhealy/Source/demog/rawdata/UKR.Mx_1x1.txt" ## [49] "/Users/kjhealy/Source/demog/rawdata/USA.Mx_1x1.txt"

What does each of these files look like? Let’s take a look at the first one, using read_lines() to show us the top of the file.

read_lines(filenames[1], n_max = 5) ## [1] "Australia, Death rates (period 1x1), \tLast modified: 26 Sep 2017; Methods Protocol: v6 (2017)" ## [2] "" ## [3] " Year Age Female Male Total" ## [4] " 1921 0 0.059987 0.076533 0.068444" ## [5] " 1921 1 0.012064 0.014339 0.013225"

All the files have a header section like this. When we read the data in we’ll want to ignore that and go straight to the data. But seeing as it’s there, we can make use of it to grab the name of the country. It saves us typing it ourselves. Let’s say we’d also like to have a code-friendly version of those names (i.e., in lower-case with underscores instead of spaces). And finally—while we’re at it—let’s grab those all-caps country codes used in the file names, too. We write three functions:

  • get_country_name() grabs the first word or words on the first line of each file, up to the first comma. That’s our country name.
  • shorten_name() makes the names lower-case and replaces spaces with underscores, and also shortens “The United States of America” to “USA”.
  • make_ccode() wraps a regular expression that finds and extracts the capitalized country codes in the file names.
get_country_name <- function(x){ read_lines(x, n_max = 1) %>% str_extract(".+?,") %>% str_remove(",") } shorten_name <- function(x){ str_replace_all(x, " -- ", " ") %>% str_replace("The United States of America", "USA") %>% snakecase::to_any_case() } make_ccode <- function(x){ str_extract(x, "[:upper:]+((?=\\.))") }

Now we create a tibble of summary information by mapping the functions to the filenames.

countries <- tibble(country = map_chr(filenames, get_country_name), cname = map_chr(country, shorten_name), ccode = map_chr(filenames, make_ccode), path = filenames) countries ## # A tibble: 49 x 4 ## country cname ccode path ## ## 1 Australia australia AUS /Users/kjhealy/Source/demog/rawdata/AUS.M… ## 2 Austria austria AUT /Users/kjhealy/Source/demog/rawdata/AUT.M… ## 3 Belgium belgium BEL /Users/kjhealy/Source/demog/rawdata/BEL.M… ## 4 Bulgaria bulgaria BGR /Users/kjhealy/Source/demog/rawdata/BGR.M… ## 5 Belarus belarus BLR /Users/kjhealy/Source/demog/rawdata/BLR.M… ## 6 Canada canada CAN /Users/kjhealy/Source/demog/rawdata/CAN.M… ## 7 Switzerland switzerland CHE /Users/kjhealy/Source/demog/rawdata/CHE.M… ## 8 Chile chile CHL /Users/kjhealy/Source/demog/rawdata/CHL.M… ## 9 Czechia czechia CZE /Users/kjhealy/Source/demog/rawdata/CZE.M… ## 10 East Germa… east_germa… DEUTE /Users/kjhealy/Source/demog/rawdata/DEUTE… ## # … with 39 more rows

Nice. We could have written each of those operations as anonymous functions directly inside of map_chr(). This would have been more compact. But often it can be useful to break out the steps as shown here, for clarity—especially if map() operations have a tendency to break your brain, as they do mine.

We still haven’t touched the actual data files, of course. But now we can just use this countries table as the basis for reading in, I mean ingesting, everything in the files. We’re going to just add a list column named data to the end of the table and put the data for each country in it. We’ll temporarily unnest it to clean the column names and recode the age variable, then drop the file paths column and nest the data again.

The hard work is done by the map() call. This time we will use ~ formula notation inside map() to write what we want to do. We’re going to feed every filename in path to read_table(), one at a time. We tell read_table() to skip the first two lines of every file it reads, and also tell it that in these files missing data are represented by a . character. Everything read in ends up in a new list column named data.

mortality <- countries %>% mutate(data = map(path, ~ read_table(., skip = 2, na = "."))) %>% unnest(cols = c(data)) %>% clean_names() %>% mutate(age = as.integer(recode(age, "110+" = "110"))) %>% select(-path) %>% nest(data = c(year:total)) mortality ## # A tibble: 49 x 4 ## country cname ccode data ## > ## 1 Australia australia AUS [10,434 × 5] ## 2 Austria austria AUT [7,881 × 5] ## 3 Belgium belgium BEL [19,425 × 5] ## 4 Bulgaria bulgaria BGR [7,104 × 5] ## 5 Belarus belarus BLR [6,438 × 5] ## 6 Canada canada CAN [10,101 × 5] ## 7 Switzerland switzerland CHE [15,651 × 5] ## 8 Chile chile CHL [1,887 × 5] ## 9 Czechia czechia CZE [7,437 × 5] ## 10 East Germany east_germany DEUTE [6,660 × 5] ## # … with 39 more rows

And we’re done. Forty nine tables of data smoothly imported and bundled together. Each of the country-level data tables is a row in data that we can take a look at as we like:

mortality %>% filter(country == "Austria") %>% unnest(cols = c(data)) ## # A tibble: 7,881 x 8 ## country cname ccode year age female male total ## ## 1 Austria austria AUT 1947 0 0.0798 0.0994 0.0899 ## 2 Austria austria AUT 1947 1 0.00657 0.00845 0.00753 ## 3 Austria austria AUT 1947 2 0.00425 0.00469 0.00447 ## 4 Austria austria AUT 1947 3 0.00337 0.00340 0.00339 ## 5 Austria austria AUT 1947 4 0.00235 0.00270 0.00253 ## 6 Austria austria AUT 1947 5 0.00174 0.00195 0.00184 ## 7 Austria austria AUT 1947 6 0.00131 0.00152 0.00142 ## 8 Austria austria AUT 1947 7 0.00132 0.00169 0.00151 ## 9 Austria austria AUT 1947 8 0.00115 0.00149 0.00132 ## 10 Austria austria AUT 1947 9 0.000836 0.000997 0.000918 ## # … with 7,871 more rows

Now you can get on with the actual analysis.

There isn’t anything especially unusual in the steps shown here. It’s just a pretty common operation that’s worth knowing how to do cleanly. One nice thing about this approach is that it’s immediately applicable to, say, a folder containing the 5-year mortality estimates rather than the 1 year estimates. You don’t have to do anything new, and there’s no mucking around with manually naming files and so on.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pages