Cellar: Open Datasets in Insurance

Improving discoverability and access of publicly available datasets.

Kevin Kuo (RStudio and Kasa AI)
08-13-2020

We’re pleased to announce Cellar, a repository of publicly available datasets for insurance analytics research and engineering.

Motivation

One of the most frequently cited issues in insurance-related reproducible research and open source software development is the scarcity of publicly available datasets. Compared to, say, image or language research, insurance does indeed fall behind in this respect. However, we believe that there is an interesting collection of datasets that exist on the Web today that is underutilized. These datasets are found in a variety of places, from websites (of researchers, actuarial societies, and governmental organizations) to self-hosted R packages, and they appear in an even greater variety of file formats and level of documentation. While folks “in the know” may be familiar with pockets of these resources, discoverability is lacking in the broader community. Through Cellar, we hope to encourage and accelerate knowledge building in the insurance analytics space.

Quickstart

The project website is the recommended entry point to get started. Practitioners can skim through the datasets listing and check out descriptions, data dictionaries (see Figure 1), and variable statistics (see Figure 2) of datasets that may be of interest.

Dataset column descriptions.

Figure 1: Dataset column descriptions.

Dataset column statistics.

Figure 2: Dataset column statistics.

Then, with a single line of code, they can download the desired data to their R session:


Rows: 77,900
Columns: 16
$ lob                   <chr> "private_passenger_auto", "private_pa…
$ group_code            <chr> "43", "43", "43", "43", "43", "43", "…
$ group_name            <chr> "IDS Property Cas Ins Co", "IDS Prope…
$ accident_year         <int> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ development_year      <int> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ development_lag       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ incurred_loss         <dbl> 607, 2254, 5843, 11422, 19933, 24604,…
$ cumulative_paid_loss  <dbl> 133, 934, 2030, 4537, 7564, 8343, 125…
$ bulk_loss             <dbl> 226, 495, 1669, 2941, 4885, 7823, 168…
$ earned_premium_direct <dbl> 957, 3695, 6138, 17533, 29341, 37194,…
$ earned_premium_ceded  <dbl> 62, 288, 249, 749, 1694, 2056, 3490, …
$ earned_premium_net    <dbl> 895, 3407, 5889, 16784, 27647, 35138,…
$ single                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ posted_reserve_97     <dbl> 73044, 73044, 73044, 73044, 73044, 73…
$ calendar_year         <dbl> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ incremental_paid_loss <dbl> 133, 934, 2030, 4537, 7564, 8343, 125…

“Technical” details

For those interested, here is a concise list of Cellar’s implementation details

Collaboration

Cellar is meant to be a community curated repository and we welcome contributions and requests to include new datasets. If there is a dataset you’d like to share, please stop by our Slack, open a GitHub issue, or reach out via email.