Improving discoverability and access of publicly available datasets.
We’re pleased to announce Cellar, a repository of publicly available datasets for insurance analytics research and engineering.
One of the most frequently cited issues in insurance-related reproducible research and open source software development is the scarcity of publicly available datasets. Compared to, say, image or language research, insurance does indeed fall behind in this respect. However, we believe that there is an interesting collection of datasets that exist on the Web today that is underutilized. These datasets are found in a variety of places, from websites (of researchers, actuarial societies, and governmental organizations) to self-hosted R packages, and they appear in an even greater variety of file formats and level of documentation. While folks “in the know” may be familiar with pockets of these resources, discoverability is lacking in the broader community. Through Cellar, we hope to encourage and accelerate knowledge building in the insurance analytics space.
The project website is the recommended entry point to get started. Practitioners can skim through the datasets listing and check out descriptions, data dictionaries (see Figure 1), and variable statistics (see Figure 2) of datasets that may be of interest.
Then, with a single line of code, they can download the desired data to their R session:
Rows: 77,900
Columns: 16
$ lob <chr> "private_passenger_auto", "private_pa…
$ group_code <chr> "43", "43", "43", "43", "43", "43", "…
$ group_name <chr> "IDS Property Cas Ins Co", "IDS Prope…
$ accident_year <int> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ development_year <int> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ development_lag <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ incurred_loss <dbl> 607, 2254, 5843, 11422, 19933, 24604,…
$ cumulative_paid_loss <dbl> 133, 934, 2030, 4537, 7564, 8343, 125…
$ bulk_loss <dbl> 226, 495, 1669, 2941, 4885, 7823, 168…
$ earned_premium_direct <dbl> 957, 3695, 6138, 17533, 29341, 37194,…
$ earned_premium_ceded <dbl> 62, 288, 249, 749, 1694, 2056, 3490, …
$ earned_premium_net <dbl> 895, 3407, 5889, 16784, 27647, 35138,…
$ single <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ posted_reserve_97 <dbl> 73044, 73044, 73044, 73044, 73044, 73…
$ calendar_year <dbl> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ incremental_paid_loss <dbl> 133, 934, 2030, 4537, 7564, 8343, 125…
For those interested, here is a concise list of Cellar’s implementation details
Cellar is meant to be a community curated repository and we welcome contributions and requests to include new datasets. If there is a dataset you’d like to share, please stop by our Slack, open a GitHub issue, or reach out via email.