Kasa AI Blog: Cellar: Open Datasets in Insurance

We’re pleased to announce Cellar, a repository of publicly available datasets for insurance analytics research and engineering.

Motivation

One of the most frequently cited issues in insurance-related reproducible research and open source software development is the scarcity of publicly available datasets. Compared to, say, image or language research, insurance does indeed fall behind in this respect. However, we believe that there is an interesting collection of datasets that exist on the Web today that is underutilized. These datasets are found in a variety of places, from websites (of researchers, actuarial societies, and governmental organizations) to self-hosted R packages, and they appear in an even greater variety of file formats and level of documentation. While folks “in the know” may be familiar with pockets of these resources, discoverability is lacking in the broader community. Through Cellar, we hope to encourage and accelerate knowledge building in the insurance analytics space.

Quickstart

The project website is the recommended entry point to get started. Practitioners can skim through the datasets listing and check out descriptions, data dictionaries (see Figure 1), and variable statistics (see Figure 2) of datasets that may be of interest.

Figure 1: Dataset column descriptions.

Figure 2: Dataset column statistics.

Then, with a single line of code, they can download the desired data to their R session:


Rows: 77,900
Columns: 16
$ lob                   <chr> "private_passenger_auto", "private_pa…
$ group_code            <chr> "43", "43", "43", "43", "43", "43", "…
$ group_name            <chr> "IDS Property Cas Ins Co", "IDS Prope…
$ accident_year         <int> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ development_year      <int> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ development_lag       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ incurred_loss         <dbl> 607, 2254, 5843, 11422, 19933, 24604,…
$ cumulative_paid_loss  <dbl> 133, 934, 2030, 4537, 7564, 8343, 125…
$ bulk_loss             <dbl> 226, 495, 1669, 2941, 4885, 7823, 168…
$ earned_premium_direct <dbl> 957, 3695, 6138, 17533, 29341, 37194,…
$ earned_premium_ceded  <dbl> 62, 288, 249, 749, 1694, 2056, 3490, …
$ earned_premium_net    <dbl> 895, 3407, 5889, 16784, 27647, 35138,…
$ single                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ posted_reserve_97     <dbl> 73044, 73044, 73044, 73044, 73044, 73…
$ calendar_year         <dbl> 1988, 1989, 1990, 1991, 1992, 1993, 1…
$ incremental_paid_loss <dbl> 133, 934, 2030, 4537, 7564, 8343, 125…

“Technical” details

For those interested, here is a concise list of Cellar’s implementation details

The data repository is implemented as a pins board.
The data currently lives in AWS S3; for those familiar with prior Cellar work, we have migrated from using GitHub releases to host files.
Datasets are processed to be tidy and have consistently formatted column names where possible.
Versioning (see Vintage information in the spec sheets) is supported, so users can document the exact versions of the datasets used in manuscripts or software.
A Python library to access Cellar is on the roadmap and will be available later this year.

Collaboration

Cellar is meant to be a community curated repository and we welcome contributions and requests to include new datasets. If there is a dataset you’d like to share, please stop by our Slack, open a GitHub issue, or reach out via email.