Kasa AI Blog: simulationmachine: Synthesizing Individual Claims Data

Today we’re excited to announce the simulationmachine package, based on the paper An Individual Claims History Simulation Machine(Gabrielli and Wüthrich 2018). We created this package for the practitioner or researcher who needs realistic data for testing reserving or claims analytics modeling workflows.

The package allows you to simulate individual claim characteristics along with their cash flow histories. We first show a quick example before delving into details of the algorithm.

simulationmachine

You can download simulationmachine from GitHub with

# install.packages("remotes")
remotes::install_github("kasaai/simulationmachine")

First, specify the desired characteristics of the simulation, by scribing a charm (which we’ll use to do some conjuoring later!):

library(simulationmachine)

charm <- simulation_machine(
  num_claims = 50000, 
  lob_distribution = c(0.25, 0.25, 0.30, 0.20), 
  inflation = c(0.01, 0.01, 0.01, 0.01), 
  sd_claim = 0.85, 
  sd_recovery = 0.75
)
charm

A simulation charm for `simulation_machine`

Each record is:
 - A snapshot of a claim's incremental paid loss and claim status
   at a development year.

Specs:
 - Expected number of claims: 50,000
 - LOB distribution: 0.25, 0.25, 0.3, 0.2
 - Inflation: 0.01, 0.01, 0.01, 0.01
 - SD of claim sizes: 0.85,
 - SD of recovery sizes: 0.75

The charm encodes the details of the simulation we want to create. In this case, the simulation contains the following:

Approximately 50,000 claims (note that this is an expected value, as claim incidence is also simulated)
The distribution among the four lines of business is 25%, 25%, 30%, and 20%
Inflation is a constant 1% for all lines of business
Claims are lognormally distributed with a standard deivation of 0.85
Recoveries are lognormally distributed with a standard deviation of 0.75

Once we have the specification, we can perform the actual simulation using the conjure() function:

library(tidyverse)
records <- conjure(charm, seed = 100)
glimpse(records)

Rows: 603,324
Columns: 11
$ claim_id          <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
$ accident_year     <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, …
$ development_year  <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1…
$ accident_quarter  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, …
$ report_delay      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ lob               <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3…
$ cc                <chr> "42", "42", "42", "42", "42", "42", "42", …
$ age               <int> 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65…
$ injured_part      <chr> "51", "51", "51", "51", "51", "51", "51", …
$ paid_loss         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4913, …
$ claim_status_open <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …

In the output data frame, each row corresponds to a snapshot in time of a claim. The number of rows, 603,324, corresponds to 50,277 claims times 12 time steps. At this point, the dataset is ready for use in a claims level exercise. However, if you’re looking for aggregate triangle data, that can be easily done with a few transformations. Say we’re interested in a cumulative paid loss triangle as of calendar year 1999 (so it’ll fit in the margins of this blog post), we can use the following snippet:

triangle <- records %>% 
  filter(accident_year + development_year <= 1999) %>% 
  # aggregate to AY-dev cells
  group_by(accident_year, development_year) %>% 
  summarize(paid_loss = sum(paid_loss)) %>% 
  group_by(accident_year) %>% 
  # calculate cumulative losses
  mutate(cumulative_paid_loss = cumsum(paid_loss)) %>% 
  select(accident_year, development_year, cumulative_paid_loss) %>% 
  # reshape the data
  pivot_wider(
    names_from = "development_year", 
    values_from = "cumulative_paid_loss"
  )

triangle

accident_year	0	1	2	3	4	5
1994	4,252,450	6,669,347	7,852,028	8,372,107	8,682,360	8,875,666
1995	3,493,776	5,438,160	5,923,379	6,157,112	6,319,793
1996	4,703,297	7,321,589	8,345,325	8,933,265
1997	4,797,499	7,796,097	8,996,412
1998	3,980,551	6,687,173
1999	3,557,440

Motivations & Technical Details

The aim of the Simulation Machine(Gabrielli and Wüthrich 2018) is to develop a stochastic simulation machine that generates individual claims histories of non-life insurance claims. Recently, loss reserving research has been very active, leading to many new ideas and techniques to estimate the claims reserves required to cover future payments for claims that have occurred in the past. In particular, there has been a lot of effort put into the development of claims reserving methods based on individual claims histories. However, a main shortcoming in this field of research is that there is no publicly available individual claims history data, as insurance companies are often unwilling or unable to publish their data due to confidentiality reasons. Therefore, there is no possibility to back-test the newly developed claims reserving methods. By developing the algorithm and the simulation package, we hope to provide a common ground for the actuarial community in the form of publicly available data for individual claims reserving.

Of course, the simulated individual claims histories should be as realistic as possible to reflect a real insurance claims portfolio. Therefore, the simulation machine was calibrated to real experience data graciously provided by the Swiss National Accident Insurance Fund (SUVA). This dataset consisted of \(n \approx 10\) million individual claims that occurred between 1994 and 2005, and for each of these individual claims we had full information of 12 years of claims development as well as of the relevant feature information, which we discuss below.

Internally, the algorithm consists of two main simulation stages. In the first stage, a portfolio of individual claims characteristics is generated, and each claim contains the following features:

The line of business (lob), of which there are four
The claims code (cc), which denotes the labor sector of the claimant, with 51 possible values
The accident year (accident_year)
The accident quarter (accident_quarter)
The age of the injured age (in 5 years age buckets), which can take on values 15, 20, ..., 70
The injured body part (injured_part), with 46 possible values

Once the static claim characteristics are generated, we use them in the second simulation stage to generate:

The (incremental) individual claims cash flow, paid_loss, for development_year = 0, ..., 11
The claim status process claim_status_open (again for development_year = 0, ..., 11) which is an indicator variable denoting whether the considered claim is open at the end of the accounting year accident_year + development_year

We remark that we only consider yearly payments, that is, multiple payments and recovery payments within each calendar year are aggregated into a single annual payment. This single annual payment can either be positive or negative, depending whether we have higher claim payments or higher recovery payments in that year. The sum over all yearly payments of any given claim is constrained to be nonnegative because recoveries cannot exceed payments.

The simulation of the cash flow process and the claim status process is based on neural networks to incorporate individual claims feature information. In recent years, neural networks have proved to be very powerful tools in classification and regression problems. Their drawbacks are that they are rather difficult to calibrate and, once calibrated, they act almost like black boxes between inputs and outputs. Of course, this is a major disadvantage in interpretation and getting deeper insight. However, the missing interpretation is not necessarily a disadvantage in this simulation machine because it implies, in back-testing other methods, that the true data generating mechanism cannot easily be guessed.

Summarizing, simulationmachine, with the algorithm it implements, provides a way to generate a synthetic insurance dataset consisting of individual claims. These claims contain a rich set of feature information, along with time series of cash flows and claims status indicators. In particular, we can simulate claims from 12 different accident years, and for every individual claim we are provided with 12 years of claims development.

Consistent Interfaces for Simulation

The workflow of generating data for research and backtesting is shared by researchers and practitioners alike. We believe that it is worthwhile to optimize this workflow by streamlining the process of generating data using different simulators. The simulationmachine package implements the conjuror interface, which includes a set of conventions for simulator package authors to consider. These conventions are currently under active development, and we encourage you to participate in shaping them.

Acknowledgments

We would like to thank Ryan Thomas, Hadrien Dykiel, and Frankie Logan for their helpful comments.

Gabrielli, Andrea, and Mario V. Wüthrich. 2018. “An Individual Claims History Simulation Machine.” Risks 6 (2): 29. https://doi.org/10.3390/risks6020029.

simulationmachine: Synthesizing Individual Claims Data

simulationmachine

Motivations & Technical Details

Consistent Interfaces for Simulation

Acknowledgments

References