SynthETIC: A Simulator of Synthetic Experience Tracking Insurance Claims

Simulation of individual claims (and payments) that mirror the user’s actual, complex claim experience.

Benjamin Avanzi (University of Melbourne) , Greg Taylor (UNSW Sydney) , Melantha Wang (UNSW Sydney) , Bernard Wong (UNSW Sydney)
09-14-2020

In this blog we introduce the individual claim simulator SynthETIC(Avanzi et al. 2020b) (Synthetic Experience Tracking Insurance Claims), recently released on the Comprehensive R Archive Network (CRAN). An associated paper can be found here(Avanzi et al. 2020a). SynthETIC generates various features of non-life insurance claims on an individual claim level, including occurrence, notification, the timing and magnitude of individual partial payments, and closure.

Development of an individual non-life insurance claim.

Figure 1: Development of an individual non-life insurance claim.

Motivation

Recent years have seen rapid increase in the application of machine learning (ML) to insurance loss reserving. These ML methods yield most value when applied to large data sets, such as individual claim, even transactional, data sets, whose volume implies that a naked-eye view of the data features is impossible.

Unfortunately, such large data sets are in short supply in the actuarial literature, and so there is a shortfall in the material required to test ML methods. SynthETIC fills this gap and offers a starting point for the testing of proposed ML methods by providing synthetic data that reflects complex features commonly observed in real data.

The default version of SynthETIC is parameterized to loosely resemble the experience of a specific (but anonymous) Auto Bodily Injury portfolio (“reference portfolio”). However, the general structure is suitable for most lines of business, with some possible amendment on the choice of parameters and/or the structure of the sampling distribution(s). In short, SynthETIC gives the user full control of the mechanics of the evolution of an individual claim, hence the flexibility to create claims data of any level of complexity.

Indeed, the user may generate a collection of data sets that provide a spectrum of complexity, and the collection may be used to present a model under test with a steadily increasing challenge.

Architecture of SynthETIC

SynthETIC is composed of 8 modelling steps (“modules”), any one or more of which may be re-designed according to the user’s requirements.

  1. Claim occurrence: claim frequency, claim occurrence times;
  2. Claim size: claim size in constant dollar values i.e. without inflation;
  3. Claim notification: notification delay (delay from occurrence to notification);
  4. Claim closure: settlement delay (delay from notification to closure);
  5. Claim payment count: number of partial payments;
  6. Claim payment size: sizes of partial payments in constant dollar values i.e. without inflation;
  7. Claim payment time: inter-partial-payment delays, partial payment times;
  8. Claim inflation: sizes of inflated partial payments (includes both base inflation and superimposed inflation, with the latter operating with respect to both claim occurrence time and payment time).

As mentioned earlier, SynthETIC is designed to be very flexible in modelling choices. The user can adjust the parameters in each simulation step (e.g. claim notification) without modifying the underlying algebraic structure of the simulator (Weibull distribution with parameters dependence on claim size and occurrence period). Alternatively, in cases where a change of structure is deemed necessary, the user may easily unplug any one of the default module and replace it with a version modified to their own purpose.

In either case, the module sequence - specified in Section 4.1 of the paper(Avanzi et al. 2020a) and reproduced above - should be maintained so that the assumed dependencies between the selected modules still hold. In particular, SynthETIC currently incorporates the following dependencies (refer to the paper(Avanzi et al. 2020a) for a more comprehensive overview of the data features):

The presence of such dependencies can often cause awkwardness in modelling (e.g. incompatibility with chain-ladder reserving), but is necessary for the effective testing of proposed alternative models as it reflects the complicated claim experience seen in real data.

Starter Demo

To install the CRAN version of the package, do


install.packages("SynthETIC")

To install the development version of the package from the GitHub repository (which can occasionally be more updated), do


if (!require(remotes)) install.packages("remotes")
remotes::install_github("agi-lab/SynthETIC/SynthETIC", build_vignettes = TRUE)

After the installation, run


library(SynthETIC)

to load the package.

For a quickstart, here we present the code that produces the built-in test data set, which has detailed records of 3,624 individual claims (and 18,983 partial payments) over 10 years of exposure. We will skip the modelling specifications, which are fully described in Section 4 of the paper(Avanzi et al. 2020a). Readers can also view a full code demonstration by running


RShowDoc("SynthETIC-demo", package = "SynthETIC")

after installation of the package. The demo document shows in detail the behind-the-scenes simulation steps along with instructions on how to modify those to tailor for alternative claim experience.

To get set up for SynthETIC simulation, users are required to define at the top of their code a time_unit and a reference value ref_claim via the set_parameter() function. The default implementation works with a time_unit of 1/4 (given as a fraction of a year), which means that time is measured in quarters (e.g. \(t = 1\) represents the end of the first calendar quarter). Similarly, all payment amounts require a monetary unit to be specified. In the default setting all amounts are in AUD. However, both monetary and time units may require variation from one application to another.

The two package-wise global parameters, time_unit and ref_claim, enable the application of SynthETIC in contexts different from the default. For instance, letting time_unit = 1/12 sets the simulator up to work with calendar months. The reference claim size ref_claim can be seen as a scaling parameter. In SynthETIC, the default distributional parameters are specified as multiples of ref_claim instead of hard-coded raw numbers. This allows users to easily simulate a synthetic portfolio with similar claim pattern but in a different currency, e.g. the reference claim size might be 1,000 USD or 1,000,000 KRW. Alternatively, some schemes of insurance define entitlements as multiples of a regulated reference claim size, and this latter may be used as the SynthETIC global parameter. The two global parameters extend the application of SynthETIC.


library(SynthETIC)
set.seed(20200131)
set_parameters(ref_claim = 200000, time_unit = 1/4)

#########################################################################
##                      Simulation starts here                         ##
#########################################################################
# Module 1: Claim occurrence
n_vector <- claim_frequency(I = 40, E = 12000, freq = 0.03)
occurrence_times <- claim_occurrence(n_vector)
# Module 2: Claim size
claim_sizes <- claim_size(n_vector)
# Module 3: Claim notification
notidel <- claim_notification(n_vector, claim_sizes)
# Module 4: Claim settlement
setldel <- claim_closure(n_vector, claim_sizes)
# Module 5: Claim payment count
no_payments <- claim_payment_no(n_vector, claim_sizes)
# Module 6: Claim payment size
payment_sizes <- claim_payment_size(n_vector, claim_sizes, no_payments)
# Module 7: Claim payment time
payment_delays <- claim_payment_delay(
  n_vector, claim_sizes, no_payments, setldel)
payment_times <- claim_payment_time(
  n_vector, occurrence_times, notidel, payment_delays)
# Module 8: Claim inflation
demo_rate <- (1 + 0.02)^(1/4) - 1
base_inflation_past <- rep(demo_rate, times = 40)
base_inflation_future <- rep(demo_rate, times = 40)
base_inflation_vector <- c(base_inflation_past, base_inflation_future)
payment_inflated <- claim_payment_inflation(
    n_vector, payment_sizes, payment_times, occurrence_times, claim_sizes,
    base_inflation_vector)

The simulated quantities can be summarised on an individual claim level, or on a transactional level:


# claim level data
claim_dataset <- generate_claim_dataset(
  frequency_vector = n_vector,
  occurrence_list = occurrence_times,
  claim_size_list = claim_sizes,
  notification_list = notidel,
  settlement_list = setldel,
  no_payments_list = no_payments
)
str(claim_dataset)

'data.frame':   3624 obs. of  7 variables:
 $ claim_no         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ occurrence_period: num  1 1 1 1 1 1 1 1 1 1 ...
 $ occurrence_time  : num  0.624 0.121 0.222 0.454 0.591 ...
 $ claim_size       : num  783769.1 214480.6 30902.2 49.9 14326 ...
 $ notidel          : num  2.63 1.53 3.05 7.6 3.41 ...
 $ setldel          : num  29.89 25.77 6.266 0.901 7.962 ...
 $ no_payment       : num  6 4 4 2 2 11 1 2 9 2 ...

# transactional level data
all_claims <- claims(
  frequency_vector = n_vector,
  occurrence_list = occurrence_times,
  claim_size_list = claim_sizes,
  notification_list = notidel,
  settlement_list = setldel,
  no_payments_list = no_payments,
  payment_size_list = payment_sizes,
  payment_delay_list = payment_delays,
  payment_time_list = payment_times,
  payment_inflated_list = payment_inflated
)
transaction_dataset <- generate_transaction_dataset(
  all_claims,
  adjust = FALSE # keeps the original (potentially out-of-bound) simulated payment times
)
str(transaction_dataset)

'data.frame':   19025 obs. of  12 variables:
 $ claim_no         : int  1 1 1 1 1 1 2 2 2 2 ...
 $ pmt_no           : num  1 2 3 4 5 6 1 2 3 4 ...
 $ occurrence_period: num  1 1 1 1 1 1 1 1 1 1 ...
 $ occurrence_time  : num  0.624 0.624 0.624 0.624 0.624 ...
 $ claim_size       : num  783769 783769 783769 783769 783769 ...
 $ notidel          : num  2.63 2.63 2.63 2.63 2.63 ...
 $ setldel          : num  29.9 29.9 29.9 29.9 29.9 ...
 $ payment_time     : num  7.31 9.06 16.38 23.71 31.54 ...
 $ payment_period   : num  8 10 17 24 32 34 9 19 26 28 ...
 $ payment_size     : num  21738 18532 20968 20642 588433 ...
 $ payment_inflated : num  22538 19382 22739 23213 687887 ...
 $ payment_delay    : num  4.06 1.75 7.32 7.33 7.83 ...

The resulting data sets are now ready for testing of proposed loss reserving models.