Machine Learning in Forecasting Motor Insurance Claims: Comparison
Please note this is a comparison between Version 1 by Periklis Gogas and Version 2 by Sirius Huang.

Accurate forecasting of insuranceinsurance claims claims is of the utmost importance for insurance activity as the evolution of claims determines cashcash outflows outflows and the pricing, and thus the profitability, of the underlying insurance coverage. These are used as inputs when the insurance company drafts its businessbusiness plan plan and determines its risk appetite, and the respective solvency capital required (by the regulators) to absorb the assumed risks. The conventional claim forecasting methods attempt to fit (each of) the claims frequency and severity with a known probability distribution function and use it to project future claims.

  • insurance
  • claims
  • forecasting
  • machine learning
  • budget
  • SVM
  • Decision Trees
  • Random Forest
  • Boosting

1. Introduction

Insurance is the activity by which an individual or enterprise exchanges an uncertain (financial) loss with a certain (financial) loss. The former is the outcome of an event for which the insured individual or enterprise has received coverage via an insurance policy; the latter is the premium that the insured has to pay to receive this coverage. When such an event occurs, the insured may formally request coverage (monetary or in-kind) in line with the policy terms and conditions, which constitutes the insurance claim.
It is therefore clear that claims are key components of the insurance activity as they essentially comprise the realization of the insurance product/service. Due to the uncertainty of (future) claims occurrence, it is in the interest of the insurers to carefully frame their claims expectations and provisions. Consequently, they pursue claims forecasting. The accurate forecasting of insurance claims is important for several reasons.
First, claims constitute the basis of pricing. In insurance, contrary to other services, the validity of the pricing is confirmed, and the adequacy of the premium is proved only after the experience has been recorded. Traditional pricing is based on historical data; however, it is the occurrence of incidents in the future that determines whether the estimated burning cost was correct or not. Hence, if the claims experience has not been properly embedded in the pricing models, the (pure) premium may not be sufficient to cover the total claims (incurred or paid) and this could lead to a loss-making activity—if the premium charged is too low. In contrast, it could result in the loss of customers—in the case where the premium charged is too high.
Second, future claims occurrence is important for the compilation of the business plan as claims affect the future profitability of the company. In fact, the claims experience is probably the most significant determinant of the operational profitability of the insurance company. This is due to the fact that when compiling the business plan, an insurance company projects the future premia and the future claims over a period of years. Future premia are based primarily on sales forecasts, the evolution of inflation (ideally the one related to the insurance coverage under examination), as well as the projected claims experience. Expected future claims are based on the historical claims experience as well as on assumptions on the development of claims; this may be decomposed to the development of claims frequency and severity.
Finally, having a forward look in claims is a prerequisite of their risk and solvency assessment process and report, which depicts the risk appetite of the insurer and thus the capital required for the solvency of the insurer. As a matter of fact, it usually requires (one of) the biggest portions of capital (allocations). Indeed, insurers assume the risks that individuals and enterprises want to transfer, hedge, or mitigate. A claim is filed when a covered event (the assumed risk) has occurred. A higher risk appetite indicates the assumption of higher risk and thus higher claim anticipation. This leads to higher (economical) capital required for the absorption of this risk.
The conventional forecasting approaches attempt to either repeat the historical (growth) pattern of claims in the future–with potential seasonality and respective premia considered–or match the claims frequency and severity experience of the insurance company with a known probability distribution function. Smaller claims exhibit higher frequency, whereas large claims have a (much) smaller frequency. To improve the precision of the forecasting, large claims are pooled separately from the small claims and different probability distribution functions are used to best fit the claims frequency and severity of the two pools of claims.
Machine Learning approaches offer an alternative route to claims forecasting. The contribution of ML (artificial intelligence—AI) in insurance globally and in claims prediction specifically has been recognized by practitioners—who have spotted a wide range of ML applications in insurance—spreading over almost all its processes, such as claims processing, claims fraud detection, claims adjudication, claim volume forecasting, automated underwriting, submission intake, pricing and risk management, policy servicing, insurance distribution, product recommendation/personalized offers, assessor assistance, property (damage) analysis, automated inspections, customer lifetime value prediction/customer retention/lapse management, speech analytics, customer segmentation, workstream balancing for agents, and self-servicing for policy management (Seely 2018; Somani 2021). A report from the Organization for Economic Cooperation and Development (OECD 2020) subscribes to this point of view as it identifies the increasing number of ML (AI) applications in insurance, which are enabled through the widespread collection of big data and their analysis. The report pinpoints marketing, distribution and sales, claims (verification and fraud), pricing, and risk classification as broader areas of ML utilization. It further addresses some attention points, such as policy and regulation with regards to the use of ML in insurance, with emphasis among others in privacy and data protection, market structure, risk classification, and explainability of ML. The implementation of ML (AI) methods in these sectors of the insurance operations, along with the relevant worries on ethical and societal challenges have been recorded by Grize et al. (2020), Banks (2020), Ekin (2020), and Paruchuri (2020). The reports of Deloitte (2017), SCOR (2018), Keller et al. (2018), and Balasubramanian et al. (2021) identify similar applications of ML as they pave the future of insurance.

2. Machine Learning in Actuarial and Risk Management

The bulk of the literature on the applications of machine learning in insurance is relatively recent (post 2019) and although they cover a wide range of topics relevant to the insurance activity, there is ample room for further research. The main literature strands focus on claims, reserving, pricing, capital requirements–solvency, coverage ratio, acquisition, and retention. Researchers group them into two main categories; actuarial and risk management that incorporates the first four (claims, reserving, pricing, and capital requirements–solvency) and customer management, which incorporates the last three (coverage ratio, acquisition, and retention). The second category will not be presented in detail herein. The interested reader may look at Mueller et al. (2018) for the coverage ratio; Boodhun and Jayabalan (2018) and Qazi et al. (2020) for acquisition; and Grize et al. (2020) and Guillen et al. (2021) for retention. The literature that is relevant to actuarial and risk management issues addresses the main functions of the insurance activity and is thus related to actuarial science and risk management. In fact, insurance is the assumption and management of risks that individuals or enterprises wish to transfer or mitigate. These functions entail the monitoring of the claims/risks evolution, the determination of the required reserves, the estimation of the appropriate tariff rates as well as the calculation of the capital that is required to ensure the solvency of the insurer. The analysis of these literature strands follows.

2.1. Claims/Risks

Fauzan and Murfi (2018) focus on the forecasting of motor insurance accident claims via ML methods with an emphasis on missing data. Rustam and Ariantari (2018) use ML approaches to predict the occurrence of motor insurance claims based on their claim history (with data stemming from an Indonesian motor insurer). Pesantez-Narvaez et al. (2019) attempt to predict the existence of accident claims with the use of ML techniques on telematics data (coming from an insurance company) with an emphasis on driving patterns (total annual distance driven and percentage of distance driven in urban areas). Qazvini (2019) employs ML methods to predict the number of zero claims (i.e., claims that have not been reported) based on telematics data (on French motor third party liability). Bermúdez et al. (2020) apply ML approaches to model insurance claim counts with an emphasis on the overdispersion and the excess number of zero claims, which may be the outcome of unobserved heterogeneity. Bärtl and Krummaker (2020) attempt to predict the occurrence and the magnitude of export credit insurance claims with the use of ML techniques. The models employed produce satisfactory results for the former but not so satisfactory for the actual claim ratios—with accuracy, Cohen’s κ and R2 were used to assess model performance. Knighton et al. (2020) focused on forecasting flood insurance claims with ML models that applied hydrologic and social demographic data to realize that the incorporation of such data can improve flood claim prediction. Hanafy and Ming (2021) apply ML approaches to predict the occurrence of motor insurance claims (over the portfolio of Porto Seguro, a large Brazilian motor insurer). Selvakumar et al. (2021) concentrated on the prediction of the third-party liability (motor insurance) claim amount for different types of vehicles with ML models (on a dataset derived from Indian public insurance companies). Some recent articles utilize the data collected through telematics. More specifically, Duval et al. (2022) used ML models to come up with a method that indicates the amount of information—collected via telematics with regards to the policyholders’ driving behavior—that needs to be (optimally) retained by insurers to (successfully) perform motor insurance claim classification. Reig Torra et al. (2023) also capitalized on the data provided by telematics and used the Poisson model, along with some weather data, to forecast the expected motor insurance claim frequency over time. They found that weather conditions do affect the risk of an accident. Masello et al. (2023) used the information collected via telematics and employed ML methods to assess the predictive ability of driving contexts (such as road type, weather, and traffic) to driving risks/safety (such as near-misses, speeding, and distraction events), which, in turn, affected the exposure to/occurrence of accidents and thus motor insurance claims. Pesantez-Narvaez et al. (2021) compared the ability of ML models to detect rare events (on a third-party liability motor insurance dataset) to realize that RiskLogitboost regression exhibits a superior performance over other methods. Shi and Shi (2022) employed ML approaches on property insurance claims to develop rating classes and estimate rating relativities for a single insurance risk; perform predictive modeling for multivariate insurance risks and unveil the impact of tail-risk dependence; and price new products. In a different direction—that of fraud detection—Pérez et al. (2005) applied ML approaches (on a motor insurance portfolio) in a different context, which still pertained to claims; they focused on the detection of fraudulent claims in motor insurance by properly classifying suspicious claims. Kose et al. (2015) employed ML approaches for the detection of fraudulent claims or abusive behavior in healthcare insurance via an interactive framework that incorporates all the interested parties and materials involved in the healthcare insurance (claim) process. On the same topic, Roy and George (2017) used ML methods to detect fraudulent claims in motor insurance. Wang and Xu (2018) employed ML models that incorporate the (accident) information embedded in the text of the claims to detect potential claim fraud in motor insurance. Dhieb et al. (2019, 2020) applied ML techniques to automatically identify motor insurance fraudulent claims and sort them into different fraud categories with minimal human intervention, along with alerts for suspicious claims. A series of papers implemented ML approaches in health management/insurance. Bauder et al. (2016) introduced ML approaches to tackle a different topic of insurance claims, thereby allowing them to spot the physicians that post a potentially anomalous behavior (pointing out misuse, fraud, or ignorance of the billing procedures) in health (medical) insurance claims (with data taken from the USA Medicare system) and for which additional investigation may be necessary. Hehner et al. (2017) highlighted the merits of the introduction of ML (AI) in hospital claims management, which can be summarized as savings for both the insurers and the insured as ML algorithms result in increased efficiency and well-informed decision-making to the benefit of all interested parties. Rawat et al. (2021) applied ML methods to analyze claims and conclude on a set of factors that facilitate claim filing and acceptance. Cummings and Hartman (2022) propose a series of ML models that provide insurers the ability to forecast Long Term Care Insurance (LTCI) claim rates and thus better their capacity to operate as LTCI providers.

2.2. Reserving

Baudry and Robert (2019) developed a ML method to estimate claims reserves with the use of all policy and policyholder covariates, along with the information pertaining to a claim from the moment it has been reported and compared their results with those generated via chain ladder. Elpidorou et al. (2019) employed ML techniques to introduce a novel Bornhuetter–Ferguson method as a variant of the traditional chain ladder method used for reserving in non-life (general) insurance through which the actuary can adjust the relative ultimate reserves with the use of externally estimated relative ultimate reserves. In the same direction, Bischofberger (2020) utilized ML methods to extend the chain ladder method via the estimated hazard rate for the estimation of non-life claims reserves. The outperformance (in 4 out of 5 lines of business studied) of ML algorithms over traditional actuarial approaches in estimating loss reserves (future customer claims) is evidenced by the work of Ding et al. (2020). Similarly, Gabrielli et al. (2020) explore the merit of the introduction of ML approaches to traditional actuarial techniques in improving the non-life insurance claims reserving (prediction).

2.3. Pricing

Gan (2013), in a comparatively early work, priced the guarantees (i.e., finds the market value and the Greeks) of a large portfolio of variable annuity policies (generated by the author) via ML techniques. Assa et al. (2019) used ML approaches to study the correct pricing of deposit insurance by improving the implied volatility calibration to avoid mispricing due to arbitrage. Grize et al. (2020) unveiled the role of ML algorithms in (online) motor liability insurance pricing and, at the same time, increased the issue of interpretability. Henckaerts et al. (2021) capitalized on ML methods to price non-life insurance products based on the frequency and severity of claims; their results are superior to the ones produced by the traditionally employed generalized linear models (GLMs). Kuo and Lupton (2020) explained that the wider adoption of ML techniques (over GLMs) in property and casualty insurance pricing depends very much on their reduced (perceived) transparency. They recommend increased interpretability to overcome this hurdle. These concerns are also addressed in Grize et al. (2020). Blier-Wong et al. (2020) performed a literature review on the application of ML methods on the property and casualty insurance actuarial tasks and in pricing and reserving. They drafted potential future applications and research in the field and noticed that there can be three main challenges: interpretability, prediction uncertainty, and potential discrimination. Some practitioner best practices have already been reported in the literature. AXA, for example, has applied ML methods to forecast large-loss car accidents to achieve optimal motor insurance pricing (Sato 2017; Ekin 2020).

2.4. Capital Requirements–Solvency

Díaz et al. (2005)—early enough compared to other studies—employed ML approaches to predict the insolvency of Spanish non-life insurance companies, which was applied on a set of financial ratios. Krah et al. (2020) focused on the derivation of the solvency capital requirement that life insurers need to honor under the Solvency II directive in the European Union with the use of ML methods, which are alternative to the approximation techniques that insurance companies use. Finally, Wüthrich and Merz (2023), in their book, presented the (entire) array of traditional actuarial and modern machine learning techniques that can be applied to address insurance-related problems. They explained how they can be applied by actuaries or real datasets and how the derived results may be interpreted.
Video Production Service