A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

Page 1

A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling P ENG S HI Division of Statistics Northern Illinois University DeKalb, Illinois 60115 email: pshi@niu.edu

E MILIANO A. VALDEZ Department of Mathematics University of Connecticut Storrs, Connecticut 06269-3009 email: emiliano.valdez@uconn.edu

August 31, 2010

Abstract In this article, we provide an alternative evidence of asymmetric information in automobile insurance based on a copula model. We use the Frank’s copula to jointly model the type of policy coverage chosen and the number of accidents, with the dependence parameter providing for evidence of the relationship between the choice of coverage and the frequency of accidents. This dependence therefore provides an indication of the presence (or absence) of asymmetric information. The type of coverage is in some sense ordered so that coverage with higher ordinals indicate the most comprehensive coverage. Henceforth, a positive relationship would indicate that more coverage is chosen for those with higher frequency of accidents, and vice versa. This presence of asymmetric information could be due to either adverse selection or moral hazard, a distinction often made in the economics of insurance literature, or both. We calibrated our copula model using a one-year cross-sectional observation of claims arising from a major automobile insurer in Singapore. Our estimation results indicate a strong evidence of the presence of asymmetric information, and we further used our estimated model for other possible actuarial applications. In particular, we are able to demonstrate the effect of coverage choice on the incidence of accidents, and based on which, a posterior pure premium is derived. In general, a positive margin is observed when compared with a prior premium available in our empirical data base. Keywords: Adverse selection, Moral hazard, Insurance, Copula models, Predictive models.

1


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

2

1 Background and motivation In contract theory, asymmetric information is defined to be the situation when one party of an economic transaction possesses information that is not available to the other contractual party. A clear example can be found in the market for used items or goods such as those found on eBay where there is transfer of ownership from one person to another. In such transactions, there are variables, usually observable only during the period of ownership of the goods and oftentimes difficult to trace, to help assist the buyer in assessing the quality of the goods being purchased. As a consequence, the original owner knows the history of the performance of the goods during its period of wear and tear which maybe unavailable to the buyer. A classic example can also be found in the market for “used cars”. Indeed, in the seminal work of Akerlof (1970), the economist George A. Akerlof explained the problems arising from information asymmetry based on the “used car” market and refered to defective used cars as “lemons”. The presence of asymmetric information is no stranger in the insurance market, and this is true in all forms of insurance (e.g. life, medical, dental, homeowners, and automobile). There is additional information about the probability of a loss unavailable to the insurance company, preventing it from creating a portfolio of homogeneous risks and thus making it more difficult for the insurer to effectively price-discriminate buyers of insurance policies. The economics literature distinguishes asymmetric information due to adverse selection and due to moral hazard. Within the context of insurance, there is presence of adverse selection when an insurer does not have enough information to assess those buyers who are considered ‘high risks’ and who are more likely to purchase insurance. On the other hand, there is presence of moral hazard when the behavior of policyholders is altered because they have insurance coverage. Prominent examples of this are in medical insurance when policyholders tend to be less careful of their health and in automobile insurance when policyholders tend to behave recklessly while driving on the road. In this article, because our data makes it difficult to make the distinction, our model for investigating the presence of asymmetric information does not allow for such distinction. The implications of the presence of asymmetric information can have terrifying effects on the insurance market such as possible deterioration of premium adequacy which may lead to insurer insolvency and subsequent market collapse. In combating for asymmetric information, insurers have tended to modify insurance policy designs within the limits of the law. Some of these popular modifications are the introduction of policy deductibles, coinsurance, and policy limits; all tend to have the characteristics that result in the sharing of the risks between the insured and the insurer. In the pioneering work of Rothschild and Stiglitz (1976), the authors demonstrated that when an insurance company offers a basket of products with varying levels of coverage, a separating equilibrium results, in which optimally, as a result, the higher risk chooses a higher level of coverage and pays a higher premium while the lower risk chooses a lower level of coverage and pays a lower premium. As discussed earlier, this phenomenon is called adverse selection. A possible implication of the work of Rothschild and Stiglitz (1976) is that the existence of adverse selection could be tested empirically by examining whether there is a positive relationship between the risk of policyholders and their subsequent selection of the level of insurance coverage. In this article, we focus on the automobile insurance market primarily motivated by the available data we have for empirical investigation and by the fact that there have been several empirical investigations along this line making it easy to compare our results with existing ones. Although theory generally predicts for the presence of asymmetric information, earlier empirical investigations, however, have rather been inconclusive in the sense that some studies have found strong evidence while others have suggested no evidence at all. Several of these studies share many model characteristics


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

3

such as the use of a policyholder’s claim history to measure the risk level of the driver in conjunction with the effects of the choice of policy coverage (the presence of deductibles, coinsurance and/or policy limits). In addition, several of the empirical models control for certain observable characteristics that allow the insurer to classify the risk level of the policyholders. However, several of these studies are distinct in many aspects. First is the choice of the risk level of the driver: some treat the likelihood of the event that an accident, or subsequently a claim, occurs while others investigate the number of times an insured has made a claim as the primary variable of interest. Next of course are the differences in the empirical data used in the investigation itself where other controlling effects make the comparison non-uniform. Finally and obviously, the model choice, which ranges from using logit or probit models for the probability of a claim to using count data regression models when using the number of accidents or claims. To illustrate, using an ordered logit model, Puelz and Snow (1994) found a strong positive correlation between the number of times an insured has made a claim together with the level of coverage purchased accounting for the risk class to which the insured has been classified. Based on the notion of conditional dependence, Dionne et al. (2001) suggested that the presence of adverse selection vanishes when “the nonlinearity of the risk classification variables” are taken into account. Using a portfolio of contracts from a French insurer, Chiappori and Salani´ e (2000) also concluded no evidence of asymmetric information in the automobile insurance market. The authors proposed the use of a bivariate probit regression to jointly model the choice of a better coverage and the occurrence of an accident. The rationale is that the influence between the occurrence of accidents and the policy selection could go in either direction. Such method has also been employed by Cohen (2005) using data from an insurer that operates in Israel, and by Saito (2006) using data from a Japanese automobile insurer, to test the asymmetric information in the automobile insurance market. These articles surprisingly have opposing results; Cohen (2005) concluded with evidence of informational asymmetry and Saito (2006) found no adverse selection or moral hazard in this market. As already alluded earlier, the adverse selection effect implies that a higher risk, one with a higher chance of accident occurrence, ends up with a more comprehensive and better coverage. However, the occurrence of accident could be the result of a moral hazard effect, that is, the selection of a more comprehensive and better coverage could provide a motivation for a careless or reckless driving leading back to a higher chance of getting into an accident. Consequently, in the literature, the coverage-occurrence relation has been attributed to a combined effect of adverse selection and moral hazard. Recently, Kim et al. (2009) pointed out that dichotomous indicators might not completely reveal the policyholder’s selection decision. Henceforth, a multinomial measure was suggested for the policyholder’s coverage selection. Similar to Richaudeau (1999), these authors considered a count data regression model to examine the effect of policy selection on the occurrence of accidents. Another factor that may affect the coverage-occurrence relationship is the degree of risk aversion of policyholders. An individual with a higher risk aversion is more likely to purchase better coverage and even drive more carefully. In contrast with adverse selection and moral hazard, this observation implies a negative relationship between occurrence of accident and choice of coverage. Such negative correlation could offset the positive correlation implied by the traditional adverse selection. Research work along this line include de Meza and Webb (2001) and Chiappori et al. (2006), where the authors argue that the absence of the positive correlation between the level of risk and the choice of coverage could be consistent with the presence of asymmetric information. Thus, the empirical results in our study represent a combined effect of various types of asymmetric information. Motivated by examining the asymmetric information in the automobile insurance market, we propose to use a copula function to jointly model the policyholder’s choice of coverage and occurrence of accident. Our approach differs from existing studies in several respects. Unlike linear


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

4

correlation models used in previous studies, the copula approach allows to capture both linear and non-linear coverage-occurrence relationships. As a consequence, our method avoids the potential endogeneity issue that possibly arises when examining the association between a multinomial coverage selection measure and the number of accidents. We note a common limitation shared by our method and other empirical studies, that the coverage-occurrence relationship represents a combined effect of various types of asymmetric information. As pointed out by Abbring et al. (2003), a longitudinal observation is required to distinguish between adverse selection and moral hazard. We leave this investigation for further work. Although we find the proposed method is motivated to empirically examine asymmetric information, it is found useful for other actuarial applications. As we additionally demonstrate in the paper, the copula model could be used to calculate conditional probabilities of accident, to examine the effects of covariates on accident occurrence, as well as to derive posterior pure premium. These posterior pure premiums can further assist the insurer to investigate premium adequacy. For the rest of the article, we have organized it as follows. Section 2 introduces the bivariate copula model for testing asymmetric information. Section 3 describes the empirical data and calibrates the model using this data. In addition, the section discusses procedures used to examine model goodness-of-fit and its implications. A robust test of model implications is provided in Appendix. Two actuarial applications are demonstrated in Section 4: the calculations of conditional accident probability applying Bayes’ rule and the derivation of posterior pure premiums. Section 5 concludes the paper with a discussion of additional further work.

2 Specification of the copula model In examining the presence of asymmetric information in the automobile insurance market, we fole (2000) by considering the joint distribution low a similar argument made in Chiappori and Salani´ model of some multinomial measure of coverage selection and the number of accidents. This motivates us to exploit the inherent flexibility of copula models in this work. We demonstrate the suitability of these types of models for testing the presence (or absence) of asymmetric information for a portfolio of automobile insurance contracts and subsequently, for other possible actuarial applications in predictive modeling. The copula approach used here has been inspired by the work of Zimmer and Trivedi (2006), although in a different context, where a trivariate copula was employed to jointly model demands for family health care and decisions regarding arrangements or purchase of health insurance. Our bivariate random vector clearly involves discrete data for which there is the additional difficulty of interpreting and applying copulas because of its non-uniqueness according to Sklar’s representation. This type of model has been explored and applied to count data in the a (2007). Other works that explore the use of copulas actuarial literature; see Genest and Neˇslehov´ for discrete data have appeared in Prieger (2002) and Smith (2003). Various types of coverage are typically available to the consumer in an automobile insurance market. An policyholder’s coverage selection is commonly described by a dichotomous measure, such as responsabilit´e civile v.s. assurance tours risques, based on the French car insurance data e (2000). However, as pointed out by Kim et al. (2009), a multinomial used in Chiappori and Salani´ indicator of coverage choice provides extra information for a policyholder’s decision. For the same argument then, we consider a multinomial measure and examine three types of coverage, ordinally ranking them from lowest to highest: third party only, third party fire and theft, and comprehensive. ‘Third party only’ refers to third party liability coverage giving protection in the event of an accident where the policyholder is at fault and this is usually the minimum liability coverage required by law in many jurisdictions for automobile insurance. ‘Third party fire and theft’ provides for additional


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

5

coverage in case of fire and theft. Finally, a ‘comprehensive’ provides for further additional coverage of damaged property in the event of an accident. With a cross-sectional set of observations, we begin by letting yi1 and yi2 indicate the choice of coverage and the number of accidents, respectively, for policyholder i. Here, yi1 , with possible values of 1, 2, or 3, represents the choice of third party, third party fire and theft, and comprehensive coverages, respectively. Note that yi1 and yi2 are the observed variables whose values will be ∗ and y ∗ , respectively. One determined according to the corresponding latent variables defined by yi1 i2 ∗ ∗ could view yi1 as the policyholder’s preferred policy coverage and yi2 as the inherent risk level of the policyholder. As explained in Section 1, a policyholder’s optimal coverage choice and the inherent degree of risk are to be jointly determined. Thus, estimating the two outcomes simultaneously should deliver more and better information compared with a separate estimation. A linear relationship can be assumed for both underlying latent variables according to the following specification: ′

∗ yi1 = xi β + νi + εi1

and ′

∗ yi2 = z i γ + νi + εi2 .

Here, xi and z i denote the vectors of covariates that explain the policy selection and the underlying risk of the ith policyholder, respectively. The coefficient vectors β and γ are the corresponding parameters to be estimated. The common term νi represents the unobserved characteristics of the ith policyholder that could possibly jointly affect the preferred coverage choice and degree of risk, and thus introduces the association between the two processes. The error terms εi1 and εi2 are assumed to be independent with νi . The above model could be calibrated by specifying the distributions of νi and error terms, or by specifying the covariance structure of the latent variables. As an alternative, we choose to model the observable variables yi1 and yi2 with a parametric copula to be denoted by C(·, ·). Then the joint probability mass function of yi1 and yi2 could be expressed as: fi (yi1 , yi2 ) = Prob(Yi1 = yi1 , Yi2 = yi2 ) = C(Fi1 (yi1 ), Fi2 (yi2 )) − C(Fi1 (yi1 − 1), Fi2 (yi2 )) − C(Fi1 (yi1 ), Fi2 (yi2 − 1)) + C(Fi1 (yi1 − 1), Fi2 (yi2 − 1)),

(1)

where Fi1 and Fi2 are the cumulative distribution functions of yi1 and yi2 , respectively. Due to the parametric feature of the copula model, one needs specifications of the distribution functions Fi1 and Fi2 for model identification. The coverage choice is measured on an ordinal scale. Thus an ordered multinomial model is used to describe the relationship between the response yi1 ∗: and the latent variable yi1  ∗   1, if yi1 ≤ α1 ∗ ≤α (2) yi1 = 2, if α1 < yi1 2   3, if y ∗ > α i1

2

where α1 and α2 are unknown thresholds to be additionally estimated. The distribution function of yi1 can thus be expressed as:  ′   π(α1 − xi β), if yi1 = 1 ′ Fi1 (yi1 ) = Prob(Yi1 ≤ yi1 ) = π(α2 − xi β), if yi1 = 2 (3)   1, if y = 3 i1


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

6

Commonly used regression techniques for an ordinal response include ordered probit model with π(a) = Φ(a), where Φ(·) is the cumulative distribution function of a standard normal random variable, and ordered logit model with π(a) = 1/[1 + exp(−a)]. These two methods typically provide consistent results and the selection between the two rests on the user’s preference. For our purposes, we considered an ordered logistic regression model in the estimation. Henceforth, we use  1  , if yi1 = 1  1+exp(−(α ′  1 −xi β))  1 (4) Fi1 (yi1 ) = Prob(Yi1 ≤ yi1 ) = , if yi1 = 2 ′ 1+exp(−(α2 −xi β))    1, if yi1 = 3 The number of accidents yi2 is specified using a Negative Binomial regression model. More specifically, its probability mass function is expressed as ψ yi2 Γ(yi2 + ψ) ψ λi fi2 (yi2 ) = Prob(Yi2 = yi2 ) = , (5) Γ(ψ)Γ(yi2+1 ) λi + ψ λi + ψ

where a log link function used for its conditional mean is given by ′

E(yi2 |z i ) = λi = ωi exp(z i γ) with ωi the weight parameter for policyholder i. This weight parameter has the interpretation of the length of time to which the policyholder is covered by the policy for the calendar year in consideration. The dispersion parameter ψ in the conditional variance expressed as Var(yi2 |z i ) = λi + λ2i /ψ provides additional flexibility to fit the claims data, especially being able to accommodate either overdispersion or underdispersion often encountered with claims count data. See, for example, Cameron and Trivedi (1986). The model specified in this section by its nature is fully parametric and can therefore be easily estimated using likelihood-based methods. The log of the likelihood function can be derived by adding up the logarithm of the joint probability function expressed in (1) for all policyholders with our set of observable data given by (yi1 , yi2 , xi , z i ) for each policyholder i. In the log-likelihood function, Fi1 (yi1 ) is replaced with expression (4), and P Fi2 (yi2 ) = fi2 (yi2 ), with fi2 (yi2 ) specified by (5). To accommodate the fact that the choice of coverage and the frequency of accidents could possibly be either positively or negatively associated, we consider the Frank copula which permits such flexibility: (e−θu1 − 1)(e−θu2 − 1) 1 , (6) C(u1 , u2 ; θ) = − log 1 + θ e−θ − 1 where θ is the dependence parameter that captures the association between the two responses. The flexibility of allowing for either direction of association has been one of the primary reason for its popularity in applications in insurance, finance and medical statistics. It is rather straightforward to show that when θ → 0, the Frank copula yields to the product copula, indicating the case of independence. Furthermore, the case where θ > 0 indicates a positive association between the two responses; the reverse is true when θ < 0. These relationships allow us to understand the type of association that we (empirically) derive from our data. Additional statistical properties of the Frank’s family of copulas in (6) have been explored in Genest (1987). We refer the reader to this very interesting article to further understand this family of copulas.


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

7

3 Calibrating the model 3.1 Data characteristics Data used to calibrate the model specified in Section 2 was drawn from a portfolio of automobile insurance policies of a major insurer in Singapore. In particular, we use the observations in calendar year 2001 for this insurer where we have a total of 44,226 policies that were recorded in the portfolio. Similar to several jurisdictions worldwide, Singapore requires drivers to have, at the minimum, a third party liability coverage to be able to drive a vehicle on the road, and at the same time, drivers have the liberty to choose beyond this minimum level of coverage. Our dataset indeed comes from a subsample of studies done in Frees and Valdez (2008) and in Frees et al. (2009) where the source of the data was also revealed. Table 1 provides a summary of the frequency statistics for our two primary variables of interest: policy choice and claim count for the year. As indicated in this table, the majority of the policyholders, roughly 72% (31,648 contracts) in this case, choose for a comprehensive coverage and roughly about 11% (4,896 contracts) of the policyholders choose for a third party only coverage. In some sense, this is not at all surprising to find: owning and driving a car in Singapore is quite expensive. The government has put up measures such as vehicle quotas to be on the road to manage the level of car ownership. This expensive ownership could have possibly led car drivers to have insurance protection that also additionally covers for property damage, as a comprehensive coverage would. In terms of the frequency of accidents, as we normally would expect in portfolios of automobile insurance policies, most policyholders do not incur an accident within a one-year time frame. For our data, roughly 8% of the policyholders had a single accident and less than one percent had 2, 3 or 4 accidents. No single policyholder in our portfolio incurred more than 4 accidents during the year. It is worth pointing out here the difference between an accident and when an accident becomes potentially a claim. The reporting of an accident to the insurer that could eventually becomes a claim may be influenced by the behavior of the policyholder. The policyholder might reluctantly report an accident to the insurer under certain circumstances. For instance, for fear of increase in future premiums, the policyholder may find it more economically beneficial not to report an accident, especially if the loss amount from the damage or injury is small. Thus, in the records of insurance companies, the number of accidents reported may actually be lower due to the presence of underreporting. Because we use the data that was reported to the insurance company, we measure accident frequency based on the number of claims recorded. Further analysis reveals that among the 11% policyholders who selected the lowest coverage (third party), 3.57% had at least one accident, among the 17% policyholders who selected third party fire and theft, 4.69% occurred at least one accidents, and among 72% policyholders who selected the highest coverage (comprehensive), 10.22% occurred at least one accidents. The above results already foreshadow some presence of moral hazard, i.e., it is more likely to have accidents for policyholders who purchase a higher coverage, though this positive correlation is largely due to the common exogenous variables. This coverage-occurrence relationship empirically revealed could also be explained by the existence of adverse selection, i.e., those who have more accidents purchase higher coverage because they believe themselves to be worse drivers. In fact, among 91% policyholders who did not have any accident, 11.67% purchased third party, 18.10% purchased third party fire and theft, and 70.23% purchased comprehensive. In contrast, among 9% policyholders who had accidents, 4.64% purchased third party, 9.55% purchased third party fire and theft, and 85.81% purchased comprehensive. The covariates (or explanatory variables) used in our study can be found in Table 2. Each of


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

8

Table 1: Frequency statistics of policy choice and claim counts Variables Value Frequency (%) Policy Choice 1 third party only 11.07 2 third party fire and theft 17.37 3 comprehensive (reference level) 71.56 Claim Count

0 1 2 3 4

91.48 7.79 0.66 0.07 0.00

these variables can be categorized as either a driver or a vehicle characteristic. Driver characteristics include the age of the insured, gender, marital status, years of driving experience, and the No Claims Discount (NCD) enjoyed by the insured. Vehicle characteristics include the age of the vehicle in years, the purpose or class (whether for private, business or other purposes), the size capacity, and the brand of the vehicle. Several of these variables are categorical. For categorical variables, the descriptive statistics shows the proportion of a value within each category. For variables with continuous values, we provide both the mean and its standard deviation. Only two of these variables (experience and vage) are continuous. Note that there are variables that could be treated as either categorical or continuous, for instance, a policyholder’s age. Both specifications are found in the literature. We selectively categorize some continuous variables to reduce the heterogeneity among policyholders and capture the potential non-linear effects. An important explanatory variable noteworthy is the NCD factor. In the empirical test of asymmetric information, it is crucial to control for the policyholder’s past driving history. As pointed out in Boyer and Dionne (1989), a policyholder’s past driving record affects his/her current propensity for accidents, and thus is a significant risk factor in the determination of pure premium. Because driving records are typically public information to insurance companies, failing to control for such heterogeneity could overestimate the level of asymmetric information by treating public information as private. The data set available to us contains an NCD that reflects past accidents. The no claims discount scheme is introduced to encourage good driving. An NCD increases 10% after a year with claims; the maximum NCD level is 50%. If the NCD is at 50% or 40%, a claim during one year will reduce it to 30% or 20%, respectively. Two or more claims in one year can cause a complete lost of the policyholder’s current level of NCD. If the NCD is at 30% or lower levels, the entire discount is forfeited in case of accidents. Because clearly the NCD reveals the policyholder’s past driving records, we use it as a proxy to measure accident history. According to these descriptive statistics, we may conclude that a typical observation in our data would be one who is middle aged (between 36 and 45) and a male driver who is married with roughly 12 years of driving experience and zero claims discount. The typical vehicle driven in our portfolio is roughly 8 years old, a private car with medium capacity whose brand is either a Toyota or something else other than those listed in the table.


= 1 if the policyholder is female, 0 if male = 1 if the policyholder is married, 0 if single length of driving experience of the insured (in years) No Claims Discount enjoyed by the insured = 1 if 0 percent = 2 if 10 percent = 3 if 20 percent = 4 if 30 percent = 5 if 40 percent = 6 if 50 percent (reference level)

sexinsured

marital

experience

NCD

the age of the insured vehicle (in years) = 1 if the vehicle is a private car = 2 if the vehicle is a goods vehicle = 3 if others (reference level) = 1 if petty cars (cubic capacity is less than 1000 or tonnage is less than 1) = 2 if small cars (cubic capacity is between 1000 and 1500 or tonnage is between 1 and 2) = 3 if medium cars (cubic capacity is between 1500 and 2000 or tonnage is between 2 and 3) = 4 if large cars (cubic capacity is greater than 2000 or tonnage is greater than 3) (reference level) = 1 if Toyota = 2 if Honda = 3 if Nissan = 4 if Misubishi = 5 if Mazada = 6 if other Japanese car = 7 if Korean car = 8 if European car = 9 if others (reference level)

vage

vehicleclass

capacityclass

brandclass

Vehicle characteristics

= 1 if insured age is less than 25 = 2 if insured age is between 26 and 35 = 3 if insured age is between 36 and 45 = 4 if insured age is between 46 and 55 = 5 if insured age is between 56 and 65 = 6 if insured age is greater than 65 (reference level)

ageclass

18.91% 13.77% 16.88% 10.34% 4.71% 4.10% 6.87% 19.92%

11.27% 33.25% 48.96%

86.29% 13.13%

7.55

31.94% 14.89% 11.17% 7.20% 6.23%

11.96

83.90%

17.23%

2.97% 31.90% 35.33% 21.89% 6.61%

Mean

Table 2: Descriptive statistics of the explanatory variables used in the model calibration

Variable Value/Description Driver characteristics

6.48

8.13

StdDev

A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling 9


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

10

3.2 Estimation results and discussion The resulting (maximum likelihood) estimates for the copula model described in Section 2 are presented in Table 3. Here in this table, we separate the effects of the marginals, with Choice referring to the coverage selection and Risk referring to the number of claims in the year. As shown on the table, the resulting effects of most of the covariates on both Choice and Risk are rather intuitive. To illustrate, the insured age shows a nonlinear effect on coverage choice. Generally, we believe that the younger driver is typically less risk averse and thus tends to more likely choose a lesser coverage. However, the driving experience typically accumulates as drivers age. This might additionally explain the nonlinear relationship between the age and policy choice. With respect to the effect of age on the level of risk, we observe a statistically significant effect of the driver’s age on the number of accidents. Expectedly, we also observe some nonlinear pattern in the relationship. Generally, younger drivers tend to have more accidents than older ones. On the other hand, despite several more years of driving experience, the older drivers tend to have slightly more accidents when compared to medium age drivers. This phenomenon is not surprising at all with older drivers generally believed to be more frail and more incapacitated when it comes to receptiveness during accidents. Although the effects are not statistically significant according to our estimates, the female and married drivers tend to have preference for better (that is, more comprehensive) coverage. At the same time, both groups of drivers tend to have fewer accidents and are therefore considered safer drivers. This could possibly be well explained by the higher degree of risk aversion for female and married individuals as many statistical studies indicate. In examining the effect of vehicle characteristics, we find that the age of the vehicle exhibits significant effect on both policy choice and accident occurrence. As a vehicle depreciates over time, a driver have the tendency to purchase less coverage. This phenomenon may well even be better explained in Singapore. A major component of the price or value of a vehicle in Singapore is what is referred to as the Certificate of Entitlement (COE) which as the name implies, entitles the driver to purchase a vehicle. This entitlement is limited to about 10 years and hence its value even more rapidly depreciates over time. On the other hand, the effect of vehicle age on the number of accidents appears to be nonlinear. On the average, we observe a small negative effect. One possible explanation could be that an older vehicle is owned by an older and possibly more experienced driver. The coefficient of the driver’s experience in the cumulative logit model is -0.0019, indicating then that policyholders with more years of driving experience tend to buy less coverage for their vehicles. This observation confirms the nonlinear effect of the insured’s age as earlier. Intuitively, the more experienced driver is less likely to have accidents. Another explanatory variable that is worth making an observation is the NCD factor. First, there is a significant effect of NCD on policy choice in the sense that a policyholder with a high NCD tends to purchase better insurance coverage on its vehicle. This is because for the same level of discount, the policyholder has the potential to even save more on a more expensive comprehensive coverage. On the other hand, the NCD is a good indicator of a policyholder’s accident history and it reflects the propensity for accidents and thus the policyholder’s risk type. Consistently, a driver with a lower NCD tends to have more accidents.


CHOICE INT1 CHOICE INT2 CHOICE AGECLASS1 CHOICE AGECLASS2 CHOICE AGECLASS3 CHOICE AGECLASS4 CHOICE AGECLASS5 CHOICE SEXINSUREDF CHOICE MARITALM CHOICE VAGE CHOICE VEHICLECLASS1 CHOICE VEHICLECLASS2 CHOICE CAPACITYCLASS1 CHOICE CAPACITYCLASS2 CHOICE CAPACITYCLASS3 CHOICE BRANDCLASS1 CHOICE BRANDCLASS2 CHOICE BRANDCLASS3 CHOICE BRANDCLASS4 CHOICE BRANDCLASS5 CHOICE BRANDCLASS6 CHOICE BRANDCLASS7 CHOICE BRANDCLASS8 CHOICE EXPERIENCE CHOICE NCD0 CHOICE NCD10 CHOICE NCD20 CHOICE NCD30 CHOICE NCD40 DEPENDENCE Log-likelihood AIC Spearmans Rho Chi-square test

Choice - Cumulative Logit Estimate StdErr p-value -4.1937 0.2293 0.0000 -0.8475 0.2269 0.0002 0.6671 0.1510 0.0000 0.9527 0.1309 0.0000 0.8321 0.1270 0.0000 0.6806 0.1269 0.0000 0.4300 0.1342 0.0014 0.0523 0.0433 0.2268 0.0484 0.0456 0.2895 -0.4732 0.0041 0.0000 3.6921 0.1730 0.0000 3.0326 0.1757 0.0000 0.3666 0.0769 0.0000 0.1792 0.0686 0.0089 0.5342 0.0665 0.0000 -0.0980 0.0834 0.2400 0.0475 0.0850 0.5757 0.1401 0.0867 0.1063 -0.1286 0.0916 0.1604 -0.1348 0.1080 0.2123 -0.3339 0.1021 0.0011 0.2131 0.1146 0.0629 0.3527 0.0874 0.0000 -0.0019 0.0022 0.3852 -0.6256 0.0452 0.0000 -0.5822 0.0522 0.0000 -0.3925 0.0570 0.0000 -0.0686 0.0688 0.3191 0.0568 0.0770 0.4609 1.4457 0.1437 0.0000 -29026.14 58170.28 0.23 108.3 RISK INT RISK AGECLASS1 RISK AGECLASS2 RISK AGECLASS3 RISK AGECLASS4 RISK AGECLASS5 RISK SEXINSUREDF RISK MARITALM RISK VAGE RISK VEHICLECLASS1 RISK VEHICLECLASS2 RISK CAPACITYCLASS1 RISK CAPACITYCLASS2 RISK CAPACITYCLASS3 RISK BRANDCLASS1 RISK BRANDCLASS2 RISK BRANDCLASS3 RISK BRANDCLASS4 RISK BRANDCLASS5 RISK BRANDCLASS6 RISK BRANDCLASS7 RISK BRANDCLASS8 RISK EXPERIENCE RISK NCD0 RISK NCD10 RISK NCD20 RISK NCD30 RISK NCD40 DISPERSION

Table 3: Resulting estimates of the copula model Risk - Negative Binomial Estimate StdErr p-value -3.5866 0.4701 0.0000 0.7033 0.1999 0.0004 0.4080 0.1803 0.0236 0.2059 0.1773 0.2456 0.2449 0.1775 0.1677 0.2963 0.1850 0.1092 -0.0665 0.0428 0.1204 -0.0462 0.0466 0.3222 -0.0520 0.0031 0.0000 1.3981 0.4254 0.0010 1.3491 0.4269 0.0016 -0.1069 0.0914 0.2423 0.0771 0.0750 0.3039 0.1662 0.0717 0.0205 -0.0284 0.0917 0.7568 0.2164 0.0922 0.0189 0.0210 0.0902 0.8163 0.0877 0.0942 0.3518 0.0329 0.1079 0.7605 -0.1263 0.1251 0.3127 0.0545 0.0995 0.5840 0.1191 0.0912 0.1916 -0.0056 0.0025 0.0247 0.3734 0.0484 0.0000 0.2984 0.0554 0.0000 0.1485 0.0609 0.0147 0.1972 0.0675 0.0035 0.1889 0.0711 0.0079 2.0422 0.3417 0.0000

A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling 11


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

12

3.3 Quality of fit tests Goodness-of-fit tests are performed for the marginals as well as for the copula. For marginal distributions, we exhibit the observed and predicted frequencies for both the policy choice and the number of accidents in Table 4. Because of the possible heterogeneity among subjects, the mass probabilities for both discrete random variables are different across policyholders. We calculate the probabilities at yi1 = 1, 2, 3 and at yi2 = 0, 1, 2, 3, 4 for each policyholder and present the average probabilities over all policyholders in the table. The consistency between the actual and predicted frequencies suggests very satisfactory fit for both marginals.

Value 1 2 3

Table 4: Goodness-of-fit tests of the marginals Choice Risk Observed Predicted Value Observed Predicted 11.07% 10.72% 0 91.48% 91.49% 17.37% 16.61% 1 7.79% 7.77% 71.56% 72.67% 2 0.66% 0.68% 3 0.07% 0.06% 4 0.00% 0.01%

According to Table 3, the estimate for the dependence parameter in the Frank copula is roughly 1.5, which translates to a Spearman’s rho coefficient of roughly 23%. This provides an evidence of the positive association between the policy choice and level of risk of the policyholder. In effect, we find that drivers with better coverage tend to be more prone to accidents. As already alluded in the previous section, the Frank copula allows for both positive and negative association. This advantage also makes the goodness-of-fit test rather straightforward because the statistical test of the dependence parameter avoids the problem often encountered with boundary hypothesis. A t-statistic of 10.06 strongly supports that the dependence parameter is statistically significantly different from zero, the case of independence, implying a significant positive association between the two variables. In addition, one could perform a likelihood ratio test to examine the quality of fit of the Frank copula. Under the standard set-up, the chi-square statistics follows a χ2 distribution which in this case has one degree of freedom. The observed value of the Chi-square statistic and its corresponding p-value are presented at the bottom of Table 3. In agreement with the t-statistic, we find that after controlling for the effects of the information on risk factors (some driver and vehicle characteristics), a strong evidence of private asymmetric information (the positive relationship between policy choice and risk) can be observed. A final examination of the copula model is a robustness test. The prior knowledge about the coverage-occurrence relationship suggests either a positive or a negative relationship between the policy choice and number of claims. Because of such a priori information, our initial instinct suggests to fit a model based on the Frank copula which allows for this flexibility. Due to the non-linear optimization required in the evaluation of the likelihood function, we are interested in the robustness of the estimation results by examining other copula specifications. In doing so, we re-calibrated the copula model under two other customarily used Archimedean-type copulas, the Gumbel copula and the Clayton copula with the following specifications: i h (7) Gumbel: C(u1 , u2 ; θ) = exp −((− log u1 )θ + (− log u2 )θ )1/θ , θ ≥ 1 −1/θ −θ Clayton: C(u1 , u2 ; θ) = u−θ + u − 1 , θ>0 1 2

(8)


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

13

where θ represents the dependence parameter in both copulas. Note that the Gumbel and Clayton copulas could not accommodate negative association. The estimated positive relationship that we already observed based on the Frank copula between policy choice and risk in Section 3 suggests both copulas are eligible to test for possible robustness. It is easy to see that we have the case of independence when θ = 1 in the case of the Gumbel copula, and when θ = 0 in the case of the Clayton copula. The parameter estimates for the Gumbel and Clayton copula models are displayed in Appendix A.1. Compared with Table 3, we find the coefficient estimates and p-values for all covariates to be in agreement with the Frank copula model. The dependence parameter of the Gumbel copula is 1.13 that translates to a Spearman’s rho of 0.17. The dependence parameter of the Clayton copula is 0.32 that corresponds to a Spearman’s rho of 0.21. Both models suggest a positive association between the policy selection and risk level of the policyholder. Although the t-test for the dependence parameter in both models involve boundary hypothesis, the small standard errors suggest that the dependence parameter of the Clayton copula is significantly greater than 0, and the dependence parameter of the Gumbel copula is significantly greater than 1. The statistical significance is further confirmed by a likelihood ratio test, with a Chi-square statistic of 94.74 and 110.02 for the Clayton and Clayton copula model, respectively.

4 Applications to predictive modeling Based on the empirical data on automobile insurance from a Singapore insurer, we calibrated and analyzed the proposed copula model, and found a strong evidence of a significant positive association between the policy choice and the level of risk of the policyholder as measured by the number of times a claim has been made during a year. This positive association implies the existence of some type of asymmetric information, although our findings are unable to distinguish between adverse selection and moral hazard. A possible application of the copula model constructed in this study is to further examine this asymmetric information in the automobile insurance market. However, as pointed out in Section 1, a longitudinal data is needed to separate the effects of the two types of asymmetric information, which is outside the scope of this study. Instead, we show that our copula model could also be used in several predictive modeling contexts. Two actuarial applications, calculating accident probabilities and pure premiums, are presented in this section.

4.1 Accident probability According to the copula model, we can additionally develop a predictive model that will analyze the effect of a policyholder’s choice of coverage on the number of accidents. More specifically, consider the probability distribution of Yi2 , conditional on Yi1 , which can easily be derived by applying Bayes’ rule: Prob(Yi2 = yi2 |Yi1 = yi1 ) = fi2|1 (yi2 |yi1 , x, z) =

fi (yi1 , yi2 |x, z) . fi1 (yi1 |x)

(9)

This equation allows one to predict the likelihood of the number of claims, given the policy choice. On the right hand side of equation (9), we can calculate the probability in the numerator according to the copula expression discussed in Section 2 and the probability in the denominator based on the marginal distribution of Yi1 . Using our entire dataset, this conditional probability is computed for each policyholder and then aggregated, given each type of coverage considered in this study. The


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

14

results of these calculations are summarized in Table 5. Due to the heterogeneity across the policyholders in the portfolio, we choose to present the mean and standard deviation of the conditional probability by the type of coverage choice. According to the table, the largest posterior accident probability is observed for the third party coverage and the lowest is observed for the comprehensive coverage. Table 5: Conditional probability of the number of accidents, given the policy coverage Number of Third Party Only Fire and Theft Comprehensive Accidents Probability StdDev Probability StdDev Probability StdDev 0 0.9564 0.0254 0.9478 0.0301 0.9035 0.0519 1 0.0401 0.0221 0.0482 0.0268 0.0883 0.0446 2 0.0032 0.0031 0.0037 0.0033 0.0075 0.0070 3+ 0.0003 0.0005 0.0003 0.0005 0.0007 0.0011 Additionally, using formula (9), we are able to derive the conditional expected number of claims E(Yi2 |Yi1 ) for each policyholder. To examine the effect of policy selection on the number of accidents, we perform a pairwise comparison for the three types of coverage. Specifically, we examine the quantities of the difference in expectations: E(Yi2 |Yi1 = 2) − E(Yi2 |Yi1 = 1) and E(Yi2 |Yi1 = 3) − E(Yi2 |Yi1 = 2). Each difference in expectations provides a measure of the average difference in the number of claims between a particular level of coverage and another coverage that is considered worse. Descriptive statistics of these two quantities are summarized in Table 6. Consistently, the expected number of accidents increases with better insurance coverage. Table 6: The effect of coverage selection on the number of accidents Mean StdDev 1st Quartile 2nd Quartile 3rd Quartile E(Yi2 |Yi1 = 2) − E(Yi2 |Yi1 = 1) 0.009 0.015 0.000 0.002 0.011 E(Yi2 |Yi1 = 3) − E(Yi2 |Yi1 = 2) 0.049 0.031 0.022 0.049 0.071 Similarly, one could investigate the effects of covariates on the accident probability without and with the information on the coverage choice for each single policyholder. Using the covariate NCD as an example, we are able to compare a prior and a posterior accident probability for a given NCD level. To be more specific, we will examine the following two probabilities: Prior: Prob(Yi2 = yi2 ) = fi2 (yi2 |yi1 , z, NCD = k)

(10)

Posterior: Prob(Yi2 = yi2 |Yi1 = yi1 ) = fi2|1 (yi2 |yi1 , x, z, NCD = k)

(11)

To illustrate, we consider a 30-year old female policyholder who drives a 5-year small car made by Toyota. The driver has 5 years driving experience and enjoys a no claim discount of 20% currently. We calculate the prior and posterior accident probabilities for this particular policyholder. The prior accident probability is calculated without the knowledge of the coverage selected by the policyholder. The posterior accident probability is the one incorporating the information of coverage choice. In principle, the probabilities (10) and (11) could be derived for all covariates. Here, we choose to demonstrate the effects only of some important explanatory variables to illustrate predictive modeling capabilities of our copula approach. Specifically, we examine the effects of age class, gender, marital status, and the level of NCD. For each of these covariates, the accident probabilities


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

0 1 2 3+ 0 1 2 3+ 0 1 2 3+ 0 1 2 3+ 0 1 2 3+ 0 1 2 3+

15

Table 7: Accident probability for different age classes Prior Posterior Third Party Only Fire and Theft Comprehensive Age ≤ 25 0.8008 0.8970 0.8930 0.7951 0.1686 0.0890 0.0924 0.1733 0.0264 0.0121 0.0126 0.0273 0.0042 0.0019 0.0020 0.0044 25<Age ≤ 35 0.8457 0.9229 0.9206 0.8422 0.1361 0.0689 0.0709 0.1391 0.0163 0.0074 0.0076 0.0167 0.0019 0.0009 0.0009 0.0020 36<Age ≤ 45 0.8712 0.9368 0.9347 0.8678 0.1162 0.0575 0.0594 0.1192 0.0115 0.0052 0.0054 0.0119 0.0011 0.0005 0.0005 0.0011 46<Age ≤ 55 0.8666 0.9343 0.9317 0.8626 0.1199 0.0596 0.0619 0.1234 0.0123 0.0056 0.0058 0.0127 0.0012 0.0006 0.0006 0.0013 56<Age ≤ 65 0.8603 0.9309 0.9274 0.8550 0.1248 0.0624 0.0655 0.1295 0.0135 0.0061 0.0064 0.0140 0.0014 0.0006 0.0007 0.0015 Age > 65 0.8932 0.9484 0.9444 0.8870 0.0982 0.0477 0.0514 0.1038 0.0080 0.0036 0.0039 0.0085 0.0006 0.0003 0.0003 0.0007

are calculated for each hypothetical category holding the other covariates equal to the observed value. The comparison results are displayed in Tables 7, 8, and 9, respectively. The same observation about the effect of coverage choice on the accident probability can be said from the results summarized in these tables. The prior probability is the same for three types of coverage, because such information is not incorporated into the calculation of the accident probability. After observing the choice of the policyholder, the accident probability is updated based on the coverage selection. The probability of no accident is the largest for the highest coverage (comprehensive) and the smallest for the lowest coverage (third party). The prior probability is somewhere in between. As discussed before, this could be explained by the presence of additional private information held only by the policyholder. A policyholder may decide to choose a high coverage because she believes she is a risky driver, or that she drives less carefully after purchasing a high coverage policy. The effects of these covariates on the posterior probability appear to be intuitively


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

0 1 2 3+

0 1 2 3+

16

Table 8: Accident probability for gender and marital status Gender Prior Posterior Third Party Fire and Theft Comprehensive Female Male Female Male Female Male Female Male 0.8457 0.8364 0.9229 0.9177 0.9206 0.9151 0.8422 0.8326 0.1361 0.1431 0.0689 0.0730 0.0709 0.0753 0.1391 0.1464 0.0163 0.0182 0.0074 0.0083 0.0076 0.0085 0.0167 0.0187 0.0019 0.0023 0.0009 0.0010 0.0009 0.0011 0.0020 0.0024 Marital Status Prior Posterior y1=1 y1=2 y1=3 Single Married Single Married Single Married Single Married 0.8393 0.8457 0.9193 0.9229 0.9168 0.9206 0.8355 0.8422 0.1409 0.1361 0.0717 0.0689 0.0740 0.0709 0.1442 0.1391 0.0176 0.0163 0.0080 0.0074 0.0083 0.0076 0.0181 0.0167 0.0022 0.0019 0.0010 0.0009 0.0010 0.0009 0.0022 0.0020

predictable. The effect of age on accident probability is non-linear; see, for example, Frees et al. (2009). The accident probability is the highest for young drivers (age ≤ 25), and then decreases within older age groups. Probability of accidents slightly rises for the old drivers (age > 65). Fixing other factors equal, a male driver is more likely to incur an accident than a female driver, and a married policyholder appears to drive more carefully than a single driver. The effect of NCD is somehow non-linear. On one hand, the NCD represents the past driving record and thus is an indicator of the risk level of the policyholder. On the other hand, a high NCD might give the policyholder economic incentive to purchase higher coverage which might result in less than careful driving.

4.2 Pure Premium An additional important actuarial application of our copula model is to calculate the posterior pure premium. In this section, we show how an actuary can evaluate the pure premium by incorporating the information on coverage choice after observing the accidents for the year. In automobile insurance, a common approach for price determination is the two-part model, where the accident probability (typically called claims frequency) and the amount of claims (typically called severity) are modeled separately. The so-called pure premium is then set to be the expected loss amount that is equal to the product of the expected number of accidents and the average claims for each accident, assuming the number of accidents and the amount of losses are independent. The independence assumption between frequency and severity is commonplace in the literature and in practice. As demonstrated in the previous section, the proposed copula model can be used to calculate the accident probability. To derive the pure premium, one needs to fit a model for the severity component. For simplicity, we focus on the average claims during the observation year. To accommodate the typical long-tail nature of insurance claims data, a Generalized Beta distribution of the second kind (GB2) regression model is employed to model the average claims for each of the three types of coverage. The GB2 is a four-parameter distributional family with density function: fGB2 (c; µ, σ, p, q) =

exp (pz) , c>0 c|σ|B(p, q)[1 + exp(z)]p+q

(12)


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

0 1 2 3+ 0 1 2 3+ 0 1 2 3+ 0 1 2 3+ 0 1 2 3+ 0 1 2 3+

17

Table 9: Accident probability for different NCDs Prior Posterior Third Party Only Fire and Theft Comprehensive NCD=0 0.8352 0.9170 0.9144 0.8314 0.1440 0.0736 0.0758 0.1473 0.0185 0.0084 0.0087 0.0190 0.0024 0.0011 0.0011 0.0024 NCD=10 0.8457 0.9229 0.9206 0.8422 0.1361 0.0689 0.0709 0.1391 0.0163 0.0074 0.0076 0.0167 0.0019 0.0009 0.0009 0.0020 NCD=20 0.8650 0.9335 0.9318 0.8624 0.1211 0.0602 0.0617 0.1234 0.0126 0.0057 0.0059 0.0129 0.0013 0.0006 0.0006 0.0013 NCD=30 0.8589 0.9302 0.9290 0.8570 0.1259 0.0629 0.0641 0.1276 0.0137 0.0062 0.0063 0.0139 0.0015 0.0007 0.0007 0.0015 NCD=40 0.8600 0.9308 0.9297 0.8583 0.1250 0.0624 0.0634 0.1266 0.0135 0.0061 0.0062 0.0137 0.0014 0.0006 0.0006 0.0014 NCD=50 0.8819 0.9426 0.9416 0.8804 0.1075 0.0526 0.0536 0.1089 0.0098 0.0044 0.0045 0.0099 0.0008 0.0004 0.0004 0.0009

where z = (ln c − µ)/σ and B(p, q) is the usual Euler beta function. The GB2 is a member of the log location-scale family with the parameters µ as the location parameter, σ as the scale parameter, and p and q as shape parameters. The GB2 model is believed to be first introduced in a seminal work by McDonald (1984) to understand income distribution. Additional applications have further been found in the economics literature, including McDonald (1987) and McDonald and Butler (1987), and in the insurance literature such as Cummins et al. (1990). With four parameters, the GB2 provides great flexibility to model heavy-tailed and skewed data. Many long-tailed distributions are nested as special or limiting cases of the GB2 distribution (see McDonald and Xu (1995)). Despite the flexibility of GB2 in modeling heavy-tailed data, its application in a regression context has rarely been found in the literature. McDonald and Butler (1990) first employed the GB2 regression to investigate the duration of welfare spells. More recently in the actuarial literature, Sun et al. (2008) applied the GB2 in the


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

18

prediction of nursing home utilization with longitudinal data. Additionally, Frees and Valdez (2008) and Frees et al. (2009) used the GB2 to capture the long-tail nature of automobile insurance data in a hierarchical insurance claims model. Following Sun et al. (2008), we reparameterize the location parameter µ as a linear function of the covariates expressed as ′

µi = li η. Here, li represents the vector of explanatory variables for the ith policyholder while the vector η denotes the corresponding coefficients to be estimated. In our context, a GB2 regression model is calibrated for the average claim amount for third party, third party fire and theft, and comprehensive policies, respectively. Hence we have in total three sets of parameters (η (j) , σ (j) , p(j) , q (j) ) to be estimated, where j = 1, 2, 3 indicates the three different types of coverage. The likelihood-based estimation method is used to estimate the three sets of parameters, and the estimation results are summarized in Appendix A.2 for convenience. To demonstrate the quality of the fit of the models, we exhibit the qq-plots of the resulting residuals from the three GB2 regression models in Figure 1. Here we define the residual for the ith policyholder in the GB2 regression model as ri = exp(ln ci − li η)/σ. Straightforward calculations show that ri follows i.i.d from GB2(·; 1, 1, p, q). These qq-plots strongly support the adequacy of the GB2 regression models for all three types of coverage. Fire and Theft

Comprehensive

−20

−10

0

10

−2.5 −3.0 −3.5 −4.5 −5.0

−10 −30

−4.0

Theoretical Quantile of GB2

−2 −4 −6 −8

Theoretical Quantile of GB2

0 −5 −10 −15 −20

Theoretical Quantile of GB2

5

0

−2.0

10

Third Party

−10

Empirical Quantile

−8

−6

−4

−2

0

Empirical Quantile

−5.0

−4.0

−3.0

−2.0

Empirical Quantile

Figure 1: GB2 qq-plots of regression residuals. Combining the frequency part and the severity part, one could then derive the posterior pure premium for the ith policyholder given his/her coverage choice yi1 using Pure Premiumi =E(Yi2 |Yi1 = yi1 ) × E(Ci |Yi1 = yi1 ) =

+∞ X

yi2 fi2|1 (y2i |y1i , x, z) × exp(li η (yi1 ) )B(p(yi1 ) + σ (yi1 ) , q (yi1 ) − σ (yi1 ) )/B(p(yi1 ) , q (yi1 ) ) (13)

yi2 =0

where B(p, q) = Γ(p)Γ(q)/Γ(p + q) and Γ(·) is the Gamma function.


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

19

Using the above formula (13), the posterior pure premium is calculated for each policyholder in our sample. It is interesting to compare the posterior premiums with the (a priori) gross premiums available in the original data set. Table 10 displays the correlation coefficients between the actual gross premium and the estimated (posterior) premium for each type of coverage. The strong positive correlation suggests that the posterior calculation is in line with the premium actually charged for each policyholder. In addition, we summarize descriptive statistics of the actual and posterior premiums for each type of coverage as shown in Table 11. The positive margin as indicated by the difference between the two premiums is anticipated because the gross premium actually charged to the policyholder typically includes margins to cover expenses, profits, taxes and other possible contingencies. This positive margin also indicates the adequacy of the premium assessed in the face of asymmetric information. Table 10: Correlation between actual and posterior premiums Correlation p-value Third Party 0.58282 <0.0001 Third Party Fire and Theft 0.62215 <0.0001 Comprehensive 0.80632 <0.0001

Table 11: Comparison of actual and posterior premiums Third Party Fire and Theft Comprehensive Mean StdDev Mean StdDev Mean StdDev Actual 204.29 197.87 272.21 234.17 492.22 442.15 Posterior 129.54 106.76 208.69 147.40 420.80 263.08

5 Concluding remarks This paper is primarily motivated to extend the literature in examining and identifying the presence of asymmetric information in the automobile insurance market. We propose a bivariate copula regression method to jointly examine the policyholder’s coverage choice and the level of risk. This allows us to examine information asymmetry controlling for the presence of policyholder and vehicle characteristics important to incorporate the heterogeneity of the policyholders in an insurance portfolio. To test for the presence of asymmetric information, the policyholder’s coverage selection was measured by an ordinal categorical variable and the degree of risk was approximated by an ex post risk measure, the number of times a policyholder has claimed in a calendar year. Existing methods have difficulty in modeling the two discrete variables simultaneously and thus are subject to the endogeneity criticisms often made with models to test information asymmetry. Our proposed copula model renders flexibility to partially address these criticisms while at the same time, captures both linear and non-linear relationships between the policyholder’s coverage choice and his/her level of risk, something often missing in the literature on information asymmetry. To calibrate the copula model, we used a cross-sectional empirical observation of an insurance portfolio from a major automobile insurer in Singapore. After controlling for the risk factors (policyholder and vehicle characteristics) observed by the insurer, we found evidence of a strong positive coverage-occurrence association, which suggests the possible existence of private information by the policyholders. Beyond testing the asymmetric information, our copula model found extensive applications in


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

20

the predictive modeling context. To demonstrate, we showed the calculation of posterior accident probability and pure premium based on the results of our model estimates. The results re-affirmed the effect of coverage choice on the likelihood of incurring accidents. The availability of the gross premium in the original data base allowed us to compare the posterior calculation with the a priori. We observed a positive margin that can be attributed to the covering of expenses, profits, taxes, among other things. A limitation of our analysis is that we focused on the policyholder’s behavior over a crosssectional set of observation. A repeated observation over time that is believed to be more informative in understanding the policyholder’s behavior is not given attention in this article. Thus, the evidence of asymmetric information is limited to a combined effect of adverse selection and moral hazard. As discussed in the text, the availability of longitudinal data provides the possibility of distinguishing the two types of private information. We leave this as one direction of future research we intend to pursue. The primary focus of this paper is on the ingenuity of the copula construction to examine asymmetric information, and ancillary to the model construction is the set of additional extensive applications in predictive modeling.

References Jaap H. Abbring, James J. Heckman, Pierre-Andr´ e Chiappori, and Jean Pinquet. Adverse selection and moral hazard in insurance: can dynamic data help to distinguish? Journal of the European Economic Association, 1(2-3):512–521, 2003. George A. Akerlof. The market for ‘lemons’: quality uncertainty and the market mechanism. The Quarterly Journal of Economics, 84(3):488–500, 1970. Marcel Boyer and Georges Dionne. An empirical analysis of moral hazard and experience rating. Review of Economics and Statistics, 71(1):128–134, 1989. A. Colin Cameron and Pravin K. Trivedi. Econometric models based on count data: comparisons and applications of some estimators. Journal of Applied Econometrics, 1(1):29–53, 1986. Pierre-Andr´ e Chiappori and Bernard Salani´ e. Testing for asymmetric information in insurance market. Journal of Political Economy, 108(1):56–78, 2000. Pierre-Andr´ e Chiappori, Bruno Jullien, Bernard Salani´ e, and Franc¸ois Salani´ e. Asymmetric information in insurance: general testable implications. RAND Journal of Economics, 37(4):783–798, 2006. Alma Cohen. Asymmetric information and learning: evidence from the automobile insurance market. The Review of Economics and Statistics, 87(2):197–207, 2005. J. David Cummins, Georges Dionne, James B. McDonald, and B. Michael Pritchett. Applications of the gb2 family of distributions in the modeling insurance loss processes. Insurance: Mathematics and Economics, 9(4):257–272, 1990. David de Meza and David C. Webb. Advantageous selection in insurance markets. RAND Journal of Economics, 32(2):249–262, 2001. Georges Dionne, Christian Gouri´eroux, and Charles Vanasse. Testing for evidence of adverse selection in the automobile insurance market: a comment. Journal of Political Economy, 109(2): 444–453, 2001.


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

21

Edward W. Frees and Emiliano A. Valdez. Hierarchical insurance claims modeling. Journal of the American Statistical Association, 103(484):1457–1469, 2008. Edward W. Frees, Peng Shi, and Emiliano A. Valdez. Actuarial applications of a hierarchical claims model. ASTIN Bulletin, 39(1):165–197, 2009. Christian Genest. Frank’s family of bivariate distributions. Biometrika, 74(3):549–555, 1987. Christian Genest and Johanna Neˇslehov´ a. A primer on copulas for count data. ASTIN Bulletin, 37 (2):475–515, 2007. Hyojoung Kim, Doyoung Kim, Subin Im, and James W. Hardin. Evidence of asymmetric information in the automobile insurance market: dichotomous versus multinomial measurement of insurance coverage. The Journal of Risk and Insurance, 76(2):343–366, 2009. James B. McDonald. Some generalized functions for the size distribution of income. Econometrica, 52(3):647–663, 1984. James B. McDonald. Model selection: some generalized distributions. Communications in Statistics, 16(4):1049–1074, 1987. James B. McDonald and Richard J. Butler. Some generalized mixture distributions with an application to unemployment duration. Review of Economics and Statistics, 69(2):232–240, 1987. James B. McDonald and Richard J. Butler. Regression models for positive random variables. Journal of Econometrics, 43(1-2):227–251, 1990. James B. McDonald and Yexiao J. Xu. A generalization of the beta distribution with applications. Journal of Econometrics, 66(1-2):133–152, 1995. James E. Prieger. A flexible parametric selection model for non-normal data with application to health care usage. Journal of Applied Econometrics, 17(4):367–392, 2002. Robert Puelz and Arthur Snow. Evidence on adverse selection: equilibrium signaling and crosssubsidization in the insurance market. Journal of Political Economy, 102(2):236–257, 1994. Didier Richaudeau. Automobile insurance contracts and risk of accident: an empirical test using french individual data. The Geneva Papers on Risk and Insurance - Theory, 24(1):97–114, 1999. Michael D. Rothschild and Joseph E. Stiglitz. Equilibrium in competitive insurance markets: the economics of markets with imperfect information. The Quarterly Journal of Economics, 90(4): 629–650, 1976. Kuniyoshi Saito. Testing for asymmetric information in the automobile insurance market under rate regulation. The Journal of Risk and Insurance, 73(2):335–356, 2006. Murray D. Smith. Modelling sample selection using archimedean copulas. The Econometrics Journal, 6(1):99–123, 2003. Jiafeng Sun, Edward W. Frees, and Marjorie A. Rosenberg. Heavy-tailed longitudinal data modeling using copulas. Insurance Mathematics and Economics, 42(2):817–830, 2008. David M. Zimmer and Pravin K. Trivedi. Using trivariate copulas to model sample selection and treatment effects. Journal of Business and Economic Statistics, 24(1):63–76, 2006.


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

Appendix: Additional calibration results A.1. Estimates of the bivariate model using alternative copulas

CHOICE INT1 CHOICE INT2 CHOICE AGECLASS1 CHOICE AGECLASS2 CHOICE AGECLASS3 CHOICE AGECLASS4 CHOICE AGECLASS5 CHOICE SEXINSUREDF CHOICE MARITALM CHOICE VAGE CHOICE VEHICLECLASS1 CHOICE VEHICLECLASS2 CHOICE CAPACITYCLASS1 CHOICE CAPACITYCLASS2 CHOICE CAPACITYCLASS3 CHOICE BRANDCLASS1 CHOICE BRANDCLASS2 CHOICE BRANDCLASS3 CHOICE BRANDCLASS4 CHOICE BRANDCLASS5 CHOICE BRANDCLASS6 CHOICE BRANDCLASS7 CHOICE BRANDCLASS8 CHOICE EXPERIENCE CHOICE NCD0 CHOICE NCD10 CHOICE NCD20 CHOICE NCD30 CHOICE NCD40 RISK INT RISK AGECLASS1 RISK AGECLASS2 RISK AGECLASS3 RISK AGECLASS4 RISK AGECLASS5 RISK SEXINSUREDF RISK MARITALM RISK VAGE RISK VEHICLECLASS1 RISK VEHICLECLASS2 RISK CAPACITYCLASS1 RISK CAPACITYCLASS2 RISK CAPACITYCLASS3 RISK BRANDCLASS1 RISK BRANDCLASS2 RISK BRANDCLASS3 RISK BRANDCLASS4 RISK BRANDCLASS5 RISK BRANDCLASS6 RISK BRANDCLASS7 RISK BRANDCLASS8 RISK EXPERIENCE RISK NCD0 RISK NCD10 RISK NCD20 RISK NCD30 RISK NCD40 RISK DISPERSION DEPENDENCE

Estimate -4.2001 -0.8588 0.6587 0.9532 0.8314 0.6826 0.4247 0.0541 0.0479 -0.4735 3.6852 3.0209 0.3710 0.1829 0.5368 -0.0958 0.0476 0.1402 -0.1286 -0.1333 -0.3334 0.2114 0.3546 -0.0020 -0.6273 -0.5827 -0.3950 -0.0751 0.0528 -3.4658 0.6939 0.4002 0.2008 0.2428 0.2952 -0.0650 -0.0480 -0.0517 1.2856 1.2321 -0.1033 0.0754 0.1626 -0.0371 0.2008 0.0180 0.0825 0.0260 -0.1229 0.0521 0.1133 -0.0055 0.3779 0.2980 0.1505 0.1971 0.1897 2.0226 0.3243

Clayton StdError 0.2357 0.2333 0.1547 0.1341 0.1299 0.1297 0.1370 0.0433 0.0457 0.0041 0.1721 0.1747 0.0765 0.0683 0.0663 0.0840 0.0855 0.0873 0.0923 0.1085 0.1025 0.1146 0.0881 0.0022 0.0452 0.0522 0.0571 0.0689 0.0770 0.4286 0.1899 0.1701 0.1672 0.1675 0.1737 0.0429 0.0464 0.0031 0.4124 0.4149 0.0901 0.0745 0.0713 0.0923 0.0927 0.0908 0.0947 0.1084 0.1259 0.1001 0.0913 0.0025 0.0481 0.0554 0.0609 0.0676 0.0711 0.3323 0.0402

p-value 0.0000 0.0002 0.0000 0.0000 0.0000 0.0000 0.0019 0.2116 0.2946 0.0000 0.0000 0.0000 0.0000 0.0074 0.0000 0.2546 0.5781 0.1083 0.1634 0.2194 0.0011 0.0650 0.0001 0.3638 0.0000 0.0000 0.0000 0.2759 0.4927 0.0000 0.0003 0.0186 0.2299 0.1472 0.0893 0.1300 0.3016 0.0000 0.0018 0.0030 0.2514 0.3111 0.0227 0.6878 0.0304 0.8424 0.3837 0.8108 0.3292 0.6030 0.2147 0.0277 0.0000 0.0000 0.0135 0.0036 0.0076 0.0000 0.0000

Estimate -4.1606 -0.8200 0.6852 0.9712 0.8473 0.6981 0.4501 0.0530 0.0482 -0.4719 3.6974 3.0445 0.3583 0.1714 0.5267 -0.1017 0.0435 0.1360 -0.1306 -0.1405 -0.3371 0.2133 0.3466 -0.0018 -0.6206 -0.5800 -0.3901 -0.0679 0.0580 -4.0221 0.7244 0.4271 0.2265 0.2634 0.3116 -0.0647 -0.0430 -0.0479 1.8009 1.7646 -0.1183 0.0621 0.1538 -0.0279 0.2222 0.0287 0.0913 0.0425 -0.1443 0.0578 0.1182 -0.0054 0.3774 0.3015 0.1529 0.1995 0.1888 1.9656 1.1339

Gumbel StdError 0.2248 0.2223 0.1500 0.1299 0.1260 0.1259 0.1333 0.0432 0.0455 0.0041 0.1722 0.1749 0.0764 0.0681 0.0661 0.0821 0.0837 0.0855 0.0905 0.1068 0.1007 0.1133 0.0859 0.0022 0.0451 0.0521 0.0569 0.0687 0.0769 0.5086 0.1948 0.1754 0.1722 0.1722 0.1794 0.0426 0.0464 0.0031 0.4749 0.4767 0.0900 0.0739 0.0709 0.0903 0.0907 0.0889 0.0928 0.1065 0.1244 0.0984 0.0892 0.0025 0.0481 0.0552 0.0606 0.0672 0.0710 0.3113 0.0152

p-value 0.0000 0.0002 0.0000 0.0000 0.0000 0.0000 0.0007 0.2198 0.2900 0.0000 0.0000 0.0000 0.0000 0.0119 0.0000 0.2153 0.6031 0.1116 0.1488 0.1883 0.0008 0.0597 0.0001 0.4160 0.0000 0.0000 0.0000 0.3230 0.4506 0.0000 0.0002 0.0149 0.1886 0.1261 0.0823 0.1292 0.3538 0.0000 0.0001 0.0002 0.1886 0.4013 0.0300 0.7573 0.0143 0.7472 0.3255 0.6900 0.2459 0.5571 0.1854 0.0307 0.0000 0.0000 0.0116 0.0030 0.0078 0.0000 0.0000

22


A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

A.2. Model estimates of the GB2 regressions for the three types of coverage

INT AGECLASS1 AGECLASS2 AGECLASS3 AGECLASS4 AGECLASS5 SEXINSUREDF MARITALM VAGE VEHICLECLASS1 VEHICLECLASS2 CAPACITYCLASS1 CAPACITYCLASS2 CAPACITYCLASS3 BRANDCLASS1 BRANDCLASS2 BRANDCLASS3 BRANDCLASS4 BRANDCLASS5 BRANDCLASS6 BRANDCLASS7 BRANDCLASS8 EXPERIENCE NCD0 NCD10 NCD20 NCD30 NCD40 σ p q

Third Party Estimate StdError 4.67758 20.22671 0.08522 0.80147 0.44681 0.6633 0.18414 0.59053 0.40905 0.56369 -0.28227 0.53849 -0.13257 0.24567 -0.01563 0.28268 -0.00048 0.01643 2.71887 26.06902 2.40576 26.08081 -1.01233 0.3859 -0.6866 0.34172 -0.6488 0.33846 1.59989 8.6066 1.3008 8.61842 1.24008 8.60906 1.25225 8.59816 1.36969 8.61839 0.26607 8.61422 2.41992 8.63265 1.29999 8.59116 0.01878 0.0107 -0.45909 0.30802 0.05255 0.35219 -0.67075 0.32677 -0.98917 0.35421 -0.37407 0.42586 2.06419 1.02818 0.27252 0.29498 0.49532 0.63701

Fire and Theft Estimate StdError 5.04693 7.25633 0.30522 0.70918 0.36807 0.67441 -0.11574 0.6555 0.19679 0.64948 -0.02156 0.68233 0.15353 0.14397 -0.02837 0.15677 -0.0117 0.01479 3.17122 7.32979 3.30577 7.32863 0.41478 0.27417 0.276 0.22547 0.23297 0.2436 0.68757 0.4083 0.91015 0.40603 0.78001 0.40809 0.47597 0.42724 1.30815 0.44762 -0.46972 0.58085 0.65263 0.61413 1.08444 0.43066 0.01852 0.00917 0.0243 0.21019 -0.08583 0.2274 0.0245 0.26651 0.14669 0.33584 -0.04255 0.35143 1.17141 0.2744 0.79842 0.45819 6.1853 12.48712

Comprehensive Estimate StdError 16.6576 1.88469 -0.26675 0.23958 -0.28167 0.20914 -0.30778 0.2063 -0.14535 0.20668 -0.1464 0.21601 0.07577 0.04895 -0.06854 0.05506 -0.00886 0.00642 -0.19153 0.4419 -0.31469 0.44478 -0.146 0.11073 -0.09755 0.08905 -0.01115 0.08472 -0.18176 0.10178 -0.04009 0.10291 -0.13743 0.09939 -0.00862 0.10427 -0.01205 0.1224 -0.25945 0.14099 -0.40162 0.1098 -0.06013 0.10041 -0.00574 0.00289 0.06888 0.05652 0.11249 0.06363 0.08566 0.06818 0.08696 0.0763 -0.01861 0.07809 0.61004 0.02914 6.74983 1.30931 134.5066 92.77842

23


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.