6. Post-estimation diagnostics#

The class ModifiedCausalForest provides you with several diagnostic tools to analyse the estimated \(\text{IATE's}\). They cover

  • descriptive statistics

  • a correlation analysis

  • \(k\)-means clustering

  • a feature importance analysis

To conduct any post-estimation diagnostics, the parameter post_est_stats of the class ModifiedCausalForest needs to be set to True. Once you have estimated your \(\text{IATE's}\) using the predict() method, you can conduct the post-estimation diagnostics with the analyse() method:

from mcf import ModifiedCausalForest
from mcf import McfOptPolReport

my_mcf = ModifiedCausalForest(
        var_y_name="y",
        var_d_name="d",
        var_x_name_ord=["x1", "x2"],
        var_x_name_unord=["female"],
        # Enable post-estimation diagnostics
        post_est_stats=True
    )

my_mcf.train(df)
results = my_mcf.predict(df)

post_estimation_diagnostics = my_mcf.analyse(results)

The easiest way to to inspect the results of the post-estimation diagnostics, is to read the PDF-report that can be generated using the class McfOptPolReport:

mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()

You can additionally specify the reference group for the \(\text{IATE's}\) with the parameter post_relative_to_first_group_only. If post_relative_to_first_group_only is True, the comparison group will be the first treatment state. This is the default. If False, all possible treatment combinations are compared with each other. The confidence level in the post-estimation diagnostics is specified through the parameter p_ci_level.

6.1. Descriptive statistics#

With post_est_stats set to True, the distribution of the estimated \(\text{IATE's}\) will be presented. The produced plots are also available in the output folder that is produced by the mcf package. You can find the location of this folder by accessing the “outpath” entry of the gen_dict attribute of your Modified Causal Forest:

my_mcf.gen_dict["outpath"]

You can also specify this path through the gen_outpath parameter of the class ModifiedCausalForest(). The output folder will contain the jpeg/pdf-files of the plots as well as csv-files of the underlying data in the subfolder ate_iate.

6.2. Correlation analysis#

The correlation analysis estimates the dependencies between the different \(\text{IATE's}\), between the \(\text{IATE's}\) and the potential outcomes, and between the \(\text{IATE's}\) and the features. You can activate the correlation analysis by setting the parameter post_bin_corr_yes to True. Note that the correlation coefficients are only displayed if their absolute values exceeds the threshold specified by the parameter post_bin_corr_threshold.

6.3. \(k\)-means clustering#

To analyze heterogeneity in different groups (clusters), you can conduct \(k\)-means clustering by setting the parameter post_kmeans_yes to True. The mcf package uses the k-means++ algorithm from scikit-learn to build clusters based on the \(\text{IATE's}\).

from mcf import ModifiedCausalForest
from mcf import McfOptPolReport

my_mcf = ModifiedCausalForest(
        var_y_name="y",
        var_d_name="d",
        var_x_name_ord=["x1", "x2"],
        var_x_name_unord=["female"],
        post_est_stats=True,
        # Perform k-means clustering
        post_kmeans_yes=True
    )

my_mcf.train(df)
results = my_mcf.predict(df)

post_estimation_diagnostics = my_mcf.analyse(results)

The report obtained through the class McfOptPolReport will contain descriptive statistics of the \(\text{IATE's}\), the potential outcomes and the features for each cluster.

mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()

If you wish to analyse the clusters yourself, you can access the cluster membership of each observation through the “iate_data_df” entry of the dictionary returned by the analyse() method. The cluster membership is stored in the column IATE_Cluster of the DataFrame.

post_estimation_diagnostics["iate_data_df"]

You can define a range for the number of clusters through the parameter post_kmeans_no_of_groups. The final number of clusters is chosen via silhouette analysis. To guard against getting stuck at local extrema, the number of replications with different random start centers can be defined through the parameter post_kmeans_replications. The parameter post_kmeans_max_tries sets the maximum number of iterations in each replication to achieve convergence.

6.4. Feature importance#

If you are interested in learning which of your features have a lot of predictive power for the estimated \(\text{IATE's}\) you can activate the feature importance procedure by setting the parameter post_random_forest_vi to True. This procedure will build a predictive random forest to determine which features influence the \(\text{IATE's}\) most. The feature importance statistics are presented in percentage points of the coefficient of determination, \(R^2\), that is lost when the respective feature is randomly permuted. The \(R^2\) statistics are obtained through the RandomForestRegressor provided by scikit-learn.

6.5. Parameter overview#

Below is an overview of the above mentioned parameters related to post-estimation diagnostics in the class ModifiedCausalForest:

Parameter

Description

post_est_stats

If True, post-estimation diagnostics are conducted. Default: True.

post_relative_to_first_group_only

If True, post-estimation diagnostics will only be conducted for \(\text{IATE's}\) relative to the first treatment state. If False, the diagnostics cover the \(\text{IATE's}\) of all possible treatment combinations. Default: True.

p_ci_level

Confidence level for plots, including the post-estimation diagnostic plots. Default: 0.9.

post_bin_corr_yes

If True, the binary correlation analysis is conducted. Default: True.

post_bin_corr_threshold

If post_bin_corr_yes is True, correlations are only displayed if their absolute value is at least post_bin_corr_threshold. Default: 0.1.

post_kmeans_yes

If True, \(k\)-means clustering is conducted to build clusters based on the \(\text{IATE's}\). Default: True.

post_kmeans_no_of_groups

Only relevant if post_kmeans_yes is True. Determines the number of clusters for \(k\)-means clustering. Should be specified as a list of values. Default: See the API.

post_kmeans_max_tries

Only relevant if post_kmeans_yes is True. Determines the maximum number of iterations to achieve convergence in each \(k\)-means clustering replication. Default: 1000.

post_kmeans_replications

Only relevant if post_kmeans_yes is True. Determines the number of replications for \(k\)-means clustering. Default: 10.

post_random_forest_vi

If True, the feature importance analysis is conduced. Default: True.

post_plots

If True, post-estimation diagnostic plots are printed during runtime. Default: True.

Please consult the API for more details.

6.6. Example#

from mcf import ModifiedCausalForest
from mcf import McfOptPolReport

my_mcf = ModifiedCausalForest(
        var_y_name="y",
        var_d_name="d",
        var_x_name_ord=["x1", "x2"],
        var_x_name_unord=["female"],
        p_ci_level=0.95,
        # Parameters for post-estimation diagnostics
        post_est_stats=True,
        post_est_stats=True
        post_relative_to_first_group_only=True,
        post_bin_corr_yes=True,
        post_bin_corr_threshold=0.1,
        post_kmeans_yes=True,
        post_kmeans_no_of_groups=[3, 4, 5, 6, 7],
        post_kmeans_max_tries=1000,
        post_kmeans_replications=10,
        post_random_forest_vi=True,
        post_plots=True
    )

my_mcf.train(df)
results = my_mcf.predict(df)

# Compute the post-estimation diagnostics
post_estimation_diagnostics = my_mcf.analyse(results)

# Access cluster memberships (column 'IATE_Cluster')
post_estimation_diagnostics["iate_data_df"]

# Produce a PDF-report with the results, including post-estimation diagnostics
mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()