Getting started#

This guide will walk you through using the mcf package to

  • estimate heterogeneous treatment effects using the Modified Causal Forest

  • learn an optimal policy rule based on a Policy Tree

Simulating data#

First, we will use the example_data() function to generate synthetic datasets for training and prediction. This functions creates training (training_df) and prediction (prediction_df) DataFrames with a specified number of observations, features, and treatments, and allows for different heterogeneity types ('linear', 'nonlinear', 'quadratic', 'WagerAthey'). The function also returns name_dict, a dictionary containing the names of variable groups. You can define some features of the generated data by using the following parameters:

First, we’ll create some synthetic data to showcase the functionality of the mcf package. Our example will involve a scenario with three possible treatments, represented by the values 0, 1, and 2.

import numpy as np
import pandas as pd
from mcf import ModifiedCausalForest
from mcf import OptimalPolicy
from mcf import McfOptPolReport

def simulate_data(n: int, seed: int) -> pd.DataFrame:
    """
    Simulate data with treatment 'd', outcome 'y', an unordered control
    variable 'female' and two ordered controls 'x1', 'x2'.

    Parameters:
    - n (int): Number of observations in the simulated data.
    - seed (int): Seed for the random number generator.

    Returns:
    pd.DataFrame: Simulated data in a Pandas DataFrame.

    """
    rng = np.random.default_rng(seed)

    d = rng.integers(low=0, high=2, size=n, endpoint=True)
    female = rng.integers(low=0, high=1, size=n, endpoint=True)
    x_ordered = rng.normal(size=(n, 2))
    y = (x_ordered[:, 0] +
        x_ordered[:, 1] * (d == 1) +
        x_ordered[:, 1] * (d == 2) +
        0.5 * female +
        rng.normal(size=n))

    data = {"y": y, "d": d, "female": female}

    for i in range(x_ordered.shape[1]):
        data["x" + str(i + 1)] = x_ordered[:, i]

    return pd.DataFrame(data)

df = simulate_data(n=1000, seed=1234)

To estimate both a Modified Causal Forest and an Optimal Policy Tree, we will use a simple sample splitting approach, dividing the simulated data into three equally sized parts:

  1. train_mcf_df: Used to train the Modified Causal Forest.

  2. pred_mcf_train_pt_df: Used to the predict the heterogeneous treatment effects and to train the Optimal Policy Tree.

  3. evaluate_pt_df: Used to evaluate the Optimal Policy Tree.

indices = np.array_split(df.index, 3)
train_mcf_df, pred_mcf_train_pt_df, evaluate_pt_df = (df.iloc[ind] for ind in indices)

Estimating heterogeneous treatment effects#

To estimate a Modified Causal Forest, we use the ModifiedCausalForest class of the mcf package. To create an instance of the ModifiedCausalForest class, we need to specify the name of

  • at least one outcome variable through the var_y_name parameter

  • the treatment variable through the var_d_name parameter

  • ordered features through var_x_name_ord and/or unordered features through var_x_name_unord

as follows:

my_mcf = ModifiedCausalForest(
    var_y_name="y",
    var_d_name="d",
    var_x_name_ord=["x1", "x2"],
    var_x_name_unord=["female"],
    _int_show_plots=False # Suppress the display of diagnostic plots during estimation
)

Accessing and Customizing Output Location:#

The mcf package generates a number of standard outputs for your convenience. After initializing a Modified Causal Forest, the package will create an output folder where these results will be stored. You can find the location of this folder by accessing the “outpath” entry of the gen_dict attribute of your Modified Causal Forest:

my_mcf.gen_dict["outpath"]

You can also specify the location of this folder manually using the gen_outpath parameter of the class ModifiedCausalForest.

Below you find a selected list of optional parameters that are often used to initialize a Modified Causal Forest. For a more detailed description of these parameters, please refer to the documentation of ModifiedCausalForest.

Commonly used optional parameters

Parameter

Description

cf_boot

Number of Causal Trees. Default: 1000.

p_atet

If True, \(\textrm{ATE's}\) are also computed by treatment status (\(\textrm{ATET's}\)). Default: False.

var_z_name_list

Ordered feature(s) with many values used for \(\textrm{GATE}\) estimation.

var_z_name_ord

Ordered feature(s) with few values used for \(\textrm{GATE}\) estimation.

var_z_name_unord

Unordered feature(s) used for \(\textrm{GATE}\) estimation.

p_gatet

If True, \(\textrm{GATE's}\) are also computed by treatment status (\(\textrm{GATET's}\)). Default: False.

var_x_name_always_in_ord

Ordered feature(s) always used in splitting decision.

var_x_name_always_in_unord

Unordered feature(s) always used in splitting decision.

var_y_tree_name

Outcome used to build trees. If not specified, the first outcome in y_name is selected for building trees.

var_id_name

Individual identifier.

Training a Modified Causal Forest#

Next we will train the Modified Causal Forest on the train_mcf_df data using the train() method:

my_mcf.train(train_mcf_df)

Now we are ready to estimate heterogeneous treatment effects on the pred_mcf_train_pt_df data using the predict() method.

results = my_mcf.predict(pred_mcf_train_pt_df)

Results#

The easiest way to get an overview of your results is to read the PDF-report that can be generated using the class McfOptPolReport:

mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()

Next, we describe ways to access the results programmatically:

The predict() method returns a dictionary containing the estimation results. To gain an overview, have a look at the keys of the dictionary:

print(results.keys())

By default the average treatment effects (\(\textrm{ATE's}\)) as well as the individualized average treatment effects (\(\textrm{IATE's}\)) are estimated. If these terms do not sound familiar, click here to learn more about the different kinds of heterogeneous treatment effects.

In the multiple treatment setting there is more than one average treatment effect to consider. The following entry of the results dictionary lists the estimated treatment contrasts:

results["ate effect_list"]

An entry [1, 0] for instance specifies the treatment contrast between treatment level 1 and treatment level 0. These contrasts are aligned with the estimated \(\textrm{ATE's}\) and their standard errors, which you can access using:

results["ate"]
results["ate_se"]

The estimated \(\textrm{IATE's}\), together with the predicted potential outcomes, are stored as a Pandas DataFrame in the following entry of the results dictionary:

results["iate_data_df"]

Please refer to the documentation of the predict() method for a more detailed description of the contents of the results dictionary.

Post-estimation#

You can use the analyse() method to investigate a number of post-estimation plots. These plots are also exported to the previously created output folder:

my_mcf.analyse(results)

Finally, for out-of-sample evaluation, apply the predict() method to the data held out for evaluation:

oos_results = my_mcf.predict(evaluate_pt_df)

Learning an optimal policy rule#

Let’s explore how to learn an optimal policy rule using the OptimalPolicy class of the mcf package. To get started we need a Pandas DataFrame that holds the estimated potential outcomes (also called policy scores), the treatment variable and the features on which we want to base the decision tree.

As you may recall, we estimated the potential outcomes in the previous section. They are stored as columns in the “iate_data_df” entry of the results dictionary:

print(results["iate_data_df"].head())

The column names are explained in the iate_names_dic entry of the results dictionary. The uncentered potential outcomes are stored in columns with the suffix _un_lc_pot.

print(results["iate_names_dic"])

Now that we understand this, we are ready to build an Optimal Policy Tree. To do so, we need to create an instance of class OptimalPolicy where we set the gen_method parameter to “policy tree” and provide the names of

  • the treatment through the var_d_name parameter

  • the potential outcomes through the var_polscore_name parameter

  • ordered and/or unordered features used to build the policy tree using the var_x_name_ord and var_x_name_unord parameter respectively

as follows:

my_policy_tree = OptimalPolicy(
    var_d_name="d",
    var_polscore_name=["Y_LC0_un_lc_pot", "Y_LC1_un_lc_pot", "Y_LC2_un_lc_pot"],
    var_x_name_ord=["x1", "x2"],
    var_x_name_unord=["female"],
    gen_method="policy tree",
    pt_depth_tree_1=2
    )

Note that the pt_depth_tree_1 parameter specifies the depth of the (first) policy tree. For demonstration purposes we set it to 2. In practice, you should choose a larger value which will increase the computational burden. See the User guide and the Algorithm reference for more detailed explanations.

After initializing an Optimal Policy Tree, the mcf package will automatically create an output folder. This folder will contain a number of standard outputs for your convenience. You can find the location of this folder in your console output. Alternatively, you can manually specify the folder location using the gen_outpath parameter.

Fit an Optimal Policy Tree#

To find the Optimal Policy Tree, we use the solve() method, where we need to supply the pandas DataFrame holding the potential outcomes, treatment variable and the features:

train_pt_df = results["iate_data_df"]
alloc_df = my_policy_tree.solve(train_pt_df)

The returned DataFrame contains the optimal allocation rule for the training data.

print(alloc_df.head())

Next, we can use the evaluate() method to evaluate this allocation rule. This will return a dictionary holding the results of the evaluation. As a side-effect, the DataFrame with the optimal allocation is augmented with columns that contain the observed treatment and a random allocation of treatments.

pt_eval = my_policy_tree.evaluate(alloc_df, train_pt_df)

print(pt_eval)
print(alloc_df.head())

A great way to get an overview of the results is to read the PDF-report that can be generated using the class McfOptPolReport:

policy_tree_report = McfOptPolReport(
    optpol = my_policy_tree,
    outputfile = 'Optimal-Policy_Report'
    )
policy_tree_report.report()

Finally, it is straightforward to apply our Optimal Policy Tree to new data. To do so, we simply apply the allocate() method to the DataFrame holding the potential outcomes, treatment variable and the features for the data that was held out for evaluation:

oos_df = oos_results["iate_data_df"]
oos_alloc_df = my_policy_tree.allocate(oos_df)

To evaluate this allocation rule, again apply the allocate() method similar to above.

oos_eval = my_policy_tree.evaluate(oos_alloc_df, oos_df)

print(oos_eval)
print(oos_alloc_df.head())

Next steps#

The following are great sources to learn even more about the mcf package:

  • The User Guide offers explanations on additional features of the mcf package and provides several example scripts.

  • Check out the API for details on interacting with the mcf package.

  • The Algorithm Reference provides a technical description of the methods used in the package.