3. Sampling weights and clustering#

3.1. Sampling weights#

You can provide sampling weights for each observation in your data set. To estimate a Modified Causal Forest with sampling weights, you need to set the gen_weighted parameter to True and provide the name of the variable containing the sampling weights in the var_w_name parameter.

3.2. Clustering#

If your data set contains clusters, you can provide the name of the variable containing the cluster identifier through the var_cluster_name parameter.

In case your data has a panel structure, your data set is also clustered, namely at the level of the individual. In this case you can provide the name of the variable containing the individual identifier through the var_cluster_name parameter.

The clusters are by default used to draw the random samples when growing the forest. You can control this behaviour through the gen_panel_in_rf parameter. To compute clustered standard errors, you need to set the gen_panel_data parameter to True.

3.3. Parameter overview#

The following table summarizes the parameters related to sampling weights and clustering in the class ModifiedCausalForest:

Parameter	Description
`var_w_name`	Name of the variable holding the sampling weight of each observation.
`gen_weighted`	If True, sampling weights from `var_w_name` will be used. Default: False.
`var_cluster_name`	Name of the variable holding the cluster identifier.
`gen_panel_data`	If True, clustered standard errors based on `var_cluster_name` are computed. Default: False.
`gen_panel_in_rf`	If True, clusters are used to draw the random samples when building the forest. Default: True. Only relevant if `gen_panel_data` is True.

Please consult the API for more details.

3.4. Examples#

from mcf import ModifiedCausalForest

ModifiedCausalForest(
    var_y_name="y",
    var_d_name="d",
    var_x_name_ord=["x1", "x2"],
    # Parameters for sampling weights:
    var_w_name="sampling_weight",
    gen_weighted=True
)

ModifiedCausalForest(
    var_y_name="y",
    var_d_name="d",
    var_x_name_ord=["x1", "x2"],
    # Parameters for clustering:
    var_cluster_name="cluster_id",
    gen_panel_data=True,
    gen_panel_in_rf=True
)