I’ve recently been reading a lot of technical papers and thought it would be nice to summarize them in less formal language than academic papers. I may or may not do this more than once.
This paper is about trying to do causal analysis using Random Forests (RF). RF are very popular for building classification or regression predictive models, but it’s not trivial to make classical statistical claims about the results. For instance, what are your confidence intervals? How do you get p-values?
Furthermore, this paper wants to make claims on causal impact, or the effect of a treatment. For instance, what’s the impact of college on income? This is hard to do for many reasons, but most fundamentally you don’t have the data you exactly need, which is data for each individual on what happened when they both went to college and did not. This is impossible of course, because in reality the individual either went to college or they did not and you don’t know what would have happened if the other situation occurred — i.e. the counterfactual. This way of framing the problem is called the “potential outcomes framework” and essentially supposes that each person has multiple potential outcomes depending on whether or not they received the treatment.
A key assumption in this paper, and causal estimation techniques of this type generally, is one of unconfoundedness. This means that once you control for the variables of interest, whether or not a person received the treatment or not is random. This enables us to treat nearby points as mini-randomized experiments. In the case of a drug trial you can randomly assign treatment, so this isn’t a problem, but if you’re analyzing observational data the treatments are already assigned — a person went to college or they didn’t.
For unconfoundedness to be upheld you would need to choose and measure all of the variables that would affect the treatment assignment. For the causal impact of a 4-year degree on income, one of the covariates you’d likely want to choose is family income since whether or not someone goes to college is probably correlated with their family’s income. Age might be another one since an 18 year-old is a lot more likely to go to college than a 50 year-old. The idea is that when you look at two neighbor points that are plotted in your family-income/age plot, the decision of whether or not to go to college should be random. This is valuable because then you can take the income of the no-college-person and subtract it from the college-person and you have an estimate of the impact of college at that point in the family-income/age space.
But this is hard, because you might forget an important variable, or maybe you just don’t have the data for it. So it’s an assumption that surely isn’t exactly true, but it might be true enough to give useful answers.
But assuming you have the right covariates, the authors use a Random Forest to split the data into self-similar groups. The Random Forest is an adaptive nearest-neighbor method in that it decides which portions of the space are similar for you, whereas most nearest-neighbor techniques tend to treat all distances equal. They then add constraints that the resultant leaves have a minimum of both treatment and no-treatment classes and then the causal impact for each leaf can be calculated by subtracting the averages as if it were a randomized experiment. Once complete for the individual trees, the estimates from each tree can be averaged. They call this implementation a Causal Forest.
II. What’s new here
As mentioned above, to make robust statistical claims in traditional statistics, you need things like p-values and confidence intervals. This requires knowing the asymptotic sampling distribution of the statistic. In this case, that means we need to know what the distribution would look like if we sampled the average treatment effect from the Random Forest estimator over an infinite number of trees/data. If we have that, then we can say things like: “the average treatment is 0.3 and the likelihood that this came from random chance is less than 0.1%”
The built-in assumption here is that we can derive properties for the asymptotic/infinite data case and apply them to our real-world case of finite samples, but that’s often the case in traditional statistics.
Prior work has been done to enable asymptotic analysis of Random Forests, but this paper establishes the constraints needed to apply them to their Causal Forests. One of the constraints requires “honest trees”, which they present two algorithms for growing.
The full proof is very complicated and I won’t attempt to recreate it here, but I’ll briefly outline some constraints and what they enable.
First, an asymptotic theory is recreated from previous work for traditional Random Forests that shows that it is an asymptotically Gaussian estimator with a mean of zero, meaning it’s unbiased. They also mention a technique called the infinitesimal jackknife for estimating the variance, which includes a finite-sample correction.
The authors are able to leverage this previous work by including the concept of the “honest tree”. The basic idea is that you cannot use the outcome variable to both do the splitting and estimate the average impact — you have to choose one or the other. They present two ways to do this. The first is the double-sample tree where you split your data in two: half for estimating the impact and the other for placing the splits. The splitting criteria is to minimize the MSE for the outcome variable. In the double-sample case it might seem you’re throwing away half your data, but this condition is for a single tree and the Random Forest is sampling new training sets for each tree, so you’ll end up using all of the data for both splitting and estimation.
The other method is by growing “propensity trees” which are classification trees that aim to predict the treatment class instead of the outcome, and then the outcome variable is only used for estimating the impact within each leaf. They impose a stopping criteria such that you stop splitting to maintain a minimum of each treatment class in any leaf. This is necessary so that you have outcomes to compare to estimate the effect.
By using honest trees and relying on assumptions of unconfoundedness and treatment class overlap within leaves, they’re able to slightly modify the traditional treatment to give the same unbiased gaussian asymptotic results.
IV. Experimental Results
Their baseline is using the k-NN algorithm. They create some simulation experiments with known conditions that mimic common problems and then apply the Causal Forest and k-NN methods.
The first experiment holds the true treatment effect at zero for all x, but establishes a correlation between the outcome and the treatment assignment, thereby testing the ability of the algorithm to correct the covariates to eliminate the bias. This is like having the algorithm automatically figure out that age and family are important as well as how to split them up. They run the experiment many times at a varying training data dimension. They reported MSE and the coverage, which was how often the true value was within the 95% confidence interval of the estimator. The Causal Forest had an order of magnitude improvement over 10-NN and a factor of 5 improvement over 100-NN. CF maintained ~0.95 coverage up to 10 dimensions and then began to degrade. 10-NN maintained reasonable coverage in the 0.9 range and 100-NN performed very poorly. It’s worth noting the confidence intervals were much wider for k-NN than CF, so the improved coverage is that much more impressive.
The 2nd experiment had constant main effect and propensity, but had the true treatment effect depend on only two covariates. They then scaled the number of irrelevant covariates to understand the ability of the algorithm to find this heterogeneous treatment effect in the presence of irrelevant covariates. Surprisingly, CF did better at higher dimension than low dimension. They explain this by noting the variance of the forest depends on the correlation between trees and suggest that the correlation between trees and therefore ensemble variance is reduced at the higher dimension. The results are similar to experiment 1 in that MSE is much better or at least on par with more consistent coverage that scales with dimension.
It was noted that the confidence intervals’ coverage begins to degrade at the edge of the feature space, particularly for high dimension. This is explained as being a situation dominated by bias that would disappear in the asymptotic limit of infinite data. It is noted that bias at the boundaries is typical of trees specifically, and nearest-neighbor non-parametric estimators generally.
Although causal analysis is not my expertise, it seems this is a nice advancement for nearest-neighbor methods with the assumption of unconfoundedness. The dramatic improvement in MSE while maintaining nominal coverage is impressive.
I found several aspects of the paper confusing, however. Specifically, those related to splitting criteria of the trees. In the case of propensity trees, they’re training a classifier to separate treatment classes, but they’re conversely requiring a constraint of heterogeneity of classes in each leaf, which is directly opposed to the split criteria.
Similarly in the double-sample framework, they’re splitting to minimize the outcome MSE, which groups points with similar outcome values. But the entire point is that after separating the points by treatment classification, the outcomes are different and that difference is the average treatment effect. Once again the splitting criteria seems opposed to the end-goal. To this end they reference a paper (Athey and Imbens ) that may contain clarification.
Finally, there’s a remark that I don’t understand, but sounds troubling.
Remark 4. (testing at many points) We note that it is not in general possible to construct causal trees that are regular in the sense of Definition 4b for all x simultaneously….In practice, if we want to build a causal tree that can be used to predict at many test points, we may need to assign different trees to be valid for different test points. Then, when predicting at a specific x, we treat the set of tees that were assigned to be valid at that x as the relevant forest and apply Theorem 11 to it. (pg 19)
I’m not sure if this is operational overhead, or something more fundamental.
What do you think?