I was thinking about stacking ensembles (stacked generalization) in the sauna.
Stacked ensembles overfit, so we need to regularize.
Generally, we use cross-validation to ensure that the meta model is fit on out-of-fold predictions.
This is to avoid data leakage, but we could say it has a “regularizing” effect.
For reference, here’s how that works via claude sonnet 3.5 (light editing from me):
Let me explain how the meta-model in stacked generalization is fit using cross-validation:
The key idea is to train the meta-model on out-of-fold predictions from the base models to avoid data leakage. Here’s the process:
First, the training data is split into K folds (typically 5 or 10).
For each fold k:
- Train each base model on the data from all other K-1 folds
- Use these trained base models to make predictions on fold k
- Store these out-of-fold predictions along with the true target values
After doing this for all K folds, we have:
- A matrix of out-of-fold predictions from each base model for every training instance
- The corresponding true target values
The meta-model is then trained on this dataset where:
- Features are the out-of-fold predictions from each base model
- Target is the true target values
Finally, to make predictions on new data:
- Train each base model on the full training dataset
- Get their predictions on the new data
- Feed these predictions into the trained meta-model
This cross-validation approach ensures the meta-model learns how to combine base model predictions without overfitting, since it never sees predictions from base models that were trained on the same data used to generate those predictions.
Me again.
I was thinking, can we use dropout when learning how to combine the predictions from lower level models in the meta model?
It probably makes most sense when the meta model is a linear model and we can zero out the predictions or coefficients randomly while training each model update.
Na, just gather out of fold predictions, and apply a dropout mask to the “dataset” prior to fitting the meta model.
If you recall dropout is a regularization method used in neural nets.
It involves zeroing out signals from a prior layer during forward and back propagation.
Here’s a summary from claude:
Dropout randomly deactivates neurons during training with probability p (typically 0.5 for hidden layers, 0.2 for input layer). > Key aspects:
In training:
- For each batch, randomly zero out different neurons
- Scale remaining activations by 1/(1-p) to maintain expected sum
- Backpropagate through non-dropped neurons only
In inference:
- Use all neurons (no dropout)
- No need to scale since using full network
This prevents co-adaptation of neurons, reduces overfitting, and approximates model averaging of neural networks. The random dropping creates robust features useful in many contexts rather than features specific to training examples.
Me again.
Can we do the same kind of thing when learning how to combine predictions in a stacked ensemble?
And do we get benefits like model robustness and less reliance on specific models.
Reminds me of ablation studies on ensemble members.
Here’s what claude thinks about the idea:
Dropout in stacking’s meta-model follows these steps:
During meta-model training:
- Apply dropout to base model predictions (features) with probability p
- Scale remaining predictions by 1/(1-p)
- Train meta-model on these masked features
During inference:
- Use all base model predictions without dropout
- Meta-model uses full feature set
This prevents meta-model from overly relying on specific base models and helps learn robust combinations of predictions.
Okay, so thinking out loud here, I guess we could train the system to predict “1” for “no prediction” or “pass” on a classification task and then 1 for the first label, 2 for the second and on.
Or predict a probability (0-1) and 0.5 for a dropout “pass” prediction.
Then compare the result to the same setup with a linear model under L1, under L2, under L1+L2, and under no regularization.
Does it do better/worse/the same?
Might be fun!
Okay, so here’s a tiny prototype of the idea (developed in with gpt4o, tweaked by me, then updated in gpto1):
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# -- 1. Create synthetic dataset ------------------------------------------------
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=8,
n_redundant=2,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
# -- 2. Define simple base models -----------------------------------------------
base_models = [
("knn", KNeighborsClassifier(n_neighbors=5)),
("svc", SVC(probability=True, random_state=42)),
("dt", DecisionTreeClassifier(random_state=42))
]
# -- 3. Define the meta-model ---------------------------------------------------
meta_model = LogisticRegression(random_state=42)
# -- 4. Custom StackingWithDropout class ----------------------------------------
class StackingWithDropout(BaseEstimator, ClassifierMixin):
"""
Manually stack base model predictions as features for a meta-model,
with an added 'dropout' mechanism on the meta-features.
"""
def __init__(self, base_models, meta_model, dropout_rate=0.2):
self.base_models = base_models
self.meta_model = meta_model
self.dropout_rate = dropout_rate
def fit(self, X, y):
# Fit each base model on the entire training set
# Collect out-of-fold predictions (cross_val_predict) for meta-features
self.base_models_ = []
self.meta_X_ = np.zeros((X.shape[0], len(self.base_models)))
for i, (name, model) in enumerate(self.base_models):
# Fit the model
model.fit(X, y)
self.base_models_.append((name, model))
# For meta-features, use cross_val_predict with probabilities
# (method="predict_proba"), then take the probability of the positive class
preds = cross_val_predict(
model, X, y, cv=5, method="predict_proba"
)[:, 1]
self.meta_X_[:, i] = preds
# Apply dropout to meta-features
meta_X_dropped = self._apply_dropout(self.meta_X_)
# Now train meta-model on the dropout-augmented meta-features
self.meta_model.fit(meta_X_dropped, y)
return self
def predict(self, X):
# For prediction, get the base-model predictions
meta_X_test = np.column_stack([
model.predict_proba(X)[:, 1] for _, model in self.base_models_
])
# Meta-model predicts the final output
return self.meta_model.predict(meta_X_test)
def _apply_dropout(self, meta_X):
"""
Apply dropout to the meta-features:
- With probability 'dropout_rate', replace each feature with 0.5.
- With probability '1 - dropout_rate', leave it as is.
"""
# Create a mask of True/False indicating which entries are "kept"
mask = np.random.rand(*meta_X.shape) > self.dropout_rate
# Where mask is False, assign 0.5
meta_X_dropped = meta_X * mask + (1 - mask) * 0.5
return meta_X_dropped
# -- 5. Compare models ----------------------------------------------------------
# Create two versions of stacking: one without dropout, one with dropout.
models = {
"Stacking Without Dropout": StackingWithDropout(base_models, meta_model, dropout_rate=0.0),
"Stacking With Dropout": StackingWithDropout(base_models, meta_model, dropout_rate=0.2),
# For reference, train the meta-model directly on X (not recommended in practice)
"MetaModel (Direct LR)": LogisticRegression(random_state=42),
# For reference, each base model individually
"KNN": base_models[0][1],
"SVC": base_models[1][1],
"DecisionTree": base_models[2][1]
}
results = {}
for name, model in models.items():
# Fit the model
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
results[name] = accuracy
print(f"{name}: {accuracy:.4f}")
# -- 6. Display results ---------------------------------------------------------
results_df = pd.DataFrame(list(results.items()), columns=["Model", "Accuracy"])
results_df.sort_values(by="Accuracy", inplace=True, ascending=False)
print("\nResults:")
print(results_df.to_string(index=False))
Here’s a sample output:
Stacking Without Dropout: 0.9000
Stacking With Dropout: 0.9100
MetaModel (Direct LR): 0.6750
KNN: 0.8800
SVC: 0.9000
DecisionTree: 0.7300
Results:
Model Accuracy
Stacking With Dropout 0.910
Stacking Without Dropout 0.900
SVC 0.900
KNN 0.880
DecisionTree 0.730
MetaModel (Direct LR) 0.675
Looks like it is off to a good start!
Are there any papers on this kind of thing?
A quick search only turns up the application of stacking to MOOC dropout rates.
A search on perplexity finds papers on school dropout or the use of dropout in neural net ensembles.
Maybe this is all too obvious to warrant a paper.