Adversarial validation as density ratio estimation
- Adversarial validation
- Density ratio
- Relationship between adversarial Validation and density ratio
- Conclusion
Adversarial validation
Adversarial validation is a technique mainly used in Kaggle.
Kaggle: Your Home for Data Science
When the distribution of train set and that of public test set differs, to make validation set randomly from train set leads to low correlation between public leaderboard and local validation. Adversarial validation chooses data from train set that has high density in public test set, and makes nice validation set that has high correlation to public test set.
Steps of adversarial validation are like following :
- Give pseudo negative labels -1 to train set labels and give pseudo positive labels +1 to test set.
- Train any probabilistic binary classifier which discriminates train set and test set using cross-validation.
- Infer train set using above classifier, employ top-N data which have high score for class +1 as validation set.
(Ref)
Density ratio
Here, "density ratio" : is introduced.
Density ratio is rewritten using Bayes' rule as :
and can be approximated by a binary classifier that discriminates train set and test set. Also, and are constant value, and can be approximated by the ratio between the number of train data and that of test data.
Thus, density ratio is :
where N and M is the number of train data and that of test data. is the probability for class +1 (target set) of data that is outputted by the classifier.
Importantly, density ratio estimation can be boiled down to a binary classification problem. That is the same process as the adversarial validation. (Remind that we make a binary classifier between train set and test set in adversarial validation!)
(Introduction to density ratio estimation) Density Ratio Estimation for KL Divergence Minimization between Implicit Distributions | Louis Tiao
Density ratio is the monotonically increasing function w.r.t. . For simplicity, is assumed in the following discussion.
Shape of density ratio is like :
Relationship between adversarial Validation and density ratio
Shown above, adversarial validation chooses data from train data that have high density ratio value.
Data which have are high density in test data and low density in train set. Such data is so useful for validation because they are rare in train set, while frequently appeared in test data.
Data which have around are difficult to classify into train set and test set. These are also useful for validation data.
Data which have are high density in train data and low density in test set. It is dangerous to use this kind of data for validation data because they are rare in test set.
Conclusion
In this article, I introduce the relationship between adversarial validation technique and density ratio estimation. Adversarial validation is theoretically equivalent to density ratio estimation between train set and test set.