I am evaluating two models using a testing set. The models are tailored to return a prediction for each instance, only if there is enough evidence that it is highly accurate. On the contrary, if there is no evidence for a test instance, then the models will not return a prediction for such instance.
This means that both models will return output vectors of different sizes. For example:
- Model 1 returns a prediction for 20% of the test set.
- Model 2 returns a prediction for 60% of the test set.
How can I perform a t-test to compare the means of both approaches?
One of the solutions I have been thinking about is to compute the t-test only for the instances that both models managed to predict (overlap).
Another solution would be to return a random prediction when there is not enough evidence, but I find this a bit misleading due to the nature of the task (predict a geolocation).