Towards Reliable Assessments of Demographic Disparities in Multi-Label Image Classifiers — arXiv2