publications

2025

  1. ICLR 2025
    BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
    In International Conference on Learning Representations 2025

2024

  1. arXiv
    GPS-SSL: Guided Positive Sampling to Inject Prior Into Self-Supervised Learning
    arXiv preprint arXiv:2401.01990 2024

2022

  1. DataPerf WS
    ICML 2022
    Revisiting Hotels-50K and Hotel-ID
    arXiv preprint arXiv:2207.10200 2022

2020

  1. EMNLP 2020
    Structure Aware Negative Sampling in Knowledge Graphs
    arXiv preprint arXiv:2009.11355 2020