The Center for Data Innovation had a chat with a leading data scientist at DrivenData, an organization based in Denver, Colorado, that runs data science competitions to create AI solutions for social good. We talked about how DrivenData has helped develop models to identify endangered species and how privacy-enhancing techniques can help unlock sensitive data for social good. The interview has been edited.
Hodan Omaar: Tell me about DataDriven’s machine learning competitions?
Jay Qi: In DrivenData’s online machine learning competitions, data scientists from around the world compete to build the best algorithms for impactful real-world problems. The performance of different solutions are evaluated automatically and displayed on a live leaderboard, a structure that has been shown to increase the highest levels of performance and engagement achieved for machine learning problems. Our focus is on social good applications, and we’ve run over 65 competitions spanning areas such as sustainability, health, social media moderation, and more. The winning models’ code and documentation are open-sourced to serve as accessible resources.
Omaar: Can you share your most interesting real-world impacts?
Qi: We’ve tackled a wide range of problems, from identifying hateful content in social media, using skin melanoma images to predict cancer relapse, estimating fresh water in snowpacks for water management to analyzing geochemical data collected by rovers on Mars. One of my favorites was our “Where’s Whale-do” challenge to identify individual beluga whales from photographs—an important task for endangered population research.
Omaar: How do you help organizations see the potential of their data?
Qi: Organizations are excited about AI and large language models (LLMs) but need help in applying their data effectively. We work to address their specific needs and guide them through the gaining success through data science.
Omaar: If there was a type of data you could unlock to best serve social good, what would it be?
Qi: Sensitive data about people holds a lot of potential for helping others but is often locked due to privacy and safety concerns. We’re following the development of privacy-enhancing technologies like differential privacy, federated learning, and homomorphic encryption to address this issue.
Omaar: How important is open-source in your work?
Qi: By making winning models and competition data open-source, we maximize our impact and contribute to a body of open-source data science tools. We also release and maintain open-source tools that we think can be generally useful, such as our data science project template and a Python library for cloud file storage. Plus, we contribute with helpful blog posts and learning resources.