Introducing the openai codex - a platform for sharing examples and training data in artificial intelligence

Introducing the openai codex — a platform for sharing examples and training data in artificial intelligence.

A blog to help people share their data in order to improve research.

OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

Today we are releasing a platform for sharing examples and training data in artificial intelligence: https://openai.com/codex/. In the coming months, we’ll be building a collection of datasets that represent interesting tasks in AI. We hope it will help people to share their data and increase collaboration.

Motivation

Today, machine learning systems are trained using data from an ever-increasing range of domains. Many of these datasets are unpublished or difficult to access. We believe that sharing this data will improve research and make it easier for researchers to build on each other’s work.

In our research on text generation, we found that many promising approaches were described in papers but not accompanied by open-source code — making it hard to replicate results or compare new models with existing ones. For example, during this time we were unable to find a single implementation of the Transformer model’s decoder (the part responsible for generating text), despite this being the most common model architecture used in modern language generation tasks.

In addition, many ML projects have very little data or training code associated with them — making it hard to find relevant code examples or test new ideas quickly.

This blog introduces OpenAI Codex, a platform for sharing examples and training data in artificial intelligence.

We’ve created this tool because we have found that large sets of examples can be extremely useful for developing AI. For example, when we demonstrated deep reinforcement learning to play Atari games from raw pixels (MNIST handwritten digits, or ImageNet images, we had to train on millions of examples.

The examples from our experiments are not always immediately useful to others — the data is in a different format, or the dataset contains private information, or there isn’t enough compute available, etc. But sometimes it would be useful if others could access these data sets and use them for their own research or for training models for deployment. For example:

1) Researchers at OpenAI developed new algorithms by using the game replays generated by human players in the Dota 2 competition as inputs to their learning algorithms.

2) Our image classifier generates more accurate results when trained on additional data from humans labeling images as containing robots.

3) When building chatbots that respond to text input, it can be helpful to train an LSTM model on a large corpus of text that was written in English by humans (such as Wikipedia).

4) A potential application of GANs is

The OpenAI Codex is a platform for sharing datasets and training data in artificial intelligence. We believe that making this kind of material available to the community will help with both research and practice. We hope that this type of sharing will become common in AI.

This is an early release, with some of our own examples to start things off, and we hope creators will add more examples in the coming months. Examples include:

OpenAI has released the OpenAI Codex, a platform for sharing data in AI. We believe that increased sharing can help AI researchers train better models on existing datasets, and build new datasets more quickly.

Codex is designed to be simple and flexible. It lets you upload any data you want, including both text and images. You specify a schema, which can include lists of names and numbers, dates, or images. Users can then search for specific examples using those fields.

For example, if you’re training a model to recognize faces in photos and find a photo with multiple people in it, you could upload your photo with a schema that includes name (text) and headshot (image). Then someone else who wants to train their model on photos of Elon Musk could search for “Elon Musk,” find your photo among the results, and grab the headshot to use as training data.

We’re releasing Codex with two datasets: one text dataset (Reddit comments), and one image dataset (ImageNet bounding boxes). We hope it’ll be useful for research into artificial general intelligence (AGI), where finite amounts of labeled data are especially limiting. We expect this project to take many years of effort from the community.

We are excited to announce the launch of the OpenAI Codex: a platform for sharing examples that can be used to train AI systems. The Codex is inspired by the long history of machine learning research built on shared datasets — such as MNIST, CIFAR-10, ImageNet, and WordNet. These datasets have enabled researchers to train new models and move the field forward.

The Codex is an attempt to build a complementary platform for sharing training data in text-based domains like translation, summarization, conversations, and question answering. We hope it will enable researchers in academia and industry to share training data more easily, accelerate research in these domains, and help us find new applications for existing models.

The Codex is structured as a Github repository with over 1,000 curated problems from many different sources, including our own research and others we’ve found online. Problems come in three different forms:

1) Raw text for translation or summarization

2) Dialogues for chatbots or dialogue agents

3) Questions and answers for question answering

At OpenAI, we’re motivated by a single goal: to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. As we work towards this goal, we’ve realized the importance of sharing our research and tools with others. Since our founding in December 2015, we’ve released many leading-edge generative models and reinforcement learning algorithms for free in the hopes of improving communication and collaboration within the AI community.

Today, we’re excited to announce another important step towards this goal: the release of the OpenAI Codex. The Codex is a platform for researchers and developers to share machine learning models and training data. Why do this?

First, researchers are continually looking for new datasets on which to test their models. This has been especially true of deep learning research over the last five years; groups have often evaluated their work on the same datasets, making it hard for readers to compare results on different tasks. We hope that by making new datasets available through the Codex, researchers will be able to better evaluate their own work against prior efforts without having to spend time collecting new data themselves.

Second, as machine learning moves into increasingly complex domains like images or audio, it’s harder than ever for us