One of the latest breakthroughs in the field of natural language processing (NLP) is the rise of large language models (LLMs), which are created using extensive datasets containing massive amounts of information. Various LLMs are currently accessible, including Google’s BERT and OpenAI’s GPT-2 and GPT-3. These models enable the generation of a wide range of content, from basic essays to complex financial models.
Image generated on USP.ai
AI startups like OpenAI, Hugging Face, Cohere, and AI21 Labs are at the forefront of advancing large language models (LLMs) by training models with billions of parameters. These companies are pushing the boundaries of what LLMs can achieve.
Among the notable applications of LLMs are AI-based code generators. Here are five examples of code generators that utilize large language models to produce high-quality code:
1. Open AI Codex
OpenAI Codex, a model built upon GPT-3, serves as the engine behind GitHub Copilot. GitHub Copilot is a tool developed by GitHub that enables code generation within popular development environments like VS Code, Neovim, JetBrains, and even cloud-based platforms such as GitHub Codespaces. It boasts the capability to generate code in numerous programming languages, including JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, and even BASH. To train the model, billions of lines of publicly available code from sources like GitHub repositories have been utilized.
Image generated on USP.ai
OpenAI made the model available through a private beta to developers and platform companies to build tools and integration.
2. Tabnine
Tabnine, though not a complete code generator, significantly enhances the auto-completion functionality within integrated development environments (IDEs). Originally created by Jacob Jackson during his time as a student at the University of Waterloo using the Rust programming language, Tabnine has evolved into a comprehensive AI-powered code completion tool.
Tabnine offers support for more than 20 programming languages and is compatible with 15 different editors, including renowned integrated development environments (IDEs) such as VS Code, IntelliJ, Android Studio, and even Vim. It can be acquired at a price of $432 per year for a team consisting of three developers.
Image generated on USP.ai
3. CodeT5
SalesForce researchers have developed CodeT5, an open-source programming language model that builds upon Google’s T5 (Text-to-Text Transfer Transformer) framework. The CodeT5 model was trained using a vast collection of code instances, amounting to over 8.35 million, which included user comments sourced from publicly accessible GitHub repositories. The team primarily utilized the CodeSearchNet dataset, which encompasses languages like Ruby, JavaScript, Go, Python, PHP, C, and C#. Additionally, two datasets for C and C# were extracted from BigQuery.
CodeT5 introduces three potential capabilities to software programming:
- Text-to-code generation: The ability to generate code based on natural language descriptions.
- Code autocompletion: Completion of entire code functions given the target function name.
- Code summarization: Generating natural language summaries for functions.
4. Polycoder
Polycoder, created by researchers at Carnegie Mellon University, serves as an open-source alternative to OpenAI’s Codex. Built upon OpenAI’s GPT-2, Polycoder undergoes training on a 249 GB codebase encompassing 12 programming languages. The authors of Polycoder claim that the model exhibits superior accuracy in writing C code compared to other models, including Codex.
Notably, while the majority of code generators are not open source, Polycoder stands out as one of the pioneering open-source models for code generation.
Image generated on USP.ai
5. Cogram
Cogram, a Berlin-based startup backed by Y-Combinator, offers a code generation tool specifically designed for data scientists and Python programmers utilizing SQL queries and Jupyter Notebooks. This tool enables data scientists to write queries in English language, which Cogram translates into intricate SQL queries encompassing joins and grouping. It provides support for various databases such as SQLite, PostgreSQL, MySQL, and Amazon Redshift.
Cogram offers seamless integration with Jupyter Notebooks, allowing Python and Julia developers to automatically generate code. By leveraging the power of Cogram, developers can generate contextual code tailored to specific tasks, taking into account comments and requirements. Furthermore, data scientists can utilize Cogram to generate visualizations using popular Python modules like Matplotlib, Plotly, or Seaborn, enabling effective data representation and analysis.