OpenCoder: A Fully Open-Source Code LLM Among the Top Performers
OpenCoder: Fully open-source code LLM, leading in performance, ideal for research with comprehensive reproducible data processing and training details.
The Code Large Language Model (CodeLLM) has become essential in various domains, such as code generation, reasoning tasks, and intelligent agent systems.
Despite the performance of open-source CodeLLMs gradually approaching that of proprietary models, high-quality CodeLLMs suitable for scientific research remain very scarce, especially those with fully reproducible processes for data cleaning, synthetic data generation, and model training.
This scarcity stems from several challenges, including resource limitations, ethical considerations, and the need to maintain competitive advantages.
To bridge this gap, the research team has launched OpenCoder, a series of CodeLLMs that rank among the top-tier models, matching the performance of leading models while providing the research community with comprehensive construction details.
Unlike most previous work, OpenCoder not only releases model weights and inference code but also offers reproducible training data, complete data processing procedures, rigorous experimental ablation results, and detailed training information, providing comprehensive resources for scientific research.
The research team identified the key factors in building a high-quality CodeLLM as follows:
Data quality is crucial, with code pre-training data requiring refined heuristic rules for cleaning and deduplication at the file level.
Incorporation of code-related text from web sources in the pre-training data.
Use of high-quality synthetic data during annealing and supervised fine-tuning stages.
The OpenCoder team aims to promote an in-depth understanding of CodeLLM by increasing openness, making OpenCoder not just a powerful model but an open foundational platform to accelerate research progress, promote reproducibility in code AI, and close the gap between the open-source community and the industry.