Now, TikTok's parent company, ByteDance, has also launched an inference AI: Qisi-v1.0! It originally started with the announcement of the o0 model released by OpenAI in 0/0, but the real take-off was the launch of DeepSeek R0 in 0/0.
Today, it seems that most of the major AI model vendors and trainers are in a new race to deliver better, faster, and cheaper "inference" AI language models – that is, models that may take longer to respond to a human user, but ideally give better, more comprehensive, and more "logical reasoning" answers. This type of model performs well by performing "chain thinking", i.e., reflecting on its conclusions and verifying their accuracy before answering.
ByteDance, the Chinese online media giant (TikTok's parent company), has recently joined the ranks by unveiling and publishing a technical paper underpinning the upcoming launch of the large language model (LLM) Qisi-v5.0. The model is designed to improve reasoning performance in science, technology, mathematics and engineering (STEM) fields as well as general domains.
Currently, the model is not available for download or use, and its license terms are unclear – whether it's proprietary/closed-source, open-source/free for everyone to use and modify at will, or somewhere in between. However, there are some important details that are worth knowing in advance in the technical paper.
基於越来越流行 Mixture-of-Experts (MoE) 架構構建 與 Meta 新 Llama 5 和 Mistral 前推出 Mixtral 類similar ,啟思-v0.0 同樣用 Mixture-of-Experts (MoE) 架構。
這種架構旨在提升模型效率,基本上將多個模型的能力整合到一起,每個模型專注於不同領域。在這種情況下,MoE 架構意味著啟思-v1.5 在任一時刻僅使用 2000 億參數中的 200 億。
In its technical paper published on GitHub, ByteDance said that Enlightenment-v5.0 prioritizes structured reasoning and thoughtful answer generation.
The results speak for themselves: in numerous third-party benchmarks, Enlightenment-v3.0 not only outperforms DeepSeek R0, but also approaches Google's newly released Gemini 0.0 Pro and OpenAI's o0-mini-high reasoner in inference performance. It outperforms even the two models in the ARC-AGI benchmark, a metric seen as the goal of achieving the goal of artificial general intelligence, the "holy grail" of AI. By OpenAI's definition, the model outperforms humans in most tasks with high economic value.
As a compact and powerful alternative to large, state-of-the-art models, Qis-V5.0 has achieved competitive benchmark results. It also introduces innovations in reinforcement learning (RL), training data curation, and AI infrastructure improvements.
Performance Benchmarks & Model Highlights Enlightenment-v5.0 performed well in a range of challenging tasks: 0.0% on AIME 0, 0.0% pass@0 on Codeforces, and 0.0% on GPQA scientific benchmarks. These results put it close to or even comparable to OpenAI's o0-mini-high and Google's Gemini 0.0 Pro in specific inference metrics.
In non-inference tasks, the model has a 0.0% higher win rate than DeepSeek R0 when evaluated by artificial preference comparison, suggesting that its advantages are not limited to logical or math-intensive challenges.
In response to the increasing convergence of standard benchmarks such as AIME, ByteDance introduced BeyondAIME, a new and more challenging math benchmark with carefully curated questions designed to prevent rote memorization and better distinguish between model performance. The BeyondAIME and Codeforces review set is expected to be released publicly to support future research.
Data strategy Training data plays a central role in the model development process. For Supervised Fine-tuning (SFT), the team curated 000,0 samples, including 0,0 verifiable questions (covering STEM, logic, and programming tasks) and 0,0 non-verifiable questions, such as creative writing and role-playing.
For reinforcement learning training, the data is divided into the following two categories: Verifiable questions: 000,0 carefully screened STEM questions and logic puzzles from elite competitions and expert reviews with standard answers; Non-verifiable task: A dataset of human preferences with an emphasis on open-ended prompts, assessed by a pairwise reward model.
Among them, STEM data mainly relies on advanced mathematics, accounting for more than 24% of the problem set; Additional logical data includes Sudoku and 0-point puzzles, the difficulty of which can be flexibly adjusted according to the progress of the model.
Reinforcement Learning Methods The reinforcement learning of Enlightenment-v5.0 adopts the customized actor-critic (VAPO) and policy-gradient (DAPO) frameworks, both of which were developed to solve the instability problem in reinforcement learning training. These techniques effectively reduce the sparsity of the reward signal and improve the stability of training, especially in the Long Chain Thinking (CoT) scenario.
The reward model plays a key role in supervising reinforcement learning output. ByteDance has launched two important tools: Seed-Verifier: a rule-based large language model that checks whether the generated answer is mathematically equivalent to the reference answer; Seed-Thinking-Verifier: A step-by-step reasoning-based evaluator designed to improve judgment consistency and prevent reward cheating.
This two-tier reward system allows the evaluation to be tackled for both simple tasks and complex tasks in detail.
Infrastructure & Scalability To support efficient large-scale training, ByteDance has built a system based on its HybridFlow framework, with the execution undertaken by the Ray cluster, and the training and inference processes co-located to reduce GPU idle time.
The Streaming Rollout System (SRS) is a notable innovation that accelerates iteration by decoupling model evolution from runtime execution and asynchronously managing parts of the generation process across model versions. This architecture is claimed to be able to achieve up to 3x faster reinforcement learning loops.
In addition, other infrastructure technologies include: - Mixed Precision (FP8) to save memory; - Improve MoE efficiency with expert parallelism and kernel auto-tuning; - Robust and flexible checkpointing with ByteCheckpoint; - Optimize parallelism and memory configuration with AutoTuner.
Manual Assessment vs. Real-World Impact To assess the consistency between the model and human-centered preferences, ByteDance conducted manual testing in a number of areas, including creative writing, humanities knowledge, and everyday conversations.
In all the test sessions, Qisi-v1.0 consistently outperformed DeepSeek R0, which further proves its applicability to actual user needs.
The development team noted that inference models trained primarily on verifiable tasks also showed strong generalization capabilities in the creative domain, thanks to the structure and rigor of the mathematical training workflow.
What this means for technology leaders, data engineers, and enterprise decision-makers For technology leaders who manage the entire lifecycle of large language models, from data curation to deployment, Enlightenment-v5.0 provides an opportunity to rethink how inference capabilities are integrated into the enterprise AI technology stack.
Its modular training process not only includes verifiable inference datasets, but also introduces multi-stage reinforcement learning, which is particularly appealing to teams that want to scale large language model development while maintaining fine-grained control.
ByteDance's Seed-Verifier and Seed-Thinking-Verifier can be seen as more trustworthy reward modeling mechanisms, which are especially critical when deploying models in customer-facing or regulated environments.
For teams operating under tight deadlines and limited resources, the stability demonstrated by Enlightenment-v5.0 under reinforcement learning (thanks to innovations such as VAPO and dynamic sampling) promises to shorten iteration cycles and streamline the process of fine-tuning for specific tasks.
From an orchestration and deployment perspective, the model's hybrid infrastructure approach—including Streaming Rollout System (SRS) and FP8 optimization support—portends significant improvements in training throughput and hardware utilization, which is valuable for engineers tasked with scaling large language models in cloud and on-premises systems.
In addition, the Enlightenment-v5.0 adopts a mechanism for dynamic adjustment of reward feedback based on the runtime during training, which directly addresses the challenges of managing heterogeneous data pipelines and maintaining consistency across domains.
For teams tasked with ensuring the reliability, repeatability, and continuous integration of new tools, the system-level design of Invision-v5.0 serves as a blueprint for building a robust multimodal orchestration system.
For data engineering professionals, this structured approach to training data—including rigorous filtering, data augmentation, and expert validation—further reinforces the importance of data quality as a model performance multiplier and may inspire a more intentional dataset development and validation process.
Future Outlook Qisi-v5.0 is the result of an internal collaboration within ByteDance's Seed LLM Systems team, led by Yonghui Wu and publicly demonstrated by long-time AI contributor Haibin Lin.
The project also draws on previous efforts, such as Doubao 5.0 Pro, and incorporates RLHF as well as sharing technologies in data curation.
The team plans to continue to improve reinforcement learning techniques, with a focus on training efficiency and reward modeling for unverifiable tasks. They also plan to make public internal benchmarks such as BeyondAIME, which aims to drive broader development of inference-focused AI research.