How the DeepSeek-R1 AI model was taught to teach itself to reason | Explained

Share This Post

[ad_1]

The story so far: For many decades, one of the great challenges in artificial intelligence (AI) has been teaching machines to reason. Reasoning goes beyond memorising facts or completing sentences. It’s the ability to follow steps, reflect on mistakes, and adjust strategies until the right answer is found.

Humans use reasoning for everything from solving maths problems to writing computer programmes, from negotiating their daily lives to deciding whom to vote for. Large language models (LLMs) such as GPT-4 or DeepSeek-V3 have surprised scientists by showing signs of reasoning when scaled to large sizes. Another method, called chain-of-thought prompting, where the model is nudged to “think step by step”, has also boosted performance.

But both these approaches come with limits. Training models to reason usually demand human-made examples. E.g. people show an AI model how to solve problems and the AI learns to copy the method. This is slow, expensive, and introduces human biases. It also caps the AI’s creativity because the model can’t explore problem-solving methods that humans didn’t think of.

In a paper published in Nature on September 17, the DeepSeek-AI team reported that it was able to reach its model, called just R1, to reason by asking an ambitious question: what if we allowed the model to teach itself to reason without showing it human examples first? That is, they found that R1 could develop new forms of reasoning using reinforcement learning, a method of trial and error guided only by rewards for correct answers.

What is reinforcement learning?

The team’s aim was to make the model smarter at maths and coding as well as to uncover how reasoning behaviours might emerge naturally when a machine is given the proper incentives.

DeepSeek researchers began with V3 Base, a large language model similar to other state-of-the-art systems. Instead of using the usual supervised fine-tuning, where humans provide the reasoning steps, they applied ‘group relative policy optimisation’, a reinforcement learning method designed for efficiency.

In this setup, the model, called R1-Zero at first, was asked to solve mathematical and algorithmic problems. For each attempt, it had to produce two parts: a reasoning process inside `<think>…</think>` tags and a final answer inside `<answer>…</answer>` tags. The only reward came from whether the final answer was correct, judged by rule-based systems like answer keys or code compilers. No one told the model how its reasoning should look.

Over thousands of training steps, the model learned by trial and error. If an answer was wrong, the path that led there was discouraged; if it was right, the path was reinforced. Importantly, the researchers also tracked how the model’s thinking time, i.e. the number of tokens it used in its reasoning section, changed. Strikingly, the model began writing longer and more reflective reasoning chains on its own, sometimes including phrases like “wait” or “let’s try again”, revealing an ability to self-correct.

Was there human intervention?

To address weaknesses such as poor readability and mixing English with Chinese, the team built R1 from R1-Zero. This process included adding incentives for consistently using one language supervised fine-tuning with both reasoning and non-reasoning data. The final model thus inherited the raw reasoning power of R1-Zero while also becoming easier to use and safer.

The results were striking. On the American Invitational Mathematics Examination (AIME) 2024, a tough competition that usually the smartest high-school students attempt, R1-Zero’s accuracy jumped from just 15.6% at the start of training to 77.9% by the end. With more tuning, it reached 86.7%, surpassing the average performance of human students.

At a certain stage, R1-Zero began using the word “wait” more often in its reasoning, just like a human might have when a mistake is spotted. The researchers said this meant the model wasn’t blindly following a path but actively rethinking steps when something seemed off. In effect, reinforcement learning had coaxed the AI into behaviours that resembled reflection and verification, both elements of reasoning.

The ultimate R1 model was even stronger: it was good at maths and coding as well as on benchmarks for general knowledge, answering questions, and following instructions. Compared to its predecessors, R1 was also more consistent with its choice of language and better aligned with human preferences for helpfulness and safety. When evaluated with frameworks like AlpacaEval 2.0 and Arena-Hard, which test how well a model follows instructions, R1 improved by 25% and 17%, respectively, which are considered large.

What’re the pros and cons of reasoning?

Many large language models, including widely used systems like ChatGPT, often demand large amounts of computational resources during testing. R1, on the other hand, could adapt how much it “thought” depending on the task’s difficulty. Simple problems were met with short reasoning chains while harder ones led to longer, more elaborate chains. This dynamic allocation avoided demanding power on questions that didn’t warrant it. However, reinforcement learning itself is energy-intensive.

Taken together, the findings confirm that reinforcement learning alone (with the right design) could produce reasoning behaviours that were previously thought to require human examples. This could change the way we think about how intelligence might grow in artificial systems. For instance, in future, researchers could build verifiers that check answers and let the model figure out its own strategies. If the answer to a maths problem, a computer programme or a factual question can be reliably checked, then reinforcement learning can do the rest. This could speed up progress while reducing human labour and bias.

Indeed, traditional LLM training pipelines bank heavily on large human-labelled datasets — people writing question-answer pairs, reasoning steps, preference judgments, etc. They are expensive and often assembled under exploitative labour conditions. If machines can be taught to reason using reinforcement learning alone, the demand for human-annotated data can shrink, thus also reducing pressure to source cheap labour worldwide. However, the study paper also acknowledges that tasks without clear ground-truthing still rely on human-labelled data for reward models. So human input is not eliminated; only its scope may shrink to areas where no reliable verifier can be built.

A model that learns to reason will also demand better reward signals for open-ended tasks like writing, which is difficult, as well as stronger safeguards as it becomes capable of generating dangerous or manipulative content. In fact, watching a machine develop reflective behaviour (pausing, checking, revising, etc.) raises questions about how far such systems can go. If reasoning emerges from incentives rather than instructions, could creativity or deeper forms of understanding emerge in the same way?

Time will tell — unless DeepSeek-R1 figures it out first.

Published – September 17, 2025 08:30 pm IST

[ad_2]

Source link

Related Posts

Online Gaming Platform Shutdown Scams: A Warning Report

The world of online gaming is filled with exciting...

The Best Apps for Mobile Live Video Broadcasting

Why Mobile Live Broadcasting Keeps GrowingMobile live video broadcasting...

Dive Into New Challenges and Win Big

Embrace the Excitement of Overcoming Challenges and Achieving Great...

Portal Breakers Enter the Fractured Universe

The universe is far larger and stranger than most...

Adios, Windows: These alternatives make switching from Microsoft easy

If you can’t install Windows 11 on your...
- Advertisement -spot_img
Slot Gacor Slot777slot mahjongslot mahjongjudi bola onlinesabung ayam onlinejudi bola onlinelive casino onlineslot danaslot thailandsabung ayam onlinejudi bola onlinesitus live casino onlineslot mahjong waysbandar togel onlinejudi bolasabung ayam onlinejudi bolaSABUNG AYAM ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINEjudi bola onlineslot mahjong wayslive casino onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlinemahjong wayssabung ayam onlinesbobet88slot mahjongsabung ayam onlinesbobet mix parlayslot777judi bola onlinesabung ayam onlinesabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysjuara303juara303juara303juara303juara303juara303juara303juara303SV388Mix ParlayBLACKJACKSLOT777Sabung Ayam OnlineBandar Judi BolaAgen Sicbo Online
agen sabung ayamslot mahjong gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjongsabung ayam onlinejudi bola onlinelive casino onlineslot mahjongslot mahjongsabung ayam onlinescatter hitamlive casino onlinemix parlaysabung ayam onlinelive casinomahjong waysmix parlaysabung ayam onlinelive casinomahjong waysmix parlaySBOBETSBOBETCASINO ONLINESBOBETSBOBET88SABUNG AYAM ONLINESBOBETagen judi bolalive casino onlinesabung ayam onlinejudi bola sbobetsabung ayam onlineSabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2slot gacorjudi bolamix parlayjudi bolasv388SABUNG AYAM ONLINELIVE CASINO ONLINEJUDI BOLAMAHJONG WAYSSLOT MAHJONGJUDI BOLA ONLINELIVE CASINO ONLINESABUNG AYAM ONLINE
SABUNG AYAM ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINEjudi bola onlinesabung ayam onlinelive casino onlinesitus toto 4djudi bola onlinejudi bola onlinesabung ayam onlinelive casino onlinejudi bola onlinemix parlaysbobet88sv388sbobet mix parlayws168sbobet88sv388sv388sbobet88sabung ayam onlinejudi bola onlinesabung ayam onlinesbobet mix parlaysabung ayam onlinejudi bola onlineslot gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysjuara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303SV388Mix ParlayLive Casino OnlineSitus Slot GacorSV388SBOBET WAPBlackjackPragmatic PlaySV388Judi Bola OnlineBlackjackKakek ZeusSV388Mix ParlayAgen BlackjackSlot Gacor Onlinesabung ayam onlinejudi bola onlinesabung ayam onlinejudi bola onlinejudi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlineslot mahjong wayssabung ayam onlinejudi bolaslot mahjonglive casino onlinesabung ayam onlinejudi bola onlineslot mahjong gacorsitus toto togel 4Dsabung ayam onlinesitus toto togel 4Dsitus live casinojudi bola onlinesitus slot mahjongjudi bolasabung ayam onlinesabung ayam onlinemahjong wayssabung ayam onlinejudi bolasabung ayam onlinejudi bola
judi bola onlinejudi bola onlinejudi bola onlinejudi bola onlineJUDI BOLA ONLINESBOBET88JUDI BOLA ONLINEJUDI BOLA ONLINESV388Judi Bola OnlineBlackjackKakek ZeusSV388SBOBET WAPAgen BlackjackSlot Gacor Onlinejuara303juara303juara303juara303juara303juara303juara303juara303judi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bolasabung ayam onlinesabung ayam onlinejudi bola onlinesitus live casino onlineslot mahjong wayssabung ayam onlinesitus live casinojudi bola onlinedexel
Slot Mahjong Waysslot danaslot danaslot danasabung ayam onlinesabung ayam onlineJUDI BOLA ONLINESV388Mix ParlayAgen Casino OnlineSLOT777Sabung Ayam OnlineAgen Judi BolaLive Casino Onlinesabung ayam onlinesabung ayam onlinejudi bola onlineslot mahjong wayssabung ayam onlinejudi bola onlinesitus live casino onlineagen togel onlineSabung Ayam OnlineJudi Bola OnlineSlot MahjongBandar togelSabung Ayam OnlineJudi Bola Onlinejudi bola onlinejudi bola onlinesabung ayam onlinelive casino onlineJUDI BOLA ONLINESBOBET88JUDI BOLA ONLINEmix parlaymix parlaylive casinosabung ayam onlinemix parlayslot danaslot mahjongslot mahjongjudi bolaMAHJONG WAYS 2SABUNG AYAM ONLINELIVE CASINO ONLINESABUNG AYAM ONLINESBOBETLIVE CASINO ONLINESLOT MAHJONG WAYSSABUNG AYAM ONLINEMIX PARLAYSABUNG AYAM ONLINESABUNG AYAM ONLINEWALA MERONWALA MERONSITUS SABUNG AYAMSITUS SABUNG AYAMjudi bola terpercayaSabung Ayam Onlinemix parlaySabung Ayam OnlineZeus Slot GacorSitus Judi BolaSabung Ayam Onlinesitus sabung ayamSlot MahjongSV388SBOBET88live casino onlineslot mahjong gacorSV388SBOBET88live casino onlineslot mahjong gacorSabung Ayam OnlineJudi Bola OnlineCasino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineLive Casino OnlineMahjong Ways 2judi bolacasino onlinesv388sabung ayam onlinejudi bola onlineagen live casino onlinemahjong waysLIVE CASINOJUDI BOLA ONLINESABUNG AYAM ONLINESITUS BOLASV388LIVE CASINO ONLINESLOT QRISSABUNG AYAM ONLINEMIX PARLAYMIX PARLAYJUDI BOLA ONLINESLOT MAHJONG
Mahjong Ways 2mahjong ways 2indojawa88daftar dan login wahanabetCapWorks Official ContactAynsley Official SitedexelHarifuku Clinic Official AccessNusa Islands Bali Official PackagesTrinidad and Tobago Pilots’ Association Official About PageNusa Islands Bali Official ContactCapworks Official SiteTech With Mike First Official SiteSahabat Tiopan Official SiteOcean E Soft Official SiteCang Vu Hai Phong Official SiteThe Flat Official SiteTop Dawg Tavern Official SiteDuhoc Interlink Official SiteRatiohead Official SiteMAN Surabaya E-Learning Official SiteShaker Group Official SiteTakaKawa Shoten Official SiteBrydan Solutions Official SiteConcursos Rodin Official SiteConmou Official SiteCareer Wings Official SiteMontero Espinosa Official SiteBDF Ventura Official SiteAkura Official SiteNamulanda Technical Institute Official Sitemenu home roasted coffeetosayama academy workshopjudi bola onlineContactez le Monaco Rugby Sevens - Club Professionnel à 7Virtual Eco Museum Official Event 2025DRT Seitai Official Contacta leading company in UWB technology development