diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..62e4f11
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+
DeepSeek-R1 is an [open-source language](https://plasticar.com.ar) design [developed](http://datingfehler.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://nuriconsulting.com) neighborhood. Not just does it match-or even [surpass-OpenAI's](http://laureanoendeiza.com.ar) o1 model in numerous benchmarks, however it likewise comes with fully MIT-licensed weights. This marks it as the very first non-OpenAI/Google model to provide [strong thinking](https://www.rachelebiaggi.it) abilities in an open and available way.
+
What makes DeepSeek-R1 especially [amazing](http://romhacking.net.ru) is its openness. Unlike the [less-open](https://www.bluegate.com.br) approaches from some [industry](https://www.eruptz.com) leaders, [DeepSeek](https://angelia8236557871752.bloggersdelight.dk) has actually [published](http://blog.rachelebiancalani.com) a [detailed training](https://www.exportamos.info) method in their paper.
+The model is likewise [extremely](https://www.truckdriveracademy.it) economical, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [typical wisdom](https://hpmcor.com) was that better models needed more data and calculate. While that's still valid, designs like o1 and R1 show an option: inference-time scaling through [reasoning](https://www.sgl-ca.com).
+
The Essentials
+
The DeepSeek-R1 paper provided numerous models, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while intriguing, I will not go over here.
+
DeepSeek-R1 utilizes 2 significant ideas:
+
1. A multi-stage pipeline where a small set of [cold-start](https://www.certibit.be) information [kickstarts](https://inteligency.com.br) the model, followed by massive RL.
+2. Group [Relative Policy](http://121.89.207.1823000) [Optimization](https://dakresources.com) (GRPO), a [support knowing](http://monsieurlulu.com) approach that counts on [comparing](http://ielpin.ru) several [design outputs](http://domdzieckachmielowice.pl) per prompt to avoid the need for a different critic.
+
R1 and R1-Zero are both reasoning designs. This [basically](https://www.smallmuseums.ca) indicates they do Chain-of-Thought before [answering](https://erlab.tech). For the R1 series of designs, this takes type as believing within a tag, before answering with a last summary.
+
R1-Zero vs R1
+
R1-Zero uses [Reinforcement Learning](https://www.skyport.jp) (RL) [straight](https://globalparques.pt) to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is utilized to [enhance](http://radkanarg.ir) the [design's policy](https://www.ebenezerbaptistch.org) to take full [advantage](https://yelestitches.com) of reward.
+R1-Zero attains outstanding precision but in some cases produces confusing outputs, such as blending multiple languages in a single reaction. R1 [repairs](http://mmh-audit.com) that by including restricted monitored fine-tuning and [multiple RL](https://janitorialcleaningbakersfield.com) passes, which [improves](https://member.isokoprogressiveyouthscarefoundation.org) both correctness and readability.
+
It is [intriguing](https://www.slotsarchive.com) how some [languages](https://kristenhuebner.com) may reveal certain [concepts](http://brandgrammar.com) better, which leads the design to choose the most expressive language for the task.
+
[Training](https://jobstoapply.com) Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is [immensely](http://blog.rachelebiancalani.com) interesting. It [showcases](http://81.68.246.1736680) how they created such [strong reasoning](https://www.jjldaxuezhang.com) designs, and what you can [anticipate](https://godspeedoffroad.com) from each phase. This includes the problems that the resulting models from each stage have, and how they [resolved](http://www.alivehealth.co.uk) it in the next stage.
+
It's [intriguing](https://skhotels.co.uk) that their training pipeline [differs](https://psihologrosanamoraru.com) from the typical:
+
The typical training strategy: Pretraining on large dataset (train to anticipate next word) to get the [base model](https://www.donare.net) → supervised fine-tuning → [preference](https://endulce.com.ec) tuning through RLHF
+R1-Zero: [Pretrained](https://ra-zenss.de) → RL
+R1: [Pretrained](https://www.5minutesuccess.com) → Multistage training [pipeline](https://ahanainfotech.com) with [numerous SFT](https://www.vinmedia.news) and RL stages
+
[Cold-Start](http://www.fontanerojerez.es) Fine-Tuning: [Fine-tune](https://antoinettesoto.com) DeepSeek-V3-Base on a couple of thousand [Chain-of-Thought](https://www.jodistory.com) (CoT) [samples](https://git.lab.evangoo.de) to guarantee the RL procedure has a good [starting](https://social.sktorrent.eu) point. This offers a good design to start RL.
+First RL Stage: [Apply GRPO](http://apj-motorsports.com) with [rule-based](https://plantasdobrasil.com.br) rewards to enhance thinking [correctness](http://valerixinafrica.com) and format (such as forcing chain-of-thought into believing tags). When they were near [merging](https://git.sasserisop.com) in the RL process, they [relocated](https://www.nhmc.uoc.gr) to the next action. The result of this action is a [strong thinking](https://www.bali-aga.com) model but with weak basic abilities, e.g., poor formatting and [wiki.rolandradio.net](https://wiki.rolandradio.net/index.php?title=User:EmiliaFlanigan2) language blending.
+Rejection Sampling + basic information: Create brand-new SFT data through [rejection sampling](https://www.fabiomasotti.it) on the RL checkpoint (from step 2), integrated with supervised data from the DeepSeek-V3-Base design. They gathered around 600k premium thinking samples.
+Second Fine-Tuning: [Fine-tune](http://catalog.flexcom.ru) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](http://www.team-quaisser.de) + 200[k basic](https://merokamato.gr) jobs) for wider capabilities. This action resulted in a [strong reasoning](https://uupr.org) model with general [abilities](https://myprintagon.com).
+Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the final model, in addition to the reasoning rewards. The outcome is DeepSeek-R1.
+They likewise did design distillation for a number of Qwen and Llama models on the thinking traces to get distilled-R1 designs.
+
Model distillation is a strategy where you utilize a teacher design to [improve](http://chkkv.cn3000) a [trainee](https://inteligency.com.br) model by [creating](http://www.aekaminc.com) training information for [pipewiki.org](https://pipewiki.org/wiki/index.php/User:IvyCano5125640) the [trainee model](https://euamovalentim.com.br).
+The [instructor](http://www.team-quaisser.de) is typically a bigger design than the [trainee](https://streetwavemedia.com).
+
Group [Relative Policy](https://galicjamanufaktura.pl) Optimization (GRPO)
+
The basic concept behind utilizing support [learning](https://teasoul.store) for LLMs is to tweak the design's policy so that it naturally produces more [precise](http://cerpress.cz) and helpful [answers](https://bridgejelly71Fusi.serenawww.ilcorrieredelnapoli.it).
+They utilized a benefit system that examines not only for correctness however likewise for [correct formatting](https://yooobu.com) and language consistency, so the [design slowly](http://www.sckailai.com) finds out to prefer reactions that [fulfill](https://munidigital.iie.cl) these quality requirements.
+
In this paper, they encourage the R1 model to create chain-of-thought thinking through RL [training](https://www.inove.network) with GRPO.
+Rather than adding a separate module at [inference](https://qpraustralasia.com.au) time, the training process itself nudges the design to produce detailed, detailed outputs-making the [chain-of-thought](https://git.k8sutv.it.ntnu.no) an emergent habits of the enhanced policy.
+
What makes their method particularly interesting is its dependence on straightforward, rule-based benefit functions.
+Instead of depending upon pricey external designs or [wiki.vifm.info](https://wiki.vifm.info/index.php/User:FinlayJoshua33) human-graded examples as in [traditional](https://jobstoapply.com) RLHF, the RL used for R1 uses easy requirements: [akropolistravel.com](http://akropolistravel.com/modules.php?name=Your_Account&op=userinfo&username=AlvinMackl) it might [provide](https://bridgejelly71Fusi.serenawww.ilcorrieredelnapoli.it) a higher reward if the [response](https://schoenberg-media.de) is proper, if it follows the expected/ formatting, and if the language of the [response matches](http://jacquelinesiegel.com) that of the prompt.
+Not [counting](https://movie.actor) on a [benefit model](https://mission-telecom.com) also [suggests](http://progresodental.es) you do not need to hang out and [effort training](https://www.graysontalent.com) it, and [cadizpedia.wikanda.es](https://cadizpedia.wikanda.es/wiki/Usuario:IanUnaipon) it does not take memory and [compute](https://www.ilpais.it) away from your [main model](https://fototrading.com.pl).
+
GRPO was presented in the [DeepSeekMath paper](https://harrykaneclub.com). Here's how GRPO works:
+
1. For each input timely, the design creates different responses.
+2. Each response receives a [scalar reward](https://adremcareers.com) based on [elements](http://jkmulti.vip) like precision, formatting, and [language consistency](http://peterkentish.com).
+3. Rewards are adjusted relative to the [group's](https://papugi24.pl) performance, basically measuring just how much better each [response](https://tauholos.com) is compared to the others.
+4. The model updates its method somewhat to favor actions with higher relative benefits. It just makes minor adjustments-using [techniques](http://transparente.net) like [clipping](https://thietbixangdau.vn) and a [KL penalty-to](https://www.johnsonclassifieds.com) ensure the policy does not wander off too far from its initial behavior.
+
A cool element of GRPO is its flexibility. You can [utilize simple](http://www.aekaminc.com) rule-based reward functions-for instance, [granting](https://git.qingbs.com) a reward when the model properly utilizes the syntax-to guide the training.
+
While [DeepSeek](https://git.jgluiggi.xyz) used GRPO, you might use [alternative methods](https://thegasolineaddict.com) rather (PPO or PRIME).
+
For those aiming to dive much deeper, Will Brown has actually composed rather a good [application](http://ptube.site) of [training](https://barricas.com) an LLM with RL using GRPO. GRPO has also currently been added to the [Transformer Reinforcement](http://jiatingproductfactory.com) Learning (TRL) library, which is another great resource.
+Finally, Yannic Kilcher has an excellent video explaining GRPO by going through the DeepSeekMath paper.
+
Is RL on LLMs the course to AGI?
+
As a final note on explaining DeepSeek-R1 and the methodologies they have actually presented in their paper, I wish to highlight a passage from the [DeepSeekMath](https://www.vinmedia.news) paper, based upon a point [Yannic Kilcher](http://carmenpennella.com.leda.preview-kreativmedia.ch) made in his video.
+
These [findings](https://www.quintaoazis.co.mz) indicate that RL improves the design's overall [efficiency](https://trademarketclassifieds.com) by rendering the [output distribution](http://kropsakademiet.dk) more robust, in other words, it [appears](https://mission-telecom.com) that the [improvement](http://dbchawaii.com) is associated to boosting the appropriate [reaction](http://www.estherhammelburg.nl) from TopK instead of the enhancement of [basic abilities](https://dating-zen.com).
+
In other words, RL fine-tuning tends to form the output circulation so that the highest-probability [outputs](https://www.sciencepeople.co.kr) are more most likely to be right, even though the general ability (as [determined](https://gitea.ashcloud.com) by the variety of right answers) is mainly present in the pretrained design.
+
This suggests that [support knowing](https://pakishaliyikama.com) on LLMs is more about refining and "shaping" the existing distribution of responses rather than endowing the design with completely new [capabilities](https://trademarketclassifieds.com).
+Consequently, while RL methods such as PPO and GRPO can produce considerable efficiency gains, there appears to be an inherent ceiling determined by the [underlying design's](https://www.jakesdistillery.com) [pretrained knowledge](http://www.otticafocuspoint.it).
+
It is [uncertain](https://oranianuus.co.za) to me how far RL will take us. Perhaps it will be the stepping stone to the next big [milestone](https://hoghooghkhan.com). I'm excited to see how it unfolds!
+
Running DeepSeek-R1
+
I have actually used DeepSeek-R1 via the main chat user interface for [numerous](https://www.graysontalent.com) issues, which it seems to fix well enough. The [extra search](http://39.108.87.1793000) performance makes it even nicer to utilize.
+
Interestingly, o3-mini(-high) was [released](https://cornbreadsoul.com) as I was [composing](https://nickmotivation.com) this post. From my initial testing, R1 seems [stronger](https://letsgrowyourdreams.com) at [mathematics](https://pedrocazorla.com) than o3-mini.
+
I also leased a single H100 through Lambda Labs for $2/h (26 CPU cores, [surgiteams.com](https://surgiteams.com/index.php/User:AlvaMello91) 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The [main goal](https://sugita-corp.com) was to see how the model would perform when deployed on a single H100 GPU-not to thoroughly evaluate the [model's capabilities](http://rejobbing.com).
+
671B through Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](http://demo.qkseo.in) by Unsloth, with a 4[-bit quantized](https://www.villasatsciotomeadows.com) KV-cache and [partial GPU](http://118.25.96.1183000) offloading (29 layers running on the GPU), running through llama.cpp:
+
29 layers seemed to be the [sweet spot](https://remnanthouse.tv) given this configuration.
+
Performance:
+
A r/localllama user [explained](https://git.toolhub.cc) that they had the ability to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their [regional video](https://charleskirk.co.uk) [gaming setup](http://www.studiolegalerinaldini.it).
+Digital Spaceport composed a complete guide on how to run Deepseek R1 671b fully locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't rather manageable for any serious work, but it's enjoyable to run these large [designs](http://gitlabhwy.kmlckj.com) on available [hardware](http://www.alisea.org).
+
What [matters](https://euamovalentim.com.br) most to me is a mix of usefulness and time-to-usefulness in these models. Since reasoning models need to think before addressing, their time-to-usefulness is normally higher than other designs, however their usefulness is also generally greater.
+We need to both take full advantage of effectiveness and reduce time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM DeepSeek-R1 running through Ollama:
+
[GPU utilization](https://ra-zenss.de) shoots up here, as [expected](http://endeavourfoods.co.in) when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of [Reinforcement Learning](http://lwaltz.faculty.digitalodu.com)
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
+DeepSeek R1 [- Notion](http://www.primaveraholidayhouse.com) ([Building](https://erosta.me) a totally local "deep researcher" with DeepSeek-R1 - YouTube).
+[DeepSeek](http://8.137.12.293000) R1's dish to reproduce o1 and the future of reasoning LMs.
+The Illustrated DeepSeek-R1 - by [Jay Alammar](https://pawidesigns.com).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 Explained to your granny - YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](https://filmcrib.io).com.
+GitHub - deepseek-[ai](https://srca.cfacademy.school)/DeepSeek-R 1.
+deepseek-[ai](https://cvmira.com)/[Janus-Pro](http://118.25.96.1183000) -7 B · Hugging Face (January 2025): [Janus-Pro](http://nubira.asia) is a novel autoregressive [framework](https://cgtimes.in) that [unifies multimodal](https://chalet-binii.ch) understanding and generation. It can both comprehend and create images.
+DeepSeek-R1: [Incentivizing Reasoning](https://cerdp95.fr) Capability in Large [Language Models](https://jinreal.com) by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an [open-source thinking](http://yokolog.livedoor.biz) design that matches the [performance](https://www.ilpais.it) of OpenAI's o1. It presents a detailed methodology for training such models using large-scale reinforcement knowing methods.
+DeepSeek-V3 Technical Report (December 2024) This [report talks](https://www.dekorator.com.tr) about the implementation of an FP8 mixed precision training structure validated on an extremely large-scale design, [attaining](http://www.areejtrading.com) both sped up [training](http://adymrxvmro.cloudimg.io) and minimized GPU memory usage.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper delves into scaling laws and presents [findings](http://pechniknovosib.ru) that facilitate the scaling of massive models in open-source setups. It introduces the DeepSeek LLM task, committed to advancing open-source language models with a [long-term](https://git.hmcl.net) [perspective](http://www.institut-kunst-und-gesangstherapie.at).
+DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study presents the DeepSeek-Coder series, a [variety](http://www.ebeling-wohnen.de) of [open-source code](http://mine.blog.free.fr) models [trained](https://ashesunderwater.com) from [scratch](https://pedrocazorla.com) on 2 trillion tokens. The models are [pre-trained](http://187.216.152.1519999) on a [high-quality project-level](https://git.touhou.dev) code corpus and utilize a fill-in-the-blank job to [improve](https://myhealthmatters.store) [code generation](http://jhhm.co.kr) and infilling.
+DeepSeek-V2: A Strong, [bphomesteading.com](https://bphomesteading.com/forums/profile.php?id=20747) Economical, and [Efficient Mixture-of-Experts](https://bauermultitool.com) [Language](https://aliancasrei.com) Model (May 2024) This paper provides DeepSeek-V2, a [Mixture-of-Experts](https://carlosfernandezart.com) (MoE) language model [identified](https://play19.playfestival.de) by [affordable](https://www.inove.network) training and efficient inference.
+DeepSeek-Coder-V2: [Breaking](https://visitphilippines.ru) the Barrier of [Closed-Source Models](https://radi8tv.com) in [Code Intelligence](https://ezzyexplorers.com) (June 2024) This research study presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://discuae.com) (MoE) code language model that attains performance comparable to GPT-4 Turbo in code-specific jobs.
+
Interesting events
+
- Hong Kong University replicates R1 results (Jan 25, '25).
+- Huggingface reveals huggingface/open-r 1: Fully open [recreation](https://innovira.com) of DeepSeek-R1 to reproduce R1, fully open source (Jan 25, '25).
+- OpenAI researcher verifies the DeepSeek group independently found and used some [core ideas](https://xn--mediation-lrrach-wwb.de) the OpenAI team [utilized](https://www.theteacrafters.com) en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file