Might be useful to those who have not been in the space but this is old news. 2020.

I tried to find out how many &quot;tokens&quot; (I know: depends on the tokenizer) &quot;The Pile&quot; has but couldn&#x27;t find it.As far as I understand RedPajama has 1.2T (<a href="https:&#x2F;&#x2F;github.com&#x2F;togethercomputer&#x2F;RedPajama-Data">https:&#x2F;&#x2F;github.com&#x2F;togethercomputer&#x2F;RedPajama-Data</a>) and has a table in the readme listing the main parts and how many tokens each part has.

Previous post: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=25607809">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=25607809</a>

I wrote an article about recent data pipelines as well if anyone is interested: <a href="https:&#x2F;&#x2F;blog.christianperone.com&#x2F;2023&#x2F;06&#x2F;appreciating-llms-data-pipelines&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.christianperone.com&#x2F;2023&#x2F;06&#x2F;appreciating-llms-d...</a>

Arguably “The Pile” approach is the bane of open source and small player LLMs. Because the models are shaped by data. And there is little beauty in “The Pile”.— anonymous shaper

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Comments

tetete

tosh

physicsgraph

perone

startupsfail