The Bloom Open-Access Multilingual Language Model By BigScience

This is is part of my live-learning series! I will be updating this post as I continue through my journey. I apologize for any grammatical errors or incoherent thoughts. This is a practice to help me share things that are valuable without falling apart from the pressure of perfection.

Speak With Tyler Bryden

00:00 / 12:39

– 176 billion parameters large language model

– 46 natural languages and 13 programming languages

– Available to download, run and study from Hugging Face

– Responsible AI and environmental impact

– Continued improvements with community involvement

Resources

BLOOM
bigscience/bloom · Hugging Face
IDRIS – Jean Zay: Introduction
IDRIS – The Missions and Objectives of IDRIS
The BigScience RAIL License
A year in the making, BigScience’s AI language model is finally available | TechCrunch
Okay, the GPT-3 hype seems pretty reasonable | TechCrunch
Hugging Face reaches $2 billion valuation to build the GitHub of machine learning | TechCrunch
The emerging types of language models and why they matter | TechCrunch
Here are a few ways GPT-3 can go wrong | TechCrunch
Genci grand équipement national de calcul intensif
IDRIS – Pierre-François Lavallée
AI21 Labs
2004.08900.pdf
Open source NLP is fueling a new wave of startups | VentureBeat
2107.03451.pdf
Study shows AI-generated fake reports fool experts
Home | Cohere
OpenAI
BigScience Research Workshop (@BigscienceW) / Twitter
BigScience Research Workshop on Twitter: “BLOOM is here. The largest open-access multilingual language model ever. Read more about it or get it at https://t.co/mE013I62In https://t.co/KrBRVklXLf https://t.co/onmQu6MxJc” / Twitter
Michelle Manafy on Twitter: “Inside a radical new project to democratize AI A group of over 1,000 AI researchers has created a multilingual large language model bigger than GPT-3—and they’re giving it out for free. https://t.co/EKybqTP3wr @techreview @BigScienceLLM” / Twitter
BLOOM: Inside the radical new project to democratize AI | MIT Technology Review
Code Projected Over Woman · Free Stock Photo

YouTube Video

Automated Transcription

Hello hello Tyler Bryden here, hope everything’s going well today. I want to talk about Bloom and this is a very exciting. It is the world’s largest open multilingual language model and just to set the stage a little bit for this there are some companies right now who are building large language models and they’re basically scraping the vast web of all human generated text across the Internet. And then there are enabling you to interact with the model that they’ve built and the company’s doing that. Famously, our open AI cohere.

Google has sort of their own model of this. I’m guessing Amazon and Microsoft are working towards this and what the sort of challenge here is that these models are very expensive to train. As a great article here by TechCrunch on this is, it’s very expensive $1,000,000 plus to to train these large language models and then generally they are then held by private companies with private incentives and then that limits the access. That people can have to them and it can be very costly if you look at tapping into the open AI. That GPT 3 API? It’s expensive and they’re trying to figure out how to price dally right now. I’m sure that it will be somewhat expensive and really all this does is create a little bit of sort of conflict in the actual training of the model. The language that it’s trained on, the goals of the model, and the output that it has, and then the again the sort of responsibility, the ethics that are built into these foundational language models, which are just early in their impact on the world today. This will continue to grow.

There’s more and more work going into these and and and so even on this example bloom, which was an initiative, a joint initiative by hugging face Jensi Idris. They came together sort of understanding this sort of larger context. The need to make these large language models more accessible, knowing the cost that they used. $7 million of public funds and grants to make this model actually come together and. And then released it in an open source manner so that researchers could they could. They could download it, they could run it, they can study it, and it’s then accessible right on hugging face. And there’s even a hosted inference API here so that if you don’t have the access to large servers and in computing that you can actually interact with the system here. So this is a huge sort of release and announcements. I’m really announcement. I’m really interested to see how this impacts companies like.

Open AI cohere both from a business perspective and then also, I think just a perception perspective. Which is, you know, these are private institutions with with you know, investment with goals to return that investment, and then the the impact that has on the decisions that are made and the and the and sort of how the business model is structured. Whereas this is a much more sort of open science approach and allowing you to interact with the system, understand it better and. And I think we’re going to see a lot of innovation come out of it. The other thing that really stuck out to me, which I think is fantastic, is that it was trained on many different languages. So 46 I believe. Yeah, 46 languages. There is a full print out in on this thread here, which is awesome to see all the different. I don’t, you know, there are some questions, why not Dutch? Where is the German? But then some other people are excited about Tamil and you know some more underserved languages.

Traditionally, that are actually available in this model, which then allow you to generate text in that language, which is very exciting, and we’re seeing a push towards Facebook released. You know, 200 plus language, automatic sort of translation and understanding. There’s more and more work being done to serve underserved languages that were not necessarily available in these models and then are not necessarily part of these more private models. So there is something really exciting about this. It’s a massive language.

At 176 billion parameters, which puts it at that the size of sort of this GPT model that that open AI has released, and overall, like a lot of work, went into this a year thousand volunteer researchers, ethicists, philosophers, legal scholars all came together to print this. Sort of bring this to life and this is where it’s actually going. Live here now, which is really exciting and then looking at yeah like open AI deep mind and then look we’re working across these multiple languages so the other part that was really fascinating to me that stuck out is. Basically, this idea of responsible AI being one of them. So building ethics and thinking of the consequences of this and there have been a lot of challenges with large language models because of the data that it pulls off the web and you know the toxic, you know the hate, the all the bad things that are part of the Internet. That also makes all you know compared to all the good things on the Internet are all pulled into these models and so then you need to think of if I’m generating text you know what? What are the possible outputs of that?

So private companies are trying to understand how to deal with that, and then it seems like there was a much more intentional approach just because of the sort of fundamental collaboration over bloom here that has made it come from a little bit more of an ethical approach. It’s really cool here you can see the languages that are are are included and and so there’s this idea of like what can you use like? What can you use it for, direct use and then what? What is misuse and what is out of scope? Use and then what is misuse. So these are things that are really in you know important. I’m having experience with G3 with Dolly right now and have noticed sort of even very hard stops and the keywords that they let you use to generate images. For example with Daly and I think hugging face and you know everyone who is part of this will continue to monitor and figure out how are people using this? What are they studying? What are the possible outputs that could be dangerous for the world?

Dangerous for humanity, and I think overall this is an important part of of this work, which is we’ve almost you know, opened up a Pandora’s box. And what does this mean for for us as people who are impacted by this with the consequences of. You know multilingual language generation across the entire Internet at speeds that humans are not capable of. These are huge consequences to this if done not done right. So some really interesting things here about the intended users. So there is an almost anyone can access this. There was one other part that I wanted to touch on, which I thought was really interesting and I think is really worthwhile. Is the environmental impact of it. So the training supercomputer they use geneza. I think I have this website open, but I’ll open it again.

Just a SEC. Yeah, the supercomputer here was used, and again this took a long time to train. Forget exactly the length, but several months to train this after a year plus of work. And yeah, start OK. There we go. Started in March ended July. So this is you know, very recent that this has come to life and that has now published. But then there are sort of measuring using mostly nuclear energy. The heat is generated and then estimated carbage miss emissions and then estimated electricity usage and I think this is something that we’re going to see more of.

Both from a sort of responsible AI as we look at sort of the carbon impact of training these models. And then I just think generally it’s a good trend to have as we monitor and understand the impact on the environment that we have as we’re building these, you know, large language models and just technology in general, and as everything moves towards AI machine learning, there’s a lot of things that are really valuable. There’s a lot of things that are not necessarily valuable, and all of those are consuming energy, vast amounts of energy, and that is something. For us to consider there, so I’ll go back to the original sort of announcement right on sort of the hugging face.co big science sub domain and what I also like and I’m excited about is that this is only the beginning. So first of all I love the name bloom. Wonderful one of my good song that I listened to sort of like a morning Meditations song is called Bloom so I enjoy that and I think this is really exciting. Is that hugging face has been very collaborative with the development development community and this is not just like a one time model.

That is. Going to sit there static, it’s the seed of a living family of models that tend that we intend to grow and then the Community will be able to support that. You’ll be able to see where positive things are happening to be able to see where negative things are happening, and then use all that information to make this model more powerful, more safe. And those are things that I’m. You know, grateful to see that there are organizations working on, and again, they’re still be lots of problems that come from this, but overall, it seems generally the intentions are are really good. Yeah, you can see the final run of 117 days there, so the other part I I think you know. I touched on a couple of parts here which were how large it is in comparison to open on models, deep mind etcetera etcetera. The natural languages and then the programming languages that are included. The fact that it’s available to download, run and study.

Which I have the link here will be included in the resources. Sort of. This idea of the responsible AI vision that they have and the intention, and then the environmental impact. That sort of statistics that they’re doing and then just this continued improvements with community involvement, which I think is is fantastic and the interest in this stuff is as it comes to life as it becomes more practical is at an all time high and again sort of the early stage almost. I feel like we’re in the the sort of cave people.

Fire sparks, and maybe we’ve just got a very, very small fire. That’s where it sort of seems we are in this space and this is just getting started and now open source versions Open Access versions of this are going to accelerate that even further, just like Mini Dolly has Crayon now, just as GPT 3 has. Just a Dolly has and it’s great to see different organizations coming from different sort of sort of vehicles. Nonprofit open source. Private all sort of culminate towards this common goal of these large language models. Is understanding of human language of intelligence and then incredible classification and and generation that is possible. One thing I will say I I would love to see, and I think this will come. It sounds like this bloom is not necessarily made for image generation. It’s made for text generation, but a lot of the practices that are available or there and the team that they have seems very capable of building in this image generation version. So I expect at some point.

You will see a high resolution version, high resolution image generator, sort of rivaling Dolly, created in sort of the same model that bloom has, so that’s sort of, I guess, a small prediction there. There’s probably many things that I have glossed over, possibly even gotten wrong during this video, so if if you feel that way, please encourage this and I encourage you to send me a message. I would love to learn. I’m digesting this information as it comes, but thought this was really exciting for anyone who’s interested in natural language processing and AI and. Technology and nonprofits and open source. All this wonderful stuff, sort of, culminating together with a bunch of amazing organizations and talented people trying to do their best to figure out how to to bring this to life in a responsible way. So this has been a video on bloom, the world’s largest open source Open Access multilingual language model man that’s a little bit of a mouthful, but the final name is good and I love the use of the emoji here on on the end. If you like this.

Though please feel encouraged to send me a message. Send me a like comment, subscribe all those good things I have so much fun covering this stuff and now I’m learning from people who are commenting and sending me messages. So overall, really grateful to spend a couple minutes each day delving into these topics, learning myself and then learning from you. And hopefully we’re all learning together. So thank you very much. This is mentally Braden. Hope you have a great rest of your day. Bye bye.

More To Explore

Podcast

Tumblr and WordPress Selling Data To Midjourney and OpenAI

Interested in Tumblr and WordPress Selling Data To Midjourney and OpenAI? Check out the latest video and resources from Tyler Bryden on Tumblr and WordPress Selling Data To Midjourney and OpenAI!

Tyler Bryden February 27, 2024

Podcast

Mistral Releases New AI Model Mistral Large & Partners With Microsoft

Interested in Mistral Releases New AI Model Mistral Large & Partners With Microsoft? Check out the latest video and resources from Tyler Bryden on Mistral Releases New AI Model Mistral Large & Partners With Microsoft!

Tyler Bryden February 26, 2024

Podcast

Google’s Gemini Won’t Generate White People

Interested in Google’s Gemini Won’t Generate White People? Check out the latest video and resources from Tyler Bryden on Google’s Gemini Won’t Generate White People!

Tyler Bryden February 22, 2024

Podcast

2023 YouTube Year In Review

Interested in 2023 YouTube Year In Review? Check out the latest video and resources from Tyler Bryden on 2023 YouTube Year In Review!

Tyler Bryden January 2, 2024

Podcast

Founder Wealth

Interested in Founder Wealth? Check out the latest video and resources from Tyler Bryden on Founder Wealth!

Tyler Bryden December 5, 2023

Podcast

Datastreamer, Diply & Unstructured Data

Interested in Datastreamer, Diply & Unstructured Data? Check out the latest video and resources from Tyler Bryden on Datastreamer, Diply & Unstructured Data!

Tyler Bryden November 28, 2023

The Bloom Open-Access Multilingual Language Model By BigScience

YouTube Video

Automated Transcription

More To Explore

Tumblr and WordPress Selling Data To Midjourney and OpenAI

Mistral Releases New AI Model Mistral Large & Partners With Microsoft

Google’s Gemini Won’t Generate White People

2023 YouTube Year In Review

Founder Wealth

Datastreamer, Diply & Unstructured Data

Connect

Listen to my podcast:

Support my work

Share This Post

Join My Personal Newsletter ❤

Get insights and resources into awareness, well-being, productivity, technology, psychedelics and more.

Let's Grow Together.

Connect

Social Channels

How to Contribute

Don't want to chat but want to keep updated?

You have Successfully Subscribed!

Pin It on Pinterest