What Is CogVideo Text-to-Video Generation?

This is is part of my live-learning series! I will be updating this post as I continue through my journey. I apologize for any grammatical errors or incoherent thoughts. This is a practice to help me share things that are valuable without falling apart from the pressure of perfection. 

Speak With Tyler Bryden
Speak With Tyler Bryden
What Is CogVideo Text-to-Video Generation?
Loading
/

Episode Summary

– Text-to-video generation
– 9 billion parameter transformer
– Inheriting a pre-trained text-to-image model CogView2
– Multi-frame-rate hierarchical training strategy
– One of the first open-source large-scale pretrained text-to-video models
– Outperforms all publicly available models

Hashtags

#joebiden #psychedelics #psychedelicslegalization #psychedelictherapy #psychedelictreatment #microdose #lsd #psilocybin #maps #rickdoblin #paulstamets #michaelpollan #psychedelicstudies #therapsil #politics #americanpolitics #warondrugs #drugwars #howtochangeyourmind #peyote #mdma #mescaline

Resources

nightmareai/cogvideo – Text-to-video generation – Replicate

GitHub – THUDM/CogVideo: Text-to-video generation.
CogVideo Demo Site
CogVideo – a Hugging Face Space by THUDM
GitHub – gradio-app/gradio: Create UIs for your machine learning model in Python in 3 minutes
[2205.15868] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
CogVideo Demo Site
GitHub – THUDM/CogVideo: Text-to-video generation.
cogvideo text to video generation – Google Search
[2205.15868] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
This AI Can Create Video From Text Prompt | by Jim Clyde Monge | Aug, 2022 | Better Programming
CogVideo: Text To Video Generation | allainews.com
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | DeepAI
Text-to-Video Generation via Transformers
CogVideo: Large-Scale Pretraining for Text-to-Video Generation via Transformers | Hacker News
Text-to-Video Generation | Papers With Code
CogVideo: New Method for Generating GIFs from Text Input
80lv
Telegram: Contact @LevelEightyNews
80 Level (@eighty_level) • Instagram photos and videos
80 LEVEL (@80Level) / Twitter
AK on Twitter: “CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers github: https://t.co/1JuOHU7puc https://t.co/Wilcq2Xxb9” / Twitter
(PDF) Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
VQGAN video generator – Google Search
rkhamilton/vqgan-clip-generator: Implements VQGAN+CLIP for image and video generation, and style transfers, based on text and image prompts. Emphasis on ease-of-use, documentation, and smooth video creation.
VQGAN-CLIP-animations.ipynb – Colaboratory
AI Video Generator — Kapwing
Our AI Art Generator Isn’t Being Used To Generate Art
pdf
GitHub – THUDM/CogVideo: Text-to-video generation. – JOYK Joy of Geek, Geek News, Link all geek

YouTube Video

 

Automated Transcription

Hello hello Tyler Bryden here hope everything’s going well. They want to talk about an extension, an extension of something that I’ve been talking about a lot on this channel which is image generation and these are large language models that are using, you know data. Scrape from the Internet to get an understanding of language to get an understanding of imagery of culture and then you’re putting a text prompt into that engine and you are creating an image with it and the you know the the the potential of this is endless the the boundaries. Seem just unimaginable as we continue to progress with this technology. Most people familiar now with dally open AI version. There’s mid journey and there’s more and more of these sort of models emerging that are allowing people to do image generation and.

I think in our mind this is this is a natural progression to this, which I think makes a lot of sense and that is text to video generation and I think this is just an early step in a world that opens up that we’re still in the early stages of fathoming around. You know now that the systems have this understanding, why do why do we just have to render images? Why can’t we render videos? Why can’t we render entire virtual worlds? And.

It seems with a couple things recently that future might not be as far off as we actually thought it was. So what I wanted to introduce today and sort of answer this question, this is a question answer for myself is what is COG video? And I’m actually trying to find and I’ll share it if I can. The original way that I found this, I sort of stumble across a lot of these things on online at many times, and I believe it was a Reddit thread, so if I can find that before I’ll post it as a resource and a link just so you can. I’ll check that out and track it down, but what I wanted to sort of share was cod video and it looks like it’s very early in this latest version pushed a week ago. I’d be interested to know when the first version was actually published and this is then hosted and you can see in this case the prompt is a black cap and then there is an output here that is part of it, and generally what seems to be happening is they’re taking an image they are doing, and I have the exact terminology.

Alright, I’m I will find it in one of the pieces here. Yes, basically a multi frame hierarchal training strategy. So basically trying to align clips and it’s basically running a series of images just like what a video is and then basically pushing those together to simulate movement. Simulate a video and you can see. Not only is there this page, there is a dedicated GitHub page so this will be a link for you to check out if you want and have a bunch of samples in here.

I’m actually not sure. I don’t think I’ve watched this. Of these prompts that then have been created through the system and they have a bunch of papers. I’ve linked all these, and obviously if you could get just this link itself, you’re going to find these and then they have a bunch of generated samples so. It seems like there is a lot of the text input and a lot of the system is built around in sort of Chinese, but this is a four second clip of 32 frames and here they they sampled 9 frames for display purposes and then that is being combined together and really again a video is just a bunch of frames put together so you can see that you know very, very interesting and it’s extending upon this original version of Dali and image generation, which is only a single image which is being generated so.

That brings me to a couple other things that there is ability to actually run this model. They talk about it being available on hugging face spaces, so I’m not sure if I type in type in COG video. Let’s see if that pops up there. We go, here we go and that takes me. So to input text and then the translated text so it only accepts Chinese in the current iteration of it right now. And then there’s the output video.

With that, now the thing that I’m seeing, and I think this is the challenges of why we are where we are, is that. This takes a long time to generate, so not Dolly is relatively quick. Sometimes you get back in 1530 seconds. If you’re lucky, and then that’s just I mean, they’re doing many variations of those images, but in the end you’re getting one return back, whereas this one is not only is it you know seemingly creating multiple images within that same sort of sort of structure and model, it’s then combining them into a video rendering that video, etcetera, etcetera. It does seem like the final output if.

That if I if I inspect one, I don’t know if I can inspect one on that. It seems to be a GIF which is a video format but sometimes not, with necessarily the best resolution, and so we’re starting to see like the limitations in this in the current model. This is a 9 billion parameter model, so it’s not like it’s small, but you know the amount of processing required the Member data required the time required to do it and then to extend that to high resolution. Full resolution video seems even more. You know of a challenge there that I think even with the advancements in computer processing today, you know is not necessarily something easy. You’re or if it is easy, it’s going to be very expensive. So there are some options to cite it. And then even talking about what GPU to use. There is the demo here. And this is where I’ll pop it up again, I think I have a couple times I have this link. You can see you know basically sort of what Dally did where. There’s sort of these.

Little demos that are almost like preset preformatted, so then you’ve got a 480 by 484 seconds frame rate, so only 8 seconds, so it does look a little bit clunky and the fact that you know we’re generally used to 30 frames per second Lord of the Rings was shot at 60 frames per second, which was somewhat disturbing, but these you know so that it there is some challenges to this, but you can see sort of the iteration of it. You know, I’m surprised. I’m confused of why this is actually taking. I don’t know if this is. Jen, you know, genuinely creating this video, or if it’s just a you know, like a a preview, just like a a placeholder. I don’t know why it just to me it seems like if it was.

There we go, so we’ve got a woman on the beach taking a walk. Let’s do you know, let’s do singing instead and then that will render there. So I don’t think so here. Oh my God, I actually got it back. Very, very exciting. So that took about 10 minutes. I’ve been recording this video for six minutes that sort of lines. I did it just before and we’ve got a clown running through water.

You know low. Resolution very short, but there’s some shapes and things that you know because I know the text prompt, I have some understanding of what it is. You can see that. The perception or understanding of water might not be as good as maybe we possibly hope, but that’s OK. And then I’m looking at some stage two generating frames, frame rate 2. So this I believe this is the second counts of how long it took. So 1st frame took 38 seconds and then we’re now moving through this stage, so this might continue to refine 80% of predictions complete within 33 minutes. So you can start to see a little bit there and set-up time could be long and I hadn’t saw this scene.

Before 63 gigabytes, this is big big big table, but I do this and again, you’re not printing the output, so I think this sort of speaks to this idea that we’re in this stage of, you know, a lot of experimentation and innovation and technology doing incredible stuff. The challenges you know in this infancy of where, how early we, in how early we are in this. How does is this business practical or useful at all? I would argue that Dolly and mid journey, but the image generation and the high resolution they can do, and the customizability and the you know the relatively quick. Response time that has more use case than this, but I think that this is a world that we’re heading into where we will get so.

We’ve also got if you’re interested, there’s a Y Combinator, so sort of a hacker news. You know a thread on this where I you know, I find just the audience, and here are generally has some pretty fascinating insights on you know, advancements in technology. These are generally pretty savvy people talking about how this might be in a short term application might be an online advertising sort of grab meant to grab attention, but you know. So then that’s that’s one of the first pieces there. The second move is pornography and you know pornography seems to be a leader in many cases for technology.

Enhancements and so. Maybe we’ll we’ll see. We’ll see that so. Just looking at seeing if there’s anything else to pick out from this this forum here and again I’ll drop this in. We generate so this person is asking do we generate a video games or business application source code from text? I believe so yeah, I believe that I believe virtual worlds definitely VR AR AR. I guess not as much when VR worlds. Why can’t you describe an environment that you want to be in and then it automatically renders there? I don’t see any.

You know any reason why that’s not in our future? It’s 2022 and AI either generating videos, yet we’re still unable to properly embed a video into into an HTML video into an HTML page. A couple other pieces on here, but generally this seems really, you know, this seems very recent that this has come to life. I should get the exact published date, but I’m I’m a dumb guy and I’m probably not going to get and it looks like they have. Yes, so here is a Reddit page. There’s a telegram channel.

Instagram, Twitter. I don’t know if this is. I’m not going to do that 80 level, so I don’t know if this is the same people who actually publish this or those are just people who are writing on it. I think that’s just the idea here, but I’ve got some tweets. Got a couple other resources and then there’s some papers on this which I had actually stumbled across and I know I’m coming up for time here and you don’t know that, but I’m I’m myself imposed time, which is this basically from my understanding, VG, QV, Q, GAN and clip to generate images, but I’ve seen a version when I was first.

Exploring this that was video generation. So while COG video saying it’s one of the first there it it might not actually be. I’m going to pull this up and put this into a couple links here too. Yes, and even kapwing used it. Yes, I remember this case so kapwing even used this so it’s not actually the first version. This is.

A bigger one that has come from from clip and and then a lot of a lot of things have come from this, but you can see Capling actually allows you to do this and there’s a gallery. OK well the gallery is not working very well. I remember it had a lot of problems when I first tested it too, so they were not the first and there is even some who a Google collab version of this. So this all will all be linked and you know again, I think everyone’s rushing to, you know claim. You know they’re the first to do it, but I I had come off this and there seemed to be built off. You know, you know they’re all building off each other and different technology technology stacks, so I think.

Interesting, so Kapwing had it and then they took it off. OK, well then I’ve got this. You know today if you’re interested in this, take a look. I’m not going to read through this entire thing, but I’ve got all these links ready in below the YouTube video on my website celebrity.com and I just think this is super fascinating stuff. You know, I I’ve got the one video that I come. It looks like it’s sort of in its final state. I can download it, which is pretty cool. I will go ahead and download that and I can share discord channel. There’s ways to upload an image and there looks like there’s different other ways to sort of modify this and goes back to this prompt.

Engineering video that I created and something that I’ve been studying here just to understand more about how people are going to get better at creating the videos, creating the images and using these technologies to its full, you know full power and capabilities, especially as these get released maybe. Privately, and there’s a cost to it, and that’s gonna drive it to more. It’s business use cases, and then I think if it’s open source, there will be continued to be a lot of experimentation. People playing around, debugging, learning what works best, and I’m really excited to sort of follow that journey. Hope to be able to continue sharing insights along the way, and if you like this, you like this kind of stuff. Feel encouraged, shoot me messages. I’ve learned stuff from people who are watching my videos, and then they were the ones who originally showed me mid journey. A couple of the other libraries that are Daly alternatives. I’m learning from you.

As much as I hope you’re learning from me, maybe you’re not learning anything from me. Maybe you’re just listening to me elaborate, and maybe there’s one thing now and then, but either way I love doing it. I love connecting with you, feel encouraged, send me message like the video? Subscribe come up for the algorithms. All that good stuff. I’m Tyler Bryden, I hope you have a fantastic rest of your day. Look forward to sharing more videos on this soon. Bye bye.

 

More To Explore

Podcast

Founder Wealth

Interested in Founder Wealth? Check out the latest video and resources from Tyler Bryden on Founder Wealth!

Read More »

Share This Post

Join My Personal Newsletter ❤

Get insights and resources into awareness, well-being, productivity, technology, psychedelics and more.

Don't want to chat but want to keep updated?

I'd love if you subscribed today. I promise I will only send you great, valuable content that has transformed me and helped others flourish. 

You have Successfully Subscribed!

Pin It on Pinterest

Shares