Daily Standup 12
- Web scraping reaffirmed as legal in US Court
- A win for researchers, journalists developers and marketers
- Problematic data scraping from companies like Clearview AI
- The rise of synthesizing web data for insights and content
#webscraping #scraping #clearview #clearviewai #beautifulsoup #webscrape #googletrends #explodingtopics
https://library.speakai.co/speak-ai-office-hours-c0acf2cf4f55/explore (password is officehours)
All right hello Tyler Bryden, here, you know what last week I said you would catch me in a video not wearing these glasses. Literally. The next day I make a video not wearing these glasses. I’m trying to protect these these puppies so if you see me now where these classes call me help let me know. This is problematic and I got some some protection. I got to do here and so thank you again for anyone who tunes in and watches this. Got a little bit of a story that’s popping up here which is continuation of a story from 2019 which is about web scraping and is web scraping legal. Or is it illegal and there was a challenge from LinkedIn? I believe that people were scraping profiles from LinkedIn and and basically what happened was went to court and and it was sort of a a moment that data that is public available and not copyrighted. It is fair game for web crawlers and then this was reaffirmed this week so a couple or a couple weeks ago, LinkedIn has lost another legal battle scraping public data from LinkedIn is legal, so I feel like LinkedIn.
It’s a little bit upset right now, but the main piece here is that web scraping is legal. There was a rear for affirmation of it, and you know, people are saying that this is a win for things like archivists and researchers and journalists. I see this as a win also for developers for marketers. For private companies who are scraping this information and generating insights, and I’ll talk about that a little bit later, but there are very. The problematic cases of web scraping. We’ve seen one of them, which was Clearview AI, which went I believe.
Like the the pictures of Facebook profiles all across the world and then police forces and military institutions were using it. And I believe there were, you know, crimes that people’s faces were detected in it that came from that data. Some of that was right, but a lot of it was wrong, and so there are definitely challenges to this idea of web scraping and then the application for it. But again, there is a little bit of more authority that that this is something legal, and I think you know just with the stories. Using around this is going to be continued to be a drive for companies to increase their web scraping abilities, and there are.
Open source libraries and tools. Beautiful soup and lots of different ways that you can actually scrape information online, and then you know there are a lot of this. Information is extremely valuable to again, not just not just private companies, but political parties to nonprofit institutions. And so I think that we’re not going to see this go away. I want to touch on something which is.
A big trend that that I’m continuing to see, which is how can you take large libraries of publicly available information and then distill that down into insights. And I’ve just pulled up. I mean, 2 examples, one of them being the most classic, which is Google Trends. And I mean they are really. Yeah, they have access to the world’s information and they’re giving us that world’s information, so they sort of have the right to mine that information and then give us things that are insightful about it and so generally we obviously see things like Google trends and real time trends and what’s being searched at this moment. And that’s an A1 way to sort of synthesize all this information. All this data that’s flowing through this system, and to distill it down to a way that is actionable in usable. And I think this will be a drive that continues.
To take place and I’m seeing this now what I would see this is actually applied in more of a content curation fashion, so another way that this has been done great. I love following this company, exploding topics and also showing sort of looking at specifically searches per month and and then the growth in those. So if you’re interested in a specific category you can go back. I don’t know if it’s going to let me do this because there we go you go back one month. You see what is? Is there something trending up? Is there something trending down and their engine is?
Finding these searches, quantifying these searches, and then giving you this insight so you could distill and take it away, and overall just what I’m seeing here is that this is a really important actual service to, I would say the public, which is what I’ve seen applied in several spaces now, and one of the companies that I’ve had the pleasure to work with an amazing company, kanatak that is a personalized nutritional research firm, and so they are trying to sort of capture all the information related. Related to personalized nutrition, distill that down into newsletters. Distill that down into insights. Distill that down into reports that then can be available for the general public, but then is also available to the companies that they’re working with, so a company can go to them and say, hey, we’re looking at exploring, you know we’re looking at exploring a product development for something to aid with the keto diet. How how do we best go about this? How can we figure out what to differentiate are different.
Leaders are is this even a worthwhile endeavor? And that’s one of the things that then you need to have these tools of web scraping. So can you look at data sources that are available and a couple ones that stick to mine for me are App Store reviews, which we’ve done in the past. There are reviews on websites like G2 and things like that. And then you’ve got other kinds of sources. You’ve got articles, you’ve got, forms, you’ve got, social media, you’ve got like you can scrape.
You can grab every Twitter with the hashtag related to a specific subject. And now all of a sudden you need to be able to take all that information structure and print it in a way that then you can query. You can navigate through when you can derive insights from and. And if you do that in the right way and then you cure right content, you were helping people sort of navigate through their own journey of self learning and discovery. And this is something you know I’m very passionate about and it’s it’s something that I struggled with and and and built even some of the journey around speaking I because I just felt that there was this overwhelming.
Amount of information being generated at all times, and. Again, there are challenges here with web scraping, and I think any of us have this sort of fear that what. What am I being? Why am I being scraped? What am I being scraped for? Where is this going? What database is this going to end up? Who’s going to see this? Who’s going to use this? What is the purpose of this? And you know, I empathize with that and I’ve had that experience myself and I think anyone over the last few years who even post anything online. You have these concerns of like what are the consequences of this and when consequences are unknown. It can sometimes be a scary thing.
I you know, overall this is an interesting reaffirmation I’ve done lots of web scraping in my own life. More on a hobby basis, but also now being applied professionally because of some of the data sources that people are looking for. Research papers podcast hook into an RSS feed. All these different things, but you know, overall I think that there are the process of distilling insights from information is a really good one but also relatively abstract. Difficult.
I really don’t try to. Come on here and actually promotes bki. To be honest, this is more about content piece. What I’m just interested in is exploring a way that we’ve done this and I apologize if there are any swear words on this, but I wanted to show. A little bit of this and it’s still not perfect like this is such a rudimentary version of what we’re actually trying to accomplish, but we’ve at least had some success with it and what you’re seeing here and just this isn’t even necessarily promoting speak, but just a walkthrough of how you know at least a web scraping and then application of this could be possible, and maybe this helps reveal some insights for you too, and so this is we have many different names for this but sharable media library.
A research repository what we’ve done is take in all the office hours videos that we’ve done as a team on YouTube. Scrape those and then dumped them into basically a folder within the speak, which then is generating this library and what’s happening here is that you are because we are running natural language processing. First of all, we’re transcribing audio and video, and we can do audio, video and text within the system. But in this case it’s video, but audio is part of that, so you transcribe that, and then you’re putting a layer of natural language processing. For that, and with natural language processing, what’s happening, there’s this idea of named entity recognition. So you have these default categories which is trying to group things in automatically, and if I go here I can see brands and I can hit apply filter and in a second sorry it’s querying through a lot of information. It will then load all the brands that are there and display it both in a word cloud, but then like a bar chart, something a little bit more practical as well too. And then I can see OK, wow, interesting Google has been mentioned.
And I can then jump and I can see it will load all the instances of Google that had been mentioned across the office hours. I can go there and then I can say, oh, it’s a negative key keyword list in Google ads and I can jump to that exact moment and it will playback. So I mean, I’m really powerful tool from a display a navigation element, what we’re finding or I think what is challenging here. And this is where some of the. The underlying underlying problems lie with the abstractness of Web scrape data. Is that what might be valuable to you might not be valuable to someone else and also depending on where you’re positioning this or where you’re focusing on other value, you’re trying to create. This could be too much raw data. This could be too general purpose, and so those are some of the challenges that we’re seeing. Even it’s great. You can filter by specific media media type.
All the different sort of categories that you’re interested in. Of course, there’s like ways to build more custom categories that are specifically focused on on like the the use case that you’re looking for, but overall, there’s a little bit more work to do here, and it’s great. Again, I can see brands I can see, you know I can see geopolitical, I can see all the locations that are mentioned, but the entity is just one layer of this. You need to be able to go deeper to actually extract the value. And and in some cases.
This might be enough. It might be enough to just visualize oh wow. Out of all these conversation, Toronto has been said the most. Or then you’ve got at least one and say, OK, this is interesting. Why did this pop up? Why did this pop up? And now I can see when this exact moment kind of appeared, but?
Again, speaks to a challenge of once you get all this information, you scrape all this data and honestly data scraping alone is complex and pages are built different and there’s different functions and some pages are protected. Once you get all that, you can pull that into a database, or you can pull this into something that is queryable. Then how do you display insights? And this is a challenge that that we’re trying to figure out. I think a lot of people with this reaffirmation with articles floating around about web scraping are going to continue to take this on and try to do this process. What I’m really interested in is how can we go from sort of this raw, unprocessed information and almost like an oil refinery?
It’s a tough comparison, and the climate that we’re in today. But how can you distill that down to to to to clean oil and clean as it can be, I guess, and. And these are the fun pieces that I’m sort of following. The threads that I’m following through, and maybe this ruling went a little bit different and it said, hey, this is completely illegal. There’s no data scraping, no publicly available data sets are available to you, and all of a sudden there are business models that are affected. Because this isn’t just something that I’m interested in, or but many, many companies are scraping information, or at least monitoring information at all times and so.
Continue to sort of follow this story. LinkedIn has said this is a preliminary ruling and that they’re going to still go at it. Linkedin’s got a, you know, they’re they’re they’re continuing to push on, and this is obviously a topic as we talked about the ethics of technology of AI. What should be available of what’s shrimp? This will be a conversation that continues. I’m excited to continue on it with you because this is something that’s fascinating to me and obviously impacts the world that we’re living in every day. So thank you so much for checking this out. Hopefully got something interesting.
For this, I’ll drop a couple of links in the video and just anything else that you know. There’s anything that you that you think would be interesting to hear. My weird brain reflect on or go through. Please let me know. I’d appreciate that very much. Basically, every day I wake up they say you know, what am I interested in here today? Sometimes that come that inspiration comes from within. Sometimes that inspiration comes from an article that I’ve read the previous day. Sometimes it comes from a Twitter thread that I read in the morning, even though I’m trying.
To read Twitter threads and then running so thank you again. I’ve gone on too long. I appreciate this very much. Hope you have a great rest of the day. Bye bye.