- AI Trailblazers - We moved to MailChimp
- Posts
- Apple, Nvidia, Anthropic used thousands of YouTube videos to train AI
Apple, Nvidia, Anthropic used thousands of YouTube videos to train AI
Plus: OpenAI nears "reasoning"-capable AI

Hello and welcome to this week's newsletter!
I wanted to let you know that this will be our final newsletter for the month. I'll be taking a short break, but it’s not just for relaxation—I’ll be attending courses, visiting the Nvidia office, and gathering even more amazing content to share with you all!
For those of you in our OG community, I'll still be sharing exciting updates on our private community app. So, stay tuned for some exclusive insights! For everyone else, I'll see you in mid-August with fresh, new content.
Have a fantastic summer, everyone!
Now, let’s dive right in:
A note from our sponsor Zapier
Zapier AI is here to redefine your productivity game. Imagine a world where your apps seamlessly talk to each other, automating the grunt work and letting you focus on what matters. From streamlining everyday tasks to constructing intricate workflows, Zapier AI has got you covered. Say goodbye to manual work and hello to efficiency.
Dive into the future of automation with Zapier AI. 👉 Explore Now
Europe’s fast-tracked AI regulations: Innovation or stifling?

Key concerns…
Impact on startups:
Andreas Cleve, CEO of Danish healthcare startup Corti, fears the EU’s new AI Act will burden startups with high compliance costs, potentially stifling innovation. He supports regulation for safety but worries it will make it difficult for small tech companies to thrive.
AI Act overview:
The AI Act categorizes AI systems by risk level: minimal, limited, high, and unacceptable. High-risk systems face stringent regulations to ensure transparency, data quality, and human oversight.
The act, effective from August 2024, aims to build public trust in AI but may impose significant costs on companies.
Implementation challenges:
Critics argue the AI Act is vague and lacks clarity on key issues like intellectual property rights and compliance guidelines.
There’s concern about the bureaucratic complexity of enforcing the regulations across EU member states, potentially leading to inconsistent implementation.
Global context:
Europe’s AI Act is part of a broader international effort to regulate AI, with similar initiatives in the US, UK, and other regions.
The EU aims to set global standards, but critics fear the act could hinder Europe’s competitiveness in the AI sector.
Industry response:
DigitalEurope and other industry bodies warn that the AI Act could create barriers to innovation.
Entrepreneurs like Cleve advocate for a balanced approach that fosters innovation without excessive regulatory burdens.
Europe’s AI Act seeks to regulate artificial intelligence to protect citizens and build trust in AI technologies. However, the challenge lies in ensuring that the regulations do not stifle innovation and hinder Europe’s competitive edge in the global AI race. As the act comes into force, the focus will be on refining details and balancing regulation with the need for technological advancement.
Apple, Nvidia, Anthropic used thousands of YouTube videos to train AI

Uncovering the data dilemma…
Massive data use: Tech giants Apple, Nvidia, Anthropic, and Salesforce used subtitles from 173,536 YouTube videos, sourced from over 48,000 channels, to train AI models, often without creators' knowledge.
Impacted creators: Influential YouTubers like MrBeast, Marques Brownlee, and PewDiePie had hundreds of videos utilized. Educational channels such as Khan Academy, MIT, and Harvard, along with media outlets like The Wall Street Journal, NPR, and the BBC, were also affected.
Creators’ reactions: David Pakman, with 2 million subscribers, criticized the unauthorized use of his content for AI training, highlighting the effort and resources invested in content creation. Dave Wiskus of Nebula, a streaming service, called the practice “disrespectful” and exploitative.
The Dataset: YouTube subtitles…
About the dataset: Created by EleutherAI, YouTube Subtitles is part of the Pile dataset, which includes various data sources like Wikipedia and Enron emails. Subtitles were extracted from educational and popular channels without including video imagery.
Corporate use: Companies like Apple, Nvidia, and Salesforce have confirmed using the Pile dataset to train AI models. Salesforce’s AI model, built using the Pile, has been downloaded 86,000 times.
Legal and ethical concerns…
Terms of Service violations: Using YouTube data without permission breaches YouTube’s terms, which prohibit automated data scraping. Google has taken measures to prevent such scraping, yet some scripts remain active on platforms like GitHub.
Creators’ rights: Creators demand compensation for the use of their content, arguing it’s essential for their livelihood. The issue raises broader questions about copyright, fair use, and the ethical use of digital content for AI training.
Potential consequences: AI-generated content risks creating fake or misleading information, exemplified by a fake Tucker Carlson clip using David Pakman’s script. Content like Einstein Parrot’s mimicking videos adds a layer of complexity regarding voice and identity replication.
Looking forward…
Regulatory and legal actions: Creators and legal experts are advocating for regulations and fair compensation mechanisms. Lawsuits against AI companies for unauthorized use of data are increasing, seeking to clarify legal boundaries and protect creators’ rights.
The role of platforms: Platforms like YouTube and GitHub need to enforce stricter controls to prevent unauthorized data use. Transparency from AI companies regarding data sources and consent is crucial to maintain trust and ethical standards in AI development.
OpenAI nears "reasoning"-capable AI
OpenAI researchers believe they are close to developing AI capable of human-level "reasoning," according to reports from Bloomberg and Reuters.

Why it matters: Creating AI that can reason like humans would be a major milestone. However, experts are divided on whether today's AI models can evolve to understand and adapt to the world like humans.
Key developments:
Five Levels of AGI: OpenAI has defined five levels of AGI:
Chatbots: Conversational language AI.
Reasoners: Human-level problem-solving AI.
Agents: AI that can take actions.
Innovators: AI aiding in invention.
Organizations: AI performing organizational tasks.
Current progress: OpenAI leaders indicated their systems are at level 1 but nearing level 2.
Project "Strawberry": An initiative aimed at developing AI with human-level reasoning, planning, and problem-solving.
Context: Prior to a failed board attempt to oust CEO Sam Altman, rumors of significant AI breakthroughs had circulated. Altman hinted at major advancements in AI during a public event.
Current products: Despite ambitions, OpenAI's latest product, GPT-4o, enhances text and visual modes but has not yet achieved human-level reasoning.
Mission and skepticism: OpenAI aims to ensure AGI benefits humanity, but definitions of AGI are vague. Optimists believe AGI will evolve from current models, while skeptics think LLMs will hit a limit before achieving human-level reasoning.
Comparisons: OpenAI's five-level AGI framework is similar to autonomous vehicle progress models and Google's AI development framework.
OpenAI, along with other tech giants, is investing heavily in achieving AGI. Clear definitions and achievable milestones are crucial as the journey towards true AGI continues.