Work-Bench is an enterprise software VC firm — if you are or know anyone thinking about founding an enterprise software startup — we’d love to meet! Please reach out to chat. 📩
As the race for AI dominance continues, many are wondering if we have the resources to continue producing LLMs at breakneck speed? While there are many tools and processes that go into training and deploying LLMs ranging from data pre-processing to model serving, there is a key dependency that could impact model innovation: access to data.
Earlier this week, the NYTimes reported that AI’s biggest players were feeling the effects of limited data availability, leading them to more “creative” means for acquiring data.
OpenAI: Back when OpenAI trained GPT-4, it exhausted all publicly available text data and created their speech recognition model, Whisper as a means of transcribing text data from publicly available YouTube videos.
Meta: Meta explored using copyrighted material like long-form essays and books, even if it meant breaking a few copyright laws by using works without authors' explicit permission.
Google: In 2023, the company broadened its terms of service, allowing internal AI/ML teams to access users’ Google Docs, Maps, restaurant reviews, and other online material to use as training data.
While current data acquisition methods obviously toe a fine line legally, we’re left wondering what the future of LLMs will look like. Here are a few questions the Work-Bench team is actively thinking about:
How will the data crunch lead to more data-efficient model architectures and training methods?
What is the ceiling for synthetic data? Will this be a true tailwind for companies like Gretel?
Will the data shortage lead to the rise of Small Language Models trained on a fewer number of ultra high-quality data points?
If you’re also thinking about any of these questions, feel free to send us a note to chat.
On the Work-Bench front, we published some of our latest research:
Danny writes about how every company is transforming into an AI company and falls somewhere on the “LLM Adoption Curve” with AI native companies having a big decision to make from day one about their target markets, product positioning, and wedge use cases.
Priyanka writes about the evolving landscape of data processing and analytics platforms, how DataFusion distinguishes itself with its seamless integration with the Apache Arrow format, and top developers tools to know.
🖥️ Join more Work-Bench events:
Tuesday, April 16: NY Enterprise Tech Meetup on “Product Inflection Points from 0 to IPO” with Braze's 5th employee and now Chief Product Officer
Tuesday, April 23: Next NYC Product Marketing Night with the GTM Product Marketing Manager Lead at Pinecone and a Senior Product Marketing Manager at Temporal
📚 Read more news:
Axios: Enterprise Tech 30
Forbes: $10 Billion Productivity Startup Notion Wants To Build Your AI Everything App
Fortune: The Founder Of A $4.4B Unicorn Had A Stroke At Age 34
The Information: OpenAI Researchers, Including Ally of Sutskever, Fired for Alleged Leaking
WSJ: Amazon CEO Touts AI Revolution While Committing to Cost Cuts
📚 Read more (th)reads:
Company: Autokitteh
Role: Junior Engineer
Technology: Workflow Automation Platform
Funding: Unannounced
🌟 Company of the Week 🌟
Val Town raises $5.5M led by Accel
Infrastructure / Dev Tools • Seed • New York, NY
Founders: Steve Krouse (CEO), Tom MacWright (CTO)
Scavenger AI raises $1.2M led by HTGF
Data / AI / Machine Learning • Pre-Seed • Berlin, Germany
Veremark raises $3M led by Samaipata and Stage 2 Capital
HR Tech • Series B • London, United Kingdom
PeerDB raises $3.6M led by 8VC, Y Combinator and Others
Infrastructure / Dev Tools • Seed • San Francisco, CA
Libretto raises $3.7M led by XYZ VC and The General Partnership
Data / AI / Machine Learning • Seed • New York, NY
Patlytics raises $4.5M led by Gradient Ventures
Future of Work • Seed • San Francisco, CA
Datafy raises $6M led by Insight Partners
Infrastructure / Dev Tools • Seed • Tel Aviv, Israel
Tabs raises $7M led by Lightspeed
Future of Work • Seed • New York, NY
GTM Buddy raises $8M led by Archerman Capital and Leo Capital
Sales / Marketing • Series A • Durham, NC
PVML raises $8M led by NFX
Data / AI / Machine Learning • Seed • Tel Aviv, Israel
Summer raises $9M led by Rebalance Capital and SemperVirens
HR Tech • Venture • New York, NY
Simbian raises $10M led by Cota Capital, Rain Capital and Others
Risk / Security • Seed • Mountain View, CA
StrikeReady raises $12M led by 33N Ventures
Risk / Security • Series A • Dallas, TX
Andesite raises $15.3M led by Red Cell Partners and General Catalyst
Risk / Security • Series A • McLean, VA
Mimo raises $19.4M led by Northzone
Future of Work • Series A • London, UK
Cariloop raises $20M led by ABS Capital
HR Tech • Series C • Richardson, TX
Sprinto raises $20M led by Accel, Elevation Capital, and Blume Ventures
Risk / Security • Series B • San Francisco, CA
Onum raises $28M led by Dawn Capital
Data / AI / Machine Learning • Series A • Madrid, Spain
Novidea raises $30M led by HarbourVest Partners
Future of Work • Series C • Netanya, Israel
Symbolica raises $33M led by Khosla Ventures
Data / AI / Machine Learning • Series A • San Francisco, CA
FloQast raises $100M led by ICONIQ Growth
Future of Work • Series E • Los Angeles, CA
Cyera raises $300M led by Coatue
Data / AI / Machine Learning • Series C • New York, NY
Vista Equity Partners acquires Model N for $1.3B
Future of Work • Acquisition • Redwood Shores, CA
CData acquires Data Virtuality for an Undisclosed Amount
Data / AI / Machine Learning • Acquisition • Leipzig, Germany
Cloudflare acquires Baselime for an Undisclosed Amount
Infrastructure / Dev Tools • Acquisition • London, UK
Hello, how are you? I love reading your substack every week for all the valuable information you provide. I am a VC analyst and I need to look for deals. I spend a lot of time each week searching through different Substacks to find these deals. Could you give me some tips to speed up this process or suggest me more effective strategies to find this information? Thank you very much for any advice you can give me.