In 2018, Google released the AI model BERT, forever changing how machines understood context in a language. BERT, short for Bidirectional Encoder Representations from Transformers, solved a long-standing problem in natural language understanding. Before BERT, researchers needed multiple bespoke models (and datasets) to understand the different contextual meanings of human languages. BERT demonstrated that one model could process contextual meaning across multiple languages (via mBERT).
While BERT became a fundamental building block in natural language processing (NLP), its impact on how we interact with computers came from its application. That change did not come from shoehorning the technology into existing solutions. It stemmed from an applied understanding of how BERT functioned, would evolve and could solve domain-specific challenges. I know because my team at Google used BERT to create responsive search ads. Our work transformed online text advertising.
Our success in applying BERT wasn’t simply because we had TPUs or more resources than our competitors. (Those things helped, undoubtedly.) Our advantage was my team’s domain expertise and close collaboration with the Google Research team, who created the model. Collectively, we could envision how to apply BERT to fundamentally reshape advertising because we understood:
- How the model operates, including its strengths, weaknesses and dependencies.
- The specific industry problems and operational challenges we were solving for.
- The way we’d tune the model and create a system around it, at scale.
This framework remains pertinent today, as business leaders seek to understand how to build and deliver scaled impact with large language models (LLMs).
Weighing the pros and cons of models
As critics and pundits debate the efficacy of generative AI, numerous studies underscore a similar finding: people are unsure of how to use LLMs. To address these uncertainties, leaders need to ensure that their organization’s decision-makers understand, at a high level, how these models function and can be applied.
That technical understanding mattered when we applied BERT. It matters even more now. Because while BERT required domain expertise to deploy effectively, today’s LLMs make it dangerously easy to deploy them poorly and unknowingly. It’s likely the reason so many AI projects never proceed past pilots, as McKinsey reported.
BERT’s success underscored the power of pre-training and fine-tuning a model on a large dataset to enhance token-level semantic context within NLP. But where BERT zigged, OpenAI’s Generative Pre-trained Transformer (GPT) zagged. Unlike BERT’s encoder-only architecture, GPTs use a decoder-only architecture to generate outputs, trained to predict the next token in a sequence. BERT was trained on billions of tokens, while today’s GPTs are trained on trillions. The more tokens these LLMs were trained upon, the more capabilities they gained.
Early applications of these LLMs have centered on their ability to generate creative, fluent and coherent writing, coding and imagery. This “creativity” reflects the probabilistic patterns that these models have learnt from their massive training datasets. But this same predictive output can be disadvantageous when these models need to be deployed in factual, deterministic environments.
Deep learning researchers have long argued that setting any rules, following the historical method of symbolic reasoning, would inhibit the model’s abilities. Structured to predict an output, even when the model is uncertain, LLMs have demonstrated a tendency to hallucinate. Hallucinations are not a bug, but a feature. They’re inherent to how these models operate and should limit your applications without guardrails. After all, mistakes in consequential industries like healthcare, finance and legal can be catastrophic.
This historical comparison, while surface-level, still illustrates how the model’s underlying functions impact its outputs and applications. Understanding these architectures solves half the problem. The other half is building a strategy to collect and scale data within your specific domain.
AI’s scaling problem is a domain problem
Healthcare, finance and legal are all industries with ample data and capital to spend. While each sector presents distinct challenges, they all have found success with AI.
Hospitals epitomize the opportunities and obstacles within the healthcare industry. The average hospital generates 50 petabytes of data annually, enough tokens to train a sophisticated model. But 97 percent of this data isn’t used: it consists of unstructured clinical notes and radiology reports, redacted documents following HIPAA compliance, and data siloed or managed under regulatory scrutiny. When you can access it cleanly, as illustrated by AI detections of tumors in radiology images, a measurable impact is possible. When you can’t, your pilot stalls.
Finance presents different tradeoffs. Transaction data is generally well-structured and high-volume, enabling novel applications in fraud detection and customer service automation. But limiting these applications is LLMs’ high error rates on multi-step numerical reasoning, creating fundamental misalignment for calculation-intensive applications.
The pervasiveness of hallucinations spotted in trial briefs and documents has sown distrust in AI within the legal industry. The ABA’s 2024 Legal Technology Survey found 75% of attorneys cite accuracy concerns as a primary barrier to their AI adoption. But outside the courtroom, there are many applications of AI to radically reshape legal work, including managing contracts, conducting compliance and risk assessments, protecting intellectual property, etc. These instances lend themselves to LLMs’ strengths: ability to handle unstructured data, pattern recognition, information extraction and data structure analysis.
Your development framework to scale data
The distinction between courtroom risk and contract is exactly the advantage that we identified at Ironclad. Every business function generates contracts that include information on renewal dates, payment terms, obligations and counterparty details. Training models on this data is a strategy that leverages LLMs’ strengths with minimal risk.
Compared to legal briefs and documents, where incorrect AI summaries have massive implications, the risk of applying AI to contracts is lower. Our approach at Ironclad is modeled after the automotive industry’s safety framework for deploying autonomous vehicles, the SAE J3016. This standard distinguishes between systems in which humans retain responsibility (Levels 1-2, driver assistance) and those in which machines become accountable (Levels 3-5, automated driving).
Applied to enterprise AI, this risk-based framework clarifies deployment roadmaps and boundaries. We chose to develop our Intake Agent, which extracts contract data from third-party papers, and our Conversational Search Agents, which enable natural language querying of documents, because we saw them as “Level 1-2” applications with low adoption barriers and associated risks. Verifying a contract autonomously, while plausible with today’s LLMs, might seem innocuous, but unsupervised verification could be very risky. There’s a plausible scenario in which an agent could overlook litigation between the two engaged parties and autonomously renew a contract because it can’t access the tort proceedings data.
Using a risk-based framework can help determine where and how to build your AI applications by answering the question: what’s the likelihood we’ll get this right, and what happens if we’re wrong? This calculation, not competitive pressure, should then determine your deployment sequencing.
Build a foundation, create transformation
During my twenty-plus years of working with AI, I’ve found that the most significant determinant of a technology’s impact on a business is the organization’s fluency with it.
As planning ensues, organizations trying to determine how and where to apply generative AI to their business need first to have an introspective look and ask themselves:
- Do you understand the technology’s actual mechanisms? Not marketing promises, but fundamental architecture. LLMs perform statistical pattern matching, not logical reasoning. They require extensive general training followed by domain-specific fine-tuning. Without this understanding, you’ll deploy for impossible tasks.
- How does the industry you’re serving share, collect and store data? What data will you use to tune and fine-tune your models, if at all? LLMs are foundational because they require sophisticated pre-training, followed by fine-tuning to generate meaningful, domain-specific outputs.
- Do you have a framework, such as a risk-based model, to prioritize, deploy and assess which products to develop? Start with low-risk, human-supervised applications. Conduct evaluations and build feedback loops. Expand systematically as reliability proves out.
Over my career, I’ve watched people overestimate what AI can do this quarter while underestimating what it will do this decade. Technology evolves. Industries are reshaped. But transformations are created by businesses that do the foundational work.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?
Read More from This Article: AI isn’t failing, people are failing with AI
Source: News

