Expert LLM Fine-Tuning and Testing from Enabled Intelligence

Enabled Intelligence helps top government and commercial customers take Large Language Model (LLM) fine-tuning and testing to the next level.

If your organization wants to get more out of LLMs, we can help. Combining best-of-breed AI technology and an expert-in-the-loop Reinforcement Learning from Human Feedback (RLHF) approach, Enabled Intelligence eliminates hallucinations and improves LLM accuracy, relevance, and contextual understanding.

Why Enabled Intelligence?

  • Leading provider of AI data annotation and AI model testing solutions
  • Highly trained Native Language Analysts—not “Gig Workers”
  • Best of breed technology approach—no vendor lock
  • Hyper-focused on quality
  • Expertise working with secure data:
    • Classified materials up to TS/SCI
    • Personally Identifying Information (PII)
    • Health Insurance Portability and Accountability Act (HIPAA) Compliant
abstract collage of text documents

Strategy & Approach

We apply our deep expertise in machine learning and AI to fine-tune your LLMs for specific use cases and evaluate LLMs for hallucinations, bias, reasoning, generation quality, and model mechanics.

We gain a thorough understanding of your specific domain or application area for which your LLM is being fine-tuned. This helps in selecting relevant data and interpreting the model’s outputs accurately.

Our cross-functional teams of native English speakers collaborate and communicate throughout the training process to ensure that the fine-tuned model meets the objectives and requirements of your LLM.

collage of cloud and text documents

Tactics & Execution

Our background in data science and engineering enables us to streamline data collection, preprocessing, and management of large datasets. This includes skills in data cleaning, annotation, and augmentation to ensure high-quality training data.

We evaluate responses with native english speakers ensuring:

  • Consistency: calibrating annotators using sample data

  • Objectivity: strictly adhering to predefined criteria

  • Confidentiality: ensuring all of your data is handled securely

Response annotation guidelines include
relevance, accuracy, coherence, fluency, completeness, and appropriateness. Additionally, our team of experts can annotate sentiment analysis, identify errors, and match intent.

Quality

Our teams are proficient in modern software engineering practices and a variety of programming languages. We have expertise in machine learning frameworks such as TensorFlow, PyTorch, and Hugging Face’s Transformers library.

We evaluate and validate model performance using various metrics (e.g., accuracy, precision, recall, F1 score) and validation techniques (e.g., cross-validation, A/B testing) that are important to assess and improve the model’s performance.

Knowledge of optimization algorithms and techniques for tuning hyperparameters, such as learning rate, batch size, and number of training epochs, is important for achieving the best performance.

Our multi-tier review process provides optimal quality assurance and a continuous feedback loop. Initial annotations are reviewed by senior annotators or team leads, regularly measuring and improving agreement between different annotators.

Quality Assurance

Is a Large Language Model right for your business? Well, don’t just ask it. Test it.

With all the hype around LLMs lots of our large business and government customers want to know what is real and can LLMs reliably help their missions and their margins. As such, we at Enabled Intelligence have had a huge uptick in requests from companies asking for help testing and evaluating Large Language Models (LLMs). And while LLMs show promise and are an impressive early-stage technology, truly evaluating them is still an evolving and complex process.

LLMs flexibility and creativity make them hard to test with automated tools

LLMs ability to interact with people using “natural” language and to create (generate) text like summaries, essays, reports, and stories can revolutionize how we interact and use computers and software. LLMs have the promise of analyzing and organizing information buried in pages of text and millions of sources and responding in “human sounding” paragraphs, lists, and even song lyrics or poems. However, this diversity, naturalness, and creativity of responses also creates LLM’s greatest weaknesses: hallucinations; unsafe language; incorrect / made up facts; and responses that don’t follow prompt instructions. And because LLMs give responses in convincing human-like language, it is difficult to quickly, comprehensively, and accurately identify these errors.

Testing LLMs is much more complicated than testing computer vision models as tone, context, emotional impact, and other factors are all part of the assessment. This requires technology AND human testers. Enabled Intelligence, Inc Intelligence’s team of skilled LLM testers have developed some quick tips to avoid the common pitfalls we see in current LLM testing methods:

  • Fine Tune the LLM to specific use cases and jargon: LLMs are foundational and generalized language models trained on vast publicly available data set like social media feeds, public websites, and other open source data. They struggle when presented with specific tasks using more focused data and jargon. However, the generalized LLM can be fine tuned using your own internal data to better focus them on specific uses. A first step of LLM evaluation is to identify these weaknesses and determine which internal data would best finetune and retrain for your use case.
  • Evaluate with native language speakers: It may seem cost effective to offshore LLM evaluation (the reading and validating Prompt/Response pairs is time consuming and tedious). However, this short-sighted economic calculus ends up being much more expensive in the long run. Evaluating an LLM requires strong native understanding of the LLM language. Many of our new customers previously used off-shore gig workers to assess English LLM results. The inaccuracy from those teams only led these companies to have to redo all the work. Adding more costs and extensive development delays.
  • Use diverse evaluators: Detecting and correcting bias is a major focus of LLM evaluation. Detecting bias is more than identifying racist or misogynistic language. Bias detection requires diversity of understanding, thought, and experience. Often LLM evaluation teams confuse quantity for diversity. Having seven different evaluators of the same background is not equivalent to a truly diverse team. That’s why Enabled Intelligence employs a diverse staff including neurodiverse professionals to evaluate LLMs and label data.
  • Employ a useful and comprehensive evaluation ontology: Most LLMs evaluations use a basic ontology of assessing “Helpfulness” (does the response follow the prompt instructions and provide an appropriate response); “Truthfulness” (does the response contain any factual errors or hallucinations); and “Safety” (does the response contain harmful, racist, hateful, pornographic, or violent language or direction). This is a great start to an ontology but is not enough. Ontologies need to capture edge cases (would a response detailing reproductive health options be consider “sexual” or “pornographic?”; would a response providing a history of Jim Crow era be classified as “racist?”) and other concepts like tone, style, verbosity, naturalness of language. Is the LLM designed to sound “human like” or should the LLM present itself as a technical assistant, and respond more like a machine?



Contact Us to Learn More

Enabled Intelligence, Inc’s diverse team of native language speakers is working with top LLM companies and enterprise users of LLMs to work through these issues and truly test LLM performance.