Nat TaylorBlog, AI, Product Management & Tinkering

Test Drive: fastdata

Published on .

Today I’m test driving fastdata “a minimal library for generating synthetic data for training deep learning models.” Recently I finetuned a model to be spooky, so today my task will be generating similar spooky descriptions of terms. The process is simple:

  1. pip install python-fastdata
  2. Set your ANTHROPIC_API_KEY
  3. Define a data model
  4. Prepare your inputs
  5. Generate data!

fastdata only works with Claude currently and implements Claudette which uses tool calling to get back structured data. A call to fastdata.generate() looks like below. They have a handy crituque example to, to evaluate the generated data. It does a nice job formatting the output, too. Overall, it’s a great little tool!

{
    'method': 'post',
    'url': '/v1/messages',
    'timeout': 600,
    'files': None,
    'json_data': {
        'max_tokens': 4096,
        'messages': [{
            'role': 'user',
            'content': [{
                'type': 'text',
                'text': 'Generate a spooky description of the item with adjectives like spooky and haunting.  Compare it to ghouls, ghosts or witches.  The topic is:\n<topic>work</topic>\n'
            }]
        }],
        'model': 'claude-3-haiku-20240307',
        'system': '',
        'temperature': 1.0,
        'tool_choice': {
            'type': 'any'
        },
        'tools': [{
            'name': 'Spook',
            'description': 'Generate a spooky description of the item with adjectives like spooky and haunting.  Compare it to ghouls, ghosts or witches.',
            'input_schema': {
                'type': 'object',
                'properties': {
                    'topic': {
                        'type': 'string',
                        'description': ''
                    },
                    'spookify': {
                        'type': 'string',
                        'description': ''
                    }
                },
                'required': ['topic', 'spookify']
            }
        }]
    }
}

Here’s my implementation

%env ANTHROPIC_API_KEY=sk-ant-api03-foo
%env ANTHROPIC_LOG=debug

import os
from textwrap import dedent
import logging
import requests

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

url = 'https://gist.githubusercontent.com/creikey/42d23d1eec6d764e8a1d9fe7e56915c6/raw/b07de0068850166378bc3b008f9b655ef169d354/top-1000-nouns.txt'
words = requests.get(url).text.split("\n")

from fastcore.utils import *
from fastdata.core import FastData

class Spook():
    "Generate a spooky description of the item with adjectives like spooky and haunting.  Compare it to ghouls, ghosts or witches."
    def __init__(self, topic: str, spookify: str): store_attr()
    def __repr__(self): return f"{self.topic} ➡ *{self.spookify}*"

prompt_template = """\
Generate a spooky description of the item with adjectives like spooky and haunting.  Compare it to ghouls, ghosts or witches.  The topic is:
<topic>{topic}</topic>
"""

inputs = [{"topic":topic} for topic in words[:5]]

fast_data = FastData(model="claude-3-haiku-20240307")
spooks = fast_data.generate(
    prompt_template=prompt_template,
    inputs=inputs,
    schema=Spook,
)

from IPython.display import Markdown

Markdown("\n".join(f'- {t}' for t in spooks))

def to_md(ss): return '\n'.join(f'- {s}' for s in ss)
def show(ss): return Markdown(to_md(ss))

class SpookCritique():
    "A critique of the spok."
    def __init__(self, critique: str, spookiness: str): store_attr()
    def __repr__(self): return f"\t- **Critique:** {self.critique}\n\t- **Spookiness:** {self.spookiness}"

sp = "You will help critique synthetic data of spooky passages."
critique_template = dedent("""\
Below is an extract of a spook. Evaluate its spookiness as a Halloween enthusiast would, considering its suitability for spooktacular use:

- EXTREME if it would spook an adult
- HIGH if it would spook a teenager
- MEDIUM if it would spook a child
- LOW if it is not very spooky

{spook}

After examining the spook:

- Briefly justify your spookiness rating in a setence
- Rate the spookiness as one of: EXTREME, HIGH, MEDIUM, LOW

""")


fast_data = FastData(model="claude-3-5-sonnet-20240620")
critiques = fast_data.generate(
    prompt_template=critique_template,
    inputs=[{"spook": f"{t.topic} -> {t.spookify}"} for t in spooks],
    schema=SpookCritique,
    sp=sp
)

show(f'{t}\n\n{c}' for t, c in zip(spooks, critiques))

The output looks like this:

Post Navigation

«
»