DataLLM - create data out of nothing đź”®

Welcome to DataLLM, your go-to service for creating data out of nothing.

DataLLM allows you to efficiently tap into the vast power of LLMs to...

  1. create mock data that fits your needs, as well as
  2. enrich datasets with world knowledge.

Get started

  1. Sign in and retrieve your API key here.
  2. Install the DataLLM Python client.
    pip install datallm
    
  3. Instantiate a client instance with your API key.
    from datallm import DataLLM
    datallm = DataLLM(api_key='INSERT_API_KEY')
    
  4. Start creating data out of nothing
    # mock 100 customers of a US fashion shop
    df = datallm.mock(
      n=100, 
      data_description="Customers of a US Fashion Shop",
      columns={
          "name": {"prompt": "full name of the customer"},
          "date_of_birth": {"prompt": "the date of birth of that customer", "dtype": "date"},
          "gender": {"categories": ["male", "female", "non-binary", "n/a"]},
          "member_level": {"prompt": "a random number between 1 and 6", "dtype": "integer"},
          "state": {"prompt": "the 2-letter code for the US state of residence"},
          "email": {"prompt": "the customers email address", "regex": "([a-z|\\.]+)(@foo\\.bar)"},
      },
      temperature=0.7
    )
    df
    

Resources

  • Hugging Face: Find the instruction dataset and models on our Hugging Face page.
  • GitHub: Access the project, source code, and comprehensive documentation on GitHub.

Usage Examples

Enrich an existing DataFrame with new columns

  1. Sign in and retrieve your API key here.
  2. Install the latest version of the DataLLM Python client.
pip install -U datallm
  1. Instantiate a client with your retrieved API key.
from datallm import DataLLM
datallm = DataLLM(api_key='INSERT_API_KEY', base_url='https://data.mostly.ai')
  1. Enrich an existing dataset with new columns, that are coherent with any of the already present columns.
import pandas as pd
from datallm import DataLLM

datallm = DataLLM(api_key='INSERT_API_KEY')

df = pd.DataFrame({
    "age in years": [5, 10, 13, 19, 30, 40, 50, 60, 70, 80],
    "gender": ["m", "f", "m", "f", "m", "f", "m", "f", "m", "f"],
    "country code": ["AT", "DE", "FR", "IT", "ES", "PT", "GR", "UK", "SE", "FI"],
})

# enrich the DataFrame with a new column containing the official country name
df["country"] = datallm.enrich(df, prompt="official name of the country")

# enrich the DataFrame with first name and last name
df["first name"] = datallm.enrich(df, prompt="the first name of that person")
df["last name"] = datallm.enrich(df, prompt="the last name of that person")

# enrich the DataFrame with a categorical
df["age group"] = datallm.enrich(
    df, prompt="age group", categories=["kid", "teen", "adult", "elderly"]
)

# enrich with a boolean value and a integer value
df["isMale"] = datallm.enrich(df, prompt="is Male?", dtype="boolean")
df["body height"] = datallm.enrich(df, prompt="the body height in cm", dtype="integer")
df["body weight"] = datallm.enrich(df, prompt="the body weight in kg", dtype="integer")

df

Summarize data records

import pandas as pd
from datallm import DataLLM

datallm = DataLLM(api_key='INSERT_API_KEY')

df = datallm.mock(
    n=100,
    data_description="fake customers of an Austrian bank",
    columns={
        "full name": {"prompt": "a first name with S"},
        "last name": {"prompt": "a double-barrelled last name"},
        "age": {"prompt": "the customers age", "dtype": "integer"},
        "balance": {"prompt": "the customers current balance in EUR", "dtype": "float"},
        "has credit card": {"prompt": "does the customer have a credit card", "dtype": "boolean"},
        "customer since": {"prompt": "date that the customer has joined", "dtype": "date"},
        "state": {"prompt": "the state of residence", 
                  "categories": ["Vienna", "Lower Austria", "Upper Austria", "Carynthia", "Styria", "Tyrol", "Vorarlberg", "Burgenland", "Salzburg"]},
        "ZIP": {"prompt": "austrian zip code for that customer", "regex": "(A-)([1-9][0-9]{3})"},
    },
    temperature=0.5,
    progress_bar=False,
)
df

Note: For this to work well, it is advised to use a powerful, yet well-balanced underlying LLM model.

More Use Cases

This is just the beginning. We are curious to learn more about your use cases and how DataLLM can help you.

Architecture

DataLLM is leveraging fine-tuned foundational models. These are served via vLLM to a Python-based server instance, exposing its service as a REST API. The Python client is a wrapper around this API, making it easy to interact with the service.

These are the core components, with all of these being open-sourced and available on GitHub:

  • Server Component datallm-server: Exposes the REST API for the service.
  • Engine Component datallm-engine Runs on top of vLLM and handles the actual prompts.
  • Python Client datallm-client: A python wrapper for the interacting with the service.
  • Utility Scripts datallm-utils: A set of utility scripts for fine-tuning new DataLLM models.

A fine-tuned model, as well as its corresponding instruction dataset, can be found on HuggingFace.

Python API docs

datallm.enrich(data, prompt, ...)

Creates a new pd.Series given the context of a pd.DataFrame. This allows to easily enrich a DataFrame with new values generated by DataLLM.

datallm.enrich(
    data: Union[pd.DataFrame, pd.Series],
    prompt: str,
    data_description: Optional[str] = None,
    dtype: Union[str, DtypeEnum] = None,
    regex: Optional[str] = None,
    categories: Optional[list[str]] = None,
    max_tokens: Optional[int] = 16,
    temperature: Optional[float] = 0.7,
    top_p: Optional[float] = 1.0,
    model: Optional[str] = None,
    progress_bar: bool = True,
) -> pd.Series:
  • data. The existing values used as context for the newly generated values. The returned values will be of same length and in the same order as the provided list of values.
  • prompt. The prompt for generating the returned values.
  • data_description. Additional information regarding the context of the provided values.
  • dtype. The dtype of the returned values. One of string, category, integer, float, boolean, date or datetime.
  • regex. A regex used to limit the generated values.
  • categories. The allowed values to be sampled from. If provided, then the dtype is set to category.
  • max_tokens. The maximum number of tokens to generate. Only applicable for string dtype.
  • temperature. The temperature used for sampling.
  • top_p. The top_p used for nucleus sampling.
  • model. The model used for generating new values. Check available models with datallm.models(). The default model is the first model in that list.
  • progress_bar. Whether to show a progress bar.

datallm.mock(n, columns, ...)

Create a pd.DataFrame from scratch using DataLLM. This will create one column after the other for as many rows as requested. Note, that rows are sampled independently of each other, and thus may contain duplicates.

datallm.mock(
    n: int,
    data_description: Optional[str] = None,
    columns: Union[List[str], Dict[str, Any]] = None,
    temperature: Optional[float] = 0.7,
    top_p: Optional[float] = 1.0,
    model: Optional[str] = None,
    progress_bar: bool = True,
) -> pd.DataFrame:
  • n. The number of generated rows.
  • data_description. Additional information regarding the context of the provided values.
  • columns. Either a list of column names. Or a dict, with column names as keys, and sampling parameters as values. These may contain prompt, dtype, regex, categories, max_tokens, temperature, top_p.
  • temperature. The temperature used for sampling. Can be overridden at column level.
  • top_p. The top_p used for nucleus sampling. Can be overridden at column level.
  • progress_bar. Whether to show a progress bar.

Contribute

We're committed to making DataLLM better every day. Your feedback and contributions are not just welcome—they're essential. Join our community and help shape the future of tabular data generation!

About MOSTLY AI

MOSTLY AI is pioneer and leader for GenAI for Tabular Data. Our mission is to enable organizations to unlock the full potential of their data while preserving privacy and compliance. We are a team of data scientists, engineers, and privacy experts, dedicated to making data and thus information more accessible. We are proud to be a trusted partner for leading organizations across industries and geographies.

If you like DataLLM, then also check out app.mostly.ai for our Synthetic Data Platform, which allows you to easily train a Generative AI on top of your own original data. These models can then be used to generate synthetic data at any volume, that is statistically similar to the original, yet is free of any personal information.