Welcome to DataLLM, your go-to service for creating data out of nothing.
DataLLM allows you to efficiently tap into the vast power of LLMs to...
pip install datallm
from datallm import DataLLM
datallm = DataLLM(api_key='INSERT_API_KEY')
# mock 100 customers of a US fashion shop
df = datallm.mock(
n=100,
data_description="Customers of a US Fashion Shop",
columns={
"name": {"prompt": "full name of the customer"},
"date_of_birth": {"prompt": "the date of birth of that customer", "dtype": "date"},
"gender": {"categories": ["male", "female", "non-binary", "n/a"]},
"member_level": {"prompt": "a random number between 1 and 6", "dtype": "integer"},
"state": {"prompt": "the 2-letter code for the US state of residence"},
"email": {"prompt": "the customers email address", "regex": "([a-z|\\.]+)(@foo\\.bar)"},
},
temperature=0.7
)
df
pip install -U datallm
from datallm import DataLLM
datallm = DataLLM(api_key='INSERT_API_KEY', base_url='https://data.mostly.ai')
import pandas as pd
from datallm import DataLLM
datallm = DataLLM(api_key='INSERT_API_KEY')
df = pd.DataFrame({
"age in years": [5, 10, 13, 19, 30, 40, 50, 60, 70, 80],
"gender": ["m", "f", "m", "f", "m", "f", "m", "f", "m", "f"],
"country code": ["AT", "DE", "FR", "IT", "ES", "PT", "GR", "UK", "SE", "FI"],
})
# enrich the DataFrame with a new column containing the official country name
df["country"] = datallm.enrich(df, prompt="official name of the country")
# enrich the DataFrame with first name and last name
df["first name"] = datallm.enrich(df, prompt="the first name of that person")
df["last name"] = datallm.enrich(df, prompt="the last name of that person")
# enrich the DataFrame with a categorical
df["age group"] = datallm.enrich(
df, prompt="age group", categories=["kid", "teen", "adult", "elderly"]
)
# enrich with a boolean value and a integer value
df["isMale"] = datallm.enrich(df, prompt="is Male?", dtype="boolean")
df["body height"] = datallm.enrich(df, prompt="the body height in cm", dtype="integer")
df["body weight"] = datallm.enrich(df, prompt="the body weight in kg", dtype="integer")
df
import pandas as pd
from datallm import DataLLM
datallm = DataLLM(api_key='INSERT_API_KEY')
df = datallm.mock(
n=100,
data_description="fake customers of an Austrian bank",
columns={
"full name": {"prompt": "a first name with S"},
"last name": {"prompt": "a double-barrelled last name"},
"age": {"prompt": "the customers age", "dtype": "integer"},
"balance": {"prompt": "the customers current balance in EUR", "dtype": "float"},
"has credit card": {"prompt": "does the customer have a credit card", "dtype": "boolean"},
"customer since": {"prompt": "date that the customer has joined", "dtype": "date"},
"state": {"prompt": "the state of residence",
"categories": ["Vienna", "Lower Austria", "Upper Austria", "Carynthia", "Styria", "Tyrol", "Vorarlberg", "Burgenland", "Salzburg"]},
"ZIP": {"prompt": "austrian zip code for that customer", "regex": "(A-)([1-9][0-9]{3})"},
},
temperature=0.5,
progress_bar=False,
)
df
Note: For this to work well, it is advised to use a powerful, yet well-balanced underlying LLM model.
This is just the beginning. We are curious to learn more about your use cases and how DataLLM can help you.
DataLLM is leveraging fine-tuned foundational models. These are served via vLLM to a Python-based server instance, exposing its service as a REST API. The Python client is a wrapper around this API, making it easy to interact with the service.
These are the core components, with all of these being open-sourced and available on GitHub:
datallm-server
: Exposes the REST API for the service.datallm-engine
Runs on top of vLLM and handles the actual prompts.datallm-client
: A python wrapper for the interacting with the service.datallm-utils
: A set of utility scripts for fine-tuning new DataLLM models.A fine-tuned model, as well as its corresponding instruction dataset, can be found on HuggingFace.
datallm.enrich(data, prompt, ...)
Creates a new pd.Series given the context of a pd.DataFrame. This allows to easily enrich a DataFrame with new values generated by DataLLM.
datallm.enrich(
data: Union[pd.DataFrame, pd.Series],
prompt: str,
data_description: Optional[str] = None,
dtype: Union[str, DtypeEnum] = None,
regex: Optional[str] = None,
categories: Optional[list[str]] = None,
max_tokens: Optional[int] = 16,
temperature: Optional[float] = 0.7,
top_p: Optional[float] = 1.0,
model: Optional[str] = None,
progress_bar: bool = True,
) -> pd.Series:
string
, category
, integer
, float
, boolean
, date
or datetime
.category
.datallm.models()
. The default model is the first model in that list.datallm.mock(n, columns, ...)
Create a pd.DataFrame from scratch using DataLLM. This will create one column after the other for as many rows as requested. Note, that rows are sampled independently of each other, and thus may contain duplicates.
datallm.mock(
n: int,
data_description: Optional[str] = None,
columns: Union[List[str], Dict[str, Any]] = None,
temperature: Optional[float] = 0.7,
top_p: Optional[float] = 1.0,
model: Optional[str] = None,
progress_bar: bool = True,
) -> pd.DataFrame:
prompt
, dtype
, regex
, categories
, max_tokens
, temperature
, top_p
.We're committed to making DataLLM better every day. Your feedback and contributions are not just welcome—they're essential. Join our community and help shape the future of tabular data generation!
MOSTLY AI is pioneer and leader for GenAI for Tabular Data. Our mission is to enable organizations to unlock the full potential of their data while preserving privacy and compliance. We are a team of data scientists, engineers, and privacy experts, dedicated to making data and thus information more accessible. We are proud to be a trusted partner for leading organizations across industries and geographies.
If you like DataLLM, then also check out app.mostly.ai for our Synthetic Data Platform, which allows you to easily train a Generative AI on top of your own original data. These models can then be used to generate synthetic data at any volume, that is statistically similar to the original, yet is free of any personal information.