{"id":119896,"date":"2026-07-03T13:32:03","date_gmt":"2026-07-03T08:02:03","guid":{"rendered":"https:\/\/www.guvi.in\/blog\/?p=119896"},"modified":"2026-07-03T13:32:04","modified_gmt":"2026-07-03T08:02:04","slug":"bentoml-tutorial","status":"publish","type":"post","link":"https:\/\/www.guvi.in\/blog\/bentoml-tutorial\/","title":{"rendered":"BentoML Tutorial: From Model to Production API"},"content":{"rendered":"\n<p>Most ML teams don\u2019t struggle with building models, they struggle with shipping them. A notebook that scores 95% accuracy is still useless to anyone outside the data science team until it\u2019s wrapped in an API, packaged with the right dependencies, and running somewhere reliable.&nbsp;<\/p>\n\n\n\n<p>That\u2019s the gap BentoML is built to close. Instead of hand-rolling a Flask app, writing a Dockerfile from scratch, and hoping the production environment matches your laptop, BentoML turns a Python class into a versioned, containerized, deployable service with a handful of decorators.<\/p>\n\n\n\n<p>This tutorial walks through that full path: saving a model into BentoML\u2019s model store, defining a Service with validated APIs, adding async and batched endpoints, wiring multiple Services together, configuring GPU resources, and finally building and deploying the whole thing.<\/p>\n\n\n\n<p><strong>TL;DR<\/strong><\/p>\n\n\n\n<ul>\n<li>BentoML separates model storage from service code, so nothing is hardcoded to a file path<br><\/li>\n\n\n\n<li>@bentoml.service turns a Python class into a deployable unit; @bentoml.api turns a method into an HTTP endpoint<br><\/li>\n\n\n\n<li>Pydantic types on your API methods give you free request validation and OpenAPI docs<br><\/li>\n\n\n\n<li>batchable=True boosts throughput, async boosts concurrency, and you can combine both<br><\/li>\n\n\n\n<li>A Bento bundles code, model references, and environment specs into one versioned artifact<br><\/li>\n\n\n\n<li>bentoml containerize and bentoml deploy take that artifact straight to Docker or BentoCloud<br><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is BentoML?<\/strong><\/h2>\n\n\n\n<p>BentoML is an open-source Python framework for packaging, serving, and deploying machine learning models as production-ready APIs. It helps developers save versioned model artifacts, define inference Services, expose prediction logic through REST endpoints, and containerize the full deployment environment.<\/p>\n\n\n\n<div style=\"background-color: #099f4e; border: 3px solid #110053; border-radius: 12px; padding: 18px 22px; color: #FFFFFF; font-size: 18px; font-family: Montserrat, Helvetica, sans-serif; line-height: 1.6; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); max-width: 750px;\">\n  <strong style=\"font-size: 22px; color: #FFFFFF;\">\ud83d\udca1 Did You Know?<\/strong> \n  <br \/><br \/> \n  The word <strong style=\"color: #FFFFFF;\">\u201cBento\u201d<\/strong> in BentoML is a direct nod to the Japanese <strong style=\"color: #FFFFFF;\">bento box<\/strong>, since a Bento packages your model, code, and environment into one neat, portable container, much like the meal itself.\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>BentoML Tutorial: From Model to Production API<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>1. Installation<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install bentoml scikit-learn<\/code><\/pre>\n\n\n\n<p>Verify the install:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>bentoml --version<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>2. Train and save a model<\/strong><\/h3>\n\n\n\n<p>BentoML has a model store that versions and tracks artifacts separately from your service code. Save a model into it using the framework-specific integration here, bentoml.sklearn:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># train.py\n\nimport bentoml\n\nfrom sklearn import svm, datasets\n\niris = datasets.load_iris()\n\nX, y = iris.data, iris.target\n\nclf = svm.SVC(gamma=\"scale\")\n\nclf.fit(X, y)\n\n# Saves the model into BentoML's local model store with a version tag\n\nsaved_model = bentoml.sklearn.save_model(\"iris_clf\", clf)\n\nprint(f\"Model saved: {saved_model}\")<\/code><\/pre>\n\n\n\n<p>Run it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python train.py<\/code><\/pre>\n\n\n\n<p>You can confirm it landed in the store:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>bentoml models list<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3. Define a Service<\/strong><\/h3>\n\n\n\n<p>BentoML Services are plain <a href=\"https:\/\/www.guvi.in\/blog\/python-for-data-science\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python<\/a> classes decorated with @bentoml.service. Each method exposed with @bentoml.api becomes an HTTP endpoint. Create service.py:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># service.py\n\nfrom __future__ import annotations\n\nimport numpy as np\n\nimport bentoml\n\nfrom pydantic import BaseModel, Field\n\nclass IrisFeatures(BaseModel):\n\n&nbsp;&nbsp;&nbsp;&nbsp;sepal_length: float = Field(..., ge=0)\n\n&nbsp;&nbsp;&nbsp;&nbsp;sepal_width: float = Field(..., ge=0)\n\n&nbsp;&nbsp;&nbsp;&nbsp;petal_length: float = Field(..., ge=0)\n\n&nbsp;&nbsp;&nbsp;&nbsp;petal_width: float = Field(..., ge=0)\n\n@bentoml.service(\n\n&nbsp;&nbsp;&nbsp;&nbsp;name=\"iris_classifier\",\n\n&nbsp;&nbsp;&nbsp;&nbsp;resources={\"cpu\": \"1\"},\n\n&nbsp;&nbsp;&nbsp;&nbsp;traffic={\"timeout\": 10},\n\n)\n\nclass IrisClassifier:\n\n&nbsp;&nbsp;&nbsp;&nbsp;# Reference the saved model by tag; BentoML resolves the path at runtime\n\n&nbsp;&nbsp;&nbsp;&nbsp;bento_model = bentoml.models.BentoModel(\"iris_clf:latest\")\n\n&nbsp;&nbsp;&nbsp;&nbsp;def __init__(self) -&gt; None:\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;import joblib\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.model = joblib.load(self.bento_model.path_of(\"saved_model.pkl\"))\n\n&nbsp;&nbsp;&nbsp;&nbsp;@bentoml.api\n\n&nbsp;&nbsp;&nbsp;&nbsp;def predict(self, input_data: IrisFeatures) -&gt; dict:\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;input_series = np.array(&#91;&#91;\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;input_data.sepal_length,\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;input_data.sepal_width,\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;input_data.petal_length,\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;input_data.petal_width,\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;]])\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;prediction = self.model.predict(input_series)\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return {\"class\": int(prediction&#91;0])}<\/code><\/pre>\n\n\n\n<p>A few details that matter here, not cosmetic ones:<\/p>\n\n\n\n<ul>\n<li>bentoml.models.BentoModel is a reference to an entry in the model store, resolved at service startup, not a raw file path. This is what makes the Bento, the final build artifact, reproducible across machines.<\/li>\n\n\n\n<li>Pydantic models as input types give you automatic request validation and a generated <a href=\"https:\/\/www.guvi.in\/blog\/what-is-openrouter-api\/\" target=\"_blank\" rel=\"noreferrer noopener\">OpenAPI<\/a> schema, visible at \/docs once the service is running. A malformed request body returns a 422 before your function body ever runs.<\/li>\n\n\n\n<li>resources and traffic in the decorator are Service-level config, not per-request config. They control how many CPU\/GPU resources the runtime allocates and what the request timeout is, and they get baked into the Bento at build time.<\/li>\n<\/ul>\n\n\n\n<p><em>Build production-ready Python skills with HCL GUVI\u2019s <\/em><a href=\"https:\/\/www.guvi.in\/courses\/programming\/python\/?utm_source=blog&amp;utm_medium=hyperlink&amp;utm_campaign=bentoml-tutorial-from-model-to-production-api\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Python Course<\/em><\/a><em>, certified by IITM Pravartak. Learn Python from the basics to advanced features through 17 hours of recorded content, 4 modules with certifications, and best practices used by real employers. Start learning for free and pay for the certificate later.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4. Run it locally<\/strong><\/h3>\n\n\n\n<p>bentoml serve service:IrisClassifier<\/p>\n\n\n\n<p>By default this binds to http:\/\/localhost:3000. BentoML auto-generates an interactive Swagger UI at http:\/\/localhost:3000 where you can hit \/predict directly, and the route name is derived from the method name unless you override it.<\/p>\n\n\n\n<p>Test it from another terminal:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>curl -X POST http:\/\/localhost:3000\/predict \\\n\n-H \"Content-Type: application\/json\" \\\n\n-d '{\"input_data\": {\"sepal_length\": 5.1, \"sepal_width\": 3.5, \"petal_length\": 1.4, \"petal_width<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>5. Async <\/strong><a href=\"https:\/\/www.guvi.in\/hub\/network-programming-with-python\/understanding-apis\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>APIs<\/strong><\/a><strong> and batching<\/strong><\/h3>\n\n\n\n<p>For IO-bound work, calling another service, reading from disk, hitting a vector DB, async methods let BentoML interleave requests instead of blocking a worker thread per request:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>@bentoml.api\n\nasync def predict_async(self, input_data: IrisFeatures) -&gt; dict:\n\n&nbsp;&nbsp;&nbsp;&nbsp;result = await self._run_inference(input_data)\n\n&nbsp;&nbsp;&nbsp;&nbsp;return result\n\nFor CPU\/GPU-bound inference where you want the runtime to group concurrent requests into a single forward pass, mark the API as batchable:\n\n@bentoml.api(batchable=True)\n\ndef predict_batch(self, input_series: list&#91;list&#91;float]]) -&gt; list&#91;int]:\n\n&nbsp;&nbsp;&nbsp;&nbsp;arr = np.array(input_series)\n\n&nbsp;&nbsp;&nbsp;&nbsp;return self.model.predict(arr).tolist()<\/code><\/pre>\n\n\n\n<p>Batching is a runtime-level optimization, not something your function implements manually: BentoML accumulates concurrent calls up to a configurable max batch size and latency window, then dispatches one combined call to your function.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>6. Composing multiple Services<\/strong><\/h3>\n\n\n\n<p>Real systems are rarely one model. BentoML lets you wire Services together with bentoml.depends(), so one Service can call another\u2019s API as if it were a local method:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>@bentoml.service\n\nclass Preprocessing:\n\n&nbsp;&nbsp;&nbsp;&nbsp;@bentoml.api\n\n&nbsp;&nbsp;&nbsp;&nbsp;def clean(self, raw: dict) -&gt; IrisFeatures:\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return IrisFeatures(**raw)\n\n@bentoml.service\n\nclass Pipeline:\n\n&nbsp;&nbsp;&nbsp;&nbsp;preprocessing = bentoml.depends(Preprocessing)\n\n&nbsp;&nbsp;&nbsp;&nbsp;classifier = bentoml.depends(IrisClassifier)\n\n&nbsp;&nbsp;&nbsp;&nbsp;@bentoml.api\n\n&nbsp;&nbsp;&nbsp;&nbsp;def predict(self, raw: dict) -&gt; dict:\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;features = self.preprocessing.clean(raw)\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return self.classifier.predict(features)<\/code><\/pre>\n\n\n\n<p>Each Service in the dependency graph can be scaled and resourced independently when deployed, which matters when one stage is CPU-light preprocessing and another needs a GPU inference.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>7. GPU resources<\/strong><\/h3>\n\n\n\n<p>For a model that needs a GPU, the resources block changes the scheduling behavior, not just a label:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>@bentoml.service(\n\n&nbsp;&nbsp;&nbsp;&nbsp;resources={\"gpu\": 1, \"gpu_type\": \"nvidia-l4\"},\n\n&nbsp;&nbsp;&nbsp;&nbsp;traffic={\"timeout\": 60},\n\n)\n\nclass GPUService:\n\n&nbsp;&nbsp;&nbsp;&nbsp;model = bentoml.models.HuggingFaceModel(\"meta-llama\/Meta-Llama-3.1-8B-Instruct\")\n\n&nbsp;&nbsp;&nbsp;&nbsp;@bentoml.api\n\n&nbsp;&nbsp;&nbsp;&nbsp;async def generate(self, prompt: str, max_tokens: int = 256) -&gt; str:\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...<\/code><\/pre>\n\n\n\n<p>bentoml.models.HuggingFaceModel pulls directly from the Hugging Face Hub and caches it through BentoML\u2019s model store, so it\u2019s versioned the same way as a locally trained model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>8. Building a Bento<\/strong><\/h3>\n\n\n\n<p>A Bento is the packaged, versioned unit BentoML deploys: code, model references, and the <a href=\"https:\/\/www.guvi.in\/hub\/python-tutorial\/getting-started-with-python\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python environment<\/a> spec, bundled together. Define the environment in the image field of the decorator, or in a separate bentofile.yaml:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># bentofile.yaml\n\nservice: \"service:IrisClassifier\"\n\nlabels:\n\n&nbsp;&nbsp;owner: ml-team\n\ninclude:\n\n&nbsp;&nbsp;- \"service.py\"\n\npython:\n\n&nbsp;&nbsp;packages:\n\n&nbsp;&nbsp;&nbsp;&nbsp;- scikit-learn\n\n&nbsp;&nbsp;&nbsp;&nbsp;- pydantic<\/code><\/pre>\n\n\n\n<p>Build it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>bentoml build<\/code><\/pre>\n\n\n\n<p>This produces a versioned Bento in the local store bentoml list into a Docker image without writing a Dockerfile shows it. Containerize it<\/p>\n\n\n\n<p>bentoml containerize iris_classifier:latest<\/p>\n\n\n\n<p>This generates a Docker image with the exact Python version, system packages, and pip dependencies your Service declared, so \u201cworks on my machine\u201d and \u201cworks in the container\u201d stop being two different claims.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>9. Deploying<\/strong><\/h3>\n\n\n\n<p>For BentoCloud-managed deployment:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>bentoml cloud login\nbentoml deploy\n<\/code><\/pre>\n\n\n\n<p>bentoml deploy, run from the project directory, picks up your bentofile.yaml, builds the Bento, and ships it to BentoCloud with autoscaling and GPU provisioning handled for you.<\/p>\n\n\n\n<p>For self-managed infrastructure, take the image from bentoml containerize and run it on whatever you already use: Kubernetes, ECS, plain Docker, since it\u2019s a standard OCI image at that point.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>Once you\u2019ve built one BentoML Service, the pattern holds for almost anything you\u2019ll deploy next, whether that\u2019s an sklearn model, a Hugging Face pipeline, or a multi-stage RAG system chained together with bentoml.depends(). The framework\u2019s real value isn\u2019t the decorators themselves, it\u2019s that the thing you tested locally with bentoml serve is the exact same thing that ends up in production, environment and all. That guarantee is what usually takes teams the longest to build by hand, and BentoML gives it to you by default.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>FAQs<\/strong><\/h2>\n\n\n<div id=\"rank-math-faq\" class=\"rank-math-block\">\n<div class=\"rank-math-list \">\n<div id=\"faq-question-1782907777161\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>Do I need Docker knowledge to use BentoML?<\/strong>\u00a0<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>No. bentoml containerize generates the Docker image for you based on your Service\u2019s declared dependencies. Docker knowledge helps if you\u2019re debugging a deployment issue, but it\u2019s not a prerequisite to get a working container.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782907790028\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>Can BentoML serve models from frameworks other than scikit-learn?<\/strong>\u00a0<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Yes. BentoML has built-in integrations for PyTorch, TensorFlow, Hugging Face Transformers, XGBoost, and others, plus a generic path for any custom Python inference code.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782907813413\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>What\u2019s the difference between bentoml serve and a deployed Bento?<\/strong>\u00a0<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>bentoml serve runs your Service directly from source for local testing. A deployed Bento is a built, versioned artifact with its environment locked in, which is what should actually receive production traffic.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782907824629\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>Do I need BentoCloud to deploy a BentoML service?<\/strong>\u00a0<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>No. bentoml deploy is the fastest path if you\u2019re using BentoCloud, but bentoml containerize produces a standard OCI image you can run on any infrastructure, including Kubernetes or plain Docker.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782907846095\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>How does BentoML handle multiple models in one pipeline?<\/strong>\u00a0<\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Through bentoml.depends(), which lets one Service call another Service\u2019s API as if it were a local method. Each Service in that chain can be scaled and resourced independently, so a lightweight preprocessing step doesn\u2019t need the same GPU as the model it feeds into.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>Most ML teams don\u2019t struggle with building models, they struggle with shipping them. A notebook that scores 95% accuracy is still useless to anyone outside the data science team until it\u2019s wrapped in an API, packaged with the right dependencies, and running somewhere reliable.&nbsp; That\u2019s the gap BentoML is built to close. Instead of hand-rolling [&hellip;]<\/p>\n","protected":false},"author":60,"featured_media":120526,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[933],"tags":[],"views":"33","authorinfo":{"name":"Vaishali","url":"https:\/\/www.guvi.in\/blog\/author\/vaishali\/"},"thumbnailURL":"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/07\/BentoML-Tutorial-300x116.webp","_links":{"self":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/119896"}],"collection":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/users\/60"}],"replies":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/comments?post=119896"}],"version-history":[{"count":4,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/119896\/revisions"}],"predecessor-version":[{"id":120528,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/119896\/revisions\/120528"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/media\/120526"}],"wp:attachment":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/media?parent=119896"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/categories?post=119896"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/tags?post=119896"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}