Milvus — Storage Model Part One, Collections and Partitions

Storage Model
Milvus employs several components to manage and process data efficiently. Key among these are collections, partitions, and segments. In this article, I will delve into these three components, providing a comprehensive understanding of their roles and functionalities within Milvus.

Collections in Milvus
A collection in Milvus is similar to a table in a relational database, serving as the primary data container. Each collection is created with a specific schema that defines the fields it will contain, much like the columns in a relational database table.
Prior to developing a collection, it’s crucial to prepare the collection schema and the field schema for all the fields. The collection schema encapsulates the field definitions, the collection’s description, and whether a primary key allocation is enabled. Meanwhile, the field schema outlines the name of the field, its data type, and other properties associated with the field.

Each collection in Milvus comprises one or more partitions, with data housed as segments within these partitions.
Every Milvus collection necessitates a unique name and a collection schema. The collection schema contains the description of the collection, an array of fields that constitute the collection, and a parameter referred to as ‘auto_id’. This parameter facilitates automatic primary key allocation.
Each field specified within the collection schema should possess a corresponding field schema. This field schema incorporates the name of the field, its datatype, and an attribute called ‘is_primary’. This attribute determines whether the particular field will serve as the primary key. Additionally, based on the datatypes, specifications such as dimensions or maximum length can be established.
Data Types
Now, let’s explore the data types supported by Milvus. A field within Milvus can be classified into one of three categories: a primary key field, a scalar field, or a vector field.
A primary key field in Milvus supports 64-bit integer and vector data types.
A scalar field could be of various types, including boolean and eight, sixteen, thirty-two, or sixty-four-bit integers. It could also be a floating-point number, a double, or even a vector data type.
When it comes to vector fields, Milvus offers support for vectors of binary data type and floating-point data type.

It is important to note that while creating a collection in Milvus, there should be at least one primary key field and one vector field.
Milvus Collections Demo
Let’s take a example on operations pertaining to collections in Milvus.
Importing Modules
First we import install and import pymilvus
module:
from pymilvus import (
connections,
utility,
FieldSchema,
CollectionSchema,
DataType,
Collection,
)
Define Connection
# Define connection
connections.connect("default", host="localhost", port="19530")
connections.connect
: This is a method from thepymilvus
library. It's used to create a connection between your Python application and the Milvus server. Milvus uses these connections to perform operations like data insertion, search, deletion, and index creation on vectors."default"
: This is the alias name for the connection. Inpymilvus
, you can create multiple connections and each connection can be given a unique alias. Here,"default"
is used as the alias, which is a common practice for the primary or sole
Define FieldSchema
# Define Field Schema
song_name = FieldSchema(
name="song_name",
dtype=DataType.VARCHAR,
description="name of the song",
max_length=200,
)
song_id = FieldSchema(
name="song_id", dtype=DataType.INT64, description="id of the song", is_primary=True
)
play_count = FieldSchema(
name="play_count", dtype=DataType.INT64, description="play count of the song"
)
song_vector = FieldSchema(
name="song_vector",
dtype=DataType.FLOAT_VECTOR,
dim=10,
description="vector of the song",
)
In the above example, we define the schema for Milvus database collection, with fields for song names, song IDs, play counts, and song vectors. These schemas are essential for organizing and querying data within Milvus.
- A
FieldSchema
object in Milvus represents the schema for a field (column) in a database table or collection. Each field has a specific data type and certain attributes. song_name = FieldSchema(...)
: This line defines a field for storing the name of a song.dtype=DataType.VARCHAR
: The data type of thesong_name
field is set as VARCHAR, suitable for storing variable-length strings.max_length=200
: This specifies the maximum length of the string that can be stored in this field.description="name of the song"
: A human-readable description of what this field represents.is_primary=True
: This indicates thatsong_id
is the primary key for the database table or collection.dim=10
: This specifies the dimensionality of the vector. In this case, each vector has 10 elements.
Define Collection Schema
# Define collection schema
collection_schema = CollectionSchema(
fields=[song_name, song_id, play_count, song_vector],
description="collection schema of songs",
)
The above snippet is a crucial step in setting up a Milvus database for storing and querying song data, especially when dealing with both traditional attributes (like names and counts) and more complex data types (like vectors for machine learning applications).
CollectionSchema
: This is a class inpymilvus
that is used to define the schema of a collection in Milvus. A collection in Milvus is similar to a table in a relational database and is used to store and manage data.fields=[song_name, song_id, play_count, song_vector]
: This line specifies the fields that make up the collection. Each field is defined by aFieldSchema
object, which was created in the previous step of your code.
Create Collection
# Create collection
collection = Collection(name="Songs", schema=collection_schema, using="default")
utility.list_collections()
The snippet is used to create a new collection named “Songs” in a Milvus database, based on a predefined schema (collection_schema
). After creating the collection, the list_collections
function is used to display all collections in the database, which is useful for confirmation and debugging purposes.
Collection(name="Songs", schema=collection_schema, using="default")
: This line of code creates a new collection in the Milvus database.utility.list_collections()
: This function lists all collections currently in the Milvus database. It's a way to verify that the new collection "Songs" has been successfully created and exists in the database.
Modify Collection
# Rename collection
utility.rename_collection("Songs", "Songs_new")
print(utility.list_collections())
# Drop collection
utility.drop_collection("Songs_new")
print(utility.list_collections())
You can use utility.rename_collection()
and utility.drop_collection()
to make changes to the existing collections.
Partitions
Once we have a collection, we can define partitions within it. The exact steps vary depending on your chosen partitioning strategy:
- In-Memory Partitions: For in-memory partitions, you can create partitions using the
create_partition
API provided by the Milvus client SDK. You can assign data to specific partitions during insertion. - Hybrid Partitions: For hybrid partitions, you can use the
create_index
API to create an index that will determine how data is stored on disk. Milvus will automatically manage the data on disk based on the chosen index type.
For example:
# Create collection
collection = Collection(name="Songs", schema=collection_schema, using="default")
print(utility.list_collections())
# Create a partition
collection.create_partition("albam1")
print(collection.has_partition("albam1"))
This code demonstrates the creation of a collection named “Songs” with a specified schema and storage engine. It then creates a partition called “albam1” within the collection and checks whether the partition exists.
collection.create_partition("albam1")
: This line creates a partition within the "Songs" collection named "albam1". Partitions are logical segments within a collection that can be used to organize and manage data.print(collection.has_partition("albam1"))
: This line checks whether the "albam1" partition exists within the "Songs" collection using thehas_partition
method. It prints eitherTrue
orFalse
to indicate the existence of the partition.
To drop a partiton:
# Drop a partition
collection.drop_partition("albam1")
print(collection.has_partition("albam1"))