Nov 1, 2022
We regularly invite ML practitioners and industry leaders to share their experiences with our Community. Want to ask questions to our next guest? Join BentoML Community Slack
We recently invited Frank Liu. Frank Liu is the Director of Operations at Zilliz, a leading provider of vector database and AI technologies. They are also the engineers and scientists who created LF AI Milvus®, the world’s most popular open-source vector database.
Vector databases are great for anybody who wants to be able to analyze unstructured data, e.g. images, video, audio, text, etc… An example, when a vector database application would be searching for images using other images, vector databases do this by performing large-scale vector search - there are quite a few libraries that do this such as FAISS and Hnswlib, but these don’t quite have database features that you’d want or need.
Great question. Feature store and vector database are ultimately two very different pieces of infrastructure. Features stores tend to be a repository for curated features, while vector databases are for large-scale similarity searches. On that topic, we don’t see vector database replacing traditional (relational/NoSQL) databases either, but I do see a day where these two database will be used together.
As for how I got involved with Milvus, I had actually heard about Milvus from a former coworker a while back. We were doing large scale similarity search across images in the early days of deep learning, and there wasn’t any great infrastructure for doing so back then and seeing Milvus was a huge a-ha moment for me, and the rest is history.
Definitely! Vector databases are used for product recommendation, video analysis, targeted advertising, personalized search, drug discovery, and so many more that I can’t think of off the top of my head! Using product recommendation as an example, I can recommend products to users based on images or videos of the product at hand using a vector database, rather than just tags or keywords
I think we’re going in a direction where more and more data is being turned into vectors - everything from images to user profiles to geospatial data. Not only that, but a lot of these vectors will cross modalities, for example, images and text can be embedded into the same vector space. Today, I can already search for an image that corresponds to a description using a vector database, e.g. “a photo of a German Shepard running across the hills” would find the most relevant image in the vector database. I envision a day where “everything” will be embedded into a common space, and that’ll really revolutionize the way search is done
For now, vector databases are meant to work with production-ready models, but these are still early days. Perhaps in the future, there could be a way where vector database are used to distill large datasets and enable the training of smaller, more efficient models. Exciting times are ahead!
Generally, you’ll use a fully trained ML model to generate the embeddings - you can then insert these embeddings into Milvus. We also have a sister open-source project called Towhee (https://towhee.io) which is specifically meant to help with this
We have started a series called “Vector Database 101” on our website (here’s a link to the first one: https://zilliz.com/blog/introduction-to-unstructured-data). We’ll be releasing a new article every week, talking about everything in and around vector databases - use cases, vector indexes, caching, time travel, etc. For folks who are interested, I highly recommend checking it out! We also have some demos here: https://milvus.io/milvus-demos/
There are tons of technical challenges related to vector databases, but I think the biggest challenge is actually getting folks in other communities to understand what a vector database is, and why it’s needed. To be an advocate for a new type of technology. Technology-wise, getting vector search to run at a billion+ scale entails quite a few architectural challenges
Because embeddings are being used almost everywhere now, ML teams would have to resort to a vector search library if they don’t plan to use a vector database. Don’t get me wrong, vector search libraries are great, but they are only meant for the core “computation” component. A vector database will provide a number of database features, such as replication/scaling/caching/failover. And a unified interface with numerous SDKs as well (Python, Go, Node, Java). Simply put, it helps take the “data infrastructure component” out of production AI/ML
That’s actually a tough question! The Milvus team and community has definitely come a long way since day one. 10k stars was definitely a huge milestone for us, but I am most proud of some of the work that we did with less techy folks, like the Cleveland Museum of Art (https://www.clevelandart.org/artlens-ai). Trying to democratize data infra is tough, but we’re all getting there bit by bit. As for future product features and milestones, we’ll continue to improve stability and performance. We’ll also take steps toward heterogeneous computing. Vector databases, being the intersection of data and ML, are a very compute-heavy type of database, and getting things to run quickly on GPUs and FPGAs will be critical. We’re also working on disk-based vector indexing. Most vector databases today store the entire index in memory, which can be pretty expensive. Having an SSD-based solution will definitely improve cost.
My main focus for next year is helping grow the Zilliz team, both engineering and GTM. Building a solid team is crucial to any company’s success. In terms of a personal goal, I hope to be better at written communication - both in terms of a professional setting as well as in blogs, articles, and papers
* The discussion was lightly edited for better readability.