AMA With Frank Liu, Future of Vector Databases and embedding of everything

Nov 1, 2022

We regularly invite ML practitioners and industry leaders to share their experiences with our Community. Want to ask questions to our next guest? Join BentoML Community Slack

Frank.png

 

We recently invited Frank Liu. Frank Liu is the Director of Operations at Zilliz, a leading provider of vector database and AI technologies. They are also the engineers and scientists who created LF AI Milvus®, the world’s most popular open-source vector database.

Key Takeaways:

  • Vector database and its use cases.
  • Future of vector database and embedding of everything era
  • How to get started with Milvus.

Can you tell us what is Vector database?

Vector databases are great for anybody who wants to be able to analyze unstructured data, e.g. images, video, audio, text, etc… An example, when a vector database application would be searching for images using other images, vector databases do this by performing large-scale vector search - there are quite a few libraries that do this such as FAISS and Hnswlib, but these don’t quite have database features that you’d want or need.

One of the components in the current MLOps ecosystem is the feature store.  Do I still need vector DB? Does vector DB replace feature store? And What are the relationships between them?

Great question. Feature store and vector database are ultimately two very different pieces of infrastructure. Features stores tend to be a repository for curated features, while vector databases are for large-scale similarity searches. On that topic, we don’t see vector database replacing traditional (relational/NoSQL) databases either, but I do see a day where these two database will be used together.

Could you tell us a bit about how you got involved with the Milvus project and what’s your advice for someone getting started with contributing to open source projects?

As for how I got involved with Milvus, I had actually heard about Milvus from a former coworker a while back. We were doing large scale similarity search across images in the early days of deep learning, and there wasn’t any great infrastructure for doing so back then and seeing Milvus was a huge a-ha moment for me, and the rest is history.

If I’m not doing a semantic search, Is a vector database suitable for other ML projects?

Definitely! Vector databases are used for product recommendation, video analysis, targeted advertising, personalized search, drug discovery, and so many more that I can’t think of off the top of my head! Using product recommendation as an example, I can recommend products to users based on images or videos of the product at hand using a vector database, rather than just tags or keywords

I was reading about milvus and saw the term “embedding everything” era.  Can you tell us what is that future looks like?

I think we’re going in a direction where more and more data is being turned into vectors - everything from images to user profiles to geospatial data. Not only that, but a lot of these vectors will cross modalities, for example, images and text can be embedded into the same vector space. Today, I can already search for an image that corresponds to a description using a vector database, e.g. “a photo of a German Shepard running across the hills” would find the most relevant image in the vector database. I envision a day where “everything” will be embedded into a common space, and that’ll really revolutionize the way search is done

From what I am learning, vectordb helps a lot in finding similar data fast.  Can it also help us to train new ML models? If so, how?

For now, vector databases are meant to work with production-ready models, but these are still early days. Perhaps in the future, there could be a way where vector database are used to distill large datasets and enable the training of smaller, more efficient models. Exciting times are ahead!

If I have DB full of data. How do I put those data into a vector db? How do I generate embedding vectors?

Generally, you’ll use a fully trained ML model to generate the embeddings - you can then insert these embeddings into Milvus. We also have a sister open-source project called Towhee (https://towhee.io) which is specifically meant to help with this

For those new to the concept of a vector database, what are the best resources to learn about it?

We have started a series called “Vector Database 101” on our website (here’s a link to the first one: https://zilliz.com/blog/introduction-to-unstructured-data). We’ll be releasing a new article every week, talking about everything in and around vector databases - use cases, vector indexes, caching, time travel, etc. For folks who are interested, I highly recommend checking it out! We also have some demos here: https://milvus.io/milvus-demos/

Are any particular challenges you thought were interesting working on a Vector database? Or just something you thought might be interesting that you discovered while working on it?

There are tons of technical challenges related to vector databases, but I think the biggest challenge is actually getting folks in other communities to understand what a vector database is, and why it’s needed. To be an advocate for a new type of technology. Technology-wise, getting vector search to run at a billion+ scale entails quite a few architectural challenges

What do ML teams do without a vector database? And why is using a vector database a better solution?

Because embeddings are being used almost everywhere now, ML teams would have to resort to a vector search library if they don’t plan to use a vector database. Don’t get me wrong, vector search libraries are great, but they are only meant for the core “computation” component. A vector database will provide a number of database features, such as replication/scaling/caching/failover. And a unified interface with numerous SDKs as well (Python, Go, Node, Java). Simply put, it helps take the “data infrastructure component” out of production AI/ML

What’s your most proud moment with Milvus? After hits a huge milestone with more than 10k stars in Github, what are some future product features to hit the next milestone?

That’s actually a tough question! The Milvus team and community has definitely come a long way since day one. 10k stars was definitely a huge milestone for us, but I am most proud of some of the work that we did with less techy folks, like the Cleveland Museum of Art (https://www.clevelandart.org/artlens-ai). Trying to democratize data infra is tough, but we’re all getting there bit by bit. As for future product features and milestones, we’ll continue to improve stability and performance. We’ll also take steps toward heterogeneous computing. Vector databases, being the intersection of data and ML, are a very compute-heavy type of database, and getting things to run quickly on GPUs and FPGAs will be critical. We’re also working on disk-based vector indexing. Most vector databases today store the entire index in memory, which can be pretty expensive. Having an SSD-based solution will definitely improve cost.

What is your main focus for next year? What is your personal goal ?

My main focus for next year is helping grow the Zilliz team, both engineering and GTM. Building a solid team is crucial to any company’s success. In terms of a personal goal, I hope to be better at written communication - both in terms of a professional setting as well as in blogs, articles, and papers

* The discussion was lightly edited for better readability.