A text embedding is a numerical representation of a sentence or passage as a vector of real-valued numbers. By converting sentences to number vectors, operations on sentences become more like math equations, which is something computers can do quickly, and can do well.
When an embedding model creates a vector representation of a sentence, the embedding model assigns values that capture the semantic meaning of the sentence. The embedding model also positions the vector within a multidimensional space based on its assigned values. The size of the dimensional space varies by model, which means the exact vector values vary also. However, all models position the vectors such that sentences with similar meanings are nearer to one another.
Most embedding models generate vectors in so many dimensions, ranging from hundreds to thousands of dimensions, that it's impossible to visualize. If an embedding model were to generate a 3-dimensional vector, it might look as follows. Note that the vector values shown in the image are fictional, but are included to help illustrate this hypothetical scenario.
The image shows that sentences with shared keywords and with shared subjects have vectors with similar values, which places them nearer to each other within the three-dimensional space. The following sentences are positioned based on their vector values:
- The Degas reproduction is hanging in the den
- Jan bought a painting of dogs playing cards
- I took my dogs for a walk
The first two sentences about artwork and last two sentences that share the keyword dogs are nearer to one another than the first and third sentences, which share no common words or meanings.
You can store generated vectors in a vector database. When the same embedding model is used to convert all of the sentences in the database, the vector store can leverage the inherent groupings and relationships that exist among the sentences based on their vector values to return relevant search results quickly.
Unlike traditional indexes that store text and rely on keyword search for information retrieval, vector stores support semantic searches that retrieve information that is similar in meaning. For example, where keyword search checks only whether the keyword is present, semantic search weighs the context in which the keyword is used, which typically produces better search results.
Parent topic: Vectorizing text