Searched and Found: A Quest Through Search Engines

Explore the core mechanics of effective information retrieval, transcending specific search engines

ยท

14 min read

Since time immemorial, the human spirit has been fueled by an insatiable curiosity, propelling us to explore the unknown, to uncover the secrets of existence. We have always been in "search of something" - be it enlightenment or just looking for the best fish and chips in town.

Long before the internet's advent, we were seekers of wisdom, pondering the vastness of the cosmos and the mysteries hidden beyond the stars. Yet, in the blink of an eye (or a tap of a keyboard), our quest for enlightenment has morphed into a curious carousel of contemporary inquiries. Our human journey is marked by such paradoxes, where we oscillate between unravelling the grand tapestry of history and obsessing over which avocado toast recipe is most Instagram-worthy.

Divided by Quest, United by Search!

In all of my experience working as an engineer (however small it is now - a lot to explore) the act of searching has always been at its core. I just don't mean a traditional search box with a cliche placeholder - any kind of information retrieval which then is presented to the user with some relevant order always has some kind of search system working at the backend. Even the bombardment of ads we experience while browsing across the "modern" internet, is just someone leveraging search like "GET the most relevant advertisements for the current User WHERE context=(the current context)"

๐Ÿ’ก
I am just talking about the act of delivering advertisements and not the act of selecting which advertisement to serve. That's the job of some kind of recommendation engine.

Evolution of Search Engines

So YES - we have established the omnipresence of Search Systems - thus they have been around for quite a while and have gone through their evolution.

Way before we could throw Solr or Elastic Search to such use-cases we already had a search engine - SQL.

RDBMS could power the earlier use-case of search, the SQL syntax was flexible enough to handle it. Till now people are very happy using the LIKE %..% query to satisfy if not all but some of their use-case of free text search.

๐Ÿค”
Fun fact - it's not because SQL came before enterprise search systems like Solr which resulted in their adoption. Search Engines or the act of retrieving information as free text happened before we introduced RDBMS. While search engines and RDBMS technologies coexisted, their respective strengths were applied to different types of applications based on the prevailing technological landscape and requirements at the time - which favoured applications like RDBMS as data was generally structured.

But eventually, we had to move away from using RDBMS and move to more specific systems which lead to the emergence of these search engines, it was primarily influenced by:

Unstructured Data's Triumph

In the good old days, DBMS were our go-to for structured data, where every piece of information had a pre-defined place in tables. But then, unstructured and semi-structured data happened. Suddenly, we were dealing with text documents, images, videos, and JSON files. The rigid structure of traditional DBMS just couldn't handle the fluidity of this new data landscape.

The Quest for Complex Queries

As our questions became more intricate, traditional DBMS started showing their limitations. Complex queries involving text search, ranking, aggregation, and geospatial elements pushed them to their limits.

The Scale Conundrum

Data doesn't sit still โ€“ it grows. Rapidly. And with growth comes the challenge of scalability. Traditional DBMS might excel at vertical scaling, throwing more resources at a single machine, but horizontal scaling โ€“ spreading the data across multiple machines โ€“ was a challenge.

Real-Time Realities

In a world that moves at the speed of light, real-time data updates and indexing became paramount. Traditional DBMS were accustomed to structured data updates, but when it came to the dynamic nature of unstructured content, they struggled to keep up.

Speeding Up Search

In a generation accustomed to instant gratification, search responsiveness is non-negotiable. Traditional DBMS, while reliable, might not deliver the lightning-fast results modern applications demand.

Going Beyond the Table

Perhaps the most intriguing aspect of these complicated search systems is their specialized features. Faceted search for effortless filtering, geospatial search for location-based queries, and dynamic highlighting of search terms within results โ€“ all tailored to enhance user experience and engagement. These features elevate search from a functional necessity to an artful experience.

The Torch Bearer and What Followed!

The first modern-day search engine that closely resembles systems like Apache Solr is "WAIS" (Wide Area Information Server), which was introduced in the early 1990s. Like Solr and other modern search engines.

WAIS offered full-text indexing, ranking of search results, and relevance ranking. It supported structured queries and provided a way to search and retrieve documents across a network of interconnected servers.

Around the same time, Alan Emtage creates Archie, indexing FTP sites for file names. It's a precursor to search engines, paving the way for information discovery.

Apache Solr: Open-Source Versatility - 2004

Apache Solr, born from the Apache Lucene project, It's built on the powerful Lucene library, offering full-text search, faceted navigation, and robust indexing capabilities.

Elasticsearch: Distributed Search at Scale - 2010

Elasticsearch, another offspring of Lucene, brings distributed architecture to the forefront. Scalability is its hallmark, capable of managing vast volumes of data across clusters of machines. Shines in real-time search and analytics, making it indispensable for applications requiring immediate insights.

Vespa: Real-Time Personalization - 2017

Vespa, a creation of Yahoo engineers, places real-time personalization at the core of its design. With a focus on dynamic content and personalized experiences, Vespa thrives in use cases that demand real-time updates, recommendation systems, and personalized search results.

Are you also confused?

If you are, then don't fret - just like you when I encountered these engines for the first time I was confused myself, but I was intrigued. I cannot even remember how many times I went around my team to seniors asking why we use Solr. Why not ElasticSearch? I mostly used to get the response that both are Lucene based so essentially the same (at least this is the part I could comprehend or remember ๐Ÿ˜…)

They won't stop Innovating...

Enhanced search engines tailored for specific niche use cases will forever impact the world, ceaselessly reshaping how we navigate information. Their specialized solutions will consistently redefine industries, perpetuating an ongoing cycle of transformative influence.

So what do when tomorrow someone comes up with a new Search Engine and you're tasked to evaluate if it suits our need or fits the use case. To overcome the same FOMO or avoid going circling the same seniors I wanted to come up with some vectors which could to some extent universally evaluate all Search engines.

๐Ÿ’ก
This is not at all exhaustive - just something I could come up with after some digging around. Feel free to share more in the comment section!

Latency - How real-time it is?

This vector would just try to judge how fast the data is available to be queried from the time it was indexed. Both ElasticSearch and Vespa are "near" real-time engines - suggesting that the data is available to be queried after the indexing operation.

Engines like Solr on the other hand can take upto few seconds before the data is available for query, this is basically because of the "Commit Strategy" they employ or if they are deployed as a leader-replica model so that all reads go to replica hence would take a while for the write to reach all replicas.

๐Ÿ’ก
Solr has introduced a "soft commits" strategy which reduces this latency - but is susceptible to being lost in case of a server crash.

This metric is a fine balance between Indexing Throughput and Search Query Latency. Most engines tend to buffer changes in memory before flushing it to persistent disk memory, this is because I/O is an expensive operation. If you start flushing data on every index operation, the write throughput will take a big hit. If you don't flush it for too long then you take a hit on the query time. Both Elastic and Vespa have their implementation to tackle and fine-tune this balance.

Scaling Strategies

As the sheer volume of online information continues to grow exponentially, search engines must adapt to handle the increasing influx of data.

Efficiently organizing, indexing, and retrieving relevant content from this vast ocean of information requires sophisticated algorithms and powerful computing resources. The ability to scale effectively ensures that search engines can maintain optimal performance, delivering accurate results to users across the globe in real time.

Moreover, search engines incorporate an amplification factor into their operations. They not only process raw data but also refine it through algorithms and analytics, creating processed data that enhance indexing and query capabilities. This dynamic interplay between raw and processed data fuels the search engine's ability to provide users with richer insights and more refined search results, shaping the way we access and interact with information online.

At a broad level we essentially have 2 scaling strategies:

  1. Vertical Scaling: Throwing more resources into the machine to scale it better.

  2. Horizontal Scaling: Spin up more machines of the same kind and load-balance between them

Vertical Scaling has a ceiling which generally has its origins in Amdahl's and Moore's Law. You can vertically scale to a certain point in theory - but even before we cross that limitation the cost to reach that scale is not justified by the amount of value it can create. Horizontal scaling came into this world because of these constraints itself but came with its baggage of issues. In our case sustaining a high query load with write throughput is difficult to sustain with Vertical Scaling.

In the search engine spectrum - Solr can traditionally only be scaled vertically. Ultimately it employs the Master Replica Model to serve the scale we want it to handle. You can horizontally scale them using sharding in some sense - but again its not very trivial to achieve it and requires a lot of configuration to stay right to thing to work normally.

ElasticSearch and Vespa on the other hand were built with "distributed architecture" thinking from Day 0. Some folks even go ahead and say ElasticSearch is just Distributed Solr but please take this statement with a "spoon" of salt. Hence when it comes to easily being able to scale with increasing demand one can assume that Vespa and ElasticSearch are better suited.

๐Ÿ’ก
Although scaling reasonably with ElasticSearch requires getting one core configuration right - How many shards do we allocate to the Index in ElasticSearch? The Elastic team wrote a good piece on this.

Support for different Ranking Methodologies

The core of any search engine is information retrieval, but what happens after that? We cannot just dump that data to a user, this fast retrieval is a cog in the process of providing users of a service with relevant details.

This can vary from e-commerce where you provide someone with a sorted and relevant list of products matching the user query or in the case of ride-hailing apps which does a geo-spatial search to provide you the nearest cabs from your location.

Ranking "documents" or "entities" which persist in the search engine has many algorithms at play. The oldest ones like TF/IDF essentially rank basis on the presence of a particular term within a document and its rareness across documents. BM25 is like TF/IDF but just makes it better by considering document length and removing that bias. Other models include representing these documents as vectors in a vector space which essentially helps us achieve "semantic" matching rather than "syntactic" matching like the previously mentioned algorithms.

LMIR (Language Model for Information Retrieval) is like having a conversation with a document. Imagine you're searching for information about penguins. The LMIR algorithm calculates how likely it is that the document you're looking at could generate the words in your query "penguins." If the document contains terms related to penguins like "Antarctica", "waddle" and "flightless" LMIR recognizes these as signals that the document is well-versed in the language of penguins. In a nutshell, LMIR uses language patterns to determine how much a document understands and speaks the language of your query, ensuring that the most "conversational" documents rise to the top of the search results.

Other than that there are more ML-driven approaches like LTR (Learning to Rank) which allow scope for making searches and results more personalised.

Solr:

  • Quite Traditional: Solr offers native support for traditional ranking algorithms like TF/IDF and BM25, making it a solid choice for applications focusing on classic relevance.

  • Text-Centric Strength: Solr's emphasis lies in providing versatile options for text-based search queries and retrieval.

Elasticsearch:

  • Traditional and Beyond: Elasticsearch supports TF/IDF and BM25 natively while introducing advanced vector search capabilities, enabling similarity searches on dense vectors.

  • Vector Search Power: Its vector search features are particularly valuable for recommendation systems and use cases where similarity-based queries are crucial.

Vespa.ai:

  • Traditional and Vector: Vespa.ai not only supports traditional algorithms like TF/IDF but also offers native vector search capabilities for similarity-based queries and recommendation systems.

  • AI-Driven Relevance: Its standout feature is the integration of machine learning models and AI-driven relevance, enabling advanced ranking strategies like Learning to Rank (LTR).

  • Holistic Modernity: Vespa.ai stands as a versatile choice, encompassing both traditional and cutting-edge search requirements, making it suitable for a wide spectrum of applications.

In summary, while Solr and Elasticsearch offer strong foundations with their traditional algorithmic support, Vespa.ai takes a step further by embracing both traditional and modern search paradigms. Its integration of machine learning models and AI-driven ranking strategies, along with native vector search capabilities, positions Vespa.ai as a holistic solution catering to the evolving needs of modern search applications.

Read and Write Anatomy

Each system generally has its way of working, these are generally there to optimise a certain use case or are just better than their previous counterparts. In a similar way when it even comes to search engines how their read and write flow happens can dictate to a big extent how it's going to respond when you throw your use-case at it.

Solr provides us with the ability to use sharding which not only helps you scale but also optimises read latencies. By providing the same "sharding key" during the write and read process one can help Solr isolate the amount of data it needs to scan to provide you the result. This key helps in identifying which "node" you ask data from so rather than asking from all the nodes in a non-shard case, we can just ask that one node and reduce latencies in gathering and processing data from multiple nodes. ElasticSearch provides the same thing with the concept of "routing key" which dictates which shard is responsible for persisting that data and the same is asked during the read phase.

Imagine if you isolate a particular user data by providing 'User-Id' as the routing or sharding key, the read latencies will improve by miles.

ElasticSearch is known for crunching and helping in the analysis of vast amounts of data with very low read latencies. This is achieved by caching data at the individual filter level. For instance, in an e-commerce scenario, when users search for products with specific filters like category, price range, and availability, Elasticsearch smartly caches results for frequently used filter combinations. Subsequent similar queries benefit from this caching by swiftly retrieving pre-computed results, dramatically reducing the time needed for retrieval.

This approach optimizes resource usage and contributes to Elasticsearch's ability to efficiently handle extensive data analysis tasks with exceptional speed.

Vespa's strategic use of phased ranking unveils its mastery in optimizing read queries and computational efficiency. In an illustration involving a news recommendation system, Vespa intelligently employs phased ranking to first retrieve potential articles matching user preferences. It then performs resource-intensive machine learning computations solely on the most relevant records, steering clear of unnecessary processing. This approach ensures that compute usage is optimized, delivering accurate and personalized results to users, thereby showcasing Vespa's prowess in enhancing both query performance and user experience.

Vespa sets itself apart by offering genuine partial updates, a feature distinct from Solr and Elasticsearch, which necessitate re-indexing the entire document. This distinction becomes pivotal when your application involves frequent updates to the same document. Vespa's approach enables modifications to specific fields without undergoing the resource-intensive process of re-indexing the whole document, resulting in significant efficiency gains. It's important to note, however, that if your data predominantly follows an append-only pattern, such as time-series data, Elasticsearch might excel due to its optimization for such scenarios.

Both Vespa and Elasticsearch stand out for their flexible data modeling, enabling changes to be seamlessly introduced without disrupting operations. This attribute empowers agility and innovation in data management. In contrast, Solr's fixed data model may pose challenges when altering the schema, necessitating careful planning to prevent complications.

A fun fact about Elasticsearch's flexibility is that while it accommodates changes, if not optimized well, it can sometimes lead to unintended consequences due to Elasticsearch's keen interpretation of dynamic mappings. This serves as a reminder that while adaptability is a strength, prudent management remains essential for harnessing its full potential.

Concluding Thoughts...

In the ever-evolving world of search engines, understanding the fundamental principles behind each system is key to harnessing their true value. From the inception of structured databases to the emergence of powerful search engines like Solr, Elasticsearch, and Vespa, our journey to retrieve information has been marked by constant innovation and adaptation. As data landscapes expand and user expectations evolve, these engines have responded with specialized capabilities, scaling strategies, and ranking methodologies.

While Solr stands as a stalwart with its text-centric strength and traditional ranking algorithms, Elasticsearch introduces distributed architecture and real-time capabilities that cater to modern demands for speed and scalability. Vespa, with its focus on real-time personalization, reshapes how we engage with search results by delivering tailored experiences.

In this ever-changing landscape, it's crucial to remember that no solution is a one-size-fits-all. The true power of these search engines is harnessed when we delve into their nuances, optimizing them according to the unique requirements of our applications. So whether it's understanding how data is indexed, optimizing read and write workflows, or choosing the right ranking methodology, the journey starts with building a strong foundation on the underlying concepts. By doing so, we're poised to unlock the full potential of these systems and offer users unparalleled search experiences in our rapidly evolving digital world.

ย