Saturday, February 15, 2025
HomeBig DataJuicebox recruits Amazon OpenSearch Service’s vector database for improved expertise search

Juicebox recruits Amazon OpenSearch Service’s vector database for improved expertise search


This put up is cowritten by Ishan Gupta, Co-Founder and Chief Know-how Officer, Juicebox.

Juicebox is an AI-powered expertise sourcing search engine, utilizing superior pure language fashions to assist recruiters determine one of the best candidates from an enormous dataset of over 800 million profiles. On the core of this performance is Amazon OpenSearch Service, which supplies the spine for Juicebox’s highly effective search infrastructure, enabling a seamless mixture of conventional full-text search strategies with trendy, cutting-edge semantic search capabilities.

On this put up, we share how Juicebox makes use of OpenSearch Service for improved search.

Challenges in recruiting search

Recruiting serps historically depend on easy Boolean or keyword-based searches. These strategies aren’t efficient in capturing the nuance and intent behind complicated queries, typically resulting in giant volumes of irrelevant outcomes. Recruiters spend pointless time filtering by means of these outcomes, a course of that’s each time-consuming and inefficient.

As well as, recruiting serps typically battle to scale with giant datasets, creating latency points and efficiency bottlenecks as extra information is listed. At Juicebox, with a database rising to greater than 1 billion paperwork and hundreds of thousands of profiles being searched per minute, we wanted an answer that would not solely deal with massive-scale information ingestion and querying, but additionally help contextual understanding of complicated queries.

Resolution overview

The next diagram illustrates the answer structure.

OpenSearch Service securely unlocks real-time search, monitoring, and evaluation of enterprise and operational information to be used instances like software monitoring, log analytics, observability, and web site search. You ship search paperwork to OpenSearch Service and retrieve them with search queries that match textual content and vector embeddings for quick, related outcomes.

At Juicebox, we solved 5 challenges with Amazon OpenSearch Service, which we talk about within the following sections.

Drawback 1: Excessive latency in candidate search

Initially, we confronted important delays in returning search outcomes as a result of scale of our dataset, particularly for complicated semantic queries that require deep contextual understanding. Different full-text serps couldn’t meet our necessities for pace or relevance when it got here to understanding recruiter intent behind every search.

Resolution: BM25 for quick, correct full-text search

The OpenSearch Service BM25 algorithm rapidly proved invaluable by permitting Juicebox to optimize full-text search efficiency whereas sustaining accuracy. By way of key phrase relevance scoring, BM25 helps rank profiles primarily based on the probability that they match the recruiter’s question. This optimization lowered our common question latency from round 700 milliseconds to 250 milliseconds, permitting recruiters to retrieve related profiles a lot quicker than our earlier search implementation.

With BM25, we noticed an almost threefold discount in latency for keyword-based searches, enhancing the general search expertise for our customers.

Drawback 2: Matching intent, not simply key phrases

In recruiting, precise key phrase matching can typically result in lacking out on certified candidates. A recruiter in search of “information scientists with NLP expertise” may miss candidates with “machine studying” of their profiles, regardless that they’ve the fitting experience.

Resolution: k-NN-powered vector seek for semantic understanding

To handle this, Juicebox makes use of k-nearest neighbor (k-NN) vector search. Vector embeddings permit the system to know the context behind recruiter queries and match candidates primarily based on semantic that means, not simply key phrase matches. We keep a billion-scale vector search index that’s able to performing low-latency k-NN search, due to OpenSearch Service optimizations like product quantization capabilities. The neural search functionality allowed us to construct a Retrieval Augmented Technology (RAG) pipeline for embedding pure language queries earlier than looking. OpenSearch Service permits us to optimize algorithm hyperparameters for Hidden Navigable Small Worlds (HNSW) like m, ef_search, and ef_construction. This enabled us to realize our goal latency, recall, and value objectives.

Semantic search, powered by k-NN, allowed us to floor 35% extra related candidates in comparison with keyword-only searches for complicated queries. The pace of those searches was nonetheless quick and correct, with vectorized queries reaching a 0.9+ recall.

Drawback 3: Issue in benchmarking machine studying fashions

There are a number of key efficiency indicators (KPIs) that measure the success of your search. While you use vector embeddings, you could have plenty of decisions to make when choosing the mannequin, fine-tuning the mannequin, and selecting the hyperparameters to make use of. It’s essential to benchmark your resolution to just remember to’re getting the fitting latency, value, and particularly accuracy. Benchmarking machine studying (ML) fashions for recall and efficiency is difficult as a result of huge variety of fast-evolving fashions out there (similar to MTEB leaderboard on Hugging Face). We confronted difficulties in choosing and measuring fashions precisely whereas ensuring we carried out properly throughout large-scale datasets.

Resolution: Actual k-NN with scoring script in OpenSearch Service

Juicebox used precise k-NN with scoring script options to handle these challenges. This function permits for exact benchmarking by executing brute-force nearest neighbor searches and making use of filters to a subset of vectors, ensuring that recall metrics are correct. Mannequin testing was streamlined utilizing the big selection of pre-trained fashions and ML connectors (built-in with Amazon Bedrock and Amazon SageMaker) offered by OpenSearch Service. The flexibleness of making use of filtering and customized scoring scripts helped us consider a number of fashions throughout high-dimensional datasets with confidence.

Juicebox was capable of measure mannequin efficiency with fine-grained management, reaching 0.9+ recall. Using precise k-NN allowed Juicebox to benchmark quicker and reliably, even on billion-scale information, offering the arrogance wanted for mannequin choice.

Drawback 4: Lack of data-driven insights

Recruiters must not solely discover candidates, but additionally achieve insights into broader expertise business developments. Analyzing a whole lot of hundreds of thousands of profiles to determine developments in abilities, geographies, and industries was computationally intensive. Most different serps that help full-text search or k-NN search didn’t help aggregations.

Resolution: Superior aggregations with OpenSearch Service

The highly effective aggregation options of OpenSearch Service allowed us to construct Expertise Insights, a function that gives recruiters with actionable insights from aggregated information. By performing large-scale aggregations throughout hundreds of thousands of profiles, we recognized key abilities and hiring developments, and helped purchasers modify their sourcing methods.

Aggregation queries now run on over 100 million profiles and return leads to below 800 milliseconds, permitting recruiters to generate insights immediately.

Drawback 5: Streamlining information ingestion and indexing

Juicebox ingests information repeatedly from a number of sources throughout the online, reaching terabytes of latest information per 30 days. We would have liked a sturdy information pipeline to ingest, index, and question this information at scale with out efficiency degradation.

Resolution: Scalable information ingestion with Amazon OpenSearch Ingestion pipelines

Utilizing Amazon OpenSearch Ingestion, we carried out scalable pipelines. This allowed us to effectively course of and index a whole lot of hundreds of thousands of profiles each month with out worrying about pipeline failures or system bottlenecks. We used AWS Glue to preprocess information from a number of sources, chunk it for optimum processing, and feed it into our indexing pipeline.

Conclusion

On this put up, we shared how Juicebox makes use of OpenSearch Service for improved search. We are able to now index a whole lot of hundreds of thousands of profiles per 30 days, retaining our information contemporary and updated, whereas sustaining real-time availability for searches.


Concerning the authors

Ishan Gupta is the Co-Founder and CTO of Juicebox, an AI-powered recruiting software program startup backed by high Silicon Valley buyers together with Y Combinator, Nat Friedman, and Daniel Gross. He has constructed search merchandise utilized by hundreds of consumers to recruit expertise for his or her groups.

Jon Handler is the Director of Options Structure for Search Companies at Amazon Net Companies, primarily based in Palo Alto, CA. Jon works carefully with OpenSearch and Amazon OpenSearch Service, offering assist and steering to a broad vary of consumers who’ve search and log analytics workloads for OpenSearch. Previous to becoming a member of AWS, Jon’s profession as a software program developer included 4 years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the College of Pennsylvania, and a Grasp of Science and a Ph. D. in Laptop Science and Synthetic Intelligence from Northwestern College.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments