Sitemap

“ChatGPT for your Data” for Customer Support

15 min readApr 12, 2024

With the emergence of OpenAI’s ChatGPT, customer care centers have a new frontier to explore. Generative AI has opened up possibilities for customer support systems to be augmented with innovative solutions. Two of the most popular systems are Turing by DevRev and SiteGPT.

I am interested in exploring both these systems to understand how they work in detail. Despite their common foundation — built on GPT, I want to know what sets them apart, particularly from a tech perspective.

Photo by Glenn Carstens-Peters on Unsplash

Exploratory Analysis

I spent some time researching two different systems — SiteGPT and DevRev’s Turing. While SiteGPT focuses only on integrating chatbots into websites, DevRev’s Turing is part of a larger platform that includes a chatbot solution along with customer relationship management (CRM) software called OneCRM.

After reading the documentation, I wanted to compare the performance of Turing and SiteGPT since both Turing and SiteGPT are similar solutions. To do this, I analyzed their performance on their respective platforms. This was a good first step to understand their performance and compare them.

I devised 100 questions/prompts that both chatbots should be able to answer about their websites. Subsequently, I collected and manually annotated each chatbot’s responses to these questions, categorizing them as accurate, inaccurate, or no response.

Here’s what each category encodes:

  • Accurate: Responses that provide factually correct answers, including those that may be incomplete but still address the question to some extent.
  • Inaccurate: Responses that contain irrelevant or incorrect information that does not address the intended query.
  • No Response: Instances where the chatbot declines to provide an answer.

This categorization process ensures an objective evaluation of the chatbots’ performance, free from subjective biases or preferences. Through this data-driven approach, we can uncover valuable insights into the strengths and weaknesses of each chatbot solution on its native platform, enabling informed decision-making.

SiteGPT and Turing Performance on native platform

Data for Comparative Analysis.

While Turing has a 45% response ratio, the inaccuracy of both these systems is around the same! There is room for improvement in Turing to minimize refusal behavior and enhance response accuracy.

DevRev offers a broader solution, with Turing at the heart of it! This platform facilitates the seamless connection between the voice of the customer from the front office — encompassing sales, customer success, and support — with product management, engineering, and design. By leveraging Turing’s capabilities alongside this comprehensive solution, organizations can enhance customer experience and team productivity by bridging the gap between the front office and back office operations.

Addressing the challenges Turing faces in reducing refusal behavior and minimizing inaccuracies could prove instrumental in unlocking its full potential as a valuable component of DevRev’s integrated platform.

So, let’s get into the details to see how we can do just that!

Fine-Tuning or Retrieval Augmented Generation?

Often people refer to finetuning (training) as a solution for adding your data on top of a pretrained language model. Common drawbacks when you finetune a LLM:

  • Factual correctness and traceability: Where does the answer come from
  • Access control: Impossible to limit certain documents to specific users or groups
  • Costs: New documents require retraining of the model and model hosting

This will make it extremely hard, close to impossible, to use fine-tuning for the purpose of Question Answering (QA).

To ensure that users receive accurate answers, we need to separate our language model from our knowledge base. This allows us to leverage the semantic understanding of our language model while also providing our users with the most relevant information. All of this happens in real-time, and no model training is required.

It might seem like a good idea to feed all documents to the model during run-time, but this isn’t feasible due to the character limit (measured in tokens) that can be processed at once.

The approach for this would be as follows:

  1. User asks a question
  2. Application finds the most relevant text that (most likely) contains the answer
  3. A concise prompt with relevant document text is sent to the LLM
  4. User will receive an answer or ‘No answer found’ response

This approach is often referred to as grounding the model or Retrieval Augmented Generation (RAG). The application will provide additional context to the language model, to be able to answer the question based on relevant resources.

Turing is a RAG application.

Simplified diagram of Turing’s flow

RAG Stack

Current RAG Stack for building a QA System

The key stack commonly used for chatbots that operate over unstructured data involves the same principles. These include loading data from a data source and querying it to retrieve the necessary information. But this isn’t quite enough!

Challenges with “Naive” RAG

  1. Bad Retrieval

Low Precision: Not all chunks retrieved are relevant — leads to ‘hallucinations’ and ‘lost in the middle’ problems.

Low Recall: Not all relevant chunks are retrieved — lacks enough context for LLM to sysnthesize an answer.

Outdated Information: The data is redundant or out of date.

2. Bad Response Generation

Hallucination: Model makes up an answer that isn’t in the context.

Irrelevance: Model makes up an answer that doesn’t answer the question.

Toxicity/Bias: Model makes up an answer that is harmful/offensive.

How to Improve the Performance of RAG Applications like Turing?

  • Data: Can we store additional information beyond raw text chunks?
  • Embeddings: Can we optimize our embedding representation?
  • Retrieval: Can we do better than top-k embedding lookup?
  • Synthesis: Can we use LLM for more than generation?

But before all this, we need to measure performance.

Evaluating RAGs

To improve the solution, we first need to identify the issues in the current implementation. While the end user might only see an incorrect result, we need to identify the source of this error.

Evaluation on Retrieval

To assess the retrieved chunks based on a user query, we need to generate a dataset that includes queries as input and the relevant documents as output. We can create this dataset manually by annotating the data or generating it synthetically. Once we have the dataset, we can measure ranking metrics such as success rate, hit rate, MRR, NDCG, and many more to evaluate the retrieval system’s performance.

Evaluation E2E

Evaluation of the final response given the input requires a dataset with queries and their corresponding ground-truth answers. We use the full RAG pipeline we built and run LLM-based evaluations, including with-label and label-free evaluations.

To identify the cause and the relevant solution, we need to classify our responses in one of the following quadrants.

Slide by Colin Jarvis (Solutions Architect — OpenAI) presented on Microsoft Ignite Switzerland

To do this, I used an external Knowledge Base (Wikipedia articles) and prompted Turing with about 1500 questions based on the Knowledge Base.

Turing’s Results for 1500 queries based on Wikipedia articles for

This is just a preliminary analysis with a few articles and 1500 prompts. To get an accurate result for these categories, we would want to analyze them extensively over a broader Knowledge Base with more questions.

Optimizing RAGs

When we want to improve our RAG systems, there are actually a million things we could do. And, we don’t want to start with the hard stuff first. We want to start with the basics first and then move towards the harder stuff.

From Simple to Advanced RAGs

Strategies to Optimize RAGs

Tune your chunking

To generate accurate and comprehensive answers to user questions, it’s important to understand the context in which the question is being asked. Sometimes, a single page or document may provide enough context, but in other cases, multiple pages or documents may be required to provide a complete understanding of the question.

It is also important to understand the data. We should consider whether our data can be broken down into smaller parts without losing context or if specific preprocessing steps are needed to obtain metadata, such as the (sub)chapter title in a dense document.

Libraries like LangChain offer pre-built document transformers, but we can also create our own preprocessing logic. In most cases, a simple approach using a sliding window to chunk your data will suffice. However, for more specific cases, additional logic may be beneficial. After analyzing a few Wikipedia articles, we found that the existing chunking strategy worked well. However, it might not be suitable for the native data.

Hybrid Search

Hybrid search is a method of using multiple search algorithms to achieve more precise and relevant search results. It combines the advantages of keyword-based and vector search techniques, resulting in a better search experience.

Keyword-based search algorithms work by matching exact keywords or phrases to search terms. This can be useful for finding exact matches of specific terms, such as “PLuG” in the documentation. However, keyword-based searches can miss relevant results that use synonyms or different phrasing.

Example of difference in Turing responses — with nearly similar query, while retrieving the same document.

Vector-based search algorithms, on the other hand, consider the context and meaning of the words used in the search query. This can be useful for finding related results that might not include the exact search terms but are still relevant to the query. However, vector-based searches can sometimes miss exact matches of specific terms.

Hybrid Search Data Flow (Simplified)

Hybrid search provides a more comprehensive and accurate search experience by combining both keyword-based and vector-based search techniques. It can identify exact matches of specific terms while also considering the context and meaning of the words used in the search query. This results in more relevant search results and thus provides a more accurate context to your LLM.

Query parsing and metadata

Users can be unpredictable, as they may send lengthy stories or combine multiple questions into one message. It can help to filter out any extraneous information and identify the user’s main intent.

We can invest time in building a solid search index with metadata. By pre-processing the user’s query and applying relevant filters to our index, we can significantly improve the quality of search results.

This pattern involves a multi-step retrieval process, which helps improve the quality of the context provided to the LLM. First, the user’s query is parsed to retrieve the relevant filters. Then the search index is queried, with these filters specified. The retrieved documents will be provided to your LLM to generate a response.

The latest versions of gpt-3.5-turbo and gpt-4 have been fine-tuned to work with functions and are able to both determine when and how a function should be called. We can implement the pattern described above by using function calling, to provide a native way for these models to formulate API calls and structure data outputs.

Example of a complex query (L) Simplified RAG flow with metadata (R)

Hypothetical Document Embeddings (HyDE)

In our previous methods, we generated embeddings for our questions and compared them to our documents to find relevant answers. But does it make sense to compare the question with the answer? With the advent of large language models, we have new possibilities. We can now generate an answer (which could be not factually correct) and compare this to the factual answer from our knowledge base.

Hypothetical Document Embeddings (HyDE) is a design pattern that leverages LLMs to create hypothetical documents based on the user’s query. These hypothetical documents serve as a proxy for the ideal answer and can be used to improve the retrieval process. The idea is that by comparing the embeddings of the hypothetical documents to those of the knowledge base, it becomes easier to identify the most relevant pieces of information to provide as context for the LLM.

HyDE flow (Simplified)

Other options

Some other approaches that might improve RAGs are:

  • Tune the vector search algorithm and parameters: find the right balance between accuracy and latency
  • “Read Retrieve Read” / ReAct: Iteratively evaluate the question for missing information and formulate a response once all information is available.
  • Parent Document Retriever: Fetch small chunks during retrieval to better capture semantic meaning, provide larger chunks with more context to your LLM.
  • Structural Index: One effective method for enhancing information retrieval is to establish a hierarchical structure for the documents.
  • Hierarchical index structure: File are arranged in parentchild relationships, with chunks linked to them.
  • Knowledge Graph index: Utilize KG in constructing the hierarchical structure of documents contributes to maintaining consistency.
  • Query Expansion: Expanding a single query into multiple queries enriches the content of the query, providing further context to address any lack of specific nuances, thereby ensuring the optimal relevance of the generated answers.
  • Multi-Query: By employing prompt engineering to expand queries via LLMs, these queries can then be executed in parallel.
  • Sub-Query: The process of sub-question planning represents the generation of the necessary sub-questions to contextualize and fully answer the original question when combined. This process of adding relevant context is, in principle, similar to query expansion.
  • Chain-of-Verification(CoVe): The expanded queries undergo validation by LLM to achieve the effect of reducing hallucinations. Validated expanded queries typically exhibit higher reliability.
  • Query Transformation: The core concept is to retrieve chunks based on a transformed query instead of the user’s original query.
  • Query Rewrite: The original queries are not always optimal for LLM retrieval, especially in real-world scenarios. Therefore, we can prompt LLM to rewrite the queries.
  • Fine-tuning Embedding Model: In instances where the context significantly deviates from pre-training corpus, particularly within highly specialized disciplines such as healthcare, legal practice, and other sectors replete with proprietary jargon, fine-tuning the embedding model on your own domain dataset becomes essential to mitigate such discrepancies.
  • Reranking: Reranking fundamentally reorders document chunks to highlight the most pertinent results first, effectively reducing the overall document pool, severing a dual purpose in information retrieval, acting as both an enhancer and a filter, delivering refined inputs for more precise language model processing.
  • Context Selection/Compression: A common misconception in the RAG process is the belief that retrieving as many relevant documents as possible and concatenating them to form a lengthy retrieval prompt is beneficial. However, excessive context can introduce more noise, diminishing the LLM’s perception of key information.
  • Iterative Retrieval: The knowledge base is repeatedly searched based on the initial query and the text generated so far, providing a more comprehensive knowledge.
  • Recursive Retrieval: Used in information retrieval and NLP to improve the depth and relevance of search results. The process involves iteratively refining search queries based on the results obtained from previous searches.
  • Adaptive Retrieval: Enables LLMs to actively determine the optimal moments and content for retrieval, thus enhancing the efficiency and relevance of the information sourced.
  • Emotional Stimuli: Using EmotionPrompt in the Prompt to enhance performance by emotional stimuli.

References

OpenAI (n.d.). What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (2021), arXiv:2005.11401

Gao, L., Ma, X., Lin, J., Callan, J., “Precise Zero-Shot Dense Retrieval without Relevance Labels” (2022), arXiv:2212.10496

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y. “ReAct: Synergizing Reasoning and Acting in Language Models” (2023), arXiv:2210.03629

Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., Muhlgay, D., Rozen, N., Schwartz, E., Shachaf, G., Shalev-Shwartz, S., Shashua, A., Tenenholtz, M. “MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning” (2022), arXiv:2205.00445

Harrison Chase, “Parent Document Retriever”. August 2023, https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever

Patrick von Platen, “How to generate text: using different decoding methods for language generation with Transformers”. July 2023, https://huggingface.co/blog/how-to-generate

Embeddings — OpenAI API. March 2023, https://platform.openai.com/docs/guides/embeddings

Introduction — OpenAI API. March 2023, https://platform.openai.com/docs/introduction/prompts

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., Zhang, Y. “Sparks of Artificial General Intelligence: Early experiments with GPT-4” (2023), arXiv:2303.12712

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T. “Toolformer: Language Models Can Teach Themselves to Use Tools” (2023), arXiv:2302.04761

Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., Scialom, T. “Augmented Language Models: a Survey” (2023), arXiv:2302.07842

X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan, “Query rewriting for retrieval-augmented large language models,” arXiv preprint arXiv:2305.14283, 2023.

H. S. Zheng, S. Mishra, X. Chen, H.-T. Cheng, E. H. Chi, Q. V. Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” arXiv preprint arXiv:2310.06117, 2023.

R. Teja, “Evaluating the ideal chunk size for a rag system using llamaindex,” https://www.llamaindex.ai/blog/ evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5, 2023.

Langchain, “Recursively split by character,” https://python.langchain. com/docs/modules/data connection/document transformers/recursive text splitter, 2023.

S. Yang, “Advanced rag 01: Small-tobig retrieval,” https://towardsdatascience.com/ advanced-rag-01-small-to-big-retrieval-172181b396d4, 2023.

Y. Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr, “Knowledge graph prompting for multi-document question answering,” arXiv preprint arXiv:2308.11730, 2023.

D. Zhou, N. Scharli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schu- ¨ urmans, C. Cui, O. Bousquet, Q. Le et al., “Least-to-most prompting enables complex reasoning in large language models,” arXiv preprint arXiv:2205.10625, 2022.

S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” arXiv preprint arXiv:2309.11495, 2023.

S. Zhuang, B. Liu, B. Koopman, and G. Zuccon, “Open-source large language models are strong zero-shot query likelihood models for document ranking,” arXiv preprint arXiv:2310.13243, 2023.

F. Xu, W. Shi, and E. Choi, “Recomp: Improving retrieval-augmented lms with compression and selective augmentation,” arXiv preprint arXiv:2310.04408, 2023.

N. Anderson, C. Wilson, and S. D. Richardson, “Lingua: Addressing scenarios for live interpretation and automatic dubbing,” in Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track), J. Campbell, S. Larocca, J. Marciano, K. Savenkov, and A. Yanishevsky, Eds. Orlando, USA: Association for Machine Translation in the Americas, Sep. 2022, pp. 202–209. [Online]. Available: https://aclanthology.org/2022.amta-upg.14

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and L. Qiu, “Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,” arXiv preprint arXiv:2310.06839, 2023.

V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, ˘ and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” arXiv preprint arXiv:2004.04906, 2020.

Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language model is not a good few-shot information extractor, but a good reranker for hard samples!” ArXiv, vol. abs/2303.08559, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257532405

J. Cui, Z. Li, Y. Yan, B. Chen, and L. Yuan, “Chatlaw: Open-source legal large language model with integrated external knowledge bases,” arXiv preprint arXiv:2306.16092, 2023.

O. Yoran, T. Wolfson, O. Ram, and J. Berant, “Making retrievalaugmented language models robust to irrelevant context,” arXiv preprint arXiv:2310.01558, 2023.

X. Li, R. Zhao, Y. K. Chia, B. Ding, L. Bing, S. Joty, and S. Poria, “Chain of knowledge: A framework for grounding large language models with structured knowledge bases,” arXiv preprint arXiv:2305.13269, 2023.

H. Yang, S. Yue, and Y. He, “Auto-gpt for online decision making: Benchmarks and additional opinions,” arXiv preprint arXiv:2306.02224, 2023.

T. Schick, J. Dwivedi-Yu, R. Dess`ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv preprint arXiv:2302.04761, 2023.

J. Zhang, “Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt,” arXiv preprint arXiv:2304.11116, 2023.

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browserassisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.

Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” arXiv preprint arXiv:2305.06983, 2023.

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” arXiv preprint arXiv:2310.11511, 2023. [26] Z. Ke, W. Kong, C. Li, M. Zhang, Q

G. Kim, S. Kim, B. Jeon, J. Park, and J. Kang, “Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models,” arXiv preprint arXiv:2310.14696, 2023.

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions,” arXiv preprint arXiv:2212.10509, 2022.

Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and W. Chen, “Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,” arXiv preprint arXiv:2305.15294, 2023.

Li, C., Wang, J., Zhang, Y., Zhu, K., Hou, W., Lian, J., Luo, F., Yang, Q., & Xie, X. (2023). Large Language Models Understand and Can be Enhanced by Emotional Stimuli. ArXiv. /abs/2307.11760

--

--

Shrusti Ghela
Shrusti Ghela

Written by Shrusti Ghela

MSDS @ University of Washington

No responses yet