AI Agents at work: real-time platform insights in Slack

AI Agents at work: real-time platform insights in Slack

Andrei Hryhoryeu
Andrei Hryhoryeu

monday.com is a large platform that comprises hundreds of microservices interacting with one another to serve millions of users daily. Naturally, platform resilience is crucial for us. We have already shared the solution that monday.com built to protect the platform from spikes in anomalous traffic in a blog post a couple of years ago

Later on, we leveraged the data we were collecting by building a tool to visualize it, thus providing valuable insights to our users, monday.com developers. My colleague has described it in another blog post.

However, there were still missing pieces in the system. One of the biggest gaps we identified was the effort required of our developers to access the data, be it plotting a chart of requests or even just checking if there are actively blocked accounts in a given service at the moment. We have always wanted to close this gap by moving the data closer to the consumers of the data – ideally, it should be at their fingertips. And just like many developers in recent years, we felt we could narrow that gap by leveraging LLM agents and connecting them to our data. Furthermore, we can bring our data to the interface that everyone already uses: Slack messages!

Today, I’ll tell you how we developed an AI Slack bot that allows our colleagues to grasp the current state of the platform and quickly identify any outstanding cases that need to be addressed.

Architecture: Bringing data to the devs

We started by specifying the use case we wanted to cover as the first iteration and outlining the pieces of the solution design. We wanted our bot to be able to answer a question, “What is the status of service my-service in the US region?” We already knew we wanted the user-facing interface to be a Slack bot because it is the interface all monday.com developers use and know.

Then, we knew we had to have a new service that would be responsible for connecting to Slack. It would also need to be able to talk to the LLM model, which would be the backbone of our AI Agent. 

Providing the data to the model is most commonly done through the Model Context Protocol (MCP). You can think of MCP as a USB port for an AI model – a standardized way of communication. The standard is widely used and can be integrated with different models. For us, this meant we would need to build an MCP server on top of our usual endpoints.

The part closest to AI engineering is building the actual agent logic. Choosing an LLM provider and a foundational model can significantly affect results, and, spoiler alert, we had to make it configurable to test various models and their parameters. For the actual logic of interaction with the bot, common solutions are the open-source frameworks: LangChain and LangGraph. Both provide Python and TypeScript implementations, but the former exposes higher-level abstractions that allow for quicker development, so we opted to move quickly when building the prototype.

To build more complex agents with various capabilities, the frameworks offered us more advanced patterns. Combining several distinct abilities and merging their results can be achieved through the Supervisor Agent pattern, in which a separate agent governs the work of the task-specific agents. To involve users in the agent workflow and obtain additional data or confirmation, we can use a human-in-the-loop pattern. Neither of these approaches was needed for our MVP phase, but they came in handy as the product developed.

Overall, we got the following design and started implementing:

Diagram showing a Slack bot architecture where a Slack bot backend communicates with a LangChain agent, which connects to a large language model and MCP tools for system time and data storage within a microservice.

A typical message processing flow involves the following steps:

  • Whenever our Slack bot receives a message, it sends it to our service via WebSockets.
  • We enrich it with some context, for example, query additional thread messages from the last hour. 
  • The resulting request is then sent to the Agent defined in our code, which has a set of rules called System Prompt. In it, we tell the model about its role, describe what we expect from it, and what tools it has to fulfill the user request. 
  • The agent can query the provided tools, for example, MCP endpoints in our case, and get the real data on the current state of the locks for a given service and time frame.
  • After getting the data, the agent returns a formatted Slack message that we then send back to the user through the Slack API.

The limitation: Why LLMs shouldn’t count

However, even simple-looking projects often hide unforeseen issues. This one was no exception, especially since it was part of a new realm: AI engineering.

First of all, AI development has a key difference: it’s not deterministic. For the same input, it would yield a different result almost every time! To control randomness, LLMs output a set of parameters that may vary across models and providers. The most common are temperature, top-p, and top-k. They control how the next token is picked based on the generated probabilities:

  • Temperature affects the way the probabilities get assigned to the options for the next token, thus controlling the creativity of the model. 
  • Top-p and top-k parameters control the number of options the model chooses from, either by the cumulative probability or just by the count of the following token options.

For our analytic workload, we opted to reduce the creativity and set the temperature to 0, while still leaving some space by keeping top-p and top-k default. That allowed us to mostly avoid hallucinating without making the output feel hard-coded.

The next thing to keep in mind is that, as engineers, it’s on us to provide the model with data in a way it can interpret. At first, we thought we could almost throw arrays of objects that we use in our API at it, and let it derive the statistics, grouping by account, user, timeframe, or vector. The readers who remember the famous “How many Rs are there in the word strawberry?” test have probably already realized that it may yield some very creative results that are not grounded in reality. 

In our case, the model wasn’t grouping by account as we expected; instead, it hallucinated the numbers per group. So, it made more sense to pre-compute some of the groupings on the MCP server side to ensure the model is providing credible data. Lesson learned: don’t ask the LLM to calculate for you; instead, ask it to fetch the pre-computed data.

Another AI-specific caveat is the need to restrict what the model can do. The problem is well-known and hard to solve, but we needed to at least try to limit the topics and types of questions that it can handle. This requires finding a delicate balance between carefully describing the scope without elaborating too much to avoid spending too many tokens and limiting the effective window! 

One thing we invented was a “haiku test,” which, when we asked the model to generate a haiku on topics ranging from general service resilience to the status of a specific service. It sometimes yielded true gems, for example:

In the server room,
Packets dance like autumn leaves,
SREs sigh, “Up.” 🍂

By tuning the prompt and explicitly listing the model’s capabilities, we were able to reach a point where it’s hard to make the agent perform tasks outside its intended scope.

Results and Next steps

Now, we have a bot that helps our team quickly answer repetitive questions like: “Were any users blocked in service X in the last hour?” Previously, it required going to one of the internal tools or digging through the automated Slack notifications. It saves time while shifting ownership closer to the developers creating the code. In fact, I’ll let the bot speak for itself:

Screenshot of a Slack conversation where a bot named “Sir Locks-A-Lot” explains the types of questions it can answer about lock analysis and lock suppression.

After rolling out the initial version, we identified other low-hanging fruit: our team is occasionally asked to suppress blocks for certain accounts that are being migrated. A desire to automate this potentially dangerous data manipulation led us to employ the Human-in-the-loop pattern to ensure that a human approves every change, and to merge two separate agent capabilities using the Supervisor Agent pattern. The agent can now show the data it is going to use to call an MCP, so after approval, the data goes directly to our microservice, eliminating the risk of the model hallucinating on the road. 

So now, we have a platform to perform data modifications safely:

Screenshot of a Slack conversation where a user requests account consolidation lock suppression and a bot responds with a “Lock Suppression Approval Required” message showing account details and approve/reject buttons.

The experience gave us a lot of fresh ideas. In the future, we want to connect the model to our raw data and enable it to execute SQL queries to respond to user requests. Then, we want to feed the AI with our raw data and processing rules to fine-tune them and potentially identify new, unseen patterns in the traffic.

Leveraging AI models for internal tools has great potential. By starting with an MVP with a clearly defined scope, we managed to get the bot to work quickly and start getting feedback. Although the tool is in its early adoption phase, it has already saved our team hours of work on the chores and simple requests that our users can now handle on their own. On top of that, having a bot opened new opportunities for us to empower developers with more capabilities, which we now want to pursue.