The Deep Tech Behind Estimating Food Preparation Time

Have you ever wondered why we, as humans, have always gravitated towards things which can be done instantly? Why do instant noodles, instant coffee or ready-to-eat packed food find a place in our kitchens?

We as a species simply hate to wait. We believe the same goes for food. A couple of questions that always crop up in our minds are ‘Where is my food?’ or ‘When will my food arrive?’

For Zomato, from the moment a customer opens the app and until their food arrives at their doorstep, it is important for us to provide accurate information on when their food will be delivered. Giving a higher than actual time estimate can deter customers from ordering as does estimating lower than actual delivery time, which can then increase inflow to our customer support.

Hence an accurate time estimation not only results in better customer experience but can also reduce the burden on our customer support teams.

What happens once an order is placed?

As showcased above, in the food delivery ecosystem, multiple handshakes happen once a customer places an order.

Each of these steps has a time component associated with it, i.e., the time it will take for the restaurant to prepare the food (Food Preparation Time, FPT), the time it will take for our Delivery Partner (DP) to reach the restaurant (DP pick up time), and the time it will take for our DP to reach the customer’s address (DP drop time).

All of these, plus a few other time components (predictable and unpredictable) play out together to finally compute the time from order placement to final delivery that is then showcased to our customers.

How does a better FPT prediction help?

Zomato’s online food delivery platform has restaurants across 500+ cities with 400+ cuisines in India, which clearly tells us that one can find diversity in every nook and corner of this nation.

The scale of business demands better FPT prediction, which in turn helps in better delivery time, better allocation of DPs in order assignment and efficient delivery of orders. It also helps us to engage better with our Restaurant Partners for monitoring FPT breaches and compliance.

What factors contribute to FPT prediction?

There are multiple factors that affect the FPT for a particular dish. Say a customer orders Chicken Biryani (D1) from two restaurants (R1 and R2) –

  • R1 is a restaurant which specializes in making Biryani.
  • R2 is a multi-cuisine restaurant with Biryani as one of the dishes, in addition to others.

All other scenarios being the same, one would expect FPT of Chicken Biryani from R1 to be less than that of R2 since –

  • R1 specializes in Biryanis and one expects their kitchen capacity as well as their food preparations to be geared towards preparing Biryanis. Also, they might have optimized their process for minimum Biryani preparation time.
  • R2 caters to a wide range of food options, it probably won’t have processes optimized for a particular item, as the same kitchen is shared for multiple dish preparations.

But there might be certain additional factors at play here –

  • Queued orders – These are the number of orders already in queue for each restaurant. It’s possible that one restaurant has a long queue and another restaurant has no active orders.
  • Fine dining restaurants vs delivery kitchens – If one restaurant happens to be a fine dining establishment and another a delivery kitchen, the latter is expected to have a shorter FPT.
  • Opening hours – This refers to the period for which the restaurant has been opened for delivery. Has it opened just now or has been open for a while? These are soft parameters that convey whether the kitchen is in full flow or not.
  • Other items in order – In addition to the Chicken Biryani, what are the other items in the order? Can the items be prepared in parallel or would they be prepared sequentially or do those individual items have a higher or lower FPT?

That’s a lot of components to keep in mind!

How did we represent these components?

Given the nature of the problem, we divided it into two major components –

  1. Item Level Information, i.e., the item composition of the order
    • As suggested earlier, different items would have a different preparation time.
    • Higher quantity orders may take a little more time.
  2. Restaurant Level Information, i.e., the inherent characteristics and nature of a Restaurant with respect to food preparation time
    • Fast food or delivery only restaurants would have a different behaviour compared to fine dining.
    • The kitchen capacity of each restaurant may vary.

1. Encoding Item Level Information

We notice that the Item level information is usually in text format. In order to use text information in machine learning models, the most common methods are Bag-of-Words, Tf-Idf or Word2Vec Embedding. The first two methods fail at our scale because they encode the information in a One Hot Encoding (this is a method where data is converted into forms that help better prediction). Given that the distinct number of dishes on our platform is ~3.5m, this would have resulted in millions of columns being added to our data. The same reason stands for Tf-Idf.

We discarded those two approaches because –

  • Data constantly increases with the increase in the number of dishes (more number of new dishes are getting added to our platform every day).
  • High dimensionality – It would require a lot of storage and might create latency issues during model serving.
  • Subsequent ML models have sparsity issues and hence models would be less robust.

For us, Word2Vec embedding became the preferred choice because –

  • It allowed us to embed the item level information in lesser memory.
  • It allows the model to learn the behaviour of similar items in terms of cuisine and preparation style.

The above image is a visualisation and subsequent clustering of menu item vectors trained using Word2Vec. One can see how different clusters are being formed. For example, all types of Biryanis are together, but are far off from Milkshakes, which is expected as they are fundamentally different dishes.

An order seldom contains only one item. In such a scenario, we take the quantity and cost weighted average of item vector to get to the final menu representation. Shown below –

Let’s take an Order for example containing N items.

Final order representation is a weighted average of the cost of each item and the quantity ordered.

2. Encoding Restaurant Level Information

Given that in a month, food is ordered from about 150k+ restaurants, understanding how a restaurant could be represented numerically for a machine learning model, becomes the most essential part of this puzzle.

In our case, a restaurant is represented by categorical data. Categorical data is very common in business datasets. For example, users are typically described by country, gender, age groups, etc. Products are often described by product type, manufacturer, seller etc.

The most used category representations are One Hot Encoding, Encoding Categories with Dataset Statistics, or Encoding Categories as Cluster labels.

Categorical data is extremely convenient for comprehension but very hard for most machine learning algorithms, due to these reasons –

  • High cardinality – categorical variables may have a large number of levels (e.g., city or restaurants), where most of the levels appear in a relatively small number of instances.
  • Many machine learning models (e.g., SVM) are algebraic, thus their input must be numerical. Using these models, categories must be transformed into numbers first before we can apply the learning algorithm.

The basic premise is, we let a neural network calculate the best representation of a restaurant by itself. Entity embedding is a vector (a list of real numbers) representation of an entity, which is a restaurant in this case.

The above image is a T-SNE plot (commonly used to visualise high dimensional data) of the most ordered from restaurants in Bangalore, where restaurants serving similar cuisines and dishes are clubbed together.

X = {Current Order Level Information, Order Vector, Restaurant Vector}

Y = Food Preparation Time

We initialise an embedding matrix representing each restaurant with ‘m’ dimensions. Each column of the embedding matrix represents one restaurant. Then using various features related to an order, the X-Vector is passed through a neural network. Through backpropagation, the restaurant representations get updated with each iteration along with the weights.

Read more for information on Categorical Embedding.

How did we train our model?


Through the embedding matrix, we get the final restaurant representation and then we pass the same X-Vector, as in the entity embedding architecture, to an XGBoost Regressor Model.

Deep Learning Architecture

Our previous model architecture couldn’t take into account the previous sequences of orders, which came to the restaurant; both the completed orders as well the current running orders.

One expects that if in ‘previously completed orders’ there was an order of Butter Chicken, then subsequently predicting FPT of a Butter Chicken order should be nearabout the past value. Passing information sequentially will better understand the kitchen capacity and behaviour at time t. FPT of a restaurant could also be understood as a time series with its various amplitudes of the series (denoting FPT of the order) depending on the item being cooked. Hence, we narrowed down to using a sequential architecture to better represent a restaurant’s kitchen.

Both running orders (running orders at time T, at max 5 running orders) and completed orders (last 5 completed orders) are passed through a stacked LSTM Layer. The resulting column vector is concatenated with the present order features and the Restaurant Embedding Vector.

The resulting column vector is passed through a 2 layer dense network and regressed on FPT.

Through this, we were able to reduce our mean absolute error from 4.64 mins to 4.13 mins and mean squared error from 32 to 28.

Enhancements to our model and next steps

In addition to the encoding of data across restaurants and dishes, we were further able to enhance the model with a restaurant level information input of preparation time.

Previously, we used to calculate FPT as the difference between the restaurant accepted order timestamp and DP order pick up timestamp. This didn’t result in true FPT as the behaviour of a particular DP during order pick up became a part of the equation. This ideally shouldn’t be the case as FPT is a restaurant phenomenon. In order to correct this, we introduced a Food Order Ready (FOR) button in the Restaurant Partner app.

They can now mark this whenever the food items are prepared and are ready for pick up. In our initial results, we saw a 9 percent improvement within 5 minutes accuracy for our prediction. As the compliance of FOR increases, our prediction results become even more accurate.

We are also moving towards the newest and most exciting paradigm in the world of data science – Reinforcement Learning, i.e., a self-learning system, which updates weights as per real-time errors observed at a restaurant level.

Given that food preparation time represents real-time behaviour, making such a system will be a more elegant solution for this problem statement, ensuring a smoother order tracking experience for our customers.