header_logo
Post thumbnail
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Minerva: AI Companion for Math Challenges

By Roopa Dharshini

Minerva is built on the same principles and architecture as other NLP models like Google’s BERT and Chat GPT, but a subtle differentiation is that the model is trained on documents comprising sophisticated mathematical expressions, thereby enabling the model to preserve the underlying mathematical expressions.

To understand Minerva, we need to understand the basics of transformers and their components. In this blog, we will look into the architecture and implementation behind Minerva.

Table of contents


  1. What is Minerva?
  2. Working of Transformers: key processes and components
  3. Minerva's training dataset
  4. Inference-time techniques
    • Few-shot prompting
    • Chain-Prompting or Chain-of-thought Prompting
    • Majority Voting
  5. Efficiency after applying different inference techniques.
  6. Examples of Minerva with 'right' predictions
  7. Examples of Minerva with 'wrong' predictions
  8. Conclusion
    • Reference

What is Minerva?

Minerva is developed by Google. It is a transformer-based natural language processing model used to solve problems in quantitative reasoning, such as mathematics and science, given only the input statement without the underlying formulation. Minerva is currently capable of solving university-level problems without breaking down the problem into mathematical expressions to feed into other tools to get the result.

Besides successfully solving problems, Minerva is also capable of giving clear-cut information on each step during the process. Minerva is an NLP model whose purpose is text prediction or to predict the appropriate next word based on probability.

Ultimately, it is not just a tool but a toolbox capable of solving a plethora of problems. It is an invaluable tool for students for learning and for researchers to delegate tasks.

Working of Transformers: key processes and components

Working of Transformers
  • A transformer contains a Transform encoder and a Transform decoder, both are fundamental components of neural networks. 
  • The Transformer decoder and encoder have a forward pass layer and some other layers before passing to the neural network that is relevant to the task at hand.
  • The forward pass contains embedded layers and self-attention layers. Self-attention layer has a long range of dependencies, that is, we can capture a long range of previous words, which is an advantage over existing recurrent neural networks, and also capture the semantic meaning of the words by mapping similar words of similar context to similar vectors.
  • The text encoder also does positional encoding into the input embedding matrix. The corpus of text goes through preprocessing, which includes removing punctuations, digits, and stop words, as well as lowercasing, and other necessary steps by the task at hand.
  • The text is then passed through a word2vec embedding layer. Word2Vec assigns lesser magnitude identifiers to more frequent words and higher magnitude values to less frequent words to avoid numerical instability and larger gradient descents.
  • The unique identifiers are associated with embedded vectors through the creation of the embedding matrix. Each embedding vector reduces the cost function through backpropagation and gradient descent.
  • The embedding vectors are also referred to as dense vectors because of their higher dimensionality.
  • At the end of the embedded layer now the embedded matrix is passed to the other layers. One of the other essential layers is the Self-attention layer.

AD 4nXe10JA5cm8NVJkGUUjE1Ob7YnFdLGUI idtHRvv0L2 0G8 8VhinMb3RCKkunQ7kdEFHAa5PcS7HiRkzoX2k4TRiI2XyC1Uci5XTB0mNCL

  • Self-attention layers use three matrices: Query, Value, and Key matrices to produce an attention output or weight matrix. 
  • The weights for Query, Value, and Key are initialized with random weights, and each of them is produced by the dot product of an embedded matrix.
  • To produce the attention weight matrix, we carry out the dot product of Q and K and divide it by the square root of the embedded matrix. This division reduces the dimensionality of the embedded matrix and prevents the higher gradient from causing issues.
  • Then, we apply the Soft-max function to the attention matrix so that the components of vectors add up to 1; indeed, we are normalizing the score to make it a probability distribution over the input sequence. After that, the score is a matrix multiplied by the Value matrix and returned as the output of the self-attention layer.
  • The output of the layer is passed to several other layers, like residual connections and feed-forward neural networks, which help in the process of predicting the output label.
  • If there is a difference in the output label, the weight of the self-attention layer and additional layers changed by backpropagation to find the optimal weights.

Are you interested in learning more about transformers? Enroll in Guvi’s IITM Pravartak certified Artificial Intelligence and Machine Learning Course. This covers all the important concepts of artificial intelligence from basics such as the history of AI, Python programming, to time series analysis, deep learning, and image processing techniques with hands-on projects.  

Recent research proves that scaling the model improves the model’s performance on predictability.

AD 4nXfY2T5Y6C0Yjo0Z3FMrPsoJueZDAbOkje1nyAMJPe6L7cZqJlfvOYzic70utd vjogRBBXWzjmy7G1TIf NA06 GFkg2LyJX1pL7S3CRTFO8z5B5pfXVt22SqarH 25J6Uo3im22g?key=J Z9awAqTXkUVmMeUalBQQ

Minerva uses a different word to vector algorithm, FastText, which is similar to the Word2Vector algorithm, but FastText breaks the sentence into substrings rather than each word. Why?

  • It decreases computational power.
  • In quantitative reasoning problems, most words are reductant. Words associated with numerical information are valuable.
MDN

Minerva’s training dataset

The Minerva team collected documents with large mathematical descriptions to train and help them to preserve the mathematical notations, all this done without changing the internal architecture of Transformer-based NLP.

The model was trained on 175 GB of mathematical text. The documents are primarily collected from the following sources.

  • ArXiv – arxiv.org is a free distribution service and an open-access archive for 2 million+ scholarly articles in the fields of physics, mathematics, and computer science.
  • Webpages with mathematical texts.

Inference-time techniques

Inference time is the process of making a trained model predict unseen data. Remember, the main purpose of NLP is to generalize to new, unseen data and learn it.

Few-shot prompting

Few-shot prompting is a technique where the model is given an example to solve a problem. The examples provided do not involve complex math, but they help the model to solve the required problem in a single step by showing it an example, such as a direct equation. This approach allows the model to generalize to new problems based on a few examples, making it more efficient than traditional methods that require extensive training with large datasets.

Few-shot prompting example shown to Minerva AI model

Chain-Prompting or Chain-of-thought Prompting

Chain-Prompting is a technique in which the model is given a prompt that consists of a chain of related statements or a complex problem with multiple lines. The model then learns from examples and tries to predict the output from unseen data. This technique can be used to solve complex problems that require multiple steps or involve multiple variables. By breaking down the problem into smaller steps or sub-problems, the model can learn to solve the entire problem more efficiently.

Inference-time techniques

Majority Voting

Minerva assigns probabilities to the different possible outputs. It generates many outputs by stochastically sampling all potential outcomes while answering a question. The model tries to sample the same answer-question pair to immediately get a prompt answer. Majority voting prevents the wrong answer if the model goes wrong and singles it to prompt the correct answer.

Efficiency after applying different inference techniques.

Efficiency after applying different inference techniques.
  • Minerva 540B is a model that without applying Majority voting but other techniques of inference.
  • Minera 540B maj1@k after applying Majority voting along with other techniques of inference.
  • Efficiency is about 10% increase if the Majority voting is performed.

Examples of Minerva with ‘right’ predictions

Examples of Minerva with 'right' predictions
Examples of Minerva with 'right' predictions
Examples of Minerva with 'right' predictions

Examples of Minerva with ‘wrong’ predictions

Examples of Minerva with 'wrong' predictions
Examples of Minerva with 'wrong' predictions
Examples of Minerva with 'wrong' predictions

Analysis of False positives reveals that 8% of the positive answers claimed by the model are false positives.

Conclusion

Minerva is a model that is used to solve quantitative problems. It was trained on the 175GB math text. Able to preserve mathematical text unaltered. It utilizes the PaLM pre-trained model which has 540B parameters, It efficiently solves a problem provided with an example.

MDN

Reference

Success Stories

Did you enjoy this article?

Schedule 1:1 free counselling

Similar Articles

Loading...
Share logo Copy link
Power Packed Webinars
Free Webinar Icon
Power Packed Webinars
Subscribe now for FREE! 🔔
close
Webinar ad
Table of contents Table of contents
Table of contents Articles
Close button

  1. What is Minerva?
  2. Working of Transformers: key processes and components
  3. Minerva's training dataset
  4. Inference-time techniques
    • Few-shot prompting
    • Chain-Prompting or Chain-of-thought Prompting
    • Majority Voting
  5. Efficiency after applying different inference techniques.
  6. Examples of Minerva with 'right' predictions
  7. Examples of Minerva with 'wrong' predictions
  8. Conclusion
    • Reference