![]() |
2766-6863 |
![]() |
2766-6863 (service hours) |
![]() |
Online Form |
![]() |
Contact your Faculty Librarians on in-depth research questions |
Research involves managing vast amounts of information, and GenAI offers new ways to handle it effectively. This guide introduces tools for research processes
With the overwhelming number of GenAI tools, it can be challenging to determine which ones to use. Start by considering the following three factors to guide your decision:
While GenAI excels in information searching—such as speed, accessibility, and the ability to generate diverse perspectives, it is also essential to critically assess the output, as AI-generated information can be incorrect and may mislead users.
The CRAAP test is a simple tool to help you evaluate information sources, including AI generated content. It involves asking yourself questions across 5 key aspects to determine whether a source is suitable for your research or decision-making. Below are some suggested questions specifically focused on evaluating AI-generated information.
Criteria | Description | Questions |
---|---|---|
C - Currency | Timeliness of information |
|
R - Relevance | Contextual fit |
|
A - Authority | Source credibility |
|
A - Accuracy | Reliability of content |
|
P - Purpose | Reason for existence |
|
Modified based on Evaluating Information - Applying the CRAAP Test By Meriam Library, California State University, Chico
Understanding how well an LLM performs across different functionalities enables you to select the more appropriate tool for your specific research needs. IBM explained LLM benchmarks as standardized frameworks assessing the performance of large language models (LLMs). These benchmarks facilitate the evaluation of LLM skills in different aspects, such as coding, common sense, reasoning, natural language processing, and machine translation.
The table below consolidates some LLM benchmarks* of GenAI LLMs in PolyU GenAI. You can compare the benchmark scores to determine the most suitable GenAI tool for your work.
*Data retrieved from llm-stats.com
Model | ||||||||
---|---|---|---|---|---|---|---|---|
DeepSeek-R1 | 90.8% | 84.0% | 71.5% | 30.1% | 79.8% | - | - | - |
Llama-3.3-70B-Instruct | 86.0% | 68.9% | 50.5% | - | - | 77.0% | 91.1% | 88.4% |
Mistral | 84.0% | - | - | - | - | - | - | 92.0% |
GPT-o1 | 91.8% | - | 78.0% | 47.0% | 83.3% | 96.4% | 89.3% | 88.1% |
GPT-4o | 88.0% | 74.7% | 53.6% | 38.2% | 13.4% | - | - | - |
GPT-4o-mini | 82.0% | - | 40.2% | - | - | 70.2% | 87.0% | 87.2% |
GPT-o3-mini | 86.9% | - | 79.7% | 15.0% | 87.3% | 97.9% | 92.0% | - |
Qwen2.5-72B-Instruct | - | 71.1% | 49.0% | - | - | 83.1% | - | 86.6% |
Remarks:
Knowledge & Reasoning benchmarks: MMLU, MMLU-Pro, GPQA, SimpleQA
MMLU: Knowledge and reasoning across science, math, and humanities.
MMLU-Pro: Advanced version of MMLU with more complex reasoning tests.
GPQA: 448 "Google-proof" questions in biology, physics, and chemistry.
SimpleQA: 4,326 fact-seeking short questions for specific answers.
Math benchmarks: AIME 2024, MATH, MGSM
AIME 2024: Challenging problems from high school mathematics competition.
MATH: A dataset of competition-level math problems across 5 levels & 7 disciplines.
MGSM: 250 grade-school math problems.
HumanEval: Assesses code generation capabilities through programming challenges.
Other LLMs Benchmarking Websites