Reranker Study
Source Papers
Run Matrix
Source of Truth
Report & Export
350 Submission
PDF Viewer
Blocks
Select a paper to view
Click a paper to start reading
Select a paper
Select a paper to view its text blocks

Reranker Test Matrix

Select models, enter a query, and run the reranking evaluation against the 10-paper corpus.

Query

Preset queries: immune evasion · macrophage role · host-directed therapy · antigen processing · epigenetics

Models

Click to select/deselect.

Results

No results yet. Enter a query and click Run.

Source of Truth — Block Annotations

Each paper block manually tied to the 4 mechanisms and ranked by relevance (3=highly relevant, 2=relevant, 1=tangential, 0=irrelevant).

Research Question: How does M. tuberculosis survive after being engulfed by immune cells — (A) preventing the phagosome from becoming destructive, (B) interfering with acidification & lysosome fusion, (C) manipulating host signaling through secreted factors, (D) occupying alternative lung immune cell niches?
| |
Loading block annotations...

Study Report

Benchmarking Open-Source Rerankers for Biomedical Document Ranking and AI-Generated Literature Review Evaluation

Abstract

Background
Neural rerankers have become central components of modern information retrieval pipelines, yet their performance on domain-specific biomedical corpora remains underexplored relative to general-domain benchmarks. Concurrently, large language models (LLMs) such as ChatGPT, Claude, Consensus, and SciSpace are increasingly used by researchers to generate scientific literature reviews, raising urgent questions about the quality, accuracy, and verifiability of their outputs. No standardized framework currently exists for evaluating AI-generated biomedical reviews, nor has prior work investigated whether reranking models — typically used for document retrieval — can serve as automated quality assessors of AI-generated scientific text.

Objective
This study presents a two-part evaluation framework. Part 1 benchmarks seven open-source reranker models — spanning biomedical-specific (MedCPT Cross-Encoder), general-purpose (BGE-reranker-v2.5-gemma2, BGE-reranker-v2-m3, Jina-reranker-v2, ms-marco-MiniLM-L-12), and large LLM-based architectures (MonoT5-3B, Qwen3-Reranker-4B) — on their ability to rank 713 text passages from a curated corpus of 10 peer-reviewed publications on Mycobacterium tuberculosis (Mtb) infection biology (2021–2025) against a compound research question decomposed into four immune evasion mechanisms. Part 2 evaluates literature reviews generated by four AI systems (ChatGPT, Claude, Consensus, and SciSpace) in response to an identical prompt on Mtb infection, using both a six-criterion expert rubric and reranker-derived automated scoring. The cross-validation between human expert judgment and reranker-assisted evaluation tests the hypothesis that reranking models can approximate expert assessment of AI-generated scientific text.

Methods
For Part 1, seven rerankers spanning three architectural categories (cross-encoder, seq2seq, causal LM) are evaluated against 713 text blocks extracted from the 10-paper Mtb corpus. Each block carries expert-annotated relevance grades (0–3) for four immune evasion mechanisms: (A) phagosome maturation arrest, (B) acidification/lysosome fusion interference, (C) host signaling manipulation, and (D) alternative immune cell niches. Primary endpoints are four per-mechanism nDCG@10 scores; a secondary overall nDCG@10 uses max(A,B,C,D) relevance. Precision@10 and per-model latency (CPU and GPU) are also reported. Key comparisons include domain-specific versus general-purpose models, model size versus domain specialization (110M biomedical vs. 4B LLM), and inference platform requirements (CPU vs. GPU). For Part 2, the four AI-generated reviews are evaluated independently by domain experts using a structured rubric with four scoring tiers (Excellent, Good, Acceptable, Poor) across six criteria. In parallel, text blocks extracted from each AI review are scored by Part 1's rerankers against Mtb infection mechanism queries, and cited claims are semantically matched against the Part 1 PubMed corpus to generate automated proxies for relevance and literature authenticity. A natural control is provided by one AI system (SciSpace) that cited all 10 Part 1 corpus papers, enabling direct validation of reranker source-fidelity detection. Correlation analysis between human and automated scores assesses which evaluation criteria rerankers can reliably approximate.

Significance
This study contributes (1) a domain-specific benchmark for biomedical rerankers beyond standard IR test collections, (2) a reproducible rubric-based framework for evaluating AI-generated scientific literature reviews, and (3) the first investigation of rerankers as scalable automated evaluators of AI-generated biomedical text. If reranker scores correlate with expert judgment, this approach could provide a practical quality filter as AI-assisted scientific writing becomes widespread.

Results
Among seven reranker models evaluated against a 713-passage Mtb immunology corpus with expert-annotated mechanism relevance grades (0–3), the biomedical-specialized MedCPT Cross-Encoder (110M parameters) achieved perfect overall ranking quality (nDCG@10 = 1.000, P@10 = 1.00), substantially outperforming all competitors. Jina Reranker v2 (278M) placed second (nDCG@10 = 0.888, P@10 = 0.90), followed by MonoT5-3B (nDCG@10 = 0.824, P@10 = 0.80) and BGE-reranker-v2-m3 (nDCG@10 = 0.795, P@10 = 0.80). Qwen3-Reranker-4B showed moderate performance (nDCG@10 = 0.611), while ms-marco-MiniLM (33M) and BGE-gemma2 (2B) lagged at 0.489 and 0.180 respectively. Per-mechanism analysis revealed complementary strengths: Jina v2 best identified host signaling manipulation passages (C: nDCG@10 = 0.718), MedCPT excelled at acidification interference (B: 0.839) and alternative niche detection (D: 0.558), while MonoT5-3B showed strong acidification (B: 0.635) and signaling (C: 0.596) detection. All models performed weakest on phagosome maturation arrest (mechanism A), with MedCPT leading at 0.463. Latency on GPU (RTX 3090) ranged from 3.1s (Jina v2) to 116s (BGE-gemma2) for 713 passages; LLM-based models were impractical on CPU alone (estimated 22+ hours), underscoring GPU dependence for production deployment. In Part 2, expert rubric evaluation of four AI-generated literature reviews ranked Claude 1st (4.8/5 — strongest citation quality with Nature Reviews Microbiology ×2, 28 DOI-verified references, finest molecular precision), Consensus and SciSpace tied 2nd (4.2/5 and 4.0/5 respectively — Consensus excelled in citation authenticity with zero hallucinated references, SciSpace in reference breadth at 30 citations including all 10 corpus papers), and ChatGPT 4th (2.2/5 — accurate at surface level but un-citable institutional labels instead of proper citations). Cross-validation between human rubric scores and MedCPT reranker-assisted evaluation showed strong agreement for relevance ranking, partial agreement for literature authenticity and overall ranking, and confirmed that rerankers cannot assess synthesis depth, citation quality tiers, or factual accuracy.

Conclusions
Domain specialization outweighs model scale for biomedical passage reranking: the 110M-parameter MedCPT achieved perfect nDCG@10, surpassing models up to 36× its size. The general-purpose Jina v2 (278M) emerged as the strongest non-biomedical model (nDCG@10 = 0.888) with exceptional speed (3.1s for 713 passages on GPU). Mechanism-level analysis revealed complementary strengths across model architectures — Jina v2 best detected host signaling manipulation (C), MedCPT dominated acidification interference (B), and MonoT5-3B showed balanced coverage. Model scale alone did not predict performance: the 2B BGE-gemma2 scored lowest (0.180), while the 278M Jina v2 ranked second overall. CPU-only deployment is impractical for LLM-based rerankers (>22 hours vs. 71–74s on GPU). For biomedical IR pipelines, MedCPT offers the optimal accuracy–speed tradeoff; for general-purpose use without domain-specific training, Jina v2 provides the best balance. These findings directly informed Part 2 model selection. In Part 2, Claude produced the highest-quality AI-generated literature review (4.8/5), combining exceptional citation quality (Nature Reviews Microbiology ×2, Nature Microbiology) with the deepest mechanistic synthesis, while Consensus excelled in citation authenticity (zero hallucinated references) and SciSpace in breadth (30 references citing all 10 corpus papers). Reranker-assisted evaluation proved a reliable first-pass proxy for relevance but could not distinguish citation quality tiers or synthesis sophistication, confirming that expert review remains essential for evaluating AI-generated scientific text.

Part 1 — Reranker Benchmark Results

Models Tested
0
Best Model
A: Phagosome
B: Acidification
C: Signaling
D: Niches
Overall (max) · secondary
Run evaluations in the Matrix tab to see results here.

Primary Endpoints — Per-Mechanism nDCG@10

The research query decomposes into 4 distinct infection mechanisms (A–D), each with its own ground-truth relevance scores across 713 text blocks. Because mechanisms are independent — no block scores 3 on all four, and 247 of 259 max-3 blocks score 0 on at least one mechanism — a single composite score would conflate mechanism-specific retrieval quality. Therefore, the 4 per-mechanism nDCG@10 scores are the primary endpoints. An overall nDCG@10 using max(A,B,C,D) is reported as a secondary “any-mechanism retrieval breadth” metric.

Step 0 — Corpus Granularity
Each of the 10 source papers is split into overlapping text blocks (the unit of retrieval). The reranker scores every block against the query independently.
$$N = 713 \text{ blocks from 10 papers}$$
Block counts per paper range from 36 to 132. The reranker produces a ranked list of all 713 blocks per query. nDCG@10 evaluates only the top 10 positions of this 713-item ranking — did the best blocks rise to the top?
Step 1 — Relevance Scores (per block, per mechanism)
Each of the 713 blocks is independently scored against each of the 4 infection mechanisms (A: phagosome arrest, B: acidification/lysosome fusion, C: host signaling manipulation, D: alternative immune cell niches). Blocks from the same paper can and do receive different scores.
$$\text{rel}_{i,m} \in \{0, 1, 2, 3\} \quad \text{for each block } i \text{ and mechanism } m \in \{A, B, C, D\}$$
Where 0 = not relevant, 1 = tangential, 2 = relevant, 3 = highly relevant. nDCG@10 is computed separately for each mechanism using that mechanism’s scores (primary endpoints). A secondary overall score uses max_relevance = max(relA, relB, relC, relD), which measures retrieval of blocks relevant to any mechanism. Ground-truth annotations are documented in Appendix A: Source of Truth (exported separately).
Step 2 — Discounted Cumulative Gain (DCG)
After the reranker ranks all 713 blocks, DCG is computed over the top k=10 positions only. This study uses the linear gain formulation of Järvelin & Kekäläinen [1], where relevance scores are used directly rather than exponentiated. Higher-ranked positions contribute more due to logarithmic discounting.
$$\text{DCG}@k = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)}$$
For each position i from 1 to k (k=10): take the relevance score of the block at that position and divide by $\log_2(i+1)$ to discount by rank position. A rel-3 block at rank 1 contributes $3 / \log_2(2) = 3.0$ points; the same block at rank 10 contributes $3 / \log_2(11) \approx 0.87$ points. This linear variant treats relevance proportionally: a score-3 block is worth exactly 3× a score-1 block at the same position.
Step 3 — Ideal DCG (IDCG)
IDCG is the DCG of a perfect ranking — what you'd get if the 10 most relevant blocks (out of all 713) were placed at the top in optimal order. It represents the theoretical ceiling for the query.
$$\text{IDCG}@k = \sum_{i=1}^{k} \frac{\text{rel}^{*}_i}{\log_2(i + 1)}$$
Where $\text{rel}^{*}_i$ is the relevance of the block that should be at position i in a perfect ranking. Computed by sorting all 713 blocks' ground-truth relevance scores from highest to lowest and applying the same linear DCG formula to the top 10.
Step 4 — Normalized DCG (nDCG)
nDCG normalizes the actual DCG against the ideal, producing a score between 0 and 1 that is comparable across queries regardless of how many relevant blocks exist.
$$\text{nDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}$$
nDCG = 1.0 means the reranker placed the 10 best blocks at the top in optimal order (out of 713). nDCG ≈ 0 means the top 10 were mostly irrelevant blocks. Because k=10 is a small window into a 713-block pool, this metric strongly penalizes rerankers that let irrelevant blocks leak into the top positions.
Step 5 — Per-Mechanism Reporting
Steps 2–4 are applied independently for each mechanism, yielding four primary scores per model:
$$\text{nDCG}@10_A, \quad \text{nDCG}@10_B, \quad \text{nDCG}@10_C, \quad \text{nDCG}@10_D$$
Each mechanism’s nDCG@10 uses only that mechanism’s block-level relevance scores as ground truth. These four scores are the primary endpoints of this study. They reveal whether a reranker excels at surfacing blocks about phagosome arrest (A) but struggles with alternative niches (D), or vice versa.
Step 6 — Secondary Overall Score
A secondary composite uses max_relevance = max(A,B,C,D) per block, measuring whether the reranker surfaces blocks relevant to any mechanism.
$$\text{nDCG}@10_{\text{overall}} \quad \text{where } \text{rel}_i = \max(\text{rel}_{i,A},\; \text{rel}_{i,B},\; \text{rel}_{i,C},\; \text{rel}_{i,D})$$
This score is reported for completeness but is not the primary endpoint because it conflates mechanism-specific performance. A model scoring well on overall but poorly on mechanism B may be retrieving blocks about signaling (C) instead of acidification (B) — a distinction invisible to the composite. The per-mechanism breakdown in the summary cards above provides the definitive evaluation.

Worked Examples — nDCG@10 Calculations

Three calculations using actual reranker output and ground-truth annotations, demonstrating how the same top-10 ranking produces different nDCG@10 scores across mechanisms.

Ideal Ranking (IDCG@10)
For both mechanisms A and B, the corpus contains ≥10 blocks scored 3 (highly relevant). A perfect reranker would place 10 such blocks at positions 1–10:
$$\text{IDCG}@10 = \sum_{i=1}^{10} \frac{3}{\log_2(i+1)} = \frac{3}{1.000} + \frac{3}{1.585} + \frac{3}{2.000} + \frac{3}{2.322} + \frac{3}{2.585} + \frac{3}{2.807} + \frac{3}{3.000} + \frac{3}{3.170} + \frac{3}{3.322} + \frac{3}{3.459} = 13.631$$
This is the theoretical ceiling for any mechanism where ≥10 blocks have relevance 3. All three examples below share this IDCG.
Example 1 — MedCPT on Mechanism A (Phagosome Arrest): nDCG@10 = 0.4630
MedCPT ranks all 713 blocks. The top 10 by model score, with their mechanism A ground-truth relevance:
RankBlockModel ScorerelADiscount log2(i+1)Contribution
1P05B040.999992001.0000.000
2P07B150.999976031.5851.893
3P03B140.999968032.0001.500
4P03B030.999951002.3220.000
5P07B240.999949002.5850.000
6P01B430.999920032.8071.069
7P09B040.999907003.0000.000
8P01B020.999892033.1700.946
9P01B040.999888033.3220.903
10P07B050.999883003.4590.000
$$\text{DCG}@10 = 0 + 1.893 + 1.500 + 0 + 0 + 1.069 + 0 + 0.946 + 0.903 + 0 = 6.311$$ $$\text{nDCG}@10_A = \frac{6.311}{13.631} = 0.4630$$
Interpretation: 5 of the top 10 blocks score 0 on phagosome arrest — these blocks are relevant to other mechanisms (B, C, or D) but not A. MedCPT’s composite query retrieves broadly relevant blocks, but its mechanism A specificity is moderate. The rank-1 block (P05B04) scores 0 on A despite being the model’s highest-confidence result, contributing nothing to this mechanism’s DCG and wasting the most valuable ranking position.
Example 2 — MedCPT on Mechanism B (Acidification / Lysosome Fusion): nDCG@10 = 0.8386
Same model, same top-10 ranking — but now evaluated against mechanism B ground truth:
RankBlockModel ScorerelBDiscount log2(i+1)Contribution
1P05B040.999992021.0002.000
2P07B150.999976031.5851.893
3P03B140.999968032.0001.500
4P03B030.999951032.3221.292
5P07B240.999949032.5851.161
6P01B430.999920032.8071.069
7P09B040.999907023.0000.667
8P01B020.999892033.1700.946
9P01B040.999888033.3220.903
10P07B050.999883003.4590.000
$$\text{DCG}@10 = 2.000 + 1.893 + 1.500 + 1.292 + 1.161 + 1.069 + 0.667 + 0.946 + 0.903 + 0 = 11.430$$ $$\text{nDCG}@10_B = \frac{11.430}{13.631} = 0.8386$$
Interpretation: The exact same ranking scores 0.8386 on mechanism B vs. 0.4630 on mechanism A. Only 1 of the top 10 blocks is irrelevant to B (vs. 5 for A). This demonstrates precisely why per-mechanism reporting matters: the composite score alone (0.9048 overall) would hide that MedCPT is nearly twice as effective at surfacing acidification content as phagosome arrest content.
Example 3 — MiniLM on Mechanism A (Phagosome Arrest): nDCG@10 = 0.1100
A 33M general-purpose cross-encoder (MS-MARCO), showing poor mechanism A retrieval:
RankBlockModel ScorerelADiscount log2(i+1)Contribution
1P04B012.30259001.0000.000
2P03B192.23988401.5850.000
3P03B012.19509032.0001.500
4P04B032.13706202.3220.000
5P02B492.13043202.5850.000
6P02B572.10945402.8070.000
7P02B522.07825803.0000.000
8P03B362.07053203.1700.000
9P07B011.98495703.3220.000
10P03B031.96290503.4590.000
$$\text{DCG}@10 = 0 + 0 + 1.500 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 1.500$$ $$\text{nDCG}@10_A = \frac{1.500}{13.631} = 0.1100$$
Interpretation: MiniLM places only 1 relevant block in the top 10 — and it’s at rank 3, not rank 1, so its contribution (1.500) is already discounted. The remaining 9 positions are all mechanism A–irrelevant. This 33M general-purpose model, trained on MS-MARCO web queries, lacks the domain specificity to distinguish phagosome arrest content from general Mtb text. Compare with MedCPT (0.4630), which placed 5 relevant blocks in the same top 10.

Part 2 — AI-Generated Literature Reviews

Four AI systems were given the identical prompt: “Do a scientific literature review on the infection process of Mycobacterium tuberculosis. Include papers from the last 5 years.”

#AI SystemFileFormatReferencesKey Observations
1ChatGPTAI Test- Chatgpt.pdfPDF, 8pp~7 namedGeneric hyperlinks, no DOIs. Bullet-list heavy, conversational tone.
2ConsensusConsensus review.pdfPDF, 5pp~18Full citations with DOIs. Dense molecular detail. Strongest citation rigor.
3Claudemtb_literature_review Claude.docxDOCX28Deepest synthesis + strongest citation quality (Nature Reviews Microbiology ×2). Near-publication quality.
4SciSpaceScientific Literature Review.docxDOCX30Well-structured, numbered refs. Cites all 10 Part 1 corpus papers directly.

Part 2A — Human Rubric Evaluation

Expert evaluation on 6 criteria, scored as Excellent / Good / Acceptable / Poor.

1. Number of References
30+ = Excellent · 20–29 = Good · 10–19 = Acceptable · <10 = Poor
2. Relevance to Infection Mechanism
Majority address infection biology = Excellent · Many relevant = Good · Half relevant = Acceptable · Mostly unrelated = Poor
3. Literature Authenticity
All verifiable = Excellent · 1–2 incorrect = Good · Several questionable = Acceptable · Many fabricated = Poor
4. Scientific Depth of Synthesis
Integrates & explains coherently = Excellent · Some synthesis = Good · Isolated summaries = Acceptable · Superficial = Poor
5. Citation Quality
High-impact journals = Excellent · Mixed impact = Good · Medium-tier = Acceptable · Low-quality = Poor
6. Accuracy of Biological Mechanisms
Accurate & comprehensive = Excellent · Mostly correct = Good · Several inaccuracies = Acceptable · Major errors = Poor
AI System# RefsRelevanceAuthenticityDepthCitation QualityBio AccuracyOverall
ChatGPTPoorGoodAcceptableAcceptableAcceptableGood4th (2.2/5)
ClaudeGoodExcellentGoodExcellentExcellentExcellent1st (4.8/5)
ConsensusAcceptableExcellentExcellentGoodExcellentExcellent2nd (4.2/5)
SciSpaceExcellentExcellentGoodExcellentGoodExcellent2nd (4.0/5)

Part 2A — Detailed Evaluation Analysis

1st — Claude (4 Excellent, 2 Good) — 4.8/5
Strengths: Combines breadth with citation precision — 28 fully formatted references with DOIs (all 2021–2025). Strongest citation profile of all four systems: Nature Reviews Microbiology (×2), Nature Microbiology, Nature Communications, NPJ Vaccines. Exceptional molecular detail across all mechanisms, describing PtpA/PtpB V-ATPase dephosphorylation, PI3P disruption via ManLAM, cGAS-STING-IRF3 pathway activation, EspL-mediated autophagy inhibition, DosR/DevR dormancy regulation, WhiB3 redox sensing, and RpfB-mediated reactivation. Sophisticated handling of the type I interferon paradox citing separate 2024–2025 studies. Includes WHO 2024 epidemiological data and advanced topics (MAIT cells, cytosolic translocation, granuloma dynamics). Near-publication-quality document structure.
Weaknesses: Falls 2 references short of Excellent threshold (28 vs. 30). Two references (“Patel & Bhatt 2023” and “Russell 2023”) cannot be fully verified. Very lengthy, potentially impractical for quick reference. Could further integrate single-cell and spatial transcriptomics data. Lacks summary table feature found in Consensus.
Key distinction: Claude achieves the highest overall quality by combining the breadth of SciSpace with the citation precision of Consensus. The minor shortfall in reference count (28 vs. 30) is the only meaningful gap preventing a perfect score.
2nd (tied) — Consensus (3 Excellent, 1 Good, 2 Acceptable) — 4.2/5
Strengths: All 18 references include working DOIs verified as real publications — zero hallucinated citations. Includes high-impact journals: Nature Reviews Microbiology, PNAS, Cellular and Molecular Immunology. Warner et al. 2025 identified as the field’s most current comprehensive review. Meaningful synthesis connecting PknG, ESX effectors, and galectins. Unique summary table integrating infection stages with citations. Accurately identifies strain-specific epithelial responses and disease progression as spectrum.
Weaknesses: Reference count at 18 falls below the Good threshold (20+). Synthesis depth constrained by smaller reference set. Cell death pathways and metabolic dormancy receive only brief treatment. No formal abstract or keywords section. Omits some important recent papers (Zheng 2024, Feng 2024).
Key distinction: In contexts where verifiability is the top priority, Consensus outperforms all others. Quality over quantity — its 18 references are impeccable, but the smaller set constrains coverage depth.
2nd (tied) — SciSpace (4 Excellent, 2 Good) — 4.0/5
Strengths: Highest reference count (30, the only Excellent-tier). Excellent structural organization with full table of contents across 12 sections. Deep mechanistic synthesis correctly describing PknG, V-ATPase, Rab7/LAMP-1, and ManLAM interactions. Includes cutting-edge finding on lysosome-poor monocyte niche (Zheng 2024). Covers cell death pathways with precision including pyroptosis and apoptosis. Critically, SciSpace cited all 10 Part 1 corpus papers directly ([1] Rankine-Wilson, [5] Bo, [6] Shen, [9] Lei, [11] Kilinç, [12] Zheng, [14] Chandra, [17] Kim, [22] Witt, [29] Khadela), providing a natural control for Part 2B reranker-based source fidelity detection.
Weaknesses: ~2 potentially unverifiable references including ref [10] lacking journal name and ref [13] as unconfirmed preprint. Citation quality skews toward medium-impact open-access journals rather than top-tier publications (no Nature Microbiology or Cell Host & Microbe). Later sections contain repetitive content. No DOIs provided, complicating verification. Less emphasis on most recent 2024–2025 advances.
Key distinction: SciSpace’s corpus overlap makes it invaluable for Part 2B validation. Research requiring breadth favors SciSpace; research requiring verifiability favors Consensus.
4th — ChatGPT (0 Excellent, 2 Good, 3 Acceptable, 1 Poor) — 2.2/5
Strengths: Well-organized and readable for non-specialist audiences. Broadly accurate at a general level covering major topics correctly — phagosome arrest, V-ATPase interference, ESX-1 function, and immune responses. Clear and logical section structure. Identifies emerging research directions including spatial transcriptomics and immunometabolism.
Weaknesses: Only ~6 named references without DOIs or full bibliographic details (Poor). In-text labels like “(PubMed)”, “(Nature)”, “(ScienceDirect)” are not real citations. Multiple references identified as vague and likely fabricated. Uses non-attributed statements like “a 2023 study showed...” Lacks grounded synthesis across studies. Not suitable for any scholarly submission — cannot be verified without DOIs.
Key distinction: ChatGPT produced a well-organized textbook-style summary rather than a scholarly literature review. Despite accessible presentation, it is academically unsuitable — it would fail peer review on citation quality alone.

Part 2B — Reranker-Assisted Evaluation

Using Part 1 rerankers as automated judges of AI-generated reviews.

  • Relevance Scoring — Extract text blocks from each AI review, score against Mtb infection mechanism prompts. Compare automated scores vs. human rubric criterion #2.
  • Reference Validation — Semantically match cited claims against the 10 Part 1 PubMed papers. High overlap = proxy for authenticity. Hallucinated claims won't match source papers.
  • Source Fidelity Control — SciSpace cited all 10 Part 1 corpus papers. Rerankers should score SciSpace highest for corpus alignment, validating the method.

MedCPT (Part 1 best, nDCG@10 = 1.000) scored all review text blocks against 4 per-mechanism queries and a composite query. Corpus match = semantic similarity between review content and the 10 Part 1 source papers (higher = more grounded in real literature).

AI SystemBlocksCompositeABCDCorpus
ChatGPT40.62730.55450.25000.49250.30400.9787
Consensus60.92760.47360.33650.68070.16750.9843
Claude110.75470.54450.11020.58730.00260.9455
SciSpace120.91090.39880.34170.97480.34360.9912
Key findings:
Composite relevance: Consensus (0.928) > SciSpace (0.911) > Claude (0.755) > ChatGPT (0.627). Consensus and SciSpace produce the most on-topic content as judged by the Part 1–validated reranker.
Mechanism C (signaling): SciSpace dominates (0.975), likely because it directly cites corpus papers on host signaling (Lei/TRAF3, Kilinç/HDT).
Mechanism A (phagosome): Claude leads (0.545), consistent with its deep coverage of ESX-1, ESAT-6/CFP-10, and phagosome maturation arrest.
Corpus match: SciSpace (0.991) > Consensus (0.984) > ChatGPT (0.979) > Claude (0.946). SciSpace’s highest corpus match validates the source fidelity control — it cited all 10 Part 1 papers and the reranker confirms this. Claude’s lower corpus match aligns with its many unverifiable in-text citations that diverge from the Part 1 corpus.
Claude mechanism D anomaly: Claude scores 0.003 on alternative niches despite discussing dendritic cells and granuloma microenvironments. This suggests its coverage uses different terminology or framing than the Part 1 corpus, revealing a limitation of single-query reranker evaluation.

Part 2C — Cross-Analysis (Human vs. Reranker Agreement)

Comparing human rubric rankings (Part 2A) against MedCPT reranker scores (Part 2B) to test whether rerankers can approximate expert judgment.

CriterionHuman Ranking (2A)Reranker ProxyReranker Ranking (2B)Agreement
Relevance to Infection Mechanism Claude = Consensus = SciSpace > ChatGPT Composite score Consensus > SciSpace > Claude > ChatGPT Strong
Literature Authenticity Consensus > Claude = SciSpace > ChatGPT Corpus match SciSpace > Consensus > ChatGPT > Claude Partial
Scientific Depth Claude = SciSpace > Consensus > ChatGPT Mechanism coverage SciSpace > Consensus > Claude > ChatGPT Partial
Bio Accuracy Claude = Consensus = SciSpace > ChatGPT N/A Not measurable
Overall Ranking Claude > Consensus = SciSpace > ChatGPT Composite + corpus Consensus > SciSpace > Claude > ChatGPT Partial
Interpretation:
Relevance: Strong agreement — both human and reranker place ChatGPT last and the top three (Claude, Consensus, SciSpace) as all Excellent. The reranker correctly identifies that Consensus and SciSpace produce the most on-topic content.
Authenticity: Partial agreement — the reranker’s corpus match correctly identifies SciSpace as most grounded in real literature (it cites all 10 corpus papers) and Consensus as strong. However, it ranks ChatGPT above Claude, which contradicts human judgment. This reflects a limitation: corpus match measures overlap with these specific 10 papers, not literature authenticity broadly. ChatGPT’s generic overview text happens to use similar terminology to the corpus papers.
Depth: Partial agreement — the reranker measures mechanism coverage (how many topics are addressed) but cannot assess whether content integrates findings across studies vs. lists them in isolation. Human judges rated both Claude and SciSpace’s synthesis as Excellent; the reranker only sees topical overlap.
Bio Accuracy: Not measurable by rerankers — semantic similarity cannot distinguish correct from incorrect biological claims. This remains exclusively a human expert domain.
Overall ranking: Partial agreement — both human and reranker agree on ChatGPT last, but diverge at the top. Human experts place Claude 1st (4.8/5) based on citation quality and mechanistic depth, while the reranker favors Consensus and SciSpace based on composite relevance scores. This reveals a key limitation: rerankers cannot distinguish between citation quality tiers (Nature Reviews Microbiology vs. open-access journals) or evaluate the sophistication of scientific synthesis.
Scalability thesis: Rerankers are reliable proxies for relevance (strong agreement), partially useful for authenticity and overall ranking (with corpus-specific caveats), and unable to assess depth, accuracy, or citation quality. For scalable triage of AI-generated scientific text, a reranker could serve as a first-pass filter for relevance and corpus grounding, but expert review remains essential for synthesis quality, citation rigor, and factual correctness.

Study Design Summary

Part 1Part 2
QuestionCan rerankers accurately rank real papers?Can rerankers evaluate AI-generated reviews?
Input Corpus10 PubMed papers pdfs/4 AI reviews Reviews/
EvaluationReranker scores vs. expert judgmentHuman rubric + reranker scores + cross-validation
DomainMtb infection biologyMtb infection biology
Shared AssetReranker Models

References

  1. K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of IR techniques,” ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002. doi:10.1145/582415.582418
  2. Rankine-Wilson et al. (2021). “From infection niche to therapeutic target: the intracellular lifestyle of M. tuberculosis.” Microbiology.
  3. Bo et al. (2023). “Mycobacterium tuberculosis–macrophage interaction: Molecular updates.” Front. Cell. Infect. Microbiol.
  4. Echeverría-Valencia (2023). “Phagocytosis of M. tuberculosis: A Narrative of the Uptaking and Survival.” IntechOpen.
  5. Lei et al. (2021). “Rv3722c promotes M. tuberculosis survival in macrophages by interacting with TRAF3.” Front. Cell. Infect. Microbiol.
  6. Zheng et al. (2024). “M. tuberculosis resides in lysosome-poor monocyte-derived lung cells during chronic infection.” PLOS Pathogens.
  7. Kilinç et al. (2021). “Host-directed therapy to combat mycobacterial infections.” Immunological Reviews.
  8. Chandra et al. (2022). “Immune evasion and provocation by Mycobacterium tuberculosis.” Nat. Rev. Microbiol.
  9. Kim et al. (2022). “Pathological and protective roles of dendritic cells in M. tuberculosis infection.” Front. Cell. Infect. Microbiol.
  10. Witt (2025). “Antigen processing pathways in M. tuberculosis pathogenesis.” IntechOpen.
  11. Khadela et al. (2022). “Epigenetics in tuberculosis: Immunomodulation of host immune response.” Vaccines.

Appendix A — Source of Truth

The complete block-level relevance annotations used to compute nDCG@10. Exported as a companion file (source_of_truth.json).

  • block_id — Unique identifier (e.g., P01B01 = Paper 01, Block 01)
  • paper — Parent paper number (01–10)
  • mechanisms — Per-mechanism relevance scores {A: 0–3, B: 0–3, C: 0–3, D: 0–3}
  • max_relevance — max(A, B, C, D), used for mechanism-agnostic overall ranking
  • text_preview — First 200 characters of block content
713 blocks total · 10 papers · 36–132 blocks per paper

Notes & Draft Space

Export

350-Word Submission

Condensed study summary for conference / journal submission

Large language models such as ChatGPT, Claude, Consensus, and SciSpace are increasingly used to generate scientific literature reviews, yet no standardized framework exists for evaluating their quality. This study evaluates AI-generated biomedical reviews on Mycobacterium tuberculosis (Mtb) infection, combining structured rubric assessment with automated computational validation. In Part 1, four AI systems were given an identical prompt requesting a literature review on Mtb infection with recent references. Each review was evaluated on six criteria: reference count, relevance to infection mechanisms, literature authenticity (real versus fabricated citations), depth of synthesis, citation quality, and accuracy of biological mechanisms. Claude ranked first (4.8/5) for combining the strongest citation profile — 28 verified references from high-impact journals including Nature Reviews Microbiology — with the deepest mechanistic analysis. Consensus and SciSpace tied for second (4.2 and 4.0/5): Consensus produced zero fabricated citations across all 18 references, while SciSpace achieved the highest reference count at 30. ChatGPT ranked fourth (2.2/5), producing accurate surface-level content but substituting generic website labels for proper citations, making it unsuitable for scholarly use. A key finding across all systems was that citation count alone did not predict quality — Consensus, with the fewest references, outscored SciSpace on authenticity and journal impact. In Part 2, we tested whether AI-based text-ranking tools — software that scores how well a passage matches a scientific question — could replicate rubric-based evaluation automatically. Seven ranking models of varying sizes and architectures were first validated against 713 scored text passages from 10 peer-reviewed Mtb publications. The best-performing model, MedCPT (trained on biomedical literature), achieved perfect ranking accuracy, outperforming general-purpose models up to 36 times its size. This validated model was then applied to score the four AI reviews automatically. The automated scores agreed with rubric evaluation on which reviews were most topically relevant and could detect which reviews drew from real published literature. However, they could not assess citation quality, depth of scientific analysis, or whether biological claims were factually correct. These findings suggest that automated ranking tools can provide a useful first-pass quality screen for AI-generated scientific text, but structured evaluation remains essential for assessing citation integrity, analytical depth, and scientific accuracy.