ContractNLI-based NDA Risk Analyzer using RoBERTa + Chunking – Looking for Feedback

Hi everyone,

I’m working on a student project using RoBERTa / LegalBERT fine-tuned on ContractNLI to analyze NDAs (entailment / contradiction / neutral for standard legal hypotheses like sharing with employees, survival of obligations, retention after termination, etc.).

On the test split, my models get 70% , but when I run them on real NDAs, some predictions are clearly wrong or over-confident (especially on clauses about sharing CI, survival, and retention). I also use chunking (section-based, max 512 tokens) to handle long contracts.

I’d really appreciate concrete advice on:

  • Choosing between roberta-base / roberta-large / LegalBERT / DeBERTa for legal NLI

  • How to improve real-world accuracy (better chunking, training tricks, calibration, etc.)

  • Any best practices or example repos for long-document NLI in the legal domain

Thanks a lot for any pointers :folded_hands:

1 Like

I’m not sure if it’s the best, but there seem to be some established practices.

Thank you again for the detailed answer — it was super helpful and gave me a lot of ideas.

To make sure I’m going in the right direction, could you help me turn this into a concrete, realistic pipeline for my project?

Right now I have:

  • RoBERTa-base fine-tuned on ContractNLI (~70% test accuracy),

  • 512-token chunking at inference,

  • But poor / over-confident behavior on my own NDA PDFs (especially survival, retention, CI sharing).

For my project, I also need to build my own model (not just use off-the-shelf models), so I’m looking for:

  1. A clear minimal pipeline you’d recommend (e.g., backbone choice, clause-level segmentation + retrieval, aggregation, calibration) that a student can realistically implement.

  2. Which extra datasets you’d prioritize in practice beyond ContractNLI (e.g., CUAD, DocNLI, ACORD, LegalBench, Atticus contracts…) both for:

    • domain-adaptive pretraining / extra training data for “my own” model, and

    • more realistic evaluation on contracts.

  3. If you had to define a “minimum serious version” of an NDA risk analyzer for a student project, what components would you definitely include, and what would you treat as optional?

Any concrete “do A → B → C” guidance and 1–2 recommended datasets to focus on would help me a lot to structure the final months of the project.

Thanks again for your time and generosity.

1 Like

Seems on the right track?

thanks , but i have to present the project tomorrow , what do u suggest me to do now ?

1 Like

Well… that’s a bit much…:sweat_smile:

It’s safer to assume that modules not yet built are still uncertain in terms of actual functionality. You’ll just have to structure the discussion around the parts you’ve already built and the parts you fully understand


Treat it as:
1 hour for results + examples →
2 hours for slides →
30–40 minutes for practice.

No new modeling, just packaging what you already have.


1. Lock your story (10–15 minutes)

Write this on a page or in a doc:

  1. One-sentence project description
    “I fine-tuned RoBERTa on the ContractNLI dataset to analyze NDAs as an NLI task (entailed / contradicted / neutral) for hypotheses like survival, retention, and CI sharing, and evaluated how well it works on real NDAs.”

  2. What you did

    • Used ContractNLI as the main training+test dataset. (datasets-benchmarks-proceedings.neurips.cc)
    • Fine-tuned RoBERTa-base for 3-way NLI.
    • Handled long NDAs with 512-token chunking.
    • Tested on some real NDAs to see how it behaves.
  3. Main finding

    • On ContractNLI: OK performance (~70%).
    • On your NDAs: noticeable errors and over-confidence for survival / retention / CI sharing.
  4. Main idea for improvement (future work)

    • Move from raw chunking to clause-level retrieval + NLI and add calibration.

If you have this clearly in front of you, making slides is easy.


2. Get one clean baseline result + 3 example NDAs (60 minutes)

2.1 One clean number on ContractNLI (20–30 minutes)

Run your current code once and write down:

  • Accuracy (or macro-F1) on ContractNLI dev/test.

That’s your benchmark number. Don’t overthink it.

Guides on ML project presentations all emphasize: you must be able to say clearly
“Here is the metric, on which dataset, and how I got it” – not necessarily achieve SOTA.

2.2 3–4 “story” examples from your NDAs (30–40 minutes)

Pick just a few NDAs and focus on the three key hypotheses:

  • Survival
  • Retention
  • CI sharing

For each hypothesis, find:

  • 1 NDA where the model works.
  • 1 NDA where it fails clearly (wrong label or clearly over-confident).

For each chosen example, write down:

  • Hypothesis text.
  • The relevant NDA clause (copy the main paragraph; highlight the key phrase in bold).
  • Ground truth label (your judgment).
  • Model prediction (E/C/N + confidence/probability).
  • One simple sentence: “Why this is wrong / interesting.”

That’s enough material to show concrete strengths/weaknesses and to “bring the model to life,” which is exactly what good technical presentations do.


3. Build a very simple slide deck (about 10–12 slides) (90 minutes)

You don’t need fancy design. Just clear structure and minimal text, as common technical-presentation advice recommends.

Slide skeleton

  1. Title

    • Title, your name, course, date.
  2. Motivation

    • One slide: “Why NDAs?” (short bullet list: frequent, risk, boring to read manually).
  3. Task = NLI

  4. Data

    • ContractNLI: NDAs from EDGAR, fixed hypotheses, E/C/N + evidence spans.
    • Your NDAs: X documents you annotated for survival / retention / CI sharing.
  5. Model & baseline

    • “RoBERTa-base fine-tuned on ContractNLI.”
    • “Long NDAs handled with 512-token chunking (split contract, run NLI per chunk, take max score).”
  6. Results: numbers

    • Small table:

      Dataset Metric Value
      ContractNLI Accuracy 0.xx
      My NDAs Accuracy 0.yy
    • One bullet: “Works reasonably on benchmark, struggles more on my NDAs.”

7–9. Example slides (most important part)

For each:

  • Slide title: “Example – Survival (Failure)”.

  • Show:

    • Hypothesis (1 line).
    • Clause (3–6 lines, with key words bold).
    • Ground truth vs model prediction (and confidence).
  • One or two bullets explaining what went wrong (e.g., clause in another section, exception language, over-confidence).

Do the same for retention and CI sharing.
(If time is short, 2 examples are enough.)

  1. Why chunking isn’t enough

    • 2–3 bullets:

      • Splits clauses across chunks.
      • Model sees irrelevant text and misses the one critical clause.
      • No explicit evidence, just a label.
  2. Proposed better pipeline (future work)

    • Simple boxes:

      NDA PDF → sections & clauses → retrieve relevant clauses → NLI per clause → aggregate → calibration.

    • One bullet: “Inspired by ContractNLI’s span-based evidence and clause retrieval datasets like CUAD/ACORD.” (datasets-benchmarks-proceedings.neurips.cc)

  3. Conclusion

    • 3 bullets:

      • “NDA review can be framed as NLI using ContractNLI.”
      • “RoBERTa + simple chunking gives okay benchmark performance but fails on real NDAs in specific ways (survival, retention, CI sharing).”
      • “A clause-level retrieval + NLI + calibration pipeline is a more realistic path forward.”

That’s it. You don’t need more slides.


4. Quick rehearsal and logistics (30–40 minutes)

Follow very basic day-before advice from standard presentation guides: focus on clarity, not perfection.

  1. Run through your slides once out loud

    • Aim to say 1–2 sentences per bullet.
    • If you go way over time, remove text or merge slides (don’t cram more bullets).
  2. Check your environment

    • Slides saved in PDF and PPTX/Keynote.
    • Model/demos: only if you really need them; otherwise rely on screenshots/text you already copied.
  3. Pick 2–3 likely questions and think of 1–2 sentence answers

    • “Why does performance drop on real NDAs?”
    • “Why is 512-token chunking not ideal?”
    • “What is the most important next step?”

Simple, honest answers are fine.


Super-short version (if you’re really tired)

If you want an even shorter checklist:

  1. Get one clean metric on ContractNLI and one on your NDA set.

  2. Choose 3 good examples (correct, wrong, over-confident) for survival/retention/CI sharing.

  3. Make ~10 slides:

    • Task, data, model, numbers, examples, what goes wrong, future pipeline, conclusion.
  4. Practice once end-to-end.

If you do just that, you will have a clear, coherent presentation that shows:

  • you understand the task,
  • you built and evaluated a real model,
  • you identified real failure cases, and
  • you know the next steps to improve it.

There are tools that help you to experiment different parameters/configs on your RAG pipeline (Grid optimization for RAGs). One strategy to use them to figure out the right strategy for your chunking technique. Using those tools, you basically run 50 versions of your RAG pipeline (each version has different config) and evaluate each outcome. Then you see which configuration of your RAG pipeline led to best outcomes (eval metrics).

1 Like