Skip to content

Draft: GPT2 training on MCQ med data#1111

Open
mina5rovic wants to merge 34 commits into
developfrom
gpt2-training
Open

Draft: GPT2 training on MCQ med data#1111
mina5rovic wants to merge 34 commits into
developfrom
gpt2-training

Conversation

@mina5rovic
Copy link
Copy Markdown
Collaborator

No description provided.

@mina5rovic mina5rovic requested a review from JulienVig April 16, 2026 11:51
@mina5rovic mina5rovic changed the title GPT2 training on MCQ med data Draft: GPT2 training on MCQ med data Apr 16, 2026
Copy link
Copy Markdown
Collaborator

@JulienVig JulienVig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of the benchmark and the memorization looks good to me! Just had some performance comments

Comment on lines +66 to +73
const promptTokens = tokenizer.tokenize(prompt).toArray();
const fullTokens = tokenizer.tokenize(prompt + continuation).toArray();

const inputTokens = fullTokens.slice(0, -1);
const inputTensor = tf.tensor2d([inputTokens], [1, inputTokens.length], "int32");

const logits = tfModel.predict(inputTensor) as tf.Tensor;
const logProbs = tf.logSoftmax(logits, -1);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're running inference on the MCQ question for each option but the question is always the same. You could run inference only once and then retrieve the logits of each option and save a lot of time. see the code snippet in the next comment

Edit: this is assuming that there is only one continuation token to evaluate, I see that there's a lot over multiple tokens lines 79-83 so maybe my comment is not valid


const logits = tfModel.predict(inputTensor) as tf.Tensor;
const logProbs = tf.logSoftmax(logits, -1);
const arr = await logProbs.array() as number[][][];
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're computing the softmax for every position but you only need the last one and line 74 materializes the whole array while you only need 4. You could rewrite this logic such that you only work with the last position, for example:

const optionLogProbs = tf.tidy(() => {
        const logits = tfModel.predict(inputTensor) as tf.Tensor3D; // [1, seqLen, vocab]
        const lastLogits = logits
            .slice([0, promptTokens.length - 1, 0], [1, 1, -1])      // final position only
            .reshape([-1]);                                          // [vocab]
        const logProbs = tf.logSoftmax(lastLogits);                  // [vocab]
        return tf.gather(logProbs, continuationTokenIDs);                  // continuationTokenIDs is an array of the 4 continuations' tokenID
    });
const scores = await optionLogProbs.array(); // just 4 values

if (predicted === answer) correct++;
total++;

if (confusion[answer]) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already checked that answer was contained in options line 128 so you can skip this check or only for throw an error if something unexpected happens :)

Comment thread cli/src/evaluate_finetuned_gpt2.ts Outdated
Comment on lines +79 to +83
for (let targetPos = promptTokens.length; targetPos < fullTokens.length; targetPos++) {
const targetToken = fullTokens[targetPos];
score += arr[0][targetPos - 1][targetToken];
count++;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should only be the last token to evaluate right? Is this loop over multiple tokens necessary?

modelPath: { type: String, description: "Path to a saved Disco GPT model.json" },
dataPath: { type: String, description: "Path to records/canaries text file" },
maxRecords: { type: Number, description: "Maximum records to evaluate; -1 for all", defaultValue: 100 },
promptLengths: { type: String, description: "Comma-separated prompt lengths", defaultValue: "10,50,100,200,500" },
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not useful to test prompt length 500 if the context length is 256 or 512

return output as tf.Tensor;
});

console.log("logits shape:", logits.shape);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to remove the debug prints before merging the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants