CSCI 1390, Spring 2025: Written HW 1

Due Date: Thursday, February 27, 6 PM. You CANNOT use late hours on the written homework.

Read the following paper excerpts, and answer the questions below. Submit your answers as a single PDF on gradescope here. TAs won’t be answering questions about the writeup in office hours; you should be able to answer the questions below merely with reading comprehension.

Reading

PipeDream – All Sections.
Megatron-LM – All Sections.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM – Sections 1, 3, Skim Evaluation.

Questions

The total length of your response should be about 600-700 words, with approximate breakdowns specified below. Please adhere to the breakdowns; you will be penalized, for example, if your summarization is too long, and the other answers are too short. Additionally, we expect you to cite specific examples and evidence from the papers when answering the questions.

Summarization (300-400 words total)

What were the key challenges that the PipeDream paper solved to get pipeline parallelism working well in practice (150 words)?
What are the advantages of tensor model parallelism over pipeline model parallelism; why was it chosen to train transformers (150 words)?
Answer for both the pipedream and megatron-LM papers (100 words) : What are key weaknesses and strengths in either paper? You can discuss weaknesses and strengths of how they framed the problem, or of the solution and the solution’s applicability.

Comprehension (200 words)

Why does the third paper explore combining parallelism strategies? What in this setting is different making it such that no one strategy alone is sufficient? Why is their method effective over any single parallelism strategy?

Synthesis (100-200 words)

Consider the workload of inference rather than training. Most transformer model weights will not fit on a single GPU. What parallelism strategy would you deploy and why?