Can use unlabeled data to transfer knowledge, but using the same training data seems to work well in practice.
Use softmax with temperature, values from 1-10 seem to work well, depending on the problem.
The MNIST networks learn to recognize digits without ever having seen base, solely based on the "errors" that the teacher network makes. (Bias needs to be adjusted)