- The training set has 12,000 samples whereas the test set has 4,000 samples. Intuitively either the training set should be weighted x1/3 or the test set should be weighted x3. How do I implement this in Gluon? Is using only 4,000 samples from the training set considered a valid method for “re-weighting data”?
- The textbook says we need to get function f. How do I obtain this function once I’m done training the classifier? Is it some attribute of the net instance?
- It says “Use the scores to compute weights on the training set”. What do you mean by the “scores”? My understanding is that once I have f I can compute \exp(f(x_i)) which is multiplied to
`loss(net(X), y)`

, so why do these “scores” even matter? - According to the textbook it is better to use \min(\exp(f(x_i)),c). What is c?

# HW5 Question 3

**kyle**#1

**ryantheisen**#2

I’m not totally sure I understand your first question, but for 2, f should simply be the output of your network (so call net(x)), and I believe scores just refer to the outputs f(x_i). For the last question, c is just some constant, which you use because you don’t want the loss function to become unbounded, which could be the case when f(x_i) outputs extremely large values. This could happen as training progresses, as the network gets better at separating the classes.

**jamesli**#3

i think the first question is asking about the hint in part 2, where it says we need to weigh the data before training the binary classifer

**kyle**#4

The first question was about how to weigh the data so that a sample from the test set matters much more than a sample from the training set, since the training set is thrice the size of the test set. My gut tells me that when I compute `loss(net(X), y)`

I need to multiply this by 3 if `y`

is 1 (i.e. it’s from the test set). Is this the right approach?

Also, for question 4, how do I compute c?

**ryantheisen**#5

The re-weighting occurs when you train the classifier between the training/test set. From slide 47 in the lecture, we defined the distribution:

r(x,y) = \frac{1}{2}[p(x)\delta(y,1) + q(x)\delta(y,-1)]= \frac{1}{2}p(x)\delta(y,1) + \frac{1}{2}q(x)\delta(y,-1)

where the \frac{1}{2} comes from the assumption that the training and test sets are the same size. If they aren’t the same size, how would you want to re-weight this data distribution?

You can choose c somewhat arbitrarily… but think about what happens when you choose c very large/very small.