Is network design for Neural Networks as trial-and-errory as it seems...

Is network design for Neural Networks as trial-and-errory as it seems? Are there any tips/tricks I should know to do this better?

It feels like I'm just throwing shit at a wall until something works.

Attached: 1512976334066.jpg (800x1133, 254K)

Cute picture of me

No, you can read all the papers you want but all they'll tell you is "we tried out a bunch of shit and here's the one we found had the highest cross validated accuracy."

>as trial-and-errory
The word you're looking for is "tinkering", or "experimental"

this
neural network architecture is just mixing and matching aspects of other architectures and seeing how well it works.
Every once in a while you get a genius like Ian Goodfellow crafting the Columbus Egg and the field explodes all over again.

Aren't there at least techniques to make the trial-and-error part not brute force? The hypest, most promising application of CS wouldn't happen to be trivial, would it?

>not brute force
It isn't brute force and it never was brute force.
Stochastic methods aren't brute force, and even research into [redacted] yeah nah nevermind

What a cute pic

>Aren't there at least techniques to make the trial-and-error part not brute force?
Understanding what your data is like, rhe kind of relationships you are looking for and where.

Explain...

What IDE or language do you guys use for neural nets?

Not that guy, but gradient descent is a pretty efficient way of optimizing a function with millions of parameters. Brute force would be more like a million nested for loops, which would basically never terminate

Jupyter for feature engineering, pycharm for production code

I've been running a bunch of iterated tests on MNIST to see if I can draw up some equations relating network size, learning rate, other hyerparameters vs eventual performance.
The ideal would be to be able to describe how a network will train given its architecture. I've got a feeling that the form of the equations will probably be about the same across datasets, and then we might be able to assign some parameters to a dataset that describe what it will take to model it / what the optimal architecture is.
I've got a batch of 2000 training runs with learning rate / middle layer size / activation function that I need to write visualizations for...
Does anyone here have links to any papers in this area? I'm a bit surprised that I haven't seen any yet.

>Top is information gain (bits) vs network size (784-X-10), bottom is 1/time to converge vs same parameter

Missing image

Attached: 1532866090896.png (823x851, 50K)

My husbando

Right now I'm using Keras and Vim.

>Is network design for Neural Networks as trial-and-errory as it seems
Work on it for a while and you'll develop insights (that mostly can't be put down into words) on how to choose hyperparameter/design

Decided to write the first of those visualizations

>you'll develop insights
Yeah, but insightful artisan hyperparameters don't scale

MNIST dataset, [784-10-10] neurons, 3000 steps, batch size 32, linear activation.
Geometric learning rates:
> 0.000122, 0.000244, 0.000488, 0.000977, 0.001953, 0.003906, 0.007812, 0.015625, 0.03125, 0.0625, 0.125, 0.25, 0.5
(Red is larger)
Test set info gain vs training steps

I ran it with smaller learning rates, but they didn't train. It's only with the linear activation that you get these nice staggered curves -- sigmoid and relu are a lot less noisy, but they go all over the place. Seems like I should explore the area between 0.25 and 1 a bit more, which is honestly higher than I was expecting.

Next test is probably variance over many training runs on the same architecture.

Attached: 1532434184030.png (757x752, 86K)

>Yeah, but insightful artisan hyperparameters don't scale
You won't develop those insights experimenting with toy dataset like MNIST.

What should I use?

CIFAR10 is a good start after MNIST, and you'll need a semi-decent GPU

To add, peopls have found that network architechture designs that are good on CIFAR-10 (~100MB dataset) are almost automatically good on Imagenet (~100GB dataset), so the insights definitely do scale

1060 is fine?

It's surprisingly hard to find useful rules of thumb for this stuff, but here's what I found in my limited experience. A lot of it is pretty obvious in hindsight.

>the deeper your network the more higher-order terms you get. So a shallow network will tend to produce piecewise linear or quadratic, go deeper for weirder functions
>the size of a layer limits the information that can go through it. So 128 -> 2 -> 64 is a waste
>going deeper than like 6 layers PROBABLY won't help unless your function is really weird
>make your model bigger. The more weights the less random your performance will be, because it seems less likely for you to get caught in shitty local minima.
>If you overfit too much consider just slapping in regularization or dropout instead of shrinking the model

YMMV I'm hardly an expert at this shit.

If it's 3GB version: consider getting a better one.
If it's 6GB version: barely ok.