At the outset my insight was based on supposed thruths the internet provided:
I was curious why. After few evenings of googling I found out a bit more and ended up giving a short talk about it.
[ slides - pdf ] [ slides - beamer/latex src ]From my limited understanding it seems that gradient descent or it's modification (stochastic GD, momentum-based methods) are immensely popular for neural network training. From my studies I remember that gradient descent was definitely not the only iterative extrema searching algorithm, nor the most powerful. Why not Newton or conjugate gradient method? I am curious why and will gather interesting info below.
It seems like at least conjugate gradient method sporadically receive some credit.
Theore about neural networks is usually fine up to the point when overfitting is mentioned. After that some ugly ad-hoc empirical band-aid is applied (dropout, regularization). As usually I am curious what would be proper (more precise) treatment of generalization.
Neural network is a non-linear function R^n -> R^m
and supervised learning might then be understood as search for the closest approximation of function mapping input data -> your output data (on training set).
In other kind-of-similar problems the approach is different.
Both these methods uses addition ("+") as a way how to aggregate primitives (simple functions) while neural networks use function composition.
Would it be possible to use composition of e. g. linear functions to obtain any interesting result?
y = w_2 * (w_1 * x_1 + b_1) + b_2
In other words - is "the magic" of neural networks hidden in their activation functions (sigmoid, hyperbolic tgz, ...) or their structure (function composition instead of linear combination)?
Neural networks and machine learning in general seems like the hot new tech. Is it really so?