/**
***/**\**|******/**
# Machine learning explorations

*\/**/

At the outset my insight was based on supposed thruths the internet provided:

- GPUs are used for neural networks nowadays
- NVidia Titan is the best

I was curious why. After few evenings of googling I found out a bit more and ended up giving a short talk about it.

[ slides - pdf ] [ slides - beamer/latex src ]From my limited understanding it seems that gradient descent or it's modification (stochastic GD, momentum-based methods) are immensely popular for neural network training. From my studies I remember that gradient descent was definitely not the only iterative extrema searching algorithm, nor the most powerful. Why not Newton or conjugate gradient method? I am curious why and will gather interesting info below.

It seems like at least conjugate gradient method sporadically receive some credit.

- Comparison of Second Order Algorithms for Function Approximation with Neural Networks; E Boutalbi, LA Gougam, F Mekideche-Chafa; Acta Physica Polonica A; 2015
- Deep learning via Hessian-free optimization; Martens, James; University of Toronto; 2010

Theore about neural networks is usually fine up to the point when overfitting is mentioned. After that some ugly ad-hoc empirical band-aid is applied (dropout, regularization). As usually I am curious what would be proper (more precise) treatment of generalization.

Neural network is a non-linear function `R^n -> R^m`

and supervised learning might then be understood as search for the closest approximation of function mapping input data -> your output data (on training set).

In other kind-of-similar problems the approach is different.

- You choose discrete regions of your input parameters (and approximate the solution region by region).
- You choose (usually very limited) sub-space of all possible functions on that region (e. g. polynoms up to multi-quadratic).
- Method gives you linear combination of some base of your chosen sub-space.

- You choose some finite dimensional subspace of all conceivable functions.
- Method gives you coefficients of linear combination of base elements of your subspace.

Both these methods uses addition ("+") as a way how to aggregate primitives (simple functions) while neural networks use function composition.

Would it be possible to use composition of e. g. linear functions to obtain any interesting result?

```
y = w_2 * (w_1 * x_1 + b_1) + b_2
```

In other words - is "the magic" of neural networks hidden in their activation functions (sigmoid, hyperbolic tgz, ...) or their structure (function composition instead of linear combination)?

Neural networks and machine learning in general seems like the hot new tech. Is it really so?

- gradient descent - 1847
- Méthode générale pour la résolution des systèmes d'équations simultanées; Cauchy, Augustin; 1847
- perceptron - 1958
- Frank Rosenblatt; 1958
- logistic regression - 1958
- The Regression Analysis of Binary Sequences; Cox, D. R.; Journal of the Royal Statistical Society. Series B; 1958
- support vector machine - 1963
- Pattern recognition using generalized portrait method; Vapnik, V., Lerner, A.; Automation and Remote Control; 1963
- convolutional neural network - 1980
- Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position; Fukushima, Kunihiko; Biological Cybernetics 36; 1980
- dropout (regularization) - 1989
- Optimal brain damage; Yann Le Cun, JS Denker, SA Solla; NIPS'89 Proceedings; 1989