/** ***/**\**|******/**

# Machine learning explorations

*\/**/
author: the_wetware @ kjx.cz
warning: Only theoretical musings about fundamentals follows, nothing of direct practical value.
disclaimer: I am curious outsider with backround in very loosely related disciplines just asking stupid questions.

## What is status quo of hardware for neural networks?

At the outset my insight was based on supposed thruths the internet provided:

• GPUs are used for neural networks nowadays
• NVidia Titan is the best

I was curious why. After few evenings of googling I found out a bit more and ended up giving a short talk about it.

[ slides - pdf ] [ slides - beamer/latex src ]

From my limited understanding it seems that gradient descent or it's modification (stochastic GD, momentum-based methods) are immensely popular for neural network training. From my studies I remember that gradient descent was definitely not the only iterative extrema searching algorithm, nor the most powerful. Why not Newton or conjugate gradient method? I am curious why and will gather interesting info below.

## What do we really expect when talking about generalization and overfitting?

Theore about neural networks is usually fine up to the point when overfitting is mentioned. After that some ugly ad-hoc empirical band-aid is applied (dropout, regularization). As usually I am curious what would be proper (more precise) treatment of generalization.

## Where is the neural network magic hidden?

Neural network is a non-linear function `R^n -> R^m` and supervised learning might then be understood as search for the closest approximation of function mapping input data -> your output data (on training set).

In other kind-of-similar problems the approach is different.

### Finite elements method for (differential differential equations solving)

• You choose discrete regions of your input parameters (and approximate the solution region by region).
• You choose (usually very limited) sub-space of all possible functions on that region (e. g. polynoms up to multi-quadratic).
• Method gives you linear combination of some base of your chosen sub-space.

### Function approximation (Taylor series, Fourier analysis)

• You choose some finite dimensional subspace of all conceivable functions.
• Method gives you coefficients of linear combination of base elements of your subspace.

Both these methods uses addition ("+") as a way how to aggregate primitives (simple functions) while neural networks use function composition.

Would it be possible to use composition of e. g. linear functions to obtain any interesting result?

``` y = w_2 * (w_1 * x_1 + b_1) + b_2 ```

In other words - is "the magic" of neural networks hidden in their activation functions (sigmoid, hyperbolic tgz, ...) or their structure (function composition instead of linear combination)?

## How old are machine learning ideas actually?

Neural networks and machine learning in general seems like the hot new tech. Is it really so?

Méthode générale pour la résolution des systèmes d'équations simultanées; Cauchy, Augustin; 1847
perceptron - 1958
Frank Rosenblatt; 1958
logistic regression - 1958
The Regression Analysis of Binary Sequences; Cox, D. R.; Journal of the Royal Statistical Society. Series B; 1958
support vector machine - 1963
Pattern recognition using generalized portrait method; Vapnik, V., Lerner, A.; Automation and Remote Control; 1963
convolutional neural network - 1980
Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position; Fukushima, Kunihiko; Biological Cybernetics 36; 1980
dropout (regularization) - 1989
Optimal brain damage; Yann Le Cun, JS Denker, SA Solla; NIPS'89 Proceedings; 1989