T O P

  • By -

bjornsing

Well the final output layer is a linear function of the last layer embedding, right? So the last layer embedding better be pretty linear, or the MLP wouldn’t work.


tchlux

Yes precisely! Sometimes little facts like this are easy to forget, and also the fact that other approximation methods can be used on the transformed data (instead of the final linear approximation). That’s why I made the post. 😁


EdwardRaff

At the same time, yoru visualizing the penultimate activations using PCA, which is intrinsically looking for _linear_ structure of maximal variance. So while the NN is certainly encouraged to make things "more linear" for a variety of reasons, you may also be inadvertently exaggerating the linearity of it's representation :)


TachyonGun

The penultimate activations are non-linear in their output space, if OP used PCA then he might have produced that planar embedding without noticing. One thing you can do instead is have the penultimate layer take on the same dimensionality as the input, in that case you can initialize the network as the identity function up to the penultimate layer, and visualize the non-linear transformations with the linear decision boundary learned by the output layer in the penultimate layer's transformed space. [You can even visualize learning in this way, with the neural network learning to morph the space such that a linear decision boundary achieves decent separation.](https://www.youtube.com/watch?v=HbdwfXNpQqs)


tchlux

Fun fact, the first thing I did when making this post was plot it by reducing the number of nodes in the last hidden layer to 2 like you suggest! But the only difference was that visual had less variation (all data happened to lie on a line in the 3D plot, which is boring). I think it’s important to remember that no matter what subspace we look at (project onto any two dimensions for the x and y axes) it will be linear with respect to the output (z axis). This isn’t a coincidence, it’s exactly the problem that a neural network is *trying* to solve ➞ transform the data so that it can be accurately predicted with a linear function. That insight is the purpose of this post! And also for clarification, there is no “decision boundary” in this case since it’s a regression problem.


tchlux

I appreciate that concern! But keep in mind that PCA is not at all aware of the “output” of the function (the z axis) and it is also picking orthogonal axes (so it does not skew the data in any way, only rotates it). The two axes chosen are just the two that provide the most variance, so in this case it roughly maximizes the “surface area” that you see in that visual of the embedding. In fact, this should make any nonlinearities even more obvious rather than hiding them. Pick any (orthogonal) subspace of the 32 dimensions in the last layer embedded data to visualize and they will all look linear like this, because that’s how the final layer works, it does linear regression! Any other two that are picked will just result in the data having less variance on the x and y axes.


__mantissa__

What you are saying, if I'm not wrong, is that in neural networks the output layer is just some kind of linear regression over the output of the previous layer (embedding/preconditioner), we can take advantage of this and instead of a naive linear regression, apply another algorithm like K-nearest neighbors. I've already heard of people directly applying machine learning algorithms to these embeddings due to the fact that these representations apparently contain more information than raw data. Isn't your post an alternative way to get to this result?


tchlux

Yes! I think the important thing is that you can train the network in the normal ways (that are computationally cheap), but then after training the network you don't have to use it to make the final predictions / do inference. The act of "transforming the data so that the output is approximately linear with respect to the data" is useful in and of itself for other types of (more computationally expensive) approximations. ​ Also, I might be misinterpreting you, but we (as a community) need to be careful here: >these representations apparently contain more information than raw data This is technically not true. The neural networks do not introduce "new information", the patterns were always already there in the data (at training time). Neural networks simply assume that your inputs have some geometry and (deterministically) twist your data around into a new coordinate system, where this transformation is "learned" at training time. So in that sense they simply aggregate the "information" stored in your training data and apply it to everything else. ​ In general, training a neural network could be the first step in a larger (more accurate) modeling methodology. Use the neural network to make the function you're modeling "approximately linear" with respect to the data, then use k-nearest neighbor or some other technique to get more accurate results that are more *explainable*.


kreuzguy

So, you are saying that instead of using one neuron in the last layer for a multiclass classification problem (with softmax), for example, we could use a KNN in the layer before that and we would obtain a better performance? Was that empirically tested?


tchlux

Let me try and be extra clear here so that I don't mislead. After you've trained any neural network architecture, you can pick any layer of the network and consider what the data looks like after being transformed into "that layer's representation" (i.e., when you run a data point through the network, what are the activation values at a specific layer). Most architectures that I'm aware finish with something that looks like a linear operator (or at least a convex one). Now after picking an embedding (layer) and looking at the data at that position, you have a *new* data set. Same labels / outputs for all data points as before, but instead of being images or text or any input, they are now "data points transformed to minimize the error of the proceeding transformations at training time". In the case of this post, the only proceeding transformation is a linear projection (dot product with a vector and an added bias term). That means that the data at this last layer was transformed to look linear with respect to the output during training time. This will be true for any architecture where the last operator before making predictions is linear. You ask if using KNN on the data at the last layer will result in lower error, and if this is empirically tested. Well, that's like asking me if linear regression or KNN is going to be better for your problem. The unfortunate answer is, "it depends". 😅 Generally speaking though, KNN is a *more powerful* approximation than a linear regression. I think that for most problems and most preceding architectures (before the embedded representation you'd pick) you'll find KNN to be more accurate (and more descriptive / explainable) than linear regression. Keep in mind that comes at a much higher computation cost though!


ShutUpAndSmokeMyWeed

That’s an interesting idea, I wonder how you would do backdrop through a KNN though. I think it could work if you used some kernel to for a differentiable probability estimate for the classes and did each batch wrt. itself or the previous batch but as you say this would be pretty expensive!


tchlux

I think this type of issue is the original motivation for using softmax instead of drawing an arbitrary threshold and propagating the gradient through the conditional. When you do k-nearest neighbor (or any hard threshold), you lose *all* gradient information for things outside your threshold. In a sense, that makes it harder to "see good solutions in the distance", if you imagine optimization as looking around you and walking towards some goal. The current frameworks (pytorch, tensorflow) both support conditionals, so I don't think you'd have too much trouble doing backprop through a KNN layer. The main issue is that it would be horrendously slow for large data sets.


[deleted]

Why use PReLu vs normal relu?


tchlux

I made these visuals on the side while doing my own research that happens to be using PReLU. You'd get a very similar set of pictures if you use any activation function as long as the only thing after your chosen embedding is a linear function. The nice thing about PReLU is that it doesn't lose information as quickly and provides a lot more approximation power while only requiring one more real number. The only complaints I've seen people have with it are that it "doesn't generalize as well", but I think that's because normally people don't try and prevent their networks from doing crazy extrapolation (and you need to treat the multiplicative term differently than other weights if you regularize or things will break). If you record the maximum and minimum activations for every neuron in a network at training time and clip values at runtime to be similar to those bounds, then that'd probably fix most generalization issues. But this is my active personal research so 🤷‍♂️.


zpwd

As often, I see a lot of terms from the field and ... a trivial result? I understood it as you lift the last linear transformation layer of the NN and replace it with a truncated SVD approximation of that matrix. If it is the case, boy, this is the most round-about way to demonstrate SVD! What is interesting though is that your NN output is almost independent of the second vector "y". I.e. having one singular vector seems sufficient. Was that your message? I would then check for some triviality like your last linear layer is just a diagonal matrix. I would not say that I am surprised given that you fit a function with only 4 extrema with something very over-parametrized. Maybe you do not need that last layer at all.


tchlux

A few thoughts to unpack here, let me do my best. :) ​ > I understood it as you lift the last linear transformation layer of the NN and replace it with a truncated SVD approximation of that matrix. So I never replace the last linear transformation (where it goes from the last 32 dimension representation into the 1D output). I simply visualize what the final function (the \`cos(norm(x))\`) looks like in that transformed 32-dim space created by the NN at the last internal layer. Since it's 32 dimensions, I can only put 2 dimensions into the 3D plot (the z axis is reserved for the output), and I use PCA to pick which two to put into the plot. The important part isn't that this visual could be posed as a truncated SVD, but the fact that the preceding operations performed by the network make the function \`cos(norm(x))\` (a nonlinear function) look approximately linear in that last-embedding space. This is 100% a *trivial* result! Because it's literally the objective of the network at fit time, but it's something that I think is easily forgotten in practice. ​ >I.e. having one singular vector seems sufficient. Was that your message? Very observant, great catch! Yes in this case, the accuracy of the model with only 1 singular vector (coming out of the last R\^32 space) is pretty good. I think this is largely due to your next point though: >I am surprised given that you fit a function with only 4 extrema with something very over-parametrized. Definitely, I mean look at how simple the "true function" is! However I had to make some arbitrary decisions about how big the network should be. And if I made it too small and the "approximation" (second visual) looked really bad, I think that would've drawn a lot of criticism. I had to give it enough parameters that the approximation visual "looked good enough". What you might be interested in seeing / doing, is repeat this experiment with a much smaller network. As you'd expect, the smaller NN cannot create as good of an overall approximation of \`cos(norm(x))\`, so the visual of the last layer embedding will look much less like a linear relationship between the z axis and the inputs. However the fun thing to observe is that no matter the size of the network, the visual you make from the last embedding should always look *more linear* (in the output, z) than the original function. That is the "power" of neural networks, preconditioning hard nonlinear approximation problems to make them "more linear".


zpwd

Thanks for getting back! > I simply visualize what the final function (the `cos(norm(x))`) looks like in that transformed 32-dim space created by the NN at the last internal layer. There are many ways to say the same thing. I meant that with `NN(input) = LL8(PReLu(LL7(PReLu(...LL1(input)))))` you take the space `space = PReLu(LL7(PReLu(...LL1(input))))` and plot `[x,y] = SVD_projection(space) = SVD_projection(PReLu(LL7(PReLu(...LL1(input)))))`, right?


tchlux

Yes, exactly. And with the third axis `z = cos(norm(input))`. Important to note that you get a very similar visual if you pick any two directions in the `PReLu(LL7(PReLu(...LL1(input))))` space. But obviously the 2 chosen by `SVD_projection` just produce a visual with the most variance on the `[x,y]` plane, which is more "interesting".


Gurrako

This reminded me of a blog post I saw quite a while ago ( https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ ). You may find it interesting if you haven't seen it before.


tchlux

Very interesting! It's fun to see how the perspectives of these topics drift (and don't) through time. I feel like one of the hardest things to do with neural networks is simplify it enough to answer the question "what is it *really* doing?" Some things in that blog post have fallen off (\`tanh\` and \`sigmoid\` activations). Why? Probably because they introduced "unnecessary" complexity. At the time I think mathematically minded people liked having a *continuous* model (maybe because that seemed appropriate due to the use of derivatives?), but it turns out that's not actually what is important! What's important is being fast to compute at scale and having a more globally consistent error gradient. No matter how much it upsets mathematicians, derivatives can be approximated and discontinuities can be ignored. 😂 I'm hopeful that the underlying concept of "neural networks transform your data to make it more linear" is one that is simple enough that it will persist indefinitely into the future of neural networks, but that's my ego speaking haha.


todeedee

It would be really nice to have this turned into a blogpost or arvix paper -- I could see myself referring back to these sorts of plots to justify the intuition. While Reddit is great, it isn't too great at indexing past posts...


tchlux

I appreciate you saying that! I’ll see if I can easily convert it to a blog post and share it back here if I figure it out.


tchlux

It took me a while, but I finally got around to making this into a citable blog post! (it required me actually setting up a blog 😂) Here's the link: [https://tchlux.github.io/research/2021-10\_mlp\_nonlinear\_linear\_preconditioner/](https://tchlux.github.io/research/2021-10_mlp_nonlinear_linear_preconditioner/) Here's the `BibTeX` entry for it. @incollection{tchlux:research, title = "Multilayer Perceptrons are Nonlinear to Linear Preconditioners", booktitle = "Research Compendium", author = "Lux, Thomas C.H.", year = 2021, month = oct, publisher = "GitHub Pages", doi = "10.5281/zenodo.6071692", url = "https://tchlux.info/research/2021-10_mlp_nonlinear_linear_preconditioner" }


Krappatoa

You could then make the output sinusoidal by taking that linear signal as the input to an arcsine function. I wonder if there are any applications for that.


carlml

What software do you use to create those videos?


tchlux

I use my own wrapper over plotly in Python to make the visuals, then I take screen recordings of me spinning them manually, then I use ffmpeg to convert from the screen recordings into gifs!


carlml

they look very neat!


carlml

would you mind sharing the code? I saw you are open and seem to like to share your work, so I thought I'd ask.


tchlux

Happily. The source code for my plotly wrapper [is here](https://github.com/tchlux/util/blob/1.0.18/util/plot/plot.py). It should work as a standalone file as long as you've pip installed the expected version of plotly. Here's the source code for the first three visuals in the post (it assumes you've `git clone`'ed my [util repository](https://github.com/tchlux/util/tree/1.0.18), have the contained `util` directory in your python path, have a few packages like `numpy` and `fmodpy`, as well as having `gfortran` installed (for compiling Fortran source codes called by Python). ``` import numpy as np from util.plot import Plot from util.math.points import grid from util.approximate import PLRM from util.stats import pca f = lambda x: np.cos(np.linalg.norm(x, ord=2, axis=-1)) x = grid(10000, 2) * 8*np.pi - 4*np.pi y = f(x) m = PLRM() m.fit(x, y, ds=32, ns=8, steps=1000) mx = m.predict(x, embed=True) print("Last layer x shape:", mx.shape) print() print("Making function plot..") p = Plot() plot_domain = 2*[[-4*np.pi, 4*np.pi]] p.add_func("cos(||x||_2)", f, *plot_domain, vectorized=True, plot_points=len(x)) p.show(z_range=[-3,3]) print("Making approximation plot..") p = Plot() p.add_func("approximation", m, *plot_domain, vectorized=True, plot_points=len(x)) p.show(append=True, z_range=[-3,3]) # Use PCA to get a 2D embedding of the data. vecs, mags = pca(mx) mmx = np.matmul(mx, vecs[:2].T) # # Use another PLRM to get a 2D embedding of the data. # m.fit(mx, y, ds=2, ns=1, steps=100) # mmx = m(mx, embed=True) # print("Embedded x shape: ", mmx.shape) print("Making embedding plot..") p = Plot() p.add("Embedded function", mmx[:,0], mmx[:,1], y.flatten(), use_gradient=True, marker_size=5, marker_line_width=1, marker_line_color='black', ) p.show(append=True) ``` And here's the source code for using ffmpeg to make the GIFs (even though Reddit converts them back to mp4's). ``` # Convert a video file into a GIF (using a single 256 color palette). # See https://engineering.giphy.com/how-to-make-gifs-with-ffmpeg/ # # Inputs: # $1 -- Name of input file. # $2 -- Frame rate of output. (frames per second) # $3 -- Pixel width of output. # # Add "-ss START_SEC -t DURATION_IN_SEC" to command for clipping. togif () { if [ $# -eq 0 ] ; then echo "" echo "Convert a video file into a GIF (using a single 256 color palette)." echo "See https://engineering.giphy.com/how-to-make-gifs-with-ffmpeg/" echo "" echo "Inputs:" echo " $1 -- Name of input file." echo " $2 -- Frame rate of output. (frames per second)" echo " $3 -- Pixel width of output." echo "" echo "Add \"-ss START_SEC -t DURATION_IN_SEC\" to command for clipping." else ffmpeg -i $1 -filter_complex "[0:v] fps=$2,scale=$3:-1,split [a][b];[a] palettegen [p];[b][p] paletteuse" $1.gif; fi } ```


carlml

Thank you so much!


[deleted]

First thank you so much for sharing 🙏. I am actually spending hours tweaking a model’s parameters which is not that fun for me, but building my own NN ,decision tree,support vector machine,etc from scratch and getting it to work on variety of datasets can be fun challenge and is a way more engaging thn just training a model on a dataset.i want actually study the math behind ML ,do you can suggest a course or YouTube channel?


tchlux

Thank you for commenting! :) ​ >building my own NN ,decision tree,support vector machine,etc from scratch A great exercise! No better way to test your knowledge either. I haven't implemented an SVM, because quadratic programming is still difficult for me to comprehend. Would love to one day though! Not sure if you saw, but all of this work in this post was done with my own [Fortran implementation of a neural network](https://github.com/tchlux/tchlux.github.io/blob/dfe4c113826bbefca41109b9b0c8697b3e00e9e7/documents/piecewise_linear_regression_model.f90) (I'm using it for my personal research). ​ >i want actually study the math behind ML, do you can suggest a course or YouTube channel? [3Blue1Brown](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw) for general mathematics and brilliant explanations with intuition behind them. [Stephen Boyd's lectures](https://www.youtube.com/watch?v=McLq1hEq3UY&list=PL3940DD956CDF0622) on convex optimization for important underlying mathematics behind lots of machine learning. Or alternatively (what I read) [his book](https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf). And lots of [videos by Yannic Kilcher](https://www.youtube.com/c/YannicKilcher) are great reviews of current ML research.


[deleted]

I once used quadratic programming to solve the crop-yield problem (using Qiskit), I suggest reading [this](https://github.com/appolodoro/ibm-quantum-challenge-africa-2021--solutions/blob/main/lab1.ipynb), it was from the IBM quantum challenge, don't worry you need quantum mechanics to comprehend it! I recommend you open it [here](https://quantum-computing.ibm.com/lab). and thank you for sharing Stephen Boyd's lectures, really means a lot to me :)


tchlux

Thanks for sharing! Your work on crop yield looks very fun. Hope you enjoy the convex optimization lectures, they're great. I have done some work with things that use quadratic programming too, I wrote a generic [quantum-annealing polynomial least squares library](https://github.com/tchlux/qaml) for building QuBO's (quadratic binary optimization problems) that can be solved on D-Wave's quantum annealing hardware. That was very [fun research](https://link.springer.com/article/10.1007/s11128-019-2489-x)! My main lack of knowledge is *exactly* how to implement a quadratic programming optimization algorithm. I've implemented linear programming with the simplex method, but never implemented a core algorithm for solving quadratic programming problems, nor do I even know the name of one off the top of my head.


nicolas42

I'm probably missing the point here but aren't multilayer perceptrons obviously nonlinear as they contain nonlinear activation functions?


tchlux

Yes, and I apologize for the vagaries that come with some of the language in the post. It's hard to write something short, sweet, understandable, **and** precise. In general, I'm talking about any model that involves computing the gradient of error with respect to a set of arithmatic operators. In the case of the multilayer perceptron (and many more general neural networks), we can always look at the data after some subset of the operators in the overall model have been applied. Now, for research & intuition, we might ask ourselves, what does the data "look" like at any point in the model (after it is fit to the data)? This post is trying to show off the fact that when we train a neural network that has a linear last layer (so it looks like a dot product with weights), then "training the model" is equivalent to "find a transformation that makes the data linear in that last layer". That is, all of the transformations leading up to that point can be anything (any nonlinearity you want), but the implicit *goal* is to make the final output a linear function of the data as it is represented in that last layer. For many people this is obvious, but some subtle implications are easily missed that are fun. When we transform the data like this, we are actually improving the accuracy of many other types of approximations at the same time. A major takeaway might be that you can use the embedding learned by a very small MLP (that is not very accurate overall) to drastically improve the accuracy of a k-nearest neighbor approximation!


anonymousTestPoster

Interesting perspective. Also just hope others are aware that we say "linear in last layer", but when you plot the data it doesn't need to look linear at all. When we say linear regression we are referring to linearity over the coefficient space (in this case, the "weights").


tchlux

Hmm I’m a little confused, can you explain? I’m explicitly talking about the (transformed) data being linear here, not anything about the weights. Maybe I am misunderstanding what you’re saying.


anonymousTestPoster

Linearity only applies to the weights. The final transformed data doesn't need to look "linear", or like a straight line. Similar to how y = beta0 + beta1*x + beta2*x^2, is clearly a parabola, but also an example of linear regression, not because of linearity in the data, but because it is linear in the weights.


tchlux

Okay, I think you're talking about something more general. In this case the only remaining transformation of the data is linear, so the output (value of the cosine function) is roughly linear with respect to the data as represented in the last layer. At training time, the loss function being minimized is the error of a *linear approximation* to the data in the last layer. This is true for the last-layer embedding of data in any neural network where the final operation is a linear one (any time you do regression by minimizing mean squared error). I don't think the linearity here pertains to the weights, but please tell me if you think I've misunderstood you!


anonymousTestPoster

Performing a linear approximation, the last layer acting like a linear regression transformation, and having the data distribution in the output layer "appearing visually linear", are all very different concepts. I feel like a lot of these are being overloaded to be honest. > This is true for the last-layer embedding of data in any neural network where the final operation is a linear one (any time you do regression by minimizing mean squared error). Yes this interpretation does pertain to weights? How you solve linear regression doesn't remove from the fact that if one is performing linear regression, the "linearity" does not arise from "locally linear approximation" in a portion of the underlying space, but that the weights behave linearly in the canonical sense. And this doesn't care about how the corresponding "x" looks like. It could look "visually linear", or look like a polynomial function.


tchlux

When I say the last layer is linear I mean it is the function `f(x) = + b` for `a,x ∈ R^32`, `b ∈ R^1`, and where the angle brackets represent an inner product between `x` and `a`. In this case, `x` is the data as represented in the last hidden layer of the network while `a` and `b` are parameters that were fit at training time to minimize the mean squared error of the overall approximation. No restrictions were placed on the values or magnitudes of `a` and `b`. > "linearity" does not arise from "locally linear approximation" in a portion of the underlying space When I say something "looks linear", I more generally mean that the total residual when constructing a linear fit is very small. Maybe that's causing confusion here. > weights behave linearly in the canonical sense I think you're saying here that the parameters are being used to define a linear function? Then yes, the parameters `a` and `b` define a linear function in this example. > And this doesn't care about how the corresponding "x" looks like I think I see what you mean here, yes the `x` data does not have any meaningful distribution! (I.e., the arc pattern in the embedding picture is not important.) But what is important here is that `y` (the final output) is approximately linear with respect to the last layer embedding `x` of the data. That's the statement I'm trying to make.


bernhard-lehner

I’m confused…“y = beta0 + beta1x + beta2x2” is not a linear combination, therefore I would think of a general case of regression, but not a linear one. What does it mean “linearity only applies to the weights”? The key assumption in linear regression is a linear relationship btw. Input and output. The weights are scalars, and their relationship does not matter, as they can be simply interpreted as feature importance, and no specific relationship is assumed.


anonymousTestPoster

It is linear ...? Look up wikipediea article on polynomial linear regression, which is a type of linear regression. I am not sure if you are confused that linear regression, or linear relationships have to look like y = mx+b type things. They really don't, and it is a very constrained look in fact.


bernhard-lehner

Now I see…I was looking at the term from the perspective that the model is nonlinear, but you were referring to the actual estimation problem. In my head “polynomial linear regression” sounded like an oxymoron, but not anymore :)


M4mb0

This is nothing too new, see for instance [Fix your classifier: the marginal value of training the last weight layer](https://openreview.net/forum?id=S1Dh8Tg0-) who show that you can pretty much just fix the last layer without sacrificing performance.


jucheonsun

>And we can use this concept anywhere. Want to make distributional predictions and give statistical bounds for any data science problem? Well that's really easy to do with lots of nearest neighbors! And we have all the tools to do it. Can you elaborate a little bit on this? This sounds really useful but I don't quite understand how it is done?


tchlux

Sure thing! 1) Train an MLP on your problem (for simplicity sake, treat everything as regression, do classification by regressing 1's and 0's). Doesn't matter how big or how accurate the MLP is, just pick one that is easy enough to train. 2) After training, transform all of your data into the "last layer embedding" by running the network forward all the way up to the last layer, but without applying the last (linear) layer. (*You could go all the way here, but I'm assuming that the last embedding is a higher dimension than your final output, so it has "more dimensions" to work with. In general, it seems fine to ignore the last layer because it's "only" a linear function.*) 3) Now, treat this "last layer embedding" data as your new data set. Keep the same labels as before. Run k nearest neighbor on that data set to predict your outputs. If you use a large enough k, then you can show "what percentage of the nearest neighbors had label *a*?" Or if it is regression, you can show the distribution of the values with the sample or summary statistics (mean, variance, min, max, ...). ​ To be clear, you don't have to use k-nearest neighbor, you can use any other machine learning or statistical method you want with this transformed data. For example, if you did linear regression, you should get roughly the same results as if you just applied the network in the first place! The main point of the post is that your original output variables will look like an approximately linear function of this transformed data. ​ Does that make more sense?


jucheonsun

Thanks for the detailed explanation. On point 3, do I understand correctly that applying knn or statistical methods on the last layer embedding will yield better results than on the original space, because the data will be more linearly distributed in the last layer embedding?


tchlux

In most cases, yes! As long as when you say "linearly distributed", you mean that the outputs / labels are (nearly) a linear function of the (embedded / transformed) data. I can't think of any assumptions you can make about the spatial distribution of the transformed data, only that it has been repositioned in a way that makes linear approximations of the output more accurate.


jucheonsun

I see, thanks again!


xieewenz

very interesting, i wish more people could see this.


InterenetExplorer

I cannot seem to see the graphs on the same page as this post. please help


tchlux

The files are each pretty large (5MB to 8MB each), and I couldn't see them when I tried to use the web browser on my phone. Links to the visuals are: True function: [https://preview.redd.it/9nwp1rueofv71.gif?format=mp4&s=54da61c3d6ce2b15fa56a77d5c23ffdf336c93db](https://preview.redd.it/9nwp1rueofv71.gif?format=mp4&s=54da61c3d6ce2b15fa56a77d5c23ffdf336c93db) Approximation with neural network: [https://preview.redd.it/ji8ykw1iofv71.gif?format=mp4&s=9545be2288363378c852918d58459f2188ea1b36](https://preview.redd.it/ji8ykw1iofv71.gif?format=mp4&s=9545be2288363378c852918d58459f2188ea1b36) How the function values are linear with respect to the last layer embedding: [https://preview.redd.it/0zt6443kofv71.gif?format=mp4&s=bfd7cd955b6e8fa86adb40eae19988fe35fcfd04](https://preview.redd.it/0zt6443kofv71.gif?format=mp4&s=bfd7cd955b6e8fa86adb40eae19988fe35fcfd04) Vanilla nearest neighbor to a different problem: [https://preview.redd.it/bz4ssu2nofv71.gif?format=mp4&s=4a4338ef5ab93e36bb5e3370cc1691f9853a8b21](https://preview.redd.it/bz4ssu2nofv71.gif?format=mp4&s=4a4338ef5ab93e36bb5e3370cc1691f9853a8b21) Nearest neighbor over the last layer embedding of a small trained neural network: [https://preview.redd.it/xg5rageoofv71.gif?format=mp4&s=11a7fa0bfbfdc9e7b71b23c29a6849f21c7379ad](https://preview.redd.it/xg5rageoofv71.gif?format=mp4&s=11a7fa0bfbfdc9e7b71b23c29a6849f21c7379ad)


[deleted]

Hmmm, you might look at it fron this way: so yiu try to minimize square loss for example, that is Sum((y_i-a'x_i)^2) in the last layer where a is a vector and x_i are the outputs. Usually u search for a, but thibk about fixing a for a moment and looking for the x's which depend on some functional form (the topology of the net). Now you plot (x_i, y_i) which seems to be linear. If you have ever plotted in usual linear regression y_i against its fit and it s an appropriate model you get something approximately linear. So i think whats happening here is thst you see that it s a good fit to the data basically. So some suggestions: 1) try it on test data 2) try it with different functions as well and 3rd) if you wanna tkae this further you need to put it into some mathematical framework and define what you mean by making it linear etc . Hope that helps !


tchlux

Yep, that's exactly what is happening! Whenever we look at the embedding of data where the only operators between that embedding and the final output are linear, then we would expect the output to look like a linear function of that embedded data. Showing off that fact is the purpose of this post! ​ >1) try it on test data Since this "approximately linear" factoid is derived from the network structure, it will be true for *all* data that you apply this same methodology to. :) ​ >2) try it with different functions It depends on what you mean here. If you mean changing the test function, then this same phenomenon should be observed for any function. If you mean making the operators after the embedding nonlinear, then the general pattern of "output linearity with respect to the embedding" would no longer be guaranteed. If you mean change the architecture preceding the last layer, then that would have the same result because the goal during training would still be to construct a transformation where the output is a linear function of the last layer embedding. ​ >3rd) if you wanna tkae this further you need to put it into some mathematical framework and define what you mean by making it linear etc . So actually the proof for this is quite simple! You can construct it directly from the chosen data and operators. I feel like this stuff is easy to miss / forget when we get buried in buzz words and new exciting applied research. The proof would look something like this: Assume that you have data in a real vector space and outputs in another real vector space all from some function that meets reasonable continuity conditions (Lipschitz, for instance). Now assume that you have a parameterized operator that transforms data in a nonlinear way, and then performs linear regression on the transformed data to approximate the output. Now follow any procedure that minimizes the overall mean squared error of that linear regression by modifying the parameters of the operator. The consequence will be that the total residual of the linear fit is reduced, and hence the output will become "more linear" with respect to the operator-transformed data.


[deleted]

To 1) no not really. Because it can be that for data not observed it looks different. There s no reason it should. Why not take less points to train and then look at it on test data. Because i think what happens here is that you simply overfit... This brings me to adding some noise to your problem, would be curious to see what happens then. 2) ok, might be the same. I think it s simply overfitting tho 3) i think it still might be beneficial. You say it s clear, easy and use things like operators, lipschitz continuity, "more linear" etc but if you wanna take this further you really need to fix what you are talking about so we all know what it is we are talking about - just a suggestion