Skip to content

Quick update on research for NN

This guy has a good explanation on Neural Networks and the direction that they are taking, and this might be critical to the understanding of how to apply skyline within this neural network.

http://www.youtube.com/watch?v=AyzOUbkUf3M

R-tree heuristic for trimming regions for convex skyline, application to image data, and neural networks -> skyline

If we know that the skyline is convex, then there should be one thing that is true most of the time – the skyline should be contained in the regions which are closest to the axes defined by the origin in a min case.  If we use this with the R-tree, we trim down on the regions necessary to use this considerably.

Perhaps an application of skyline query is that we can split a dataset into different areas, and when we calculate the skyline for each of those areas, we should be able to tell something about the general gradient of the data – for instance, when we have image data, it is broken up into (x, y, r, g, b) (or (x, y, c, m, y, k) in case of print data) and we can analyze the channels separately in three 3d graphs which we can partition and analyze to see what the simulated density is in these areas, and interpolate from there.

Finally, another direction that seems only to have been taken from the BPNN approach is getting skyline queries in a Neural Network.  The paper is also not especially accessible unless you have money.  That would require, though, that we learn about Neural Networks.  I need to have a direction today, though, so I can narrow my research.

On using DR-trees for k-dim queries

So, using the indexing method that I was thinking about earlier – assume the database is using a DR-tree to store the data within.  A query to find the minimum value on one axis will query the boxes nearest to the origin.  If that’s the case, won’t returning those regions be the best way to find candidates for minima?  Same for maxima – just pick an origin which is the extents of the top node.  This selects the areas which we will be comparing.  After this, how do we partition the areas for the P&F algorithm?  Quick area trimming is a good thing for i/o – less reads necessary.

Thoughts on using discrete calculus to analyze properties of a skyline curve

Sort of copying some stuff off of the board that I wrote in class today…

If you work with a few points in 2-d, like for instance, (1,10), (3,6), and (9,1), you find that the slope of the points for the outer points is -9/8, and for inner points, it’s -2 and -5/6 respectively (delta y/delta x in all cases). If you plot those on a graph showing the relative slope of the line as a derivative, you get the lines which correspond to those values, but on a discrete basis. What does this tell you? Since we are working with plain old lines, the slope is about the most useful information that we can get out of them. Another derivative does not give us anything about the shape in terms of concavity as if we were to use it on a polynomial problem. Therefore, calculating concavity is another discrete problem, since our derivative is discrete.

Thus, in our example,

a ……………. n ……b
S m dx >  ..SUM …S m ..dx
b ..k …….. i=k+1 ..a …i

and it shows that the (sub)skyline curve is convex.

a ……………. n ……b
S m dx =  ..SUM …S m ..dx
b ..k …….. i=k+1 ..a …i

means that the (sub)skyline curve is a normal line.

a ……………. n ……b
S m dx <  ..SUM …S m ..dx
b ..k …….. i=k+1 ..a …i

means that the (sub)skyline curve is concave.

As the number of points increases, in 2d, we get something that looks like n(n+1)/2 number of slopes.  I hypothesize that for k-dimensions, the number of slopes should be something of a O(n^k) complexity.

More thoughts, further direction

After talking with Professor Makki and Jeff, I think I have a better idea of the problem, and perhaps a direction.

It seems that my intuition and concept of the idea is not exactly what a skyline is all about. It’s possible that a skyline may not be convex, so my technique needs to be generalized a little more. Mainly, that instead of making a line, the points will make a rectangle within which all points which are less than the value of the maximum point will be housed within. This allows the area that is a bar graph to be cut out of the area for making the skyline. The formulation of the nearest neighbor query from that box, however, is one that can still be approached without an issue (ie, NN search then repeat from the new point). Dr. Makki’s P&F algorithm, though, seems like it might be more useful in practice still.

One manner that I might work on research is the interpolation of data on the curve from this skyline query, and what it tells me about the data – if it is concave, it tells me that there’s a hole in the data that may need to be filled, or that things are good – ie the existence of a strong middle class in economics. With a convex curve, it means that for purchasing something, people get a good aggregate of price vs. features. As you analyze it, the analysis is recursive, where the derivative is 0, there’s a change in concavity.

I’m sure this is stuff that has all been done before, but perhaps not as much as I might think. Reading some stuff by Pareto, and so far, that has proved to be a mindtrip on one side, while providing a heuristic that much anti-correlated data follows.

Index method with NN – necessities and Problems with the Skyline Definition

when thinking about the “Index” method, there was one shortcoming in how to trim the dataset in a simple fashion. Once you get the minimum value from one axis of the tree, and the minimum value from the other subsequent axes, how do you correlate the data to figure out which are past the line that is created by the extremum? Also, what kind of methods can you use in >2 dimensions? Do you just do projections of points for that means?

This problem is effectively the same as the convex hull problem, only truncated by the external most data points on the skyline.

When I took a look at the skyline definition, it isn’t intuitively obvious what the formal definition is getting at. Effectively, it states that a point Pi dominates point Pj if Pi is as good or better than Pj on all dimensions, or is better than any point in Set S -> (P1, P2, P3, … Pn) on one dimension. Fuzzy definition is fuzzy. However, if you were to set up a set of data in a grid format, the skyline candidate points would be the points which are closest to the origin on each grid level. It also implies a worst-fit contour map of the bottom of the data. So, how do I simplify this into easy terms? One idea is using the slope idea as presented before and using a function f(x,y,…q) = x*y*…*q, where candidates for skyline will have the max value is less than the slope of the line attained by the outside points This is also, unfortunately, ambiguous.

Further clarification of combination method, and another approach

After thinking a little more, and reading some things about partitioning and having a seminar on R-trees, I think I’ve come up with a way to hone the algorithm a bit more.

1. Find the closest point on each axis respectively to the origin (wherever it happens to be). If there are two or more points on the same axis, choose the closer of the two. Oh bother, what if there’s more than two axes?
2. Connect each of the points together to form a line/plane/k-surface, and eliminate all the points above that surface. One method of doing that is by taking the midpoint of the close points, and multiplying the value of those coordinates all together. This represents a maximum value. Then, each of the points may have their coordinates multiplied together in the form f(x,y) = x*y et.al.
3. Follow the minimum coords of each of the points to find a new origin. Apply the nearest neighbor algorithm to the remaining points from that point. If there are no points, or the nearest neighbor is one of the end points, you have found your skyline.
Else, go to 2 with your new data slopes problem.

Now, to pull back the data, if you don’t have an indexing scheme, you would have to allocate as many points that are in memory for the first run. If there’s some sort of a tree that easily returns heavy things to the middle, that would be helpful.

Another idea, is the idea of parallelising the process, so splitting the work up into which axis, and each of the axes would be on its own thread for the first access. It seems that then the skyline would be compositional, and there might be a bit of overlap in that case.

Other reading atm:

http://en.wikipedia.org/wiki/Response_surface_methodology

http://en.wikipedia.org/wiki/Multiobjective_optimization

Initial Approach & Paper References

A method that first came to mind after looking at the problem is basically as follows:

  1. Find the closest points on any given axis (in the case of R >= 2). These will be our outside bounding points.
  2. Make a line/plane/n-surface that serves as a boundary in which nothing greater than this will be within the skyline bounds.
  3. Apply NN approach or P&F to the remaining points.

The other thing is that this reminds me of one of the approaches to the traveling salesman problem, the shell approach – except that one does not need to find better paths, only to trim the part of the shell which is not dominant. This can be found using the closest points method shown above.

Also, here’s a couple of papers that were given to me, and one that I found.

http://www.cse.unsw.edu.au/~luoyi/paper/Yi/masterthesis.pdf
“An Efficient Algorithm for Skyline Queries” – Jing Yu & S. Kami Makki
“Parallelizing Skyline Queries for Scalable Distribution” – Ping Wu et. al.
“An Optimal and Progressive Algorithm for Skyline Queries” – Dimitris Papadias et. al.

First Post

I’m opening this website for the logging portion of the project at Lamar University called the “REU Program.”  My topic of study is Skyline Queries and efficient computation for a skyline query.  The overview for the topic is effectively as follows:

There are 3 main ways to approach the Skyline Query problem.  The first is Nested Loop, which just compares all of the points to each other.  Then there’s Nearest Neighbor, which goes radially to the first closest point, and calls it the first skyline point.  Finally, there’s a branch and bound method which implements an R-tree for trimming and quick finding.  All good methods, though, seem to implement a number of these points.  The method that the professor at this university came up with is called “Partition and Filter”, which basically does a nearest neighbor approach, but with splitting it into boxes first, where many data points will not even be compared.

The only problem I see with most of these algorithms is that they tend to like to use a lot of memory.  If there’s a simple way of employing some statistical analysis to it using characteristics, maybe we can get a quick result.  Also, the first thing that comes to mind is calculus, but I don’t know how scalable that will be efficiently.

Just musings, there will be more posted here, as well as links to papers that I’ll be using for this.