Assignment-3_Paper-1_Stavros

advertisement

Cache-conscious Frequent Pattern Mining Processor

The majority of reviewers commented on the section where the authors talk about exploiting SMT. I agree that this section could give more details about the idea. I will comment on the weak points since they are the only points that worth of discussing.

Adrian:

I agree that the data sets and the chosen supports are not well explained and in general all graphs have the same trends. As such the experimental section is not very solid.

True, but the paper emphasizes in a specific domain of data mining to show that the data-mining algorithms should consider cache locality.

Cansu:

Valid point but this might not be straightforward.

True. In general the section where the authors exploit multi-threading is not well written.

Danica:

The motivation of this work is to reduce the memory stalls that the processor faces during the execution of a cache-unconscious data-mining algorithm.

The difference between future modern processors and the processor that the authors use, is the size of the last level on-chip cache. The authors exploit spatial locality by improving cache line utilization and temporal locality by splitting the tree to multiple tiles that each one fits in the cache memory. So, in the case of future modern processors, the authors should use a different size of tiling the tree.

In fact in a chip multiprocessor each thread will have its own L1 cache. As such, threads will not compete for a shared L1-cache. However, they will compete for the shared LLC. In order to be able to see whether the current implementation can be used in a chip multiprocessor, the authors should show how many memory stalls the SMT reduces in each level of cache. As such, the algorithm might need small modifications that take into consideration the fact that cores benefit by sharing the LLC. There are many ways to exploit the multiple cores. The major contribution of the paper is that people should take into consideration the CPU architecture.

Djordje:

The authors indeed do not talk about the overhead of constructing the cacheconscious prefix tree. However, this cost is included in the total execution time. The results show that the benefits of using this tree overwhelm the overhead. Of course, a quantitative analysis should be done.

I think that the methodology is clear. They use hardware counters to evaluate the CPI. Even if the numbers are rounded, the order of the CPI is so huge that does not change the conclusion that CPI is very far from the basic CPI of this

processor. Still, I totally agree that they should show non-rounded numbers in case they didn’t.

For a cpu-bound algorithm you expect that this would be linear. The graph motivates the readers that the performance of a memory-bound algorithm is very far from the optimal.

Regarding the CPU utilization, I totally agree that it’s misleading. The authors wanted to say useful computation.

Farhan:

In my opinion the simulation should be preferred in case that the authors wanted to exploit various hardware architectures.

This might be the case in a multicore processor where we can have false sharing. Not only the authors use a uniprocessor, but the operations on the tree are read-only (after being created) and hence I don’t see why there is any negative effect on placing more nodes in one cache line.

Assuming a tree that does not fit in the main memory is orthogonal to the cache optimizations that the authors implement.

Ioannis:

I agree that the authors should provide how much time the transformation to the cache-conscious tree takes. However, this is included in the total execution time and hence we can see that it’s well overwhelmed.

True. It would be nice to use real datasets.

True. Personally I needed to go through references to understand how the algorithm works.

Manos:

See above comments

Mutaz:

See comments on Djordje’s reviews.

I agree that the algorithms were not well explained and that the readers who are not familiar need to go through seminar work.

Indeed the authors cannot use larger data sets saying that the FIMI implementations cannot handle them. Although, this can be considered as not a good excuse, I don’t think that larger datasets would change something.

Onur:

Actually the FP-tree is memory-resident which is quite smaller than the dataset. Thus, the dataset does not fit in the memory. The purpose of this paper is the cache performance of data-mining algorithms. As such, I consider that examining a non-memory resident FP-tree is beyond the scope of the paper.

I think that the main contribution of this paper is that they are the first in the data-mining community to take into consideration the underlined hardware

architecture. As such, I am sure that other data-mining algorithms can benefit from this contribution.

Indeed specialized hardware for data mining can be very effective. However, this would imply using architectural simulators.

Pinar:

The overhead of rebuilding the prefix tree is overwhelmed by the benefits of exploiting the data locality

I don’t consider complexity as a weak point. Indeed it will take more time to build a cache-conscious algorithm or exploit the multithreading. However, we cannot expect getting more performance without exploiting the cache even if this requires more programming effort.

Renata:

This is true but if you want to optimize the cpu or cache performance you cannot be I/O bound. Having a non-memory resident tree is orthogonal.

Frequent pattern mining is an important data-mining algorithm. The contribution of this paper is not to give an implementation of a frequent pattern mining that exploits cache locality but to show that cache consciousness is the correct way to increase performance of data mining algorithms.

Sotiria:

True. The FPGrowth algorithm was not well explained.

True. Prefetching does not improve significantly the performance but a simple sequential or stride prefetcher exists in modern processors and hence does not add any additional cost.

Download