Recently, I began working on a demo for our log analysis tool, LogDelta, using your Hadoop. However, during the demo's creation, I grew increasingly suspicious of certain labels in the Hadoop data. As a result, what started as a simple demo evolved into a label investigation, ultimately requiring far more effort than initially anticipated.
I focused solely on the PageRank application, meaning that the WordCount application might still contain additional incorrect labels. Below are the identified incorrect labels along with their corresponding fixes:
| ID |
Orig Label |
Fixed Label |
| 1445144423722_0024 |
Normal |
Disk Full |
| 1445182159119_0017 |
Machine Down |
Normal |
| 1445062781478_0020 |
Machine Down |
Normal |
| 1445182151478_0015 |
Machine Down |
Disk Full |
| 1445182159119_0013 |
Disk Full |
Machine Down |
| 1445182159119_0011 |
Disk Full |
Machine Down |
If you're curious about how I reached these conclusions, the process is documented in a YouTube playlist.
- The key part of the label correction is covered in the final video.
- The earlier videos provide details on how the suspicions began to arise.
- I have also shared the text script of the video, which includes some visuals.
Recently, I began working on a demo for our log analysis tool, LogDelta, using your Hadoop. However, during the demo's creation, I grew increasingly suspicious of certain labels in the Hadoop data. As a result, what started as a simple demo evolved into a label investigation, ultimately requiring far more effort than initially anticipated.
I focused solely on the PageRank application, meaning that the WordCount application might still contain additional incorrect labels. Below are the identified incorrect labels along with their corresponding fixes:
If you're curious about how I reached these conclusions, the process is documented in a YouTube playlist.