Chukwa is an open source data collection system for managing large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. It also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Chukwa uses an end-to-end delivery model that can leverage local on-disk log files for reliability. This approach also eases integration with legacy systems
Use case of Chukwa
- Monitor/debug systems
- Machine learning is getting increasingly good at detecting anomalies automatically.
- Web log analysis is key to many businesses
Why use Chukwa?
Use Chukwa to monitor multiple clusters of several thousand hosts, potentially generating several terabytes of data per day. The goals in designing Chukwa are based on a survey of cluster user’s functional requirements and performance demands. Chukwa is meant to be used by four different (though overlapping) constituencies: Hadoop users, cluster operators, cluster managers, and Hadoop developers.
How does Chukwa work?
Chukwa works by separating collection from processing:
- Adaptors (on each node) output chunks of data, with some minimal metadata.
- Framework uploads data to a small number of collectors, that write to “sink” files in HDFS.
- Periodic MapReduce jobs to organize and analyze collected data
- Dump to structured storage for visualization and ad-hoc querying.