Wednesday, 5 December 2012

My first "in depth" look at the PureData for Analytics System


Hosts
The primary interface to the PureData for Analytics system are high performance Linux hosts. External tools and applications, e.g. reporting, backup and recovery etc, interact with the host via standardized interfaces, e.g. JDBC etc. The host compiles SQL queries into executable code snippets, creates optimized query plans and distributes the snippets to massive parallel processing nodes for execution. The host is in an active-passive high availability cluster configuration, mirroring data to the standby hosts which monitors the primary host and takes over in case of a failure.

S-blades
The bulk of the analytics workload processing occurs on intelligent massively parallel processing nodes called S-blades. S-blades are optimzed for processing analytics workloads at massive scale. They contain multi-core CPU, multiengine Field-Programmable Gate Architectures, and gigabytes of RAM, all optimized to work together to deliver peak performance. Continuous availability is made possible my the systems management software, which monitors the s-blades (including memory), and automatically takes a failed S-blade out of service and moves the processing load to a spare one.

Disks
The S-blades are connected to disk enclosures via a high-speed interconnect that enables streaming of data to the S-blade memory at the fastest rate possible. The disk enclosures contain high density, high performance disks. Redundancy is built into the data path from each S-blade to the disks. Each drive is mirrored in a RAID 1 configuration, and should a disk fail, the storage subsystem simply redirects I/O processing to the mirror without interruption of service. Spare drives are included, allowing the system to replace failed drives and regenerate content for full redundancy.

Network
The communication in the MPP grid occurs on an optimized IP based network designed for high volume data warehousing traffic patterns. It allows maximum utilization of the network bandwidth without overloading it, thereby allowing predictable performance close to the data transmission speed of the network. There are 2 completely independent networks for redundancy. The data network is also completely separate from the management network. This enables the system to assess the health of its components even where there might be data network problems.

No comments:

Post a Comment