Hadoop.Operations(2012.9) Eric.Sammer

[ Pobierz całość w formacie PDF ]
.One of the most common types of scripts is one that uses a CSVfile of machine to rack mappings.See Examples 5-19 and 5-20.Example 5-19.Python Hadoop rack topology script (/etc/hadoop/conf/topology.py)#!/usr/bin/pythonimport sysclass RackTopology:# Make sure you include the absolute path to topology.csv.DEFAULT_TOPOLOGY_FILE = '/etc/hadoop/conf/topology.csv'DEFAULT_RACK = '/default-rack'def __init__(self, filename = DEFAULT_TOPOLOGY_FILE):self._filename = filenameself._mapping = dict()self._load_topology(filename)def _load_topology(self, filename):'''Load a CSV-ish mapping file.Should be two columns with the first being thehostname or IP and the second the rack name.If a line isn't well formed,it's discarded.Each field is stripped of any leading or trailing space.Ifthe file fails to load for any reason, all hosts will be in DEFAULT_RACK.'''try:f = file(filename, 'r')for line in f:fields = line.split(',')if len(fields) == 2:self._mapping[fields[0].strip()] = fields[1].strip()except:passdef rack_of(self, host):'''Look up and a hostname or IP address in the mapping and return its rack.Rack Topology | 131'''if self._mapping.has_key(host):return self._mapping[host]else:return RackTopology.DEFAULT_RACKif __name__ == '__main__':app = RackTopology()for node in sys.argv[1:]:print app.rack_of(node)Example 5-20.Rack topology mapping file (/etc/hadoop/conf/topology.csv)10.1.1.160,/rack110.1.1.161,/rack110.1.1.162,/rack210.1.1.163,/rack210.1.1.164,/rack2With our script (Example 5-19) and mapping file (Example 5-20) defined, we only needto tell Hadoop the location of the script to enable rack awareness.To do this, set theparameter topology.script.file.name in core-site.xml to the absolute path of the script.The script should be executable and require no arguments other than the hostnamesor IP addresses.Hadoop will invoke this script, as needed, to discover the node to rackmapping information.You can verify that Hadoop is using your script by running the commandhadoop dfsadmin -report as the HDFS superuser.If everything is working, you shouldsee the proper rack name next to each machine.The name of the machine shown inthis report (minus the port) is also the name that is passed to the topology script tolook up rack information.[esammer@hadoop01 ~]$ sudo -u hdfs hadoop dfsadmin -reportConfigured Capacity: 19010409390080 (17.29 TB)Present Capacity: 18228294160384 (16.58 TB)DFS Remaining: 5514620928000 (5.02 TB)DFS Used: 12713673232384 (11.56 TB)DFS Used%: 69.75%Under replicated blocks: 181Blocks with corrupt replicas: 0Missing blocks: 0-------------------------------------------------Datanodes available: 5 (5 total, 0 dead)Name: 10.1.1.164:50010Rack: /rack1Decommission Status : NormalConfigured Capacity: 3802081878016 (3.46 TB)DFS Used: 2559709347840 (2.33 TB)Non DFS Used: 156356984832 (145.62 GB)DFS Remaining: 1086015545344(1011.43 GB)DFS Used%: 67.32%132 | Chapter 5: Installation and ConfigurationDFS Remaining%: 28.56%Last contact: Sun Mar 11 18:45:47 PDT 2012.The naming convention of the racks is a slash separated, pseudo-hierarchy, exactly thesame as absolute Linux paths.Although today, rack topologies are single level (that is,machines are either in the same rack or not; there is no true hierarchy), it is possiblethat Hadoop will develop to understand multiple levels of locality.For instance, it isnot currently possible to model multinode chassis systems with multiple racks.If achassis holds two discreet servers and the cluster spans multiple racks, it is possiblethat two replicas of a block could land in a single chassis, which is less than ideal.Thissituation is significantly more likely in the case of highly dense blade systems, althoughthey have other problems as well (see Blades, SANs, and Virtualization on page 52).Some users see rack topology as a way to span a Hadoop cluster across data centers bycreating two large racks, each of which encompasses all the nodes in each data center.Hadoop will not berate you with errors if you were to try this (at least not initially), butrest assured it will not work in practice.It seems as though, with multiple replicas andthe ability to impact how replicas are placed, you d have everything you need.Thedevil, as they say, is in the details.If you were to configure a cluster this way, you dalmost immediately hit the network bottleneck between data centers.Remember thatwhile rack topology doesn t necessarily prohibit non-local reads, it only reduces thechances, and either way, all writes would always span data centers (because replicasare written synchronously) [ Pobierz całość w formacie PDF ]

Archiwum