Cray re-soldering Titan’s connectors, supercomputer testing could be done in April

Jeff Nichols and Titan at ORNL

Jeff Nichols, associate director for computing and computational sciences at Oak Ridge National Laboratory, in front of Titan, the world’s fastest supercomputer. (Photos courtesy of ORNL)

Hundreds of connectors are being re-soldered each week, and the Titan supercomputer at Oak Ridge National Laboratory—the world’s fastest machine—could be in regular production by May, a lab official said Wednesday.

Jeff Nichols, ORNL associate lab director for computing and computational sciences, said connectors on the $100 million computer’s motherboards had too much gold, and solder was interacting with the gold on connector pins, making the solder unstable and leading to cracks.

There are about 20,000 of the pencil-sized connectors, which link central and graphic processing units, or CPUs and GPUs. Each connector has about 100 pins.

Motherboards from Titan’s 200 closet-sized cabinets are being shipped back to Cray Inc., and the company is removing the connectors, laying down new ones with the right amount of gold, and re-soldering them, Nichols said.

Titan is a Cray XK7 system.

ORNL had hoped to complete acceptance testing on Titan, allowing it to be put into production with full-scale user operations, by the end of 2012, Nichols said. But that was an aggressive target and assumed that everything went well, he said.

Lab officials now plan to have all the components back in service by April 6, and they plan to run the acceptance test one more time. It includes a 14-day stability test that will ensure Titan is finishing problems, producing the right answers, and performing appropriately.

The acceptance testing could be complete by the end of April.

The testing was almost completed once before, but workers noticed a degradation in communications between the CPUs and GPUs.

While repairs are being made, research is continuing on Titan. The machine’s GPUs give it a lot of power, but the CPUs still allow it to be used.

“Right now, the users are on it, but they’re not able to take advantage of the full system in the way that they could in the future,” Nichols said.

Titan has 24 pizza box-sized metal “blades” in each of its 200 cabinets. There are four connectors per blade or about 100 connectors per cabinet. Nichols said Cray is repairing connectors in about 12-16 cabinets per week.

He said the lab is not assigning blame for the solder problems on the big, cutting-edge machine. The solder started to crack as Titan heated up and cooled down, and blades were moved in and out of cabinets.

“We have the biggest machine on the planet,” Nichols said. The setbacks are part of “life on the leading edge,” he said.

He said Cray is bearing the cost of the repairs, and the company won’t get all of its money until the machine is accepted.

Titan received a first-place ranking in a semiannual Top500 list that was released in November at the SC12 supercomputing conference in Salt Lake City, Utah. A test showed Titan is capable of reaching a speed of 17.59 petaflops, or more than 17,000 trillion calculations per second. It had an even higher theoretical capability of 27 petaflops.

As big as a basketball court, Titan is 10 times faster than Jaguar, the computer system it replaced.

Advertisement
Advertisement

Join the club!

If you support Oak Ridge Today, please consider becoming a voluntary subscriber. You don't have to subscribe to read our stories, but your contribution will help us grow and improve our coverage.

We currently offer three subscription levels: $5, $10, or $25 per month. We accept payments through PayPal. You may also visit our subscription page for information on other options.

Thank you for your support.


Subscription options




Advertisement
Advertisement

Commenting Guidelines

We welcome comments, but we ask you to follow a few guidelines:
1) Use your real name, including last name.
2) Be civil. Don't insult others, attack their character, or get personal.
3) Stick to the issues.
4) No profanity.
5) Keep your comments to a reasonable length and to a reasonable number per article.

We reserve the right to remove any comments that violate these guidelines. More information is available here.

  • TJ Garland

    (WMR)—Sources in the U.S. intelligence community report that the National Security Agency (NSA) is establishing a major Internet surveillance at the Oak Ridge National Laboratory in Tennessee that is dedicated to decrypting encoded communications, including file transfers, over the Internet and within private private networks such as those used by banks, foreign governments, and multinational corporations.

    The new multi-structure NSA facility is called the Multiprogram Research Facility and will work in tandem with the massive NSA data center being built in Bluffdale, Utah, on acreage that is part of the Camp Williams Utah National Guard base. The Utah center will store massive amounts of intercepted data and the Oak Ridge facility will rely on new generation supercomputers to decode encrypted data stored at the Bluffdale facility and that which is captured in real time from NSA’s worldwide signals intelligence system.

    The Oak Ridge center will also complement the NSA’s High Performance Computing Center at Fort Meade, currently under construction and scheduled for completion in 2015. It is rumored that the building at Fort Meade will contain the world’s fastest computer, the speed of which will be measured in Exa-FLOPs. A FLOP means FLoating point OPerations per Second and exa means a 1 followed by 19 zeros.

    WMR was first alerted to the Oak Ridge facility when NSA began transferring personnel from its Fort Meade, Maryland, headquarters to Oak Ridge. The NSA has, for the past decade, been spreading out its functions to facilities around the country, including at regional facilities at Fort Gordon, Georgia (NSA Georgia); San Antonio, Texas (NSA Texas); and Kunia, Hawaii (NSA Hawaii). The Utah and Tennessee facilities represent a further expansion of NSA’s “Big Brother” surveillance network to additional regions of the United States.

    And with the new NSA facility at Oak Ridge comes the expansion of the Obama administration’s already-massive counter-intelligence (anti-whistle blowing) operations into Tennessee. The FBI field office in Knoxville, working with NSA’s Q Group, have been active in ensuring that NSA employees and contractors refrain from discussing the new Oak Ridge center with uncleared individuals, including members of the media.

    • Charlie Jernigan

      What is wmr?

      • Johnny Beck
        • Charlie Jernigan

          Thanks for that. He doesn’t seem very reliable.

          • TJ garland

            What makes you think that? His credentials are far greater than mine or yours combined. He has been an investigative reporter after crooked politicians for years-mostly Chicago, which continues to be fertile ground.
            How many books have you written? Ask around town about Building 5700.

          • Charlie Jernigan

            Well… Once I knew what I was looking for, I found it about 5th down on the google search, whereas an equivalent search for ap gives us pretty much a whole page of Associated Press links. That is an indicator of overall trust.

            A little more searching, gives assessments like “Madsen writes seemingly anything and asks you to take him either at his
            word or at the word of his many, never-named sources. True or not, he
            can be damn entertaining to read.” This fits the quote you provided.

            He also seems to get easily sidetracked if he can fit gay sex or Presidential sex into his speculations. There may be many more categories that hold special interest for him, but I had grown tired of checking by then.

          • TJ Garland

            Interesting theory why we will rarely get the truth from the MSM.
            http://lewrockwell.com/north/north1270.html

    • Dave Smith

      I do not believe your (WMR) intrigue has anything at all to do with the Titan supercomputer in general or this story in particular. That said, it would be entirely appropriate for the moderator to delete your comment.

      Also, there is no supercomputer in building 5700 at ORNL, spy machine or otherwise.