Q:
Can Computing Fabrics really operate over the WAN and
broadband?
A: The broadband will constitute
an interconnect with which to "loosely couple" cells,
clusters of nodes that are local and tightly coupled
themselves. This will require predictability of latencies
for utmost performance, but that can be achieved through
QoS measures debuting over the next 5 years. Systems
will not use hardware approaches to tightly couple over
broadband connections. It is the essence of fabrics
that they utilize both types of couplings, tight and
loose, but in a more flexible manner than current networks
and clusters.
Q:
Won't the additional cost of Computing Fabric nodes
and interconnects, compared with desktop PCs and Ethernet,
slow their adoption?
A: Employees throughout
corporate America may have little to say about what
desktop gets purchased for them – even less if the
functional equivalent, a fabric node, no longer sits
on their desk! TCO was picked up and embraced mighty
fast and if Computing Fabrics bear out then the cost
savings they’ll accrue through more effective use of
ambient cycles will justify purchasing the machines
that save the most money in the final analysis, not
just the cheapest machines.
Besides this, some cells
will be formed using Automated Transparent Object Distribution
(the strategy behind Microsoft's Millennium), requiring
little more than today's desktops.
Q:
How will security be handled on Computing Fabrics?
A: This is a big issue
with many sub-issues. For example, what does it mean
to secure a remote memory read or write? How would you
deal with the overhead and still maintain ultra-low
latencies and preserve cache coherence? However, there
are plenty of benefits here to justify the R&D to
attack these problems, as well as sufficient time and
dollars to do so. Tight coupling using software at the
object level will incur fewer security issues than tight
coupling with hardware at the memory page level.
Q:
Will Computing Fabrics support legacy compatibility?
A: Existing multithreaded
applications should run on Computing Fabrics, if not
fully exploit their unique properties of dynamic reconfigurability
and fluid system boundaries. Beyond this there are issues
such as incorporating equipment currently owned or purchased
over the next several years (which will be legacy within
Computing Fabrics' evolutionary time frame) into the
fabric. Cellular IRIX from SGI will support both CORBA
and DCOM thus making fabrics employing IRIX interoperable
with most legacy systems.
Q:
The enterprise seems a likely place for Computing Fabrics
to take hold, but are fabrics really likely to take
off in the consumer space?
A: The enterprise is where
fabrics will initially be taking off, as an outgrowth
of the current interest in clustering, especially as
the problems that clustering is targeted at have been
so poorly addressed by most vendors.
Moving beyond the enterprise,
into the consumer realm, ones finds what’s being planned
by telcos for residential neighborhoods, distributing
massive processing power through them, heavily utilizing
distributed services.
Now add to that certain
technologies that impact the human interface for consumers,
including 3D, media integration, and the convergence
of devices such as game consoles, set tops, and PCs.
These will be a strong motivator for Computing Fabrics
outside of the enterprise because they significantly
advance the user experience, and they eat up lots of
computational resources, many of which need to be near
the end user because of the speed of light.
Visualize the progression
of Computing Fabric as beginning with many smaller fabrics,
first in the enterprise, that in time join up to become
fewer, larger fabrics. Neighborhoods of processors and
memory within these fabrics exhibit a single system
image through tight coupling. These neighborhoods of
tight coupling are themselves loosely coupled with each
other. And due to the distributed OSes and interconnects
these fabrics will employ, the boundaries of the neighborhoods
will not be rigid but fluid. Feeding and supporting
all this will be economies of scale and reuse far greater
than today’s. These same principles will in time apply
to processors distributed throughout residential areas.
Q:
Are Computing Fabrics just about scalability?
A: Computing Fabrics address
scalability but that’s only one small slice of a wide
panorama. Improvements in scalability are largely a
quantitative change. Computing Fabrics will be a qualitative
change (as well as quantitative) in that we’re no longer
talking about networking fixed-architecture systems
but seeing system architecture converge with network
architecture, with architecture itself ultimately becoming
a dynamic variable - Architecture On Demand.
Q:
Are Computing Fabrics really "in sight"?
A: When is a new era of
technology within sight? Technologies often first enter
into classified "black" projects long before the public
even gets a whiff of them. Then they enter unclassified
military usage, then on into academic research, on into
the high-end of the commercial marketplace, and only
after many years do they enter the mainstream. At which
point are they "near us", "within sight", or "upon us"?
The intelligence community would say the next era is
upon us while many in education are nowhere near client/server!
The technologies of Computing Fabrics have been and
continue to be proven and are also being scaled. Within
two years they will be implemented with mass-market
microprocessors.
Q:
Isn't it a well established fact that NUMA machines
have higher latencies than SMP machines do, and always
will?
A: This is not just an
architectural issue, it is also involves engineering
and implementation factors. The normative latency of
some SMPs exceeds the worst case latencies of some NUMA
implementation. The team responsible for Craylink at
SGI/Cray Research has a project to extend Craylink with
three major milestones. Within 2 years time they will
expand the system size that Craylink can support by
extending the reach of Craylink cables to encompass
a very large room. In three years time (from the present
- late 1998) they will expand the bandwidth of Craylink.
And in 5 years time SGI will extend the range again,
this time dramatically, to support distribution throughout
a building and beyond.
One possible implementation
to achieve this is to add a cache coherency protocol
on top of the follow-on to SuperHIPPI, but that is just
one direction being considered. These three extensions
to Craylink are being pursued in parallel by SGI.
Another approach, being
pursued by Microsoft in their Millennium research project,
automates the distribution of COM+ objects around a
network (unlike today where programmers must decide
the location of client and server objects). Furthermore,
they are layering DCOM over VI Architecture links to
achieve very low latencies for remote method invocations.
HIPPI-6400 will support distances up to 10km. With VIA
running over HIPPI-6400 and DCOM over VIA, and Millennium's
Continuum automating the distribution of COM+ objects,
a distributed object space (as contrasted with distributed
shared memory) could reach out over far more than a
quarter of a mile.
Q:
Will special programming tools or paradigms be required,
like MPI or PVM?
A: Programming a Computing
Fabric will most likely resemble programming a distributed
shared memory machine. This involves great thread packages
and great compilers, just as it does for programming
centralized shared memory machines (SMPs), although
there are differences in the application on these two
architectures.
MPI (Message Passing Interface)
and PVM (Parallel Virtual Machine) are not required.
MPI and PVM are used by programmers of massively parallel
machines (and on networks of workstations where supported
by a utility) to obtain portability between parallel
architectures and implementations. They explicitly support
programming distributed "Non-Shared" memory computing,
where each processor has its own memory with it own
address space and message passing is used to coordinate
function invocations, reads, writes, etc. amongst the
ensemble. PVM was developed at Oak Ridge National Lab
in ‘89 to run across a network of UNIX boxes. MPI began
at a workshop on message passing in ’92 and the first
version was published a little over a year later, making
its big debut at Supercomputing ’93 in November of that
year.
Computing Fabrics is not
network computing, which primarily means distributed
file systems and more recently distributed objects.
What we’re moving into is a convergence of the loosely
coupled programming model of networks and MPP (distributed
processing) with the more abstract, transparent programming
of SMPs, where the programmer is shielded from many
of the details of the underlying distribution, since
all memory is shared. It should be pointed out that
the SGI Origin 2000, a precursor of a Computing Fabric,
can also be programmed with a parallel library, such
as MPI, when message passing best suits the needs of
the developer.
Q:
I haven’t heard anyone talk about dynamic reconfiguration
especially in the sense of interconnects since that
could disrupt the entire adaptive routing algorithms
used for these type of systems causing all sorts of
hotspots and contention especially if the applications
didn’t have the same requirements, which they most probably
wouldn’t.
A: The architecture of
the SGI Origin2000 and its routing system was designed
so that hardware makes routing decisions while software
can reconfigure the network. Right now this reconfiguration
is limited to avoiding faults and better using resources,
but this will be changing. Microsoft is also working
on this as part of the Millennium project at Microsoft
Research, very much focused on creating fluid system
boundaries. These elements differentiate Computing Fabrics
from mere networks of bus- and switch-based multiprocessor
systems with their inherently rigid system boundaries.
Q:
Don't Beowulf systems have a great deal of similarity
to Computing Fabrics? I understand these systems have
even enabled screensavers that can detect idle times
automatically and log your machine into the "fabric"
and begin work on a shared problem.
A: Though focused on Linux,
Beowulf is a great project but like Linux is not a commercially
backed one, limiting its applicability in the enterprise.
There is even a 1,000 processor Beowulf system under
construction using Alphas. Beowulf systems do not even
utilize a low-latency link between systems, because
they are intended to exploit technology that is commodity
"today". Also, software tends to follow hardware, often
by significantly long stretches. Today we’re seeing
the beginnings of the hardware for Computing Fabrics.
It will motivate the development of the software. Beowulf
is a software solution to provide some measure of distributed
processing on today’s commodity hardware.
Q:
How much will it cost to put a a scalable interconnect
on each desktop in order to create a Computing Fabric
across an enterprise? If it is cheaper to use a non-scalable
one I suspect that one will win. And do we really need
a hypercube on every desk?
A: Despite the fact that
in the mid eighties Danny Hillis suspected that the
Connection Machine would spawn desktop hypercubes in
10 years, no, we do not need a hypercube on the desktop,
that would be a perversion of the direction technology
is headed. Rather, the machines that replace desktops
will participate in a fabric that has hypercube regions
to its topology. At first, say 3-5 years off, it will
indeed be pricey to use cache coherent scalable interconnects.
Now, although NCs haven’t been a big hit (and rightly
so) TCO has caught on, and so too will TCC, Total Cost
of Cycles. This could well make widespread use of modularly
scalable interconnects very attractive as a way of exploiting
an organization’s cycles. Besides this, there are several
human interface directions at work that mandate significant
power (processing cycles and cache) out near the users
but not dedicated to a specific user 24x7x365. These
will weigh heavily as motivation towards distributed
shared processing and Computing Fabrics.
Q:
Is all this really necessary to run Word?
A: Ever increasing cycles
will be needed to enhance the human-computer interface,
and that power better be "near" the user though it needn’t
be on their desktop. Desktops themselves may disappear
(literally the desk tops – the computers will depart
with them. More workers will be peripatetic and mobile,
as well as working from home and in the field). The
first place fabrics will catch on is in the heart of
the enterprise, an evolution of server farms. But financial
pressure will likely cause the assimilation of whatever
succeeds the desktop. Two kinds of connectivity will
be present (at least). Information appliances and personal
interface devices will connect in using wireless RF
and scattered IR (and other technologies from DARPA
projects). These will certainly not support cache coherent
SSI but only loosely coupled distributed processing.
Its what’s behind the walls that’s likely to become
tightly coupled but distributed so as to remain relatively
close to the users.
Q:
Will Computing Fabrics create a totally homogeneous
computer architecture from the bottom of the industry
to the top?
A: There will still be
layering in the industry and technological innovations
specific to the realm of problems being solved. But
the similarities will be greater than the differences,
that fabric clusters in academia operating at teraflops
will be roughly equivalent to small fabrics in the enterprise,
but the superfabrics will use processor variants with
bigger caches, perhaps extended superscalar architecture,
the latest, fastest SuperDuperHIPPI, while the small
fabrics at the department level will use last year’s
variants. The point is that for the first time these
technologies are variants of one another, not altogether
different beasts. It means that one year’s superfabric
technology can directly become a commodity fabric technology
within years, not decades.
Q:
How do Computing Fabrics in residential neighborhoods
help manufacturers get closer to the consumer?
A: When the consumer is
literally embedded in massive processing many new things
become possible. The heart and soul of a company, caught
in its 3D multimedia knowledgebase, can be locally instantiated
for the consumer, allowing them to configure the products
and services based on the production and support capabilities
of the vendors or a cooperative of vendors. Such vendors
will virtually "fuse" their demo, warehousing, and production
spaces with the consumers’ space, sometimes multiple
vendors at a time in a comparison shopping and bid situation.
This is "getting close" to the consumer and demands
lots of power. Do you need this power to send the customer
an invoice? No, but that’s not the kind of cozying up
being considered.
Q:
Why should business begin making plans now for Computing
Fabrics when they’re still years away?
A: Infrastructure, tooling,
training, corporate structure, and capital investment
costs big, very big, and can chain a company so tightly
to the past that they can’t get free of it. Second,
business should consider fabrics in their business model
(analyses and planning – not taking immediate action),
which does look out beyond next Christmas. Technology
vendors will be the first who will need to come to grips
with fabrics so that they can exploit the trend rather
than become a casualty of it.
In retrospect, would organizations
have wanted to become aware of the PC back in the 70’s,
would minicomputer vendors have wanted to know about
distributed processing based on commodity microprocessors,
and would businesses have wanted an advance warning
of the coming web? The answers are trivially easy: yes,
yes, and yes.
Q:
Won't Computing Fabrics run into memory locality problems
and contention that their distributed shared memory
architecture can't address?
A: The Computing Fabrics
landscape will likely have very large ensembles of processors,
probably making today’s 64 and 128 processor machines
seem diminutive. So in this future let’s compare a cluster
of 16 Sun Starfire SMP servers, each with 64 processors,
to a single SGI Origin with 1,024 processors. If you’ve
got a problem that fits in the address space of a single
one of the Sun servers then you can claim uniform latencies.
But many of these huge arrays will handle huge problems
that only work nicely in a contiguous address space.
First, these are not going to run on the Sun, as the
problem must be broken up into parts that can be distributed
to each node (of 64 processors) in the Sun cluster.
So let’s say you do just that. What about the latency
of travelling from a processor on one SMP to a processor
on a different SMP in the cluster? Is this latency going
to be uniform with the latency across the crossbar switch
in a single Sun server? No it won't. In fact, its likely
to far exceed the end-to-end latency in the 1,024 processor
Origin. So, if we say this is bad for the Origin then
we’ll have to say it’s doubly bad for the Sun, which
carries over to all loosely coupled clusters of SMPs.
All architectures make
design decisions that invoke tradeoffs. The question
is not if a particular architecture has deficiencies,
they all do, but rather does an architecture make wise
tradeoffs given the kinds of applications it is known
in advance that it will be used for, as well as the
many areas that it will ultimately be applied to. In
pursuit of modularly extendable fabrics SMP just doesn’t
cut it, except at the nodes of the architecture, with
hardware and software used to maintain cache coherence
between these SMPs. This is fine because today’s clusters
of Small SMP’s are likely to evolve with fabric technology
to become exactly what the SGI machines are becoming
– commodity processors and all, but they’ll be lots
cheaper. That’s one of the main points here, as the
Origin architecture goes Intel similar functionality
will come to clusters of commodity machines. This is
big news for the future of computing, as clusters become
systems and systems become clusters, and the distinction
disappears. What’s the source of this revolution? It’s
the 3 key technologies identified herein, the distributed
shared architecture, the rich modularly scalable interconnect,
and the Cellular OS, on their way into Intel space.
Concerning non-parallelized
code, if you mean code that does not take advantage
of a multithreaded package it won't function on an SMP
either. Effective multithreading is all that’s "minimally"
required to utilize an SMP, a CC-NUMA machine, or a
Computing Fabric. Now if by parallelization you mean
instead rearchitecting software to explicitly use MIMD,
SIMD, or vector parallelization of loops – that’s what
almost everyone is trying to avoid and why a single
address space, whether in an SMP or in CC-NUMA, is so
desired. Programming to a message passing model is vastly
different than programming to a shared memory model,
whether that memory is in reality centralized or distributed.
Programs based on message passing can be run sub-optimally
on an SMP, run less sub-optimally on some NUMA architectures,
and run very well on clusters and MPPs. Programs explicitly
written to an SMP model can run sub-optimally on CC-NUMA
but not at all on a cluster or MPP. Since programs such
as RDBMSes are for the most part (but not entirely)
programmed to the SMP model they will be better supported
and more easily ported to a fabric that supports distributed
shared memory (e.g., ccNUMA) than a message passing
cluster or MPP (note Oracle’s multi-year painful experience
in porting to the nCUBE2 MPP from a code base optimized
for Sequent SMP). Lastly, on contention and hotspotting,
these are minimized by the Cellular OS as well as the
hardware that provides cache coherency. To reiterate,
no architecture solves all problems free of side effects.
SMPs don’t. ccNUMA doesn’t. MPP sure doesn’t. Moving
to a loosely coupled approach using distributed objects
doesn’t either, in fact that only invokes multiple passes
through the entire protocol stack unless layered on
ST or VIA – talk about contention!
Q:
How does the SGI Spider chip enable Craylink to function
without contention and without running into an N-squared
increase in system cost and complexity?
A: The Spider chip itself
is non-blocking. Since each of its 6 ports supports
4 virtual channels there can be contention in that a
packet with a recent age may wait while the on-chip
arbiter enables an older packet to progress though the
crossbar. This is a design decision and alternatives
are being explored with the next version of the Spider
chip which will also offer an increased number of ports
(exact increase not yet disclosed). The fact that to
create a totally non-blocking NxN switch requires on
the order of N squared Spider chips is not relevant
to the design of the SGI Origin as it does not use this
architecture - it employs a variant of a hypercube.
The cost of the system
does scale close to linearly with a single discontinuity
when expanding from 64 to 128 processors, then the number
of Spider chips required resumes a linear increase.
In terms of backplane connections between Spider chips
these too increase in a direct linear relationship to
the number of processors. Only the number of Craylink
cables between Spider chips departs from strict linearity,
but only slightly so, in that doubling the number of
processors with this architecture from 128 to 256 ups
the required Craylink count from 112 to 236 rather than
224 – a minor departure. While bandwidth per processor
remains constant as the system is grown it is true that
latency increases between widely separated processors
– this is still NUMA after all and that is a design
tradeoff. However, actual measured latencies for widely
separated processors in a large Origin 2000 are actually
less than the "constant" latency between processors
and memory in many SMPs.
Q:
Don’t hypercubes of the SGI Origin design decrease system
bandwidth with increasing size, meaning a fabric based
on this design cannot adequately scale?
A: The hierarchical fat
hypercube topology does not "scale down" as nodes are
added, provided that the metarouters used by the system
between hypercubes increase in dimension as the system
grows – precisely the pattern followed by SGI engineers.
For example, taking a 1,024 processor system built with
this architecture and cutting it in two (bisection)
yields 8 processors per connection, exactly the same
bisection bandwidth as a similarly connected system
of 16 or 32 processors. The key is to increase the dimensionality
of the metarouter as the system grows. For example,
a 128 processor Origin uses 8 2D metarouters (essentially
rings), but a 1,024 processor Origin will use 8 5D Hypercubical
metarouters. It should added that this use of metarouters
will be advantageous when the interconnects are distributed
through a building, though that’s 5 years off.
As for bristling (or the
similar topology of cube connected cycles) this does
indeed impact bandwidth per processor, hence it is wise
to decide on the bristling factor before making design
decisions on the network. It appears that SGI has indeed
done their homework here. The bristle factor (the number
of processors supported by each router) begins at 4
processors per router for configurations of the Origin2000
up to and including a 64 processor configuration. But
beginning with 128 processors (and up once they begin
shipping) the bristling decreases to on average 2 processors
per router. Why? Because the metarouters proportionately
expand, with Spider chips in the metarouter joining
up with Spider chips on node boards to form virtual
super routers of the fat hierarchical hypercube. Bottom
line: bandwidth scales linearly with system size (with
one discontinuity), latency is non-uniform by architectural
choice, and cost scales so close to linearly with size
that it deserves to be called linear scaling.
|