並列編程模型

在電腦科學中，並列編程模型（Parallel programming model）是平行計算機架構的抽象化，通過它可方便的表達演算法和它們在程式中的合成。判別編程模型的價值可以通過它的通用性：在各種不同架構上能表達多大範圍的不同問題，和它的效能：編譯的程式在執行時有多高的效率^[1]。並列編程模型的實現形式可以是從「順序編程」語言中呼叫的函式庫，作為現存語言的擴充，或作為全新的語言。

圍繞特定編程模型的共識是很重要的，這可導致建造不同的平行計算機來支援這個模型，從而促進軟體的可移植性。在這個意義上，編程模型被稱為在硬體和軟體之間的橋接（英語：Bridging model）^[2]。

並列編程模型的分類[編輯]

並列編程模型的分類可以寬泛的在兩個領域主題下進行：行程互動和問題分解^[3]^[4]^[5]。

行程互動[編輯]

行程互動有關於並列行程能夠籍此相互通訊的機制。最常見的互動形式是共享主記憶體和訊息傳遞。

共享主記憶體[編輯]

共享主記憶體是在行程間傳遞資料的高效方式。在共享主記憶體模型中，並列行程共享它們可以非同步讀與寫的全域位址空間。非同步並行訪問可能導致競爭條件，和用來避免它們的機制如：鎖、號誌和監視器。常規的多核處理器直接支援共享主記憶體，很多並列程式語言和庫在設計上利用了它，比如採用分叉會合模型的：Cilk、OpenMP和執行緒建造塊（英語：Threading Building Blocks）^[6]。

訊息傳遞[編輯]

在訊息傳遞模型中，並列行程通過訊息傳遞相互交換資料。這種通訊可以是非同步的，就是說訊息可以在接收者準備好之前發出；或是同步的，就是說訊息發出前接收者必須準備好。交談循序程式（CSP）形式化了使用同步通訊通道來連接行程的訊息傳遞，並引出了重要的語言如：Occam、Limbo和Go。與之相對，演員模型使用非同步訊息傳遞，並被採用於如下語言的設計中：D、Scala和SALSA^[7]。

分散式主記憶體[編輯]

分散式主記憶體（英語：Distributed memory）指稱一類多處理器電腦系統，其中每個處理器都有自己私有的主記憶體，計算任務只能在本地資料上運算，如果需要遠端資料，計算任務必須與一個或多個遠端處理器通訊。在分散式主記憶體系統編程中的關鍵要點是如何把資料分布到各個主記憶體上；依賴於所解決的問題，資料可以靜態分布，也可以在節點間移動；資料可以在需要時移動，也可以事先推入新的節點。

MPI規定了用於分散式主記憶體系統的通訊協定，支援點到點通訊和集體通訊（collective communication）二者。MPI還是訊息傳遞 API，帶有對其特徵在任何實現中必須如何表現的協定和語意規定^[8]。MPI的目標是高效能、可伸縮性和可移植性，目前仍是高效能計算領域中統治性的模型^[9]。此外還有支援單邊通訊（one-sided communication）的分割區全域位址空間模型。

問題分解[編輯]

並列程式由同時執行的行程構成。問題分解有關於規劃（formulate）成員行程的方式^[10]^[11]。

資料並列[編輯]

資料並列模型關注進行運算所在的資料集，典型的是正規結構的陣列。一組任務將在這些資料上運算，但是單獨的處於在不相交的分割區中。資料並列通常對應SPMD（英語：SPMD）編程模型^[12]，相應執行模型對應費林分類法中的SIMD（例如AVX擴充（英語：Processor supplementary capability））或MIMD（例如Xeon Phi）^[13]，還有GPGPU採用的SIMT（英語：Single instruction, multiple threads）^[14] （例如NVIDIA Tesla）。

任務並列[編輯]

任務並列模型關注行程或執行緒的執行。這些行程通常表現出獨特性，並強調對通訊的需求。任務並列是表達訊息傳遞通訊的自然方式。任務並列通常對應MPMD（英語：MPMD）編程模型，與SPMD（英語：SPMD）的區別在於適合解決的問題而非執行模型^[15]。

隱式並列[編輯]

與上述顯式並列（英語：Explicit parallelism）相反，隱式並列（英語：Implicit parallelism）不向編程者透露任何編譯器、執行時系統或硬體負責的事情。例如，在編譯器中自動並列化（英語：automatic parallelization）是把順序代碼轉變程並列代碼的過程；而在電腦架構中，超純量執行是採用指令級並列來進行並列運算的機制，因自身限制而被實際利用為同時多執行緒/超執行緒。

在隱式並列中，行程互動對編程者均不可見，轉而由編譯器和/或執行時系統負責進行互動。函數式程式設計語言不存在副作用，允許無依賴的函式並列執行，人們已經進行了很多有關的研究實驗^[16]。以SISAL和SAC為代表的一類資料流程編程語言，是可以高效隱式並列的函數式程式設計語言，其關鍵特徵是單賦值和面向陣列。

術語[編輯]

平行計算模型是計算模型的一大範疇，包括：細胞自動機、PRAM機、LogP機、佩特里網、行程網（英語：Kahn process networks）和互動網（英語：Interaction nets）等。計算模型是用來分析計算行程代價的一種抽象化，利用它分析並列演算法（英語：Analysis of parallel algorithms）效能，可以不取決於特定實現和技術所特有的各種變化，並列演算法一般而言是針對特定計算模型而編寫的，為PRAM機編寫的偽碼通常會採用某種For迴圈形式的並行編程構造^[17]。

編程模型（英語：Programming model）指稱一種編程樣式，即通過看起來像庫呼叫的方式引發執行。例子包括POSIX的Pthreads庫和Apache Hadoop中的MapReduce。在這二者情況下，執行模型（英語：Execution model）都符合這個庫所用語言的語法卻不能按照其語意來理解。不同於計算模型，編程模型特別暗含著對硬體或軟體實現的實際考慮^[18]。

在平行計算中，執行模型經常必須暴露硬體特徵來達成高效能。並列硬體有大量的變種導致了同時需要類似數量的並列執行模型。對每個執行模型都建立一門新語言是不實際的，因此常見的實踐都是通過某個API來引發並列執行模型的行為。並列程式語言可以基於一種或一個組合的編程模型。例如，高效能Fortran基於共享主記憶體互動和資料並列問題分解，而Go提供共享主記憶體互動和訊息傳遞互動。

並列編程模型[編輯]

這裡列出的編程模型是可稱為橋接模型（英語：Bridging model）的電腦的抽象模型^[2]，它提供了在一個機器的物理實現和編程者可獲得的這個機器的抽象概念之間的橋梁；換句話說，它意圖在硬體和軟體工程師之間提供共同的理解層面。成功的編程模型可以在現實中有效的實現並被編程者有效的作為目標；特別是應當有可能用典型的高階語言編譯器生成良好的代碼。從編程者的角度來看，這種橋接並列編程模型一般典型的位於Pthreads、IPC、MPI等之上，而在OpenMP、OpenACC等之下。

名稱	互動類別	分解類別	實現及用例
分叉會合模型	共享主記憶體	資料	Cilk，OpenMP，執行緒建造塊（英語：Threading Building Blocks）^[6]
整體同步並列模型	共享主記憶體	資料^[19]	BSPlib^[20]：MulticoreBSP^[21]、BSPonMPI^[22]，Apache Giraph^[23]，Apache Hama（英語：Apache Hama）^[24]
分割區全域位址空間模型	分散式主記憶體（英語：Distributed memory）	資料	SHMEM（英語：SHMEM）^[25]，Coarray Fortran（英語：Coarray Fortran），Chapel（英語：Chapel (programming language)），X10（英語：X10 (programming language)）
交談循序程式模式	同步訊息傳遞	任務	Occam，Ada，Go
演員模型	非同步訊息傳遞	任務	Erlang^[26]，D，Scala，很多的框架和庫實現
異構計算框架	共享主記憶體	資料	OpenCL，CUDA^[27]，SYCL，Pocl^[28]
串流處理/資料流程範式	管道	資料^[1]	Brook（英語：BrookGPU）^[29]，Apache Flink，RaftLib（英語：RaftLib），TensorFlow，Lustre
進階綜合電子設計自動化	通道	任務	C轉HDL（英語：C to HDL）：Handel-C、SystemC

OpenCL將計算系統視為組成自一組「計算裝置」，它們可以是CPU或是附加的「加速器」比如GPU。它定義了一種類C語言（英語：List of C-family programming languages）用來寫程式。在OpenCL裝置上執行的函式叫做「核心」^[30]^:17。一個單一的計算裝置典型的組成自一些「計算單元」，它們依次又包含很多「處理元素」（PE）。一個單一的核心執行可以在所有或多個PE上並列執行。OpenCL定義了API，允許執行於主機上的程式，啟動在計算裝置上的核心，並管理裝置主記憶體，它至少在概念上分離於主機主記憶體。用OpenCL語言寫的程式預期會被即時編譯，所以使用OpenCL的應用程式在針對各種裝置的實現之間是可移植的^[31]。

FPGA可以被用來解決任何可計算的問題，這通過用FPGA能實現軟微處理器（英語：Soft microprocessor）的事實就可輕易的證明。它們的好處在於對某些應用它們明顯的要更快速，因為它們有著並列本質和在對用在特定處理上的邏輯閘的數目方面的最佳化^[32]。近年來開始興起使用OpenCL編程來利用FPGA提供的效能和能耗效率。OpenCL允許編程者用C語言編碼並把FPGA組合函式作為使用OpenCL構造的OpenCL計算核心的目標^[33]。

MapReduce是通過並列、分散式演算法在叢集上處理和生成鍵/值對形式的大數據集的編程模型和有關實現^[34] ，Apache Hadoop中將它與HDFS分立實現^[35]。MapReduce受到了在函數式程式設計範式中常用的map和reduce函式的啟發^[36]，但是它們在MapReduce框架中的用途不同於它們在起初形式中那樣^[37]。

並列編程模型還有很多，比如：馬里蘭大學學院市分校依據PRAM 計算模型，建立了指令級並列顯式多執行緒（英語：Explicit multi-threading）的多處理器電腦和程式語言XMTC（英語：XMTC）^[38]，實現了Spawn-Join範式^[39]。

參見[編輯]

並行計算
異構計算
高效能計算
叢集計算
分散式計算
並行及並列程式語言列表（英語：List of concurrent and parallel programming languages）
遠端直接主記憶體訪問
並列外部主記憶體（英語：Parallel external memory）

參照[編輯]

^ ^1.0 ^1.1 Skillicorn, David B., "Models for practical parallel computation", International Journal of Parallel Programming, 20.2 133–158 (1991), https://www.ida.liu.se/~chrke55/papers/modelsurvey.pdf （頁面存檔備份，存於網際網路檔案館）
^ ^2.0 ^2.1 Leslie G. Valiant. A bridging model for parallel computation (PDF). Communications of the ACM. August, 1990, 33 (8): 103–111 [2019-12-05]. （原始內容 (PDF)存檔於2019-08-11）. 請檢查|date=中的日期值 (幫助)
^ John E. Savage, Models of Computation: Exploring the Power of Computing, 2008, Chapter 7 (Parallel Computation), http://cs.brown.edu/~jes/book/ （頁面存檔備份，存於網際網路檔案館）
^ Ian Foster, Designing and Building Parallel Programs, 1995, Section 1.3, "A Parallel Programming Model", http://www.mcs.anl.gov/~itf/dbpp/text/node9.html （頁面存檔備份，存於網際網路檔案館）
^ Blaise Barney, Introduction to Parallel Computing, "Models", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Models （頁面存檔備份，存於網際網路檔案館）
^ ^6.0 ^6.1 Michael McCool; James Reinders; Arch Robison. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013 [2019-12-12]. （原始內容存檔 (PDF)於2018-11-23）. Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines.
^ SALSA（頁面存檔備份，存於網際網路檔案館）
^ Gropp, William; Lusk, Ewing; Skjellum, Anthony. A High-Performance, Portable Implementation of the MPI Message Passing Interface. Parallel Computing. 1996, 22 (6): 789–828. doi:10.1016/0167-8191(96)00024-5.
^ Sur, Sayantan; Koop, Matthew J.; Panda, Dhabaleswar K. MPI and communication---High-performance and scalable MPI over Infini Band with reduced memory usage. High-performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-depth Performance Analysis. ACM. 4 August 2017: 105. ISBN 978-0769527000. doi:10.1145/1188455.1188565.
^ Ian Foster, Designing and Building Parallel Programs, 1995, Section 2.2, "Partitioning", http://www.mcs.anl.gov/~itf/dbpp/text/node16.html （頁面存檔備份，存於網際網路檔案館）
^ Blaise Barney, Introduction to Parallel Computing, "Partitioning", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning （頁面存檔備份，存於網際網路檔案館）
^ Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. （原始內容存檔於2019-12-24）. SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid.
^ Flynn, Michael J. Some Computer Organizations and Their Effectiveness (PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. （原始內容 (PDF)存檔於2022-04-11）.
2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV（英語：ILLIAC）).
3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board（英語：Plugboard） machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
^ Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013: 52 [2019-12-12]. （原始內容存檔 (PDF)於2018-11-23）. There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple 「threads」 (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously.
Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. （原始內容存檔於2019-12-01）.
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained（英語：Granularity (parallel computing)） sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity（英語：Granularity (parallel computing)）, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
^ Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. （原始內容存檔於2019-12-24）. MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning).
^ McBurney, D. L., and M. Ronan Sleep. "Transputer-based experiments with the ZAPP architecture." PARLE Parallel Architectures and Languages Europe. Springer Berlin Heidelberg, 1987.
Hammond, Kevin. Parallel functional programming: An introduction. In International Symposium on Parallel Symbolic Computation, p. 46. 1994. 「The Alfalfa project implemented serial combinators for the Intel iPSC[44]. …… Buckwheat re-implemented this model for the Encore Multimax, a shared-memory multiprocessor. ……[45].」
^ Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. （原始內容存檔 (PDF)於2016-03-04）. An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. （原始內容存檔 (PDF)於2019-12-06）. The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], （原始內容存檔 (PDF)於2017-12-15）, For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo 「programming construct」, which is heavily used in these notes to express operations that are performed in parallell:
    for P_i , 1 ≤ i ≤ n pardo
        A(i) := B(i)
This means that the following n operations are performed concurrently: processor P₁ assigns B(1) into A(1), processor P₂ assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
    for i , 1 ≤ i ≤ α pardo
      A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
^ Skillicorn, David B., and Domenico Talia, Models and languages for parallel computation, ACM Computing Surveys, 30.2 123–169 (1998), https://www.cs.utexas.edu/users/browne/CS392Cf2000/papers/ModelsOfParallelComputation-Skillicorn.pdf （頁面存檔備份，存於網際網路檔案館）
^ Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSP Programming Library (PDF). Parallel Computing. December 1998, 24 (14): 1947–1980 [2019-12-05]. （原始內容存檔 (PDF)於2019-11-04）. Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process.
^ BSPlib（頁面存檔備份，存於網際網路檔案館）
^ MulticoreBSP（頁面存檔備份，存於網際網路檔案館）
^ BSPonMPI（頁面存檔備份，存於網際網路檔案館）
^ Introduction to Apache Giraph. [2019-12-09]. （原始內容存檔於2019-12-19）. Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps.
^ Apache Hama. [2019-12-09]. （原始內容存檔於2019-12-18）. Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief.
^ OpenSHMEM. [2019-12-09]. （原始內容存檔於2019-12-09）. The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models.
^ Armstrong, Joe. Erlang. Communications of the ACM. September 2010, 53 (9): 68–75 [2020-05-07]. doi:10.1145/1810891.1810910. （原始內容存檔於2020-06-09）. Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP.
^ CUDA Programming Guide 1.0 (PDF). 6/23/2007 [2019-12-09]. （原始內容存檔 (PDF)於2018-04-17）. CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. 請檢查|date=中的日期值 (幫助)
Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. （原始內容存檔於2019-12-09）. OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics.
September 2009 OpenCL Public Downloads. [2019-12-09]. （原始內容存檔於2019-12-09）. Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation.
MPI Solutions for GPUs. [2019-12-09]. （原始內容存檔於2019-12-09）. MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node.
OpenSHMEM Communication on GPUs. [2019-12-09]. （原始內容存檔於2019-12-09）. NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.
^ Pocl（頁面存檔備份，存於網際網路檔案館）
^ Brook Language. [2019-12-09]. （原始內容存檔於2019-11-19）. Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams.
^ Howes, Lee. The OpenCL Specification Version: 2.1 Document Revision: 23 (PDF). Khronos OpenCL Working Group. November 11, 2015 [November 16, 2015]. （原始內容存檔 (PDF)於2015-11-18）.
^ Stone, John E.; Gohara, David; Shi, Guochin. OpenCL: a parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering. 2010, 12 (3): 66–73. Bibcode:2010CSE....12c..66S. PMC 2964860 . PMID 21037981. doi:10.1109/MCSE.2010.69.
^ Xilinx Inc, Form 8-K, Current Report, Filing Date Apr 26, 2006. secdatabase.com. [May 6, 2018]. （原始內容存檔於2018-05-07）.
^ Why use OpenCL on FPGAs?. StreamComputing. 2014-09-16 [2019-12-06]. （原始內容存檔於2017-01-01）.
^ MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. （原始內容存檔 (PDF)於2020-06-15）. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
^ MapReduce Tutorial. Apache Hadoop. [3 July 2019]. （原始內容存檔於2019-12-14）.
^ MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. （原始內容存檔 (PDF)於2020-06-15）. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
^ Lämmel, R. Google's MapReduce programming model — Revisited (PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. （原始內容存檔 (PDF)於2019-12-07）. The following overview lists more detailed questions and summarizes our findings:
Is MAP essentially the map combinator? NO.
Is MAP essentially an application of the map combinator? NO.
Does MAP essentially serve as the argument of map? YES.
Is REDUCE essentially the reduce combinator? NO.
Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
Does REDUCE essentially serve as the argument of reduce? NO.
Does REDUCE essentially serve as the argument of map? YES.
Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
^ Vishkin, Uzi, Using simple abstraction to reinvent computing for parallelism (PDF), Communications of the ACM, 2011, 54: 75–85, doi:10.1145/1866739.1866757, The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
For i =0 to n−1 do
    X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
…… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
For i =0 to n−1 pardo
    X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
…… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
spawn (0 ,n−1) {
    var x
      x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
}
The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM』 s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.
^ Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, Explicit Multi-Threading (XMT) bridging models for instruction parallelism, Proc. 1998 ACM Symposium on Parallel Algorithms and Architectures (SPAA): 140–151, 1998 [2019-12-12], （原始內容存檔於2017-09-21）, This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL（英語：NESL）) and relies on the ability of the hardware to suspend, and later reactive, threads.

進一步閱讀[編輯]

Blaise Barney, Introduction to Parallel Computing, Lawrence Livermore National Laboratory
Murray I. Cole., Algorithmic Skeletons: Structured Management of Parallel Computation (PDF), University of Glasgow
J. Darlinton; M. Ghanem; H. W. To, Structured Parallel Programming (PDF), In Programming Models for Massively Parallel Computers. IEEE Computer Society Press, 1993
Ian Foster, Designing and Building Parallel Programs, Argonne National Laboratory

[Skillicorn-1] 1.0 ^1.1 Skillicorn, David B., "Models for practical parallel computation", International Journal of Parallel Programming, 20.2 133–158 (1991), https://www.ida.liu.se/~chrke55/papers/modelsurvey.pdf （頁面存檔備份，存於網際網路檔案館）

[Valiant1990-2] 2.0 ^2.1 Leslie G. Valiant. A bridging model for parallel computation (PDF). Communications of the ACM. August, 1990, 33 (8): 103–111 [2019-12-05]. （原始內容 (PDF)存檔於2019-08-11）. 請檢查|date=中的日期值 (幫助)

[3] John E. Savage, Models of Computation: Exploring the Power of Computing, 2008, Chapter 7 (Parallel Computation), http://cs.brown.edu/~jes/book/ （頁面存檔備份，存於網際網路檔案館）

[4] Ian Foster, Designing and Building Parallel Programs, 1995, Section 1.3, "A Parallel Programming Model", http://www.mcs.anl.gov/~itf/dbpp/text/node9.html （頁面存檔備份，存於網際網路檔案館）

[5] Blaise Barney, Introduction to Parallel Computing, "Models", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Models （頁面存檔備份，存於網際網路檔案館）

[TBB-6] 6.0 ^6.1 Michael McCool; James Reinders; Arch Robison. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013 [2019-12-12]. （原始內容存檔 (PDF)於2018-11-23）. Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines.

[7] SALSA（頁面存檔備份，存於網際網路檔案館）

[8] Gropp, William; Lusk, Ewing; Skjellum, Anthony. A High-Performance, Portable Implementation of the MPI Message Passing Interface. Parallel Computing. 1996, 22 (6): 789–828. doi:10.1016/0167-8191(96)00024-5.

[9] Sur, Sayantan; Koop, Matthew J.; Panda, Dhabaleswar K. MPI and communication---High-performance and scalable MPI over Infini Band with reduced memory usage. High-performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-depth Performance Analysis. ACM. 4 August 2017: 105. ISBN 978-0769527000. doi:10.1145/1188455.1188565.

[10] Ian Foster, Designing and Building Parallel Programs, 1995, Section 2.2, "Partitioning", http://www.mcs.anl.gov/~itf/dbpp/text/node16.html （頁面存檔備份，存於網際網路檔案館）

[11] Blaise Barney, Introduction to Parallel Computing, "Partitioning", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning （頁面存檔備份，存於網際網路檔案館）

[12] Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. （原始內容存檔於2019-12-24）. SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid.

[13] Flynn, Michael J. Some Computer Organizations and Their Effectiveness (PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. （原始內容 (PDF)存檔於2022-04-11）.
2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV（英語：ILLIAC）).
3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board（英語：Plugboard） machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.

[14] Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013: 52 [2019-12-12]. （原始內容存檔 (PDF)於2018-11-23）. There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple 「threads」 (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously.
Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. （原始內容存檔於2019-12-01）.
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained（英語：Granularity (parallel computing)） sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity（英語：Granularity (parallel computing)）, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.

[15] Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. （原始內容存檔於2019-12-24）. MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning).

[ParFuncProg-16] McBurney, D. L., and M. Ronan Sleep. "Transputer-based experiments with the ZAPP architecture." PARLE Parallel Architectures and Languages Europe. Springer Berlin Heidelberg, 1987.
Hammond, Kevin. Parallel functional programming: An introduction. In International Symposium on Parallel Symbolic Computation, p. 46. 1994. 「The Alfalfa project implemented serial combinators for the Intel iPSC[44]. …… Buckwheat re-implemented this model for the Encore Multimax, a shared-memory multiprocessor. ……[45].」

[17] Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. （原始內容存檔 (PDF)於2016-03-04）. An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. （原始內容存檔 (PDF)於2019-12-06）. The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], （原始內容存檔 (PDF)於2017-12-15）, For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo 「programming construct」, which is heavily used in these notes to express operations that are performed in parallell:
for P_i , 1 ≤ i ≤ n pardo
A(i) := B(i)
This means that the following n operations are performed concurrently: processor P₁ assigns B(1) into A(1), processor P₂ assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
for i , 1 ≤ i ≤ α pardo
A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.

[18] Skillicorn, David B., and Domenico Talia, Models and languages for parallel computation, ACM Computing Surveys, 30.2 123–169 (1998), https://www.cs.utexas.edu/users/browne/CS392Cf2000/papers/ModelsOfParallelComputation-Skillicorn.pdf （頁面存檔備份，存於網際網路檔案館）

[19] Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSP Programming Library (PDF). Parallel Computing. December 1998, 24 (14): 1947–1980 [2019-12-05]. （原始內容存檔 (PDF)於2019-11-04）. Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process.

[20] BSPlib（頁面存檔備份，存於網際網路檔案館）

[21] MulticoreBSP（頁面存檔備份，存於網際網路檔案館）

[22] BSPonMPI（頁面存檔備份，存於網際網路檔案館）

[23] Introduction to Apache Giraph. [2019-12-09]. （原始內容存檔於2019-12-19）. Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps.

[24] Apache Hama. [2019-12-09]. （原始內容存檔於2019-12-18）. Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief.

[25] OpenSHMEM. [2019-12-09]. （原始內容存檔於2019-12-09）. The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models.

[26] Armstrong, Joe. Erlang. Communications of the ACM. September 2010, 53 (9): 68–75 [2020-05-07]. doi:10.1145/1810891.1810910. （原始內容存檔於2020-06-09）. Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP.

[27] CUDA Programming Guide 1.0 (PDF). 6/23/2007 [2019-12-09]. （原始內容存檔 (PDF)於2018-04-17）. CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. 請檢查|date=中的日期值 (幫助)
Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. （原始內容存檔於2019-12-09）. OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics.
September 2009 OpenCL Public Downloads. [2019-12-09]. （原始內容存檔於2019-12-09）. Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation.
MPI Solutions for GPUs. [2019-12-09]. （原始內容存檔於2019-12-09）. MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node.
OpenSHMEM Communication on GPUs. [2019-12-09]. （原始內容存檔於2019-12-09）. NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.

[28] Pocl（頁面存檔備份，存於網際網路檔案館）

[29] Brook Language. [2019-12-09]. （原始內容存檔於2019-11-19）. Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams.

[specification-30] Howes, Lee. The OpenCL Specification Version: 2.1 Document Revision: 23 (PDF). Khronos OpenCL Working Group. November 11, 2015 [November 16, 2015]. （原始內容存檔 (PDF)於2015-11-18）.

[CiSE-31] Stone, John E.; Gohara, David; Shi, Guochin. OpenCL: a parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering. 2010, 12 (3): 66–73. Bibcode:2010CSE....12c..66S. PMC 2964860 . PMID 21037981. doi:10.1109/MCSE.2010.69.

[Xilinx-Inc-Apr-2006-8-K-32] Xilinx Inc, Form 8-K, Current Report, Filing Date Apr 26, 2006. secdatabase.com. [May 6, 2018]. （原始內容存檔於2018-05-07）.

[33] Why use OpenCL on FPGAs?. StreamComputing. 2014-09-16 [2019-12-06]. （原始內容存檔於2017-01-01）.

[34] MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. （原始內容存檔 (PDF)於2020-06-15）. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

[35] MapReduce Tutorial. Apache Hadoop. [3 July 2019]. （原始內容存檔於2019-12-14）.

[map-36] MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. （原始內容存檔 (PDF)於2020-06-15）. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

[37] Lämmel, R. Google's MapReduce programming model — Revisited (PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. （原始內容存檔 (PDF)於2019-12-07）. The following overview lists more detailed questions and summarizes our findings:
Is MAP essentially the map combinator? NO.
Is MAP essentially an application of the map combinator? NO.
Does MAP essentially serve as the argument of map? YES.
Is REDUCE essentially the reduce combinator? NO.
Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
Does REDUCE essentially serve as the argument of reduce? NO.
Does REDUCE essentially serve as the argument of map? YES.
Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.

[38] Vishkin, Uzi, Using simple abstraction to reinvent computing for parallelism (PDF), Communications of the ACM, 2011, 54: 75–85, doi:10.1145/1866739.1866757, The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
For i =0 to n−1 do
X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
…… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
For i =0 to n−1 pardo
X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
…… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
spawn (0 ,n−1) {
var x
x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
}
The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM』 s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.

[39] Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, Explicit Multi-Threading (XMT) bridging models for instruction parallelism, Proc. 1998 ACM Symposium on Parallel Algorithms and Architectures (SPAA): 140–151, 1998 [2019-12-12], （原始內容存檔於2017-09-21）, This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL（英語：NESL）) and relies on the ability of the hardware to suspend, and later reactive, threads.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]