divide and conquer with multiprocessing

rasbt · rasbt · commit 1d6dcf8d52c9 · 2016-06-23T20:14:08.000-04:00
diff --git a/ipython_nbs/essentials/divide-and-conquer-algorithm-intro.ipynb b/ipython_nbs/essentials/divide-and-conquer-algorithm-intro.ipynb
@@ -192,6 +192,290 @@
     "    print(binary_search(lst=lst, item=k))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example 2 -- Finding the Majority Element"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\"Finding the Majority Element\" is a problem where we want to find an element in an array positive integers with length *n* that occurs more than *n/2* in that array. For example, if we have an array $a = [1, 2, 3, 3, 3]$, $3$ would be the majority element. In another array, b = [1, 2, 3, 3] there exists no majority element, since $2$ (where $2$ is the the count of element $3$) is not greater than $n / 2$.\n",
+    "\n",
+    "Let's start with a simple implementation where we count how often each unique element occurs in the array. Then, we return the element that meets the criterion \"$\\text{occurences } > n / 2$\", and if such an element does not exist, we return -1. Note that we return a tuple of three items: (element, number_occurences, count_dictionary), which we will use later ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[] -> -1\n",
+      "[1, 2, 3, 4, 4, 5] -> -1\n",
+      "[1, 2, 4, 4, 4, 5] -> -1\n",
+      "[4, 2, 4, 4, 4, 5] -> 4\n",
+      "[5, 4, 4, 4, 2, 4] -> 4\n",
+      "[2, 3, 9, 2, 2] -> 2\n",
+      "[2, 2, 9, 3, 2] -> 2\n",
+      "[0, 0, 2, 2, 2] -> 2\n",
+      "[2, 2, 2, 0, 0] -> 2\n"
+     ]
+    }
+   ],
+   "source": [
+    "def majority_ele_lin(lst):  \n",
+    "    cnt = {}\n",
+    "    for ele in lst:\n",
+    "        if ele not in cnt:\n",
+    "            cnt[ele] = 1\n",
+    "        else:\n",
+    "            cnt[ele] += 1\n",
+    "    for ele, c in cnt.items():\n",
+    "        if c > (len(lst) // 2):\n",
+    "            return (ele, c, cnt)\n",
+    "    return (-1, -1, cnt)\n",
+    "\n",
+    "###################################################\n",
+    "\n",
+    "lst0 = []\n",
+    "print(lst0, '->', majority_ele_lin(lst=lst0)[0])\n",
+    "\n",
+    "lst1 = [1, 2, 3, 4, 4, 5]\n",
+    "print(lst1, '->', majority_ele_lin(lst=lst1)[0])\n",
+    "\n",
+    "lst2 = [1, 2, 4, 4, 4, 5]\n",
+    "print(lst2, '->', majority_ele_lin(lst=lst2)[0])\n",
+    "\n",
+    "lst3 = [4, 2, 4, 4, 4, 5]\n",
+    "print(lst3, '->', majority_ele_lin(lst=lst3)[0])\n",
+    "print(lst3[::-1], '->', majority_ele_lin(lst=lst3[::-1])[0])\n",
+    "\n",
+    "lst4 = [2, 3, 9, 2, 2]\n",
+    "print(lst4, '->',majority_ele_lin(lst=lst4)[0])\n",
+    "print(lst4[::-1], '->', majority_ele_lin(lst=lst4[::-1])[0])\n",
+    "\n",
+    "lst5 = [0, 0, 2, 2, 2]\n",
+    "print(lst5, '->',majority_ele_lin(lst=lst5)[0])\n",
+    "print(lst5[::-1], '->', majority_ele_lin(lst=lst5[::-1])[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, \"finding the majority element\" is a nice task for a Divide and Conquer algorithm. Here, we use the fact that if a list has a majority element it is also the majority element of one of its two sublists, if we split it into 2 halves. \n",
+    "\n",
+    "More concretely, what we do is:\n",
+    "\n",
+    "1. Split the array into 2 halves\n",
+    "2. Run the majority element search on each of the two halves\n",
+    "3. Combine the 2 subresults\n",
+    "  1. Neither of the 2 sub-arrays has a majority element; thus, the combined list can't have a majority element so that we return -1\n",
+    "  2. The right sub-array has a majority element, whereas the left sub-array hasn't. Now, we need to take the count of this \"right\" majority element, add the number of times it occurs in the left sub-array, and check if the combined count satisfies the \"$\\text{occurences} > \\frac{n}{2}$\" criterion.\n",
+    "  3. Same as above but with \"left\" and \"right\" sub-array swapped in the comparison.\n",
+    "  4. Both sub-arrays have an majority element. Compute the combined count of each of the elements as before and check whether one of these elements satisfies the \"$\\text{occurences} > \\frac{n}{2}$\" criterion."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[] -> -1\n",
+      "[1, 2, 3, 4, 4, 5] -> -1\n",
+      "[1, 2, 4, 4, 4, 5] -> -1\n",
+      "[4, 2, 4, 4, 4, 5] -> 4\n",
+      "[5, 4, 4, 4, 2, 4] -> 4\n",
+      "[2, 3, 9, 2, 2] -> 2\n",
+      "[2, 2, 9, 3, 2] -> 2\n",
+      "[0, 0, 2, 2, 2] -> 3\n",
+      "[2, 2, 2, 0, 0] -> 3\n"
+     ]
+    }
+   ],
+   "source": [
+    "def majority_ele_dac(lst):  \n",
+    "    \n",
+    "    n = len(lst)\n",
+    "    left = lst[:n // 2]\n",
+    "    right = lst[n // 2:]\n",
+    "    \n",
+    "    l_maj = majority_ele_lin(left)\n",
+    "    r_maj = majority_ele_lin(right)\n",
+    "    \n",
+    "    # case 3A\n",
+    "    if l_maj[0] == -1 and r_maj[0] == -1:\n",
+    "        return -1\n",
+    "    \n",
+    "    # case 3B\n",
+    "    elif l_maj[0] == -1 and r_maj[0] > -1:\n",
+    "        cnt = r_maj[1]\n",
+    "        if r_maj[0] in l_maj[2]:\n",
+    "            cnt += l_maj[2][r_maj[0]]\n",
+    "        if cnt > n // 2:\n",
+    "            return r_maj[0]\n",
+    " \n",
+    "    # case 3C\n",
+    "    elif r_maj[0] == -1 and l_maj[0] > -1:\n",
+    "        cnt = l_maj[1]\n",
+    "        if l_maj[0] in r_maj[2]:\n",
+    "            cnt += r_maj[2][l_maj[0]]\n",
+    "        if cnt > n // 2:\n",
+    "            return l_maj[0]\n",
+    "        \n",
+    "    # case 3D\n",
+    "    else: \n",
+    "        c1, c2 = l_maj[1], r_maj[1]\n",
+    "        if l_maj[0] in r_maj[2]:\n",
+    "            c1 = l_maj[1] + r_maj[2][l_maj[0]]\n",
+    "        if r_maj[0] in l_maj[2]:\n",
+    "            c2 = r_maj[1] + l_maj[2][r_maj[0]]\n",
+    "        m = max(c1, c2)\n",
+    "        if m > n // 2:\n",
+    "            return m\n",
+    "    return -1\n",
+    "\n",
+    "###################################################\n",
+    "\n",
+    "lst0 = []\n",
+    "print(lst0, '->', majority_ele_dac(lst=lst0))\n",
+    "\n",
+    "lst1 = [1, 2, 3, 4, 4, 5]\n",
+    "print(lst1, '->', majority_ele_dac(lst=lst1))\n",
+    "\n",
+    "lst2 = [1, 2, 4, 4, 4, 5]\n",
+    "print(lst2, '->', majority_ele_dac(lst=lst2))\n",
+    "\n",
+    "lst3 = [4, 2, 4, 4, 4, 5]\n",
+    "print(lst3, '->', majority_ele_dac(lst=lst3))\n",
+    "print(lst3[::-1], '->', majority_ele_dac(lst=lst3[::-1]))\n",
+    "\n",
+    "lst4 = [2, 3, 9, 2, 2]\n",
+    "print(lst4, '->',majority_ele_dac(lst=lst4))\n",
+    "print(lst4[::-1], '->', majority_ele_dac(lst=lst4[::-1]))\n",
+    "\n",
+    "lst5 = [0, 0, 2, 2, 2]\n",
+    "print(lst5, '->',majority_ele_dac(lst=lst5))\n",
+    "print(lst5[::-1], '->', majority_ele_dac(lst=lst5[::-1]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Adding multiprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our Divide and Conquer approach above is actually a good candidate for multi-processing, since we can parallelize the majority element search in the two sub-lists. So, let's make a simple modification and use Python's `multiprocessing` module for that. Here, we use the `apply_async` method from the `Pool` class, which doesn't return the results in order (in contrast to the `apply` method). Thus, the left sublist and right sublist may be swapped in the variable assignment `l_maj, r_maj = [p.get() for p in results]`. However, for our implementation, this doesn't make a difference."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[] -> -1\n",
+      "[1, 2, 3, 4, 4, 5] -> -1\n",
+      "[1, 2, 4, 4, 4, 5] -> -1\n",
+      "[4, 2, 4, 4, 4, 5] -> 4\n",
+      "[5, 4, 4, 4, 2, 4] -> 4\n",
+      "[2, 3, 9, 2, 2] -> 2\n",
+      "[2, 2, 9, 3, 2] -> 2\n",
+      "[0, 0, 2, 2, 2] -> 3\n",
+      "[2, 2, 2, 0, 0] -> 3\n"
+     ]
+    }
+   ],
+   "source": [
+    "import multiprocessing as mp\n",
+    "\n",
+    "def majority_ele_dac_mp(lst):  \n",
+    "    \n",
+    "    n = len(lst)\n",
+    "    left = lst[:n // 2]\n",
+    "    right = lst[n // 2:]\n",
+    "    \n",
+    "    results = (pool.apply_async(majority_ele_lin, args=(x,)) \n",
+    "               for x in (left, right))\n",
+    "    l_maj, r_maj = [p.get() for p in results]\n",
+    "    \n",
+    "    if l_maj[0] == -1 and r_maj[0] == -1:\n",
+    "        return -1\n",
+    "    \n",
+    "    elif l_maj[0] == -1 and r_maj[0] > -1:\n",
+    "        cnt = r_maj[1]\n",
+    "        if r_maj[0] in l_maj[2]:\n",
+    "            cnt += l_maj[2][r_maj[0]]\n",
+    "        if cnt > n // 2:\n",
+    "            return r_maj[0]\n",
+    "    \n",
+    "    elif r_maj[0] == -1 and l_maj[0] > -1:\n",
+    "        cnt = l_maj[1]\n",
+    "        if l_maj[0] in r_maj[2]:\n",
+    "            cnt += r_maj[2][l_maj[0]]\n",
+    "        if cnt > n // 2:\n",
+    "            return l_maj[0]\n",
+    "        \n",
+    "    else: \n",
+    "        c1, c2 = l_maj[1], r_maj[1]\n",
+    "        if l_maj[0] in r_maj[2]:\n",
+    "            c1 = l_maj[1] + r_maj[2][l_maj[0]]\n",
+    "        if r_maj[0] in l_maj[2]:\n",
+    "            c2 = r_maj[1] + l_maj[2][r_maj[0]]\n",
+    "        m = max(c1, c2)\n",
+    "        if m > n // 2:\n",
+    "            return m\n",
+    "    return -1\n",
+    "\n",
+    "###################################################\n",
+    "\n",
+    "lst0 = []\n",
+    "print(lst0, '->', majority_ele_dac(lst=lst0))\n",
+    "\n",
+    "lst1 = [1, 2, 3, 4, 4, 5]\n",
+    "print(lst1, '->', majority_ele_dac(lst=lst1))\n",
+    "\n",
+    "lst2 = [1, 2, 4, 4, 4, 5]\n",
+    "print(lst2, '->', majority_ele_dac(lst=lst2))\n",
+    "\n",
+    "lst3 = [4, 2, 4, 4, 4, 5]\n",
+    "print(lst3, '->', majority_ele_dac(lst=lst3))\n",
+    "print(lst3[::-1], '->', majority_ele_dac(lst=lst3[::-1]))\n",
+    "\n",
+    "lst4 = [2, 3, 9, 2, 2]\n",
+    "print(lst4, '->',majority_ele_dac(lst=lst4))\n",
+    "print(lst4[::-1], '->', majority_ele_dac(lst=lst4[::-1]))\n",
+    "\n",
+    "lst5 = [0, 0, 2, 2, 2]\n",
+    "print(lst5, '->',majority_ele_dac(lst=lst5))\n",
+    "print(lst5[::-1], '->', majority_ele_dac(lst=lst5[::-1]))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},