Information

2.13: Algorithms and Data Structures - Biology

2.13: Algorithms and Data Structures - Biology


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Having learned so many programming concepts—variables and the data to which they refer, functions, loops, and so on—it would be a shame not to talk about some of the surprising and elegant methods they enable. Computer science students spend years learning about these topics, and the themes in this chapter underlie much of bioinformatics.

We’ll start with algorithms, which according to a classic book on the topic—Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein—are:

any well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output.

With such a broad definition, all of the Python coding we’ve done could accurately be categorized as practice in algorithms. The word “algorithm,” by the way, derives from the name of the eighth-century Persian mathematician al-Khwārizmī, who developed systematic approaches for solving equations (among other accomplishments). Although most functioning code may be characterized as an algorithm, algorithms usually involve more than just collating data and are paired with well-defined problems, providing a specification for what valid inputs are and what the corresponding outputs should be. Some examples of problems include:

  1. Given a list of numbers in arbitrary order, return it or a copy of it in sorted order. (The “sorting” problem.)
  2. Given a list of numbers in sorted order and a query numberq, returnTrueifqis present in the list andFalseif it is not. (The “searching” problem.)
  3. Given two strings of length and, line them up against each other (inserting dashes where necessary to make them the same length) to maximize a score based on the character matches. (The “string alignment” problem.)

Clearly, some problems, like the string alignment problem, are of special interest for life scientists. Others, like sorting and searching, are ubiquitous. No matter who cares about the problem, a good algorithm for solving it should do so efficiently. Usually, “efficiently” means “in a short amount of time,” though with the large data sets produced by modern DNA-sequencing technologies, it often also means “using a small amount of memory.”

Most bioinformatics problems like string alignment require rather sophisticated methods built from simpler concepts, so we’ll start with the usual introductory topic for algorithms design and analysis: sorting. After all, someone had to write Python’ssorted()function for lists, and it does form a good basis for study.

Consider a small list of five numbers, in unsorted order.

What code might we write to sort these numbers? Whatever it is, it might make sense to put it into a function that takes an unsorted list, and returns a copy of it in sorted order (to stay consistent with Python’s use of the principle of command-query separation). To start, we’ll make a copy of the list, and then we’ll have the function merely “fix up” some of the out-of-order numbers in this copy by looking at a sliding window of size 2—pairs of adjacent numbers—and switching their placement if they are out of order.

Just fixing up a handful of pairs in this way won’t result in the list being sorted, but it should be obvious that because the pairs of fix-ups overlap, the largest number will have found its way to the last position. In a sense, it “bubbled” to the top. If we were to repeat this process, the second-to-largest number must also find its way to the correct spot. In fact, if we repeat this process as many times as there are numbers, the list will be fully sorted.

This algorithm for sorting is affectionately known as “bubblesort.” In a rough sense, how many operations will this algorithm perform, in the worst case, for a list of size? Suppose the if-statement always findsTrue, and numbers need to be swapped. In this case, each operation of the inner loop requires five or so “time steps” (four assignments and anifcheck). The inner loop runs times (if there are numbers), which is itself repeated times in the outer loop. The copy step to produce the newc_numslist also uses steps, so the total number of basic operations is, more or less,

When analyzing the running time of an algorithm, however, we only care about the “big stuff.”

In computing an algorithm’s run time, we only care about the highest-order terms, and we tend to ignore constants that aren’t related to the “size” of the input. We say that the run time is order of the highest-order term (without constants), which we denote with an uppercase: is

We read this as “five squared minus four is order squared” or “five squared minus four is big-oh squared.” Because this run time talks about a particular algorithm, we might also say “bubblesort is order squared.” Big-O notation implies that, roughly, an algorithm will run in time less than or equal to the equation interpreted as a function of the input size, in the worst case (ignoring constants and small terms).

Worst-case analysis assumes that we get unlucky with every input, but in many cases the algorithm might run faster in practice. So, why do we do worst-case analysis? First, it gives the user of an algorithm a guarantee—nobody wants to hear that software might use a certain amount of time. Second, while it is sometimes possible to do average-case analysis, it requires knowledge of the distribution of input data in practice (which is rare) or more sophisticated analysis techniques.

Why do we use order notation to characterize algorithm run time, dropping small terms and constants? First, although we stated previously that each basic operation is one “computer step,” that’s not really true: the linec_nums[index + 1] = leftnumrequires both an addition and an assignment. But no one wants to count each CPU cycle individually, especially if the result won’t change how we compare two algorithms for large data sets, which is the primary purpose of order notation. Dropping off constant terms and lower-order terms makes good mathematical sense in these cases. For large enough input (i.e., all greater than some size), even terms that seem like they would make a big difference turn out not to matter, as the following hypothetical comparison demonstrates.

(Technically in the above is an abuse of notation; we should say is based on the definition of the notation.) In this case, is larger than when is larger than 400,024,000 (which is this example’s).

So, although order notation seems like a fuzzy way of looking at the efficiency of an algorithm, it’s a rigorously defined and reasonable way to do so.[1] Because some programming languages are faster than others—but only by a constant factor (perhaps compiled C code is 100 times faster than Python)—a good algorithm implemented in a slow language often beats a mediocre algorithm implemented in a fast language!

Now, a run time of isn’t very good: the time taken to sort a list of numbers will grow quadratically with the size of the list.

isn’t great, but for the sorting problem we can do better, and we’ll return to the study of algorithms and sorting a bit later. First, let’s detour and talk about interesting ways we can organize data, using variables and the objects to which they refer.

A data structure is an organization for a collection of data that ideally allows for fast access and certain operations on the data. We’ve seen a couple of data structures already in Python: lists and dictionaries. Given such a structure, we might ask questions like:

  • Does the data structure keep its elements in a specific order (such as sorted order)? If we use.append()to add elements to lists, lists are kept in the order items are added. If we wanted the list to stay sorted, we’d have to ensure new elements are inserted in right spot.
  • How long does it take to add a new element? For Python lists, the.append()method operates in time , which is to say a single element can be added to the end, and the time taken does not depend on how big the list is. Unfortunately, this won’t be true if we need to keep the list sorted, because inserting into the middle requires rearranging the way the data are stored.
  • How long does it take to find the smallest element? Because using.append()adds elements to the end of the list, unless we keep the list sorted, we need to search the entire list for the smallest element. This operation takes time , where is the length of the list. If the list is sorted, though, it takes only time to look at the front of it.

There are trade-offs that we can make when thinking about how to use data structures. Suppose, for example, that we wanted to quickly find the smallest element of a list, even as items are added to it. One possibility would be to sort the list after each addition; if we did that, then the smallest element would always be at index0for quick retrieval. But this is a lot of work, and the total time spent sorting would start to outweigh the benefits of keeping the list sorted in the first place! Even better, we could write a function that (1) inserts the new item to the end and then (2) bubbles it backward to its correct location; only one such bubbling would be needed for each insertion, but each is an operation (unless we could somehow guarantee that only large elements are added).

Finding the smallest item is easy, as we know it is always present at index0.

Instead, let’s do something much more creative, and build our own data structure from scratch. While this data structure is a lot of trouble to create for only a little benefit, it does have its uses and serves as a building block for many other sophisticated structures.

Our data structure, called a “sorted linked list,” will keep a collection of items in sorted order, and will do the work of inserting new items in the correct location in that order. These items could be integers, floats, or strings, anything that can be compared with the<operator (which, recall, compares strings in lexicographic order; as discussed in previous chapters, we can even define our own object types that are comparable with>by implementing.__lt__(),.__eq__()and similar methods).

To accomplish this task, we’re going to make heavy use of classes and objects, and the fact that a variable (even an instance variable) is a name that refers to an object.

The strategy will be to have two types of objects: the first, which we’ll call aLinkedList, will be the main type of object with which our programs will interact (much like we interacted withChromosomeobjects in previous chapters, whileSNPobjects were dealt with by theChromosomeobjects). TheLinkedListobject will have methods like.insert_item()and.get_smallest(). The other objects will be of typeNode, and each of these will have an instance variableself.itemthat will refer to the item stored by an individualNode. So, drawing just the objects in RAM, our sorted linked list of three items (4,7, and9) might look something like this:

(The objects are unlikely to be neatly organized this way in RAM—we’re showing them in a line for illustration only.) The arrows in this figure indicate that after we create a newLinkedListobject calleditemlist, this variable is a name that refers to an object, and eachNodeobject has aself.iteminstance variable that refers to a “data” object of some type, like an integer or string.[2]

Now, if this collection of objects were to exist as shown, it wouldn’t be that useful. Our program would only be able to interact with theitemlistobject, and in fact there are no variables that refer to the individualNodeobjects, so they would be deleted via garbage collection.

Here’s the trick: if an instance variable is just a variable (and hence a reference to an object), we can give eachNodeobject aself.next_ninstance variable that will refer to the next node in line. Similarly, theLinkedListobject will have aself.first_ninstance variable that will refer to the first one.

The lastNodeobject’sself.next_nrefers toNone, a placeholder that we can use to allow a variable to reference “nothing here.” Actually,Nonewill be the initial value forself.next_nwhen a new node is created, and we’ll have to add methods forget_next_n()andset_next_n()that allow us to get or change aNode’snext_nvariable at will. The LinkedLists’sfirst_nvariable will similarly be initialized asNonein the constructor.

Suppose that we have this data structure in place, and we want to add the number2; this would be the new “smallest” item. To do this we’d need to runitemlist.insert_item(2), and this method would consider the following questions to handle all of the possible cases for inserting an item (by using if-statements):

  1. Isself.first_nequal toNone? If so, then the new item is the only item, so create a newNodeholding the new item and setself.first_nto that node.
  2. Ifself.first_nis not equal toNone:
    1. Is the new item smaller thanself.first_n’s item? If so, then (1) create a newNodeholding the new item, (2) set itsnext_ntoself.first_n, and then (3) setself.first_nto the new node. Here’s an illustration for this case:
    2. Otherwise, the new node does not go between theLinkedListobject and the firstNodeobject. In this case, we could treat theself.first_nobject as though it were itself aLinkedList, if only it had an.insert_item()method.

This case (b) is really the heart of the linked list strategy: eachNodeobject will also have an.insert_item()method. Further, each node’sinsert_item()will follow the same logic as above: ifself.next_nisNone, the new node goes after that node. If not, the node needs to determine if the new node should go between itself and the next node, or if it should “pass the buck” to the node next in line.

Now we can turn to the code. First, here’s the code for theLinkedListclass.

The highlighted lines above are those illustrated in step 2a and are crucial; in particular, the order in which the various variables are set makes all the difference. What would happen ifself.first_n = newnodewas called beforenewnode.set_next(self.first_n)? We would lose all references to the rest of the list:itemlist.first_nwould refer tonewnode, the new node’s.next_nwould refer toNone, and no variable would refer to the node holding3—it would be lost in RAM and eventually garbage collected.

As mentioned above, the class for aNodeis quite similar. In fact, it is possible to build linked lists with only one object type, but this would necessitate a “dummy” node to represent an empty list anyway (perhaps storingNonein itsself.item).

Our new data structure is relatively easy to use, and it keeps itself sorted nicely:

This idea of “passing the buck” between nodes is pretty clever, and it allows us to write sophisticated queries on our data structure with ease. Suppose we wanted to ask whether a given item is already present in the list.

To solve a problem like this, we can think of each node as implementing a decision procedure (using a method, like.is_item_present(query)). TheLinkedListinterface object would returnFalseif itsself.first_nisNone(to indicate the list is empty, so the query item can’t be present). If itsself.first_nis notNone, it callsself.first_n.is_item_present(query), expecting that node to either returnTrueorFalse.

For a node, the decision procedure is only slightly more complex:

  1. Check whetherself.itemis equal to thequery. If so, aTruecan safely be returned.
  2. Otherwise:
    1. Ifself.next_nisNone, thenFalsecan be returned, because if the buck got passed to the end of the list, no node has matched thequery.
    2. Ifself.next_ndoes exist, on the other hand, just pass the buck down the line, and rely on the answer to come back, which can be returned.

Here is a quick demonstration of the usage (the whole script can be found in the file linkedlist.py):

Notice the similarity in all of these methods: each node first determines whether it can answer the problem—if so, it computes and returns the answer. If not, it checks for a node to pass the problem on to, and if one exists, the buck is passed. Notice that at each buck pass the method being called is the same—it’s just being called for a different object. And each time the overall “problem” gets smaller as the number of nodes left to pass the buck to decreases.

How much time does it take to insert an item into a list that is already of length? Because the new item might have to go at the end of the list, the buck might need to be passed times, meaning an insertion is . What about getting the smallest element? In theLinkedList’s.get_smallest()method, it only needs to determine whetherself.first_nisNone, and if not, it returns the element stored in that node. Because there is no buck passing, the time is .

The creation of the sorted linked list structure didn’t get us much over a more straightforward list kept in sorted order via bubbling, but the ideas implemented here pave the way for much more sophisticated solutions.

Exercises

  1. How much time would it take to insert sequences into a Python list, and then at the end sort it with bubblesort in the worst-case scenario (using order notation)?
  2. How much time would it take to insert elements into a sorted linked list that starts empty, in the worst-case scenario (using order notation)? (Note that the first insertion is quick, but the second item might take two buck-passes, the third may take three, and so on.)
  3. Add “pass the buck” methods to theLinkedListandNodeclasses that result in each item being printed in order.
  4. Write “pass the buck” methods that cause the list of items to be printed, but in reverse order.
  5. Add methods to theLinkedListandNodeclasses so that the linked list can be converted into a normal list (in any order, though reverse order is most natural). For example,print(itemlist.collect_to_list())should print something like['9', '3', '7'].

So far, both the algorithm (bubblesort) and the data structure (sorted linked list) we’ve studied have been linear in nature. Here, we’ll see how these ideas can be extended in “bifurcating” ways.

Let’s consider the sorted linked list from the last section, which was defined by a “controlling” class (theLinkedList) and a number of nearly identicalNodes, each with a reference to a “next”Nodeobject in the line. So long as certain rules were followed (e.g., that the list was kept in sorted order), this allowed each node to make local decisions that resulted in global answers to questions.

What if we gave each node a bit more power? Rather than a singleself.next_ninstance variable, what if there were two: aself.left_nand aself.right_n? We will need a corresponding rule to keep things organized: smaller items go toward the left, and larger (or equal-sized) items go toward the right. This data structure is the well-known binary tree.

The illustration above looks quite a bit more complicated. But if we inspect this structure closely, it’s quite similar to the linked list:[3] there is a controlling class ofBinaryTree, and instead of aself.first_ninstance variable, it has an instance variable calledself.root_n, because the node it references is the “root” of the tree. Before any items have been inserted,self.root_nwill beNone. Upon an item insertion, ifself.root_nisNone, the new item goes there; otherwise, the buck is necessarily passed toself.root_n. We’ll see why in a moment.

Now, for theNodeclass, we’ll need a constructor, as well as “get” and “set” methods for bothleft_nandright_n, which initially are set toNone.

What about a node’s.insert_item()method? What sort of decision-making process needs to happen? The process is even simpler than for a sorted linked list. If each node always follows the rule that smaller items can be found to the left and larger or equal items can always be found to the right, then new items can always be inserted at the bottom of the tree. In the tree above, for example, a node holding8would be placed to the right of (and “below”) the node holding7. The decision process for a node is thus as follows:

  1. Is the new item to insert less than ourself.item? If so, the new item goes to the left:
    1. Isself.left_nequal toNone? If so, then we need to create a new node holding the new item, and setself.left_nto that node.
    2. If not, we can pass the buck toself.left_n.
  2. Otherwise, the item must be larger than or equal toself.item, so it needs to go to the right:
    1. Isself.right_nequal toNone? If so, then we need to create a new node holding the new item, and setself.right_nto that node.
    2. If not, we can pass the buck toself.right_n.

In the previous figure of a tree,8would go to the right of7,6would go to the left of7,18would go the right of11, and so on. The logic for inserting a new item is thus quite simple, from a node’s point of view:

The remaining method to consider is the tree’s.get_smallest(). In the case of a linked list, the smallest item (if present) was the first node in the list, and so could be accessed with no buck passing. But in the case of a binary tree, this isn’t true: the smallest item can be found all the way to the left. The code for.get_smallest()in theTreeclass and the correspondingNodeclass reflects this.

In the case of a node, ifself.left_nisNone, then that node’s item must be the smallest one, because it can assume the message was only ever passed toward it to the left. Similar usage code as for the linked list demonstrates that this wonderful structure (binarytree.py) really does work:

The most interesting question is: how much time does it take to insert a new item into a tree that already contains items? The answer depends on the shape of the tree. Suppose that the tree is nice and “bushy,” meaning that all nodes except those at the bottom have nodes to their left and right.

The time taken to insert an item is the number of times the buck needs to be passed to reach the bottom of the tree. In this case, at each node, the total number of nodes in consideration is reduced by half; first, then , then , and so on, until there is only a single place the new item could go. How many times can a number be divided in half until reaching a value of 1 (or smaller)? The formula is . It takes the same amount of time to find the smallest item for a bushy tree, because the length down the left-hand side is the same as any other path to a “leaf” in the tree.

In general, the logarithm of is much smaller than itself, so a binary tree trades off some speed in finding the smallest element for speed in insertion.

Note, however, that the shape of a tree depends on the order in which the items are inserted; for example if10is inserted into an empty tree, followed by9, the9will go to the left. Further inserting8will put it all the way to the left of9. Thus, it is possible that a tree isn’t in fact bushy, but rather very unbalanced. For an extreme example, if the numbers from were inserted in that order, the tree would look like so:

In this case, the tree has degenerated into a reverse-sorted linked list. So, the insertion time (for1, for example) is , and finding the smallest item is also, because the tree is heavily unbalanced in a leftward direction. Unfortunately, in practice, we can’t guarantee the order in which data will be inserted, and such runs of consecutive insertions aren’t uncommon in real-world data.

More sophisticated structures called “balanced” binary trees have modifications to their insertion methods that ensure the tree stays bushy, no matter what order items are inserted. This is tricky, as it requires manipulating the structure of the tree after insertions, but the structure manipulation itself can’t take too long or the benefit of insertion is lost. Examples of balanced binary trees include so-called red-black trees, and AVL trees (named after Georgy Adelson-Velsky and Evgenii Landis, who first described them).

In practice, we don’t make a distinction between “bushy” and “un-bushy” binary trees: simple binary-search trees are for both.insert_item()and.get_smallest(), because we cannot guarantee bushiness (recall that we assume worst-case performance when analyzing an algorithm). Real applications thus use AVL trees, red-black trees, or other more sophisticated data structures.

Exercises

  1. Add “pass the buck” methods toBinaryTreeand itsNodeclass for.print_in_order()and.print_reverse_order(), causing the items to be printed in sorted and reverse-sorted order, respectively.
  2. Add.count_nodes()methods that return the total number of items stored in the tree. How long does it take, in order notation? Does this depend on whether the tree is bushy or not? If so, what would be the run time for a bushy versus un-bushy tree?
  3. Add.count_leaves()methods that return the total number of “leaves” (nodes withNoneinleft_nandright_n).
  4. Binary search trees are so called because they can easily and efficiently determine whether a query item is present. Add.is_item_present()methods that returnTrueif a query item exists in the tree andFalseotherwise (similar to theLinkedList.is_item_present()). How long does it take, in order notation? Does this depend on whether the tree is bushy or not? If so, what would be the run time for a bushy versus un-bushy tree?
  5. Modify the binary tree code so that duplicate items can’t be stored in separate nodes.

Back to Sorting

We previously left the topic of sorting having developedbubblesort, an method for sorting a simple list of items. We can certainly do better.

One of the interesting features of theinsert_item()method used by nodes in both the tree and linked list is that this method, for any given node, calls itself, but in another node. In reality, there aren’t multiple copies of the method stored in RAM; rather, a single method is shared between them, and only theselfparameter is really changing. So, this method (which is just a function associated with a class) is actually calling itself.

In a related way, it turns out that any function (not associated with an object) can call itself. Suppose we wanted to compute the factorial function, defined as

One of the most interesting features of the factorial function is that it can be defined in terms of itself:

If we wanted to computefactorial(7), a logical way to think would be: “first, I’ll compute the factorial of 6, then multiply it by 7.” This reduces the problem to computingfactorial(6), which we can logically solve in the same way. Eventually we’ll want to computefactorial(1), and realize that is just 1. The code follows this logic impeccably:

As surprising as it might be, this bit of code really works.[4] The reason is that the parameternis a local variable, and so in each call of the function it is independent of any othernvariable that might exist.[5] The call tofactorial(7)has annequal to 7, which callsfactorial(6), which in turn gets its ownnequal to6, and so on. Each call waits at thesubanswer = factorial(n-1)line, and only whenfactorial(1)is reached do the returns start percolating back up the chain of calls. Because calling a function is a quick operation (), the time taken to computefactorial(n)is , one for each call and addition computed at each level.

This strategy—a function that calls itself—is called recursion. There are usually at least two cases considered by a recursive function: (1) the base case, which returns an immediate answer if the data are simple enough, and (2) the recursive case, which computes one or more subanswers and modifies them to return the final answer. In the recursive case, the data on which to operate must get “closer” to the base case. If they do not, then the chain of calls will never finish.

Applying recursion to sorting items is relatively straightforward. The more difficult part will be determining how fast it runs for a list of size. We’ll illustrate the general strategy with an algorithm calledquicksort, first described in 1960 by Tony Hoare.

For an overall strategy, we’ll implement the recursive sort in a function calledquicksort(). The first thing to check is whether the list is of length 1 or 0: if so, we can simply return it since it is already sorted. (This is the base case of the recursive method.) If not, we’ll pick a “pivot” element from the input list; usually, this will be a random element, but we’ll use the first element as the pivot to start with. Next, we’ll break the input list into three lists:lt, containing the elements less than the pivot;eq, containing the elements equal to the pivot; andgt, containing elements greater than the pivot. Next, we’ll sortltandgtto producelt_sortedandgt_sorted. The answer, then, is a new list containing first the elements fromlt_sorted, then the elements fromeq, and finally the elements fromgt_sorted.

The interesting parts of the code below are the highlighted lines: how do we sortltandgtto producelt_sortedandgt_sorted?

We could use Python’s built-insorted()function, but that’s clearly cheating. We could use bubblesort, but doing so would result in our function suffering the same time bound as for bubblesort (we say “time bound” because the order notation essentially provides an upper bound on time usage). The answer is to use recursion: becauseltandgtmust both be smaller than the input list, as subproblems they are getting closer to the base case, and we can callquicksort()on them!

Let’s attempt to analyze the running time ofquicksort()—will it be better than, equal to, or worse than bubblesort? As with the binary trees covered previously, it turns out the answer is “it depends.”

First, we’ll consider how long the function takes to run—not counting the subcalls toquicksort()—on a list of size. The first step is to pick a pivot, which is quick. Next, we split the input list into three sublists. Because appending to a list is quick, but all elements need to be appended to one of the three, the time spent in this section is . After the subcalls, the sorted versions of the lists need to be appended to theanswerlist, and because there are things to append, the time there is also . Thus, not counting the recursive subcalls, the run time is plus which is .

But that’s not the end of the story—we can’t just ignore the time taken to sort the sublists, even though they are smaller. For the sake of argument, let’s suppose we always get “lucky,” and the pivot happens to be the median element of the input list, solen(lt)andlen(gt)are both approximately . We can visualize the execution of the function calls as a sort of “call tree,” where the size of the nodes represents how big the input list is.

Here, the top “node” represents the work done by the first call; before it can finish, it must call to sort theltlist on the second layer, which must again call to sort its ownltlist, and so on, until the bottom layer, where a base case is reached. The path of execution can be traced in the figure along the line. To analyze the run time of the algorithm, we can note that the amount of work done at each layer is , so the total amount of work is this value times the number of layers. In the case where the pivots split the lists roughly in half, the result is the same as the number of layers in the bushy binary tree: . Thus, in this idealized scenario, the total run time is , which is much better than the of bubblesort.[6]

This equation assumes that the pivots split the input lists more or less in half, which is likely to be the case (on average), if we choose the pivots at random. But we didn’t choose at random; we usednums[0]. What would happen if the input to the list happened to be already sorted? In this case, the pivot would always be the first element,ltwould always be empty, andgtwould have one fewer element. The “work tree” would also be different.

In this “unlucky” case, the work on each level is approximately , and so on, so the total work is which is . Does this mean that in the worst casequicksort()is as slow as bubblesort? Unfortunately, yes. But some very smart people have analyzedquicksort()(assuming the pivot element is chosen at random, rather than using the first element) and proved that in the average case the run time is , and the chances of significantly worse performance are astronomically small. Further, a similar method known as “mergesort” operates by guaranteeing an even 50/50 split (of a different kind).

Using random selection for pivot,quicksort()is fast in practice, though mergesort and similar algorithms are frequently used as well. (Python’s.sort()andsorted()use a variant of mergesort called “Timsort.”) Although as mentioned above worst-case analysis is most prevalent in algorithms analysis,quicksort()is one of the few exceptions to this rule.

These discussions of algorithms and data structures might seem esoteric, but they should illustrate the beauty and creativeness possible in programming. Further, recursively defined methods and sophisticated data structures underlie many methods in bioinformatics, including pairwise sequence alignment, reference-guided alignment, and Hidden Markov Models.

Exercises

  1. The first step for writingmergesort()is to write a function calledmerge(); it should take two sorted lists (together comprising elements) and return a merged sorted version. For example,merge([1, 2, 4, 5, 8, 9, 10, 15], [2, 3, 6, 11, 12, 13])should return the list[1, 2, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15], and it should do so in time , where is the total number of elements from both lists. (Note that.append()on a Python list is time , as are mathematical expressions likec = a + b, but other list operations like.insert()are not.)

    Themergesort()function should first split the input list into two almost-equal-sized pieces (e.g.,first_half = input_list[0:len(input_list)/2]); these can then be sorted recursively withmergesort(), and finally themerge()function can be used to combine the sorted pieces into a single answer. If all steps in the function (not counting the recursive calls) are , then the total time will be .

    Implementmerge()andmergesort().

  2. The Fibonacci numbers (1, 1, 2, 3, 5, 8, 13, etc.) are, like the factorials, recursively defined:

    Write a recursive functionfib()that returns theth Fibonacci number (fib(1)should return1,fib(3)should return2,fib(10)should return55).

  3. Next, write a function calledfib_loop()that returns theth Fibonacci by using a simple loop. What is the run time, in order notation, in terms of? Compare how long it takes to computefib(35)versusfib_loop(35), and then tryfib(40)versusfib_loop(40). Why do you thinkfib()takes so much longer? Try drawing the “call trees” forfib(1),fib(2),fib(3),fib(4),fib(5), andfib(6). Could you make a guess as to what the run time of this function is in order notation? Can you imagine any ways it could be sped up?


69443 - ALGORITHMS AND DATA STRUCTURES FOR COMPUTATIONAL BIOLOGY

At the end of the cycle of lessons, the student has the basic knowledge of the methodologies and issues for the definition and complexity analysis of correct and efficient algorithms and data structures. In particular, the student has basic knowledge on abstract data types and is able to design scalable and arbitrarily complex data structures functional to the algorithmic requirements. The student will be able to create a functional synergy between the design of correct and efficient algorithms and the corresponding well designed data structures for common computational tasks, under the optimization criteria of both computational and space complexity for algorithms execution. The algorithmic methodologies will be presented with concrete examples and applications, including iterative and recursive programming, divide et impera, greedy and dynamic programming techniques. The student will be able to analyze the complexity of existing algorithms and data structures, using analysis methodologies for algorithmic complexity evaluation, and will be trained on the design and analysis new algorithms and data structures for computational biology.


2.13: Algorithms and Data Structures - Biology

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited.

Feature Papers represent the most advanced research with significant potential for high impact in the field. Feature Papers are submitted upon individual invitation or recommendation by the scientific editors and undergo peer review prior to publication.

The Feature Paper can be either an original research article, a substantial novel research study that often involves several techniques or approaches, or a comprehensive review paper with concise and precise updates on the latest progress in the field that systematically reviews the most exciting advances in scientific literature. This type of paper provides an outlook on future directions of research or possible applications.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to authors, or important in this field. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.


Data Structure and Algorithm

Searching involves deciding whether a search key is present in the data. For example, looking up a phone book or address book. The searching algorithm includes:

  • Linear Search: See "Linear Search".
  • Recursive Binary Search for sorted list
  • Binary Tree Search

Linear Search

Complexity

The worst-case and average-case time complexity is O(n). The best-case is O(1).

Binary Search

A binary search (or half-interval search) is applicable only to a sorted array. It compares the search key with the middle element. If there is a match, it returns the element's index. If the search key is less then the middle element, repeat searching on the left half otherwise, search the right half. If the remaining element to be searched is zero, return "no found".

Complexity

The worst-case and average-case time complexity for binary search is O(log n). The best-case is O(1).

Sorting

Sorting involves arranging data in ascending or descending order, according to a certain collating sequence (or sorting sequence). The sorting algorithm includes:

  • Insertion Sort: See "Insertion Sort".
  • Selection Sort: See "Selection Sort".
  • Bubble Sort: See "Bubble Sort"
  • Merge Sort (Recursive Top-Down or Interactive Bottom-up)
  • Quick Sort (Recursive)
  • Bucket Sort
  • Heap Sort
  • Binary Tree Sort

Bubble Sort

Complexity

The average-case and worst-case time complexity is O(n 2 ).

Insertion Sort

Complexity

The average-case and worst-case time complexity is O(n 2 ).

Selection Sort

Complexity

The average-case and worst-case time complexity is O(n 2 ).

Merge Sort

Recursive Top-Down Merge Sort

The algorithm is as follows:

  1. Recursively divide the list into 2 sublists.
  2. When the sublists contain 1 element (a list of 1 element is sorted), merge two sublists in the right order. Unwind the merging recursively.
Interactive Bottom-up Merge Sort

Treat the list as sublists of length 1, and interactively merge a pair of sublists bottom-up, until there is only one sublist.

Complexity

The worst-case and average-case time complexity is O(n log n). The best-case is typically O(n log n). However, merge sort requires a space complexity of O(n) for carrying out the merge-sorting.

Quick Sort

Quick sort is a divide and conquer algorithm. It picks a pivot and divides the list into two sub-lists: the low elements and the high elements, and recursively sorts the sub-lists.

  1. Pick a element, called pivot, from the list.
  2. Partition the list so that the smaller elements are before the pivot, and the larger elements after the pivot.
  3. Recursively sort the sub-lists.

Partition: The partition procedure are:

Choosing the pivot: If the list is already sorted, choosing the first or last element as pivot results in worst performance of O(n 2 ). We choose the middle element, and swap the the element at the right end, so as not to interfere with the partition process.

QuickSort.cpp
Complexity

The worst-case time complexity is O(n 2 ). The average-case (typical) and best-case is O(n log n). In-place sorting can be achieved without additional space requirement.

Bucket Sort

Bucket sort (or bin sort) works by distributing the elements into a number of buckets, and each bucket is then sorted individually. Bucket sort is a distribution sort, and is a cousin of radix sort, in which the sorting begins at the most significant digit and goes down to the less significant ones. Bucket sort is not a comparison sort like bubble sort, insertion sort or selection sort.

The algorithm is as follows:

  1. Set up buckets, initially empty.
  2. Scatter: place each element into an appropriate bucket.
  3. Sort each non-empty bucket.
  4. Gather: Gather elements from buckets and put back to the original array.
Radix Sort (Recursive)

Use 10 buckets to sort from the most-significant down to the least-significant digit.

  • Need to implement the buckets using dynamic array (such as vector ) as their sizes are unknown and to expensive to set to the maximum.

Data Structures

The built-in array has many limitations. It is fixed in size and cannot grow and shrink during execution. The size has to be decided during the declaration.

Many applications require dynamic data structures, that can grow and shrink during execution. The commonly used data structures include:

  • List: Single linked list, double linked list, etc.
  • Queue: FIFO queue, priority queue, etc.
  • Stack: LIFO queue
  • Tree:
  • Map or Associative Array:

Single Linked List

Node.h
List.h
TestList.cpp

Double Linked List

DoubleLinkedNode.h
DoubleLinkedList.h
TestDoubleLinkedList.cpp

Stack (LIFO Queue)

Stack.h
TestStack.cpp

A tree has a root node. Each parent node could have child nodes. A node without child is called a leaf node. A tree with only the root node is called a null tree. The depth of a tree is the length of the path from the root to the deepest node in the tree. A null tree has depth of zero.

In a binary tree, a parent node could have up to two child nodes: left child and right child (called siblings with the same parent). They are root of the left subtree and right subtree respectively.

Depth-First Search (DFS)

Start at the root and explore as far as possible along each branch before backtracking. They are 3 types of depth-first search:

  1. Pre-order: visit the root, traverse the left subtree, then the right subtree. E.g., 6 -> 5 -> 4 -> 10 -> 7 -> 9 ->15.
  2. In-order: traverse the left subtree, visit the root, then the right subtree. E.g., 4 -> 5 -> 6 -> 7 -> 9 ->10 -> 15.
  3. Post-order: traverse the left subtree, the right subtree, then visit the root. E.g, 4 -> 5 -> 9 -> 7 -> 15 -> 10 -> 6.

Pre-, in- and post- refer to the order of visiting the root.

Breadth-First Search (BFS)

Begin at the root, visit all its child nodes. Then for each of the child nodes visited, visit their child nodes in turn. E.g., 6 -> 5 -> 10 -> 4 -> 7 -> 15 -> 9.

Binary Search Tree

A binary search tree, without duplicate elements, has these properties:

  1. All values in the left subtree are smaller than the parent node.
  2. All values in the right subtree are larger than the parent node.

The above diagram illustrates a binary search tree. You can retrieve the sorted list or perform searching via in-order depth-first traversal. Take note that the actual shape of the tree depends on the order of insertion.

Node.h
BinaryTree.h
TestBinaryTree.cpp

[TODO] Breadth-First Search: need a FIFO queue to keep the child nodes.

Latest version tested: Cygwin/MinGW GNU GCC 4.62
Last modified: May, 2013


Неделя 2

We consider two fundamental data types for storing collections of objects: the stack and the queue. We implement each using either a singly-linked list or a resizing array. We introduce two advanced Java features—generics and iterators—that simplify client code. Finally, we consider various applications of stacks and queues ranging from parsing arithmetic expressions to simulating queueing systems.

2 материала для самостоятельного изучения

1 практическое упражнение


Projects

Design and analysis of algorithms data structures theory of computation combinatorial optimization computational geometry automata theory graph theory on-line algorithms graph drawing algorithms algorithms in discrete tomography.


Tao Jiang

My research has been focused on the design and analysis of algorithms of either theoretical or practical importance. Some of my past work includes approximation algorithms, average-case analysis, computational complexity, and computational learning. My current work is mostly concerned with algorithmic and machine learning problems in bioinformatics and biomedical informatics.

Research interests: computational molecular biology, bioinformatics, design of algorithms, machine learning and data mining.


Yihan Sun

Broad topics in the theory and practice of parallel computing, including algorithms, data structures, frameworks, implementations, and their applications. My work involves improving asymptotical theoretical bounds, simplifying algorithms and proofs, and developing efficient solutions to real-world problems.

My research interests lie in Complexity Theory. More specifically, I am interested in topics related to approximation algorithms, constraint satisfaction problems, analysis of Boolean functions and probabilistically checkable proofs.

My current research interest is design efficient (usually parallel) algorithms for large-scale data with good performance in practice.


MODELS IN BIOLOGY: SCALES AND COMPLEXITY

As already pointed out, any natural phenomena can be observed at different scales thus, in describing the phenomenon by conceptual and quantitative models one needs to choose to appropriate scale to describe the experimental data available.

However, in almost all complex natural phenomena there are aspects that cannot be even observed at just one scale of description (either temporal or spatial). To study these specific facets of reality, multiscale models that represent objects and relationships on different levels of abstraction are required.

Choosing a scale depends on which aspects of the phenomena, from ‘micro’ to ‘macro’, one is interested to analyze. In physics this is already a well defined approach that originates from different research areas, and the distinction between different scales is based upon the characteristic lengths of objects and the characteristic time of the phenomena under investigation. For instance, microphysics refers to areas of physics that study phenomena, which take place on the microscopic scale (length scales <1 mm), such as: molecular physics, atomic physics, nuclear physics and particle physics.

In the life sciences the definition of a scale is a bit more ambiguous. A basic unit available for defining a scale is the ‘cell’ with no regard to its physical dimension. Starting from this, one can define different scales: the ‘sub-cellular’ or ‘intra-cellular’ scale, the ‘cellular’, mesoscopic or ‘inter-cellular scale the ‘macroscopic scale’ and the ‘populations scale’.

Models developed at sub-cellular scale deal with the evolution of the physical and biochemical state of a single cell. This scale involves genes, proteins and signals in cell nucleus and surface, which regulate the evolution of the cell and any signaling processing operations of the cells enabling cell crosstalk.

Modeling the overall activity of a single cell is a very hard problem as many biological details of this activity are unknown.

Biologists and modelers have joined forces to develop and use mathematical and computer science techniques in modeling sub-cellular phenomena. Interested readers can find plenty of references in the scientific literature [ 13].

In the cellular scale, one is interested in describing the evolution of a system consisting of a large number of different interacting cells and molecules. Cell interactions are regulated by signals emitted and received by cells through complex recognition processes. Cellular scale is thus highly connected with the sub-cellular scale but, modeling at this scale, one may forget the details of single cell models and consider them as BBM. The areas of mathematical methods and tools involved in this description refer to statistical mechanics, cellular automata, lattice gas and other similar approaches.

The ‘macroscopic scale’ include tissues, organs (i.e. a collection of tissues joined in structural unit to serve a common function), systems (i.e. a group of organs working together to perform a certain task) and organism.

In this scale, one is interested in describing the dynamical behavior of observable quantities, in most cases, the concentrations of various entities (cells or molecules). Tissues are usually described using techniques originating from physical continuous systems, i.e. ordinary or partial differential equations or moments of kinetic equations. In describing organs, a model is required to describe both the main tissue and the sporadic tissues but also, and most importantly, the biological function. To model a system, one is required to consider a network of organs that perform a specific task. Depending on the modeling goal the model of a biological system can be arranged with different levels of details. Organs can be fully described in their components or simply as BBM performing a given task. Connections between organs (like lymphatic vessels) can be described physically (dynamical description of the fluid motion in the vessels) or simply considering the flux and the time required to move portions of fluid from different organs, i.e. though law of transport.

Finally, in the population scale, one is interested in describing the dynamics of the populations with respect to one or more characteristics. Models of epidemics or population controls are well known models at this scale. Population dynamics is extremely complex because the effects of all previously mentioned scales and the effects of the environment on a single organism can modify the overall dynamic of the populations. In this class of models, a single organism can or cannot be described in detail, according to the size of the populations one is required to describe. In both cases, changes of the major characteristics of a single organism must be taken into account. For example, to describe the response of a population to large-scale vaccinations (as required in influenza epidemics), one does not describe in detail any single organism but it may be required to consider the age structure of the population and the effects of environments. At variance for small size populations, like modeling the effect of a new vaccine for a small trial, a sufficiently detailed description of the organism and the effect of the vaccine on each organism should be required. A variety of different techniques are available for these classes of models. In the former case, one uses both, ordinary or stochastic differential equations to describe the populations dynamics or agent-based models (simple agents representing a single organism) to study the resulting complex phenomena. In the latter case, a more detailed description of the organism is required and the population dynamics may eventually be extracted from the dynamic response of each organism.

Complexity and multiscale models

Living organisms are complex systems. Using a ‘classical’ definition, a complex system is a system composed of different interconnected parts that, as a whole, exhibit one or more properties which do not obviously arise from the properties of the individual parts. System complexity may be either a ‘disorganized complexity’ or an ‘organized complexity’. In the former case complexity arises from a very large number of parts, whereas in the latter case, complexity is intrinsic to the system, eventually with a limited number of parts, and its connections rules.

In living organisms both situations occurs. A living organism is formed by a collection of different parts which are, each of them, organized complex systems. Cells, organs, systems of the human body are each of the complex systems.

As an example, the immune system is one of the very complex one where complexity arises both from a very large number of parts (organs), constituents (cells and molecules) and rules hierarchically connecting different scales of the parts.

Models including many scales of a phenomenon are now requested both for knowledge discovery and drug discovery. In life sciences not only an entire living organism, but also parts of the organism are too complex to be represented in a single, precise, multiscale model. The resulting model would certainly not be computable. Thus, one is forced to break the conceptual model in a set of models describing only part of the phenomenon (like a single organ, or a definite scale) and connect their outcomes [ 16]. To link models at different scales is a not an easy task. Phenomena occurring at different scales have usually different characteristic time scales and models’ output should be properly fitted. Interested readers are referred to another study [ 3] in this issue.


2.13: Algorithms and Data Structures - Biology

The Multicore Algorithmics group headed by Prof. Nir Shavit develops techniques for designing, implementing, and reasoning about multiprocessor algorithms, in particular concurrent data structures for multicore machines and the mathematical foundations of the computation models that govern their behavior.

Faculty

Prof. Nir Shavit
Quote: “”
My main interests are techniques for designing, implementing, and reasoning about multiprocessor algorithms, in particular concurrent data structures for multicore machines and the mathematical foundations of the computation models that govern their behavior. My research these days is directed at the use of randomness and combinatorial techniques in concurrent algorithm and data-structure design. I am also interested in parallelism in the Brain and what we can learn from it towards the design of futuristic computing machines.

Postdocs

Igor Zablotchi
My research focuses on the theory and practice of concurrent algorithms. I am most interested on the impact of emerging technologies, such as RDMA and persistent memory, on the way that we build distributed systems. I am also interested in algorithms for scalable machine learning on multicore machines.

Graduate Students

Past Members

Alexander Matveev
Quote: “People who are really serious about software should make their own hardware.”
I am working on practical and efficient synchronization techniques for multi-core systems. In particular, my focus is hardware and software transactional memory designs and implementations.
Rati Gelashvili
Quote: “inveniam viam aut faciam”
I am working on shared memory algorithms and data-structures (full of exciting mathematical challenges). I am also very interested in extending topological framework for distributed computability in various directions.
Justin Kopinsky
Quote: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are–by definition–not smart enough to debug it.”
I am primarily interested in concurrent data structures with a focus on using non-blocking algorithms and other methods to achieve scalability. I’m also currently working on producing a model for Hybrid Transactional Memory which is motivated by real architectures but is rigorous enough to allow us to prove interesting lower bounds.
William Leiserson
Quote: “Any technology that is distinguishable from magic is insufficiently advanced.”
I work on concurrent data structures, structures that scale efficiently with the number of cores operating on them. I’m especially interested in the application of hardware transactional memory (HTM) to problems that use locks to guarantee mutual exclusion. HTM has the potential to be lighter-weight than a lock, especially if there is low contention, and lock-freedom has desirable progress guarantees.
Michael Coulombe
Quote: “For the mind does not require filling like a bottle, but rather, like wood, it only requires kindling to create in it an impulse to think independently and an ardent desire for the truth.”
My research interests include provable algorithms and data structures, with applications from concurrency to computational biology, and the methods of expressing and proving them. I am currently working on concurrent, shared memory data structures

Quote: “”An algorithm must be seen to be believed.”I’m mainly interested in distributed algorithms. My main research topics are the complexity of concurrent data structures, and communication in distributed systems.


Reviews

Reviewed by Stephan Olariu, Professor, Old Dominion University on 6/20/17

This book is not intended to be a comprehensive introduction to algorithms and data structures. For this, there are other books. Instead, the authors have focused on a smattering of fundamental topics that provide the student with tools for the. read more

Reviewed by Stephan Olariu, Professor, Old Dominion University on 6/20/17

Comprehensiveness rating: 5 see less

This book is not intended to be a comprehensive introduction to algorithms and data structures. For this, there are other books. Instead, the authors have focused on a smattering of fundamental topics that provide the student with tools for the study of other topics that were left out in the book. This book is not for beginners -- and it does not teach students how to program. Throughout the book, algorithmic and data structure-related ideas are cast in Pascal-style pseudo-code that has the benefit of being easy to assimilate and has none of the complications of "modern" programming languages. As the title suggests, this is not a dry text on algorithms and data structures. There is a welcome emphasis on applying the algorithms and the data structures covered to real problems in computer graphics and geometry. In fact, Part VI of the book is intended to show the usefulness of data structures for the purpose of efficient implementation of algorithms that manipulate geometric objects. In spite of being a comprehensive compendium of algorithms and data structures, the book can, and has been used successfully, in undergraduate courses dealing with algorithm design and data structures. It can also be used for self-study by all those who wish to broaden their horizon and wish to acquire a solid footing in the fundamentals of computer science.

Content Accuracy rating: 5

The book is highly accurate and has been tested by the authors in their classes for decades

Relevance/Longevity rating: 5

This is not a new book. It was published (with the same title) in 1993 by Prentice Hall. The topics covered and their tested relevance guarantee that it will always be fresh and timely. I can say without hesitation that his is one of the fundamental books in the computer science literature.

The authors are well known compete scientists and educators. They pedagogical style is perfect. This said, the book is not for everyone. It assumes that the reader has been exposed to the fundamentals of programming.

This book is consistent to the extreme. Notation is consistent end-to-end and the coverage of topics is masterfully orchestrated in a coherent and pleasant way. Each chapter starts out by enumerating the learning objectives and presents motivation for the study of the chapter.

The book is organized into six parts, each of them partitioned into several chapters. The parts are both independent of each other and, at the same time, build on ideas mentioned in previous parts and chapters. It is an exemplary way of organizing a successful book.

Organization/Structure/Flow rating: 5

The algorithmic and data structure topics are organized in a natural order. The various topics are motivated, discussed in some detail and then consolidated by showing how they apply to real problems selected from computer graphics and geometry.

Flawless, I could not identify and problems here.

Grammatical Errors rating: 5

The authors are using perfect English. I was unable to identify any grammatical errors or typos

Cultural Relevance rating: 5

The book does not further any hidden agenda and is not offensive in any way.

I highly recommend this book to all those who wish to acquire a solid footing in the design of algorithms and data structures.


Conclusion

Data structure skills form the bedrock of software development, particularly when it comes to managing large sets of data in today’s digital ecosystem. Leading companies like Adobe, Amazon, and Google hire for various lucrative job positions in the data structure and algorithm domain. And in interviews, recruiters test not only your theoretical knowledge but also your practical skills. So, practice the above data structure projects to get your foot in the door!

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.


Watch the video: Danny Hendler Lock-free concurrent data structures Part 1 (June 2022).


Comments:

  1. Brayton

    I agree, a great thought

  2. Juha

    It is simply amazing :)

  3. Eduard

    Strange any dialogue turns out..

  4. Abdul- Qadir

    You are not right. I am assured. I can prove it. Write to me in PM, we will talk.



Write a message