Investigating the efficiency of using string prefixes to detect duplicate genes
Explaining the source code
Alish
def compare_gene_prefix_alish(gene: Gene, other_gene: Gene) -> bool:
"""Compare the prefix of the gene name with the prefix of another gene name."""
# extract the prefix from the gene name
# compare the prefix to the prefix of the other gene name
# return a boolean value that indicates whether or not the prefixes are equal
# refer to the function specification for more details about this function
return (
gene.gene_name[: gene.gene_name_prefix]== other_gene.gene_name[: other_gene.gene_name_prefix]
)
def detect_duplicates_gene_alish(data: List[Gene]) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
# provide an implementation of this function that can determine
# whether or not the provided list of Gene values contains duplicates
if not data:
return False
for i in range(len(data)):
for j in range(i + 1, len(data)):
if compare_gene_prefix_alish(data[i], data[j]):
return True
return False
The function detect_duplicates_gene_alish
is designed to determine whether there are duplicate gene names in a given list of Gene objects. It achieves this by iterating through the list and comparing each gene’s prefix with every other gene’s prefix using the compare_gene_prefix_alish
function. If a pair of genes has the same prefix, it indicates a potential duplicate, and the function returns True
. If no duplicates are found after comparing all pairs, it returns False
.
The time complexity of detect_duplicates_gene_alish
heavily depends on the time complexity of compare_gene_prefix_alish
. In the worst-case scenario, compare_gene_prefix_alish
compares the prefixes of every pair of genes in the list, resulting in a nested loop structure. With the introduction of a new understanding of the time complexity for compare_gene_prefix_alish
being \(O(m)\), where \(m\) is the length of the prefix, the overall time complexity of detect_duplicates_gene_alish
becomes \(O(n^3)\), where \(n\) represents the number of genes in the input list.
Miles
def detect_duplicates_int(data: list) -> bool:
"""Detect whether or not there are duplicate values in a list of integer values."""
if len(data) == 0:
return False
= len(data)
n for i in range(1, n):
for j in range(i):
if data[i] == data[j]:
return False
return False
def detect_duplicates_gene_miles(data: list) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
# provide an implementation of this function that can determine
# whether or not the provided list of Gene values contains duplicates
= detect_duplicates_int_miles(data)
boolean if boolean:
return True
else:
return False
When looking at my code, I assumed that it would have a time complexity of at least \(O(n^2)\) because it has two for loops. The results from the benchmark framework show that it took \(4.05999890062958e-05\) seconds. Based off of the results I believe my analytical evaluation was correct. My implementation of the problem was one of the slowest out of my group, that’s because I used two for
loops in my function while the others used \(1\) or none.
Keller
def detect_duplicates_gene(data: List[Gene]) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
if not data:
return False
= set()
seen_prefixes for gene in data:
= gene.gene_name[: gene.gene_name_prefix]
gene_prefix if gene_prefix in seen_prefixes:
return True
seen_prefixes.add(gene_prefix)return False
This function will loop through each gene in the data set using one for loop which makes this function’s time complexity O(n)
. This function first determines whether data is present or not using an if statement that will return False
if no data is present. The seen_prefixes
keeps track of the prefixes of the gene names that have been seen so far. The for loop looks at each gene in the data and gets its prefix. This prefix will be added to the set. If a prefix that has been previously added to the set is the same as a prefix that is being added to the set then it will return True
indicating that there is a duplicate. If no prefixes are duplicates then the data will be looped through and return False
indicating there are no duplicates in the data.
Tugi
def detect_duplicates_gene(data):
"""Detect whether or not there are duplicate values in a list of Gene values."""
= {}
gene_map
for gene_object in data:
if gene_object.gene_name_prefix in gene_map:
return True
else:
= 1
gene_map[gene_object.gene_name_prefix]
return False
I think that my function is \(O(n)\) due to the use of one for
loop when iterating the data. This code above utilizes a dictionary and counts the number of the genes it finds. If there is more than one found, then it will immediately give True
!
Team Four
The following code block contains implementations of various functions to detect duplicate values within gene objects. To empirically validate their efficiency, the code utilizes the timeit module. Specifically, lambda functions are employed to repeatedly execute each detection function with a shared data set, thereby measuring their execution time. This empirical analysis serves to validate their performance against theoretical time complexities and offers valuable insights into their efficiency.
By observing how the execution times align with the expected theoretical complexities, we gain a deeper understanding of algorithmic efficiency. This process aids in discerning which approaches are more effective and scalable, thereby facilitating the selection of optimal algorithms for various problem scenarios. Additionally, these measured execution times provide practical feedback on the scalability and efficiency of the algorithms. Such experimentation fosters an enhanced comprehension of algorithmic complexity and informs the design of more efficient solutions across diverse problem domains.
"""Team Four: Algorithm Analysis for Question 4 of the Midterm Examination."""
import timeit
from typing import List
# Part (a) {{{
def detect_duplicates_int_nic(data: List[int]) -> bool:
"""Detect whether or not there are duplicate values in a list of integer values."""
# If matrix is empty, then there are no duplicate values
if len(data) == 0:
return False
# Initialize a list of seen values
= []
seen_values # Check if each value is in the list of seen values
for value in data:
# If it is, then return True
if value in seen_values:
return True
# If not, add it to the list of seen values
seen_values.append(value)# If there are no duplicates, return False
return False
def detect_duplicates_int_alish(data: str) -> bool:
"""Detect whether or not there are duplicate values in a list of integer values."""
return len(set(data)) != len(data)
def detect_duplicates_int_miles(data: list) -> bool:
"""Detect whether or not there are duplicate values in a list of integer values."""
if len(data) == 0:
return False
= len(data)
n for i in range(1, n):
for j in range(i):
if data[i] == data[j]:
return False # changed to false
return False
# }}}
# Part (b) {{{
def detect_duplicates_str(data: List[str]) -> bool:
"""Detect whether or not there are duplicate values in a list of string values."""
return len(set(data)) != len(data)
# }}}
# Part (c) {{{
class Gene:
"""Represent a gene with a name, prefix amount, and a flexible description."""
def __init__(self, gene_name: str, gene_prefix: int, gene_description: str) -> None:
"""Initialize a gene with an agreed-on name and a flexible description."""
self.gene_name = gene_name
self.gene_name_prefix = gene_prefix
self.gene_description = gene_description
def compare_gene_prefix_alish(gene: Gene, other_gene: Gene) -> bool:
"""Compare the prefix of the gene name with the prefix of another gene name."""
# extract the prefix from the gene name
# compare the prefix to the prefix of the other gene name
# return a boolean value that indicates whether or not the prefixes are equal
# refer to the function specification for more details about this function
return (
gene.gene_name[: gene.gene_name_prefix]== other_gene.gene_name[: other_gene.gene_name_prefix]
)
def detect_duplicates_gene_alish(data: List[Gene]) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
# provide an implementation of this function that can determine
# whether or not the provided list of Gene values contains duplicates
if not data:
return False
for i in range(len(data)):
for j in range(i + 1, len(data)):
if compare_gene_prefix_alish(data[i], data[j]):
return True
return False
def detect_duplicates_gene_keller(data: List[Gene]) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
if not data:
return False
= set()
seen_prefixes for gene in data:
= gene.gene_name[: gene.gene_name_prefix]
gene_prefix if gene_prefix in seen_prefixes:
return True
seen_prefixes.add(gene_prefix)return False
def detect_duplicates_gene_miles(data: list) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
# provide an implementation of this function that can determine
# whether or not the provided list of Gene values contains duplicates
= detect_duplicates_int_miles(data)
boolean if boolean:
return True
else:
return False
def compare_gene_prefix_ochirsaikhan(gene: Gene, other_gene: Gene) -> bool:
"""Compare the prefix of the gene name with the prefix of another gene name."""
# extract the prefix from the gene name
= gene.gene_name[: gene.gene_name_prefix]
prefix = other_gene.gene_name[: other_gene.gene_name_prefix]
other_prefix # compare the prefix to the prefix of the other gene name
if prefix == other_prefix:
return True
# return a boolean value that indicates whether or not the prefixes are equal
# refer to the function specification for more details about this function
return False
def detect_duplicates_gene_ochirsaikhan(data: List[Gene]) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
# provide an implementation of this function that can determine
# whether or not the provided list of Gene values contains duplicates
if data:
for i in range(len(data) - 1):
= data[i]
curr_gene = data[i + 1]
next_gene if compare_gene_prefix_ochirsaikhan(curr_gene, next_gene):
return True
return False
def detect_duplicates_gene_tugi(data: List[Gene]) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
# : provide an implementation of this function that can determine
# whether or not the provided list of Gene values contains duplicates
map = {}
for gene_object in data:
if gene_object.gene_name_prefix not in map:
map[gene_object.gene_name_prefix] = 1
else:
return True
return False
def detect_duplicates_gene_nic(data: List[Gene]) -> bool:
"""Detect whether or not there are duplicate values in a list of Gene values."""
# Create a list of prefixes
= []
prefixes for gene in data:
prefixes.append(gene.gene_name_prefix)# Return result of checking for duplicate values
return detect_duplicates_int_nic(prefixes)
# All Hands {{{
# The following section contains code that will execute functions from the team's
# four members and return a value representing the time it took the program to execute.
# This investigation aims to understand the complexity of our programs better and grasp
# what makes an algorithm more efficient, even if they share the same complexity.
= [
input_gene_data "AADACL3", 5, "Arylacetamide deacetylase-like 3"),
Gene("FPGT02", 4, "Fucose-1-phosphate guanylyltransferase"),
Gene("AADACL4", 5, "Arylacetamide deacetylase-like 4"),
Gene("AADACL3", 5, "Arylacetamide deacetylase-like 3"),
Gene("AADACL4", 5, "Arylacetamide deacetylase-like 4"),
Gene(
]
= [
timing_results "detect_duplicates_gene_alish", timeit.timeit(lambda: detect_duplicates_gene_alish(input_gene_data), number=10)),
("detect_duplicates_gene_keller", timeit.timeit(lambda: detect_duplicates_gene_keller(input_gene_data), number=10)),
("detect_duplicates_gene_miles", timeit.timeit(lambda: detect_duplicates_gene_miles(input_gene_data), number=10)),
("detect_duplicates_gene_ochirsaikhan", timeit.timeit(lambda: detect_duplicates_gene_ochirsaikhan(input_gene_data), number=10)),
("detect_duplicates_gene_tugi", timeit.timeit(lambda: detect_duplicates_gene_tugi(input_gene_data), number=10)),
("detect_duplicates_gene_nic", timeit.timeit(lambda: detect_duplicates_gene_nic(input_gene_data), number=10))
(
]
= sorted(timing_results, key=lambda x: x[1])
sorted_results
for result in sorted_results:
print(f"Timing {result[0]}: {result[1]}")
# }}}
After multiple runs of the experiment, Tugi and Nic consistently emerged as the fastest performers, with Keller occasionally shifting between second and third place. Meanwhile, Ochirsaikhan, Miles, and Alish showed varying performance, typically falling in the fourth, fifth, and sixth positions. Notably, Tugi demonstrated remarkably short execution times, closely followed by Nic, indicating the efficiency of their algorithms. Keller exhibited commendable performance, albeit slightly slower. Conversely, Ochirsaikhan, Miles, and Alish consistently showed longer execution times. These findings offer valuable insights into the efficiency and scalability of the detection algorithms, guiding potential optimizations for future implementations.
Example of the output:
Timing detect_duplicates_gene_tugi: 5.300999873725232e-06
Timing detect_duplicates_gene_nic: 8.33000012789853e-06
Timing detect_duplicates_gene_keller: 9.926000529958401e-06
Timing detect_duplicates_gene_ochirsaikhan: 1.4033000297786202e-05
Timing detect_duplicates_gene_miles: 1.4749999536434188e-05
Timing detect_duplicates_gene_alish: 1.749400053085992e-05