My data looks as follows:
ID my_val db_val
a X X
a X X
a Y X
b X Y
b Y Y
b Y Y
c Z X
c X X
c Z X
Expected result :
ID my_val db match
a X:2;Y:1 X full_match
b Y:2;X:1 Y full_match
c z:2;X:1 X partial_match
a full_match is when db_val matches the most abundant my_val a partial_match is when db_val is in the other values but doesn't match the top one.
My current approach consists of grouping by ID then counting values into a seperate column then concatenating the value and its count, then aggregating all values into one row for each ID.
This is how I aggregate the columns:
def all_hits_aggregate_df(df, columns=['my_val']):
grouped = data.groupby('ID')
l=[]
for c in columns:
res = grouped[c].value_counts(ascending=False, normalize=False).to_frame('count_'+c).reset_index(level=1)
res[c] = res[c].astype(str) +':'+ res['count_'+c].astype(str)
l.append(res.groupby('ID').agg(lambda x: ';'.join(x)))
return reduce(lambda x, y: pd.merge(x, y, on = 'ID'), l)
And for the comparison phase, I loop through each row and parse the my_val column into lists then do the comparison.
I am sure that the way I do the comparison step is extremely inefficient but I am unsure how I would do it before aggregation to avoid having to parse the generated string later in the process.
JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)