我们可以将其设置为约束优化问题,它有四个部分:
- 创建变量:我们将用布尔向量表示我们对行选择的选择。第 k 个条目中的 True 意味着我们选择了第 k 行,而 False 则意味着我们没有选择。
- 指定约束:我们需要确保目标行的总和大于阈值。通过计算目标列和所选行向量之间的点积来完成。
- 指定目标函数。这里的目标函数是所选行的成本之和,即成本列与所选行向量的点积。
- 在这种情况下求解是最小化受约束的目标函数。
有几个 Python 运筹学库,即或图书馆 https://www.xiang.dev/python-or/用于解决此类问题。该解决方案使用Google OR 工具 https://developers.google.com/optimization这是一个“用于优化的开源软件套件”。
我们表明,使用优化包进行求解比对所有可能的行选择执行详尽搜索的替代解决方案要快得多。穷举搜索的计算复杂度呈指数级,O(2^nrows),因此仅适用于少量行(即
Code
import numpy as np
import pandas as pd
# Google or-tools solver
from ortools.sat.python import cp_model
import timeit
def solve(df, threshold):
'''
Uses or-tools module to solve optimization
'''
weights = df['Target']
cost = df['Cost']
names = df['Names']
# Creates the model.
model = cp_model.CpModel()
# Step 1: Create the variables
# array containing row selection flags i.e. True if row k is selected, False otherwise
# Note: treated as 1/0 in arithmeetic expressions
row_selection = [model.NewBoolVar(f'{i}') for i in range(df.shape[0])]
# Step 2: Define the constraints
# The sum of the weights for the selected rows should be >= threshold
model.Add(weights.dot(row_selection) >= threshold)
# Step 3: Define the objective function
# Minimize the total cost (based upon rows selected)
model.Minimize(cost.dot(row_selection))
# Step 4: Creates the solver and solve.
solver = cp_model.CpSolver()
solver.Solve(model)
# Get the rows selected
rows = [row for row in range(df.shape[0]) if solver.Value(row_selection[row])]
return names[rows]
# Setup
df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
'Target': [35, 15, 12, 8, 7, 5],
'Cost': [15, 40, 30, 30, 25, 10]})
print(solve(df, 40))
# Output:
0 a
5 f
Name: Names, dtype: object
表现
当前解决方案(基于 OR-Tools)
%timeit main(df, 40)
3.13 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
与穷举搜索算法相比,例如斯科特·波士顿解决方案 https://stackoverflow.com/questions/69473985/pandas-find-subset-of-rows-minimizing-the-sum-of-a-column-under-other-column-con/69474411#69474411.
from itertools import combinations, chain
df = pd.DataFrame(
{
"Names": ["a", "b", "c", "d", "e", "f"],
"Target": [35, 15, 12, 8, 7, 5],
"Cost": [15, 40, 30, 30, 25, 10],
}
)
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
%timeit min( (df.loc[list(i)] for i in powerset(df.index) if df.loc[list(i), "Target"].sum() >= 40), key=lambda x: x["Cost"].sum(),)
64.4 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
因此,使用 OR-Tools 比穷举搜索快约 20 倍(即 3.13 毫秒与 64.4 毫秒)