MIT 6.5830(6.830) Lab3

简介

Lab3主要实现查询优化,在本实验中,主要思想是为每一个表进行统计,其中包括了扫描耗费、对不同条件的查询耗费,然后综合来判断一个查询计划中的连接耗费,对连接次序进行调整,找到最小耗费的连接次序。在这个过程中,我们需要分别完成对不同选择条件的耗费估计、为每个表建立统计状态、对连接的耗费估计以及最后对连接次序重排。如下图所示,IntHistogramStringHistogram对不同类型的Field选择进行估计,然后在TableStats中保存每张表的统计数据,之后在JoinOptimizer中利用统计数据对计划进行排序。

img

Exercise

Exercise 1 IntHistogram

在练习一中实现对于int类型数据的选择直方图,其实可以理解为,对于不同的操作和比较数,返回满足条件的元组占总元组的比例,在文档中也对怎么实现进行了描述。对于给定的输入流,计算它的最大最小值,然后根据bucket的数目,对输入流进行平均划分。然后给定一个值和操作符,比如给定值b和操作符>,那就如图中所示,计算阴影部分柱形图面积占总面积的比例,即估算的查询耗费。

img

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
   public void addValue(int v) {
// TODO: some code goes here
this.tuples ++;
//System.out.println(this);
buckets[(v - min) / range] ++;
}
public double estimateSelectivity(Predicate.Op op, int v) {

// TODO: some code goes here

if (op == Predicate.Op.EQUALS) {
if (v < min || v > max) return 0;
int index = (v - min) / range;
int h = buckets[index];
return (h * 1.0 / range) / tuples;
}
if (op == Predicate.Op.NOT_EQUALS) {

return 1.0 - estimateSelectivity(Predicate.Op.EQUALS, v);
}
if (op == Predicate.Op.GREATER_THAN) {
if (v < min) return 1;
if (v > max) return 0;
int index = (v - min) / range;
int h = buckets[index];
int sum = 0;
for (int i = index + 1; i < buckets.length; i++) {
if (i == buckets.length - 1) {
sum += (max - i * range - min + 1) * buckets[i];
}
else sum += buckets[i] * range;
}
return (((index + 1) * range + min - 1 - v) * buckets[index] + sum + 0.0) / tuples;
}
if (op == Predicate.Op.LESS_THAN) {
if (v < min) return 0;
if (v > max) return 1;
int index = (v - min) / range;
int h = buckets[index];
int sum = 0;
for (int i = 0; i < index; i++) {
sum += buckets[i] * range;
}
return ((v - range * index - min) * buckets[index] + sum + 0.0) / tuples;
}
if (op == Predicate.Op.GREATER_THAN_OR_EQ) {
return estimateSelectivity(Predicate.Op.GREATER_THAN, v ) + estimateSelectivity(Predicate.Op.EQUALS, v);
}
if (op == Predicate.Op.LESS_THAN_OR_EQ) {
return estimateSelectivity(Predicate.Op.LESS_THAN, v) + estimateSelectivity(Predicate.Op.EQUALS, v);
}
return -1.0;
}

Ecercise 2 TableStats

这个练习主要实现对每一张表的各种统计数据的保存,这个类需要实现的最主要方法就是给定筛选条件中的field op value,需要返回这个比较操作的估计耗费,这就需要用到我们刚才实现的类。为了初始化IntHistogramStringHistogram,我们需要数据流的最大值和最小值,然后不断向其中添加元素,所以这里就需要对表中的元组进行两次遍历,第一次统计出最大值最小值,并为每一个Field初始化直方图,然后第二遍遍历向其中添加元素。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
   static final int NUM_HIST_BINS = 1000;
private int tableId;
private int ioCostPerPage;
private Map<Integer, List<Integer>> maxAndMin;
private Map<Integer, IntHistogram> intMap;
private Map<Integer, StringHistogram> strMap;
private int pageNum;
private int tupleNum;
private TupleDesc td;
public TableStats(int tableid, int ioCostPerPage) {
// TODO: some code goes here
this.tableId = tableid;
this.ioCostPerPage = ioCostPerPage;
this.maxAndMin = new HashMap<>();
this.intMap = new HashMap<>();
this.strMap = new HashMap<>();
HeapFile heapFile =(HeapFile) Database.getCatalog().getDatabaseFile(tableid);
this.pageNum = heapFile.numPages();
this.tupleNum = 0;
DbFileIterator iterator = heapFile.iterator(new TransactionId());

try {
iterator.open();
while (iterator.hasNext()) {
Tuple next = iterator.next();
this.tupleNum ++;
TupleDesc tupleDesc = next.getTupleDesc();
if (this.td == null) td = tupleDesc;
for (int i = 0; i < tupleDesc.numFields(); i++) {
if (tupleDesc.getFieldType(i) == Type.INT_TYPE) {
IntField field = (IntField) next.getField(i);
if (!maxAndMin.containsKey(i)) {
List<Integer> list = new ArrayList<>();
list.add(field.getValue());list.add(field.getValue());
maxAndMin.put(i, list);
} else {
List<Integer> list = maxAndMin.get(i);
list.set(0, Math.min(list.get(0), field.getValue()));
list.set(1, Math.max(list.get(1), field.getValue()));
maxAndMin.put(i, list);
}
}
}
}
for (Map.Entry<Integer, List<Integer>> entry : maxAndMin.entrySet()) {
int filedId = entry.getKey();
List<Integer> list = entry.getValue();
if (td.getFieldType(filedId) == Type.INT_TYPE) {
intMap.put(filedId, new IntHistogram(NUM_HIST_BINS, list.get(0), list.get(1)));
} else {
strMap.put(filedId, new StringHistogram(NUM_HIST_BINS));
}
}
iterator.rewind();
while (iterator.hasNext()) {
Tuple next = iterator.next();
for (int i = 0; i < td.numFields(); i++) {
if (td.getFieldType(i) == Type.INT_TYPE) {
IntField field = (IntField) next.getField(i);
intMap.get(i).addValue(field.getValue());
} else {
StringField field = (StringField) next.getField(i);
strMap.get(i).addValue(field.getValue());
}
}
}
iterator.close();
} catch (DbException | TransactionAbortedException e) {
throw new RuntimeException(e);
}
}

Exercise 3 Join Cost Estimation

对两个连接所产生表的耗费和数量进行估计,产生表的数量主要遵从以下三个原则

  • 对于等值连接,如果连接属性是一张表的主键,那么最后的数量不能超过另一张表的元组数量;
  • 还是等值连接,如果连接属性没有主键,那么最后的数量可取两表元组数量中的最大值;
  • 对于非等值连接,可以产生各种数量的表,从0到两表数量相乘都可以,所以加入一个因子,可以是0.3进行平均化。

而连接的耗费则在文档中也有说明

1
2
joincost(t1 join t2) = scancost(t1) + ntups(t1) x scancost(t2) //IO cost
+ ntups(t1) x ntups(t2) //CPU cost
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
public double estimateJoinCost(LogicalJoinNode j, int card1, int card2,
double cost1, double cost2) {
if (j instanceof LogicalSubplanJoinNode) {
// A LogicalSubplanJoinNode represents a subquery.
// You do not need to implement proper support for these for Lab 3.
return card1 + cost1 + cost2;
} else {
// Insert your code here.
// HINT: You may need to use the variable "j" if you implemented
// a join algorithm that's more complicated than a basic
// nested-loops join.
return cost1 + cost2 * card1 + card1 * card2;
}
}
public static int estimateTableJoinCardinality(Predicate.Op joinOp,
String table1Alias, String table2Alias, String field1PureName,
String field2PureName, int card1, int card2, boolean t1pkey,
boolean t2pkey, Map<String, TableStats> stats,
Map<String, Integer> tableAliasToId) {
int card = 1;
if (joinOp == Predicate.Op.EQUALS) {
if (t1pkey) {
card = card2;
}
else if (t2pkey) {
card = card1;
}
else {
card = Math.max(card1, card2);
}
} else {
card = (int) (0.3 * card1 * card2);
}
// TODO: some code goes here
return card <= 0 ? 1 : card;
}

Exercise 4 Join Ordering

在完成了之前的练习之后,可以实现对连接顺序的优化。实验中主要实现的是称为Selinger algorithm的方法,伪代码如下

1
2
3
4
5
6
7
8
9
10
11
1. j = set of join nodes
2. for (i in 1...|j|):
3. for s in {all length i subsets of j}
4. bestPlan = {}
5. for s' in {all length d-1 subsets of s}
6. subplan = optjoin(s')
7. plan = best way to join (s-s') to subplan
8. if (cost(plan) < cost(bestPlan))
9. bestPlan = plan
10. optjoin(s) = bestPlan
11. return optjoin(j)

对于要连接的节点集合,生成它所有长度的子集,然后对于从长到短的这些节点集合进行下一步判断。记录一个最佳即耗费最小的计划,然后计算如下操作的耗费:将集合S中一个连接节点S’与子计划{S - S’}进行连接成本,如果这个耗费更小的话,就更新最佳耗费变量。同时伪代码中的optjoin即连接缓存,将每一个子连接的最佳耗费进行记录,并在最终产生完整查询的最佳顺序。

在代码框架中已经给定了产生所有子集和计算子计划连接成本的函数,即伪代码第3行和6-7行代码所完成的操作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public List<LogicalJoinNode> orderJoins(
Map<String, TableStats> stats,
Map<String, Double> filterSelectivities, boolean explain)
throws ParsingException {
// Not necessary for labs 1 and 2.
// TODO: some code goes here
PlanCache planCache = new PlanCache();
Set<Set<LogicalJoinNode>> sets = null;
for (int i = 1; i < joins.size() + 1; i++) {
sets = enumerateSubsets(joins, i);
for (Set<LogicalJoinNode> lset : sets) {
CostCard bestCostSoFar = new CostCard();
bestCostSoFar.cost = Double.MAX_VALUE;
for (LogicalJoinNode ljn : lset) {
CostCard costCard =
computeCostAndCardOfSubplan(stats, filterSelectivities, ljn, lset, bestCostSoFar.cost, planCache);
if (costCard !=null && costCard.cost < bestCostSoFar.cost) {
bestCostSoFar = costCard;
}
}
planCache.addPlan(lset, bestCostSoFar.cost, bestCostSoFar.card, bestCostSoFar.plan);
}
}
Set<LogicalJoinNode> max = null;
int size = 0;
for (Set<LogicalJoinNode> ll : sets) {
if (ll.size() > size) {
size = ll.size();
max = ll;
}
}
if (explain) printJoins(planCache.getOrder(max), planCache, stats, filterSelectivities);
return planCache.getOrder(max);
}

测试

所有测试全部通过,测试顺序为

1
2
3
4
5
ant runtest -Dtest=IntHistogramTest
TableStatsTest
JoinOptimizerTest
ant runsystest -Dtest=QueryTest

时间记录

开始时间:2022.12.2

结束时间:2022.12.3

耗时:2天