应用竞赛树结构模拟实现外排序（C++实现）

2023-05-16

基本要求

（1）设计实现最小输者树结构 A D T ADT ADT， A D T ADT ADT 中应包括初始化、返回赢者、重构等基本操作。
（2）设计实现外排序，外部排序中的生成最初归并串以及 K K K 路归并都用最小输者树结构实现；
（3）随机创建一个较长的文件；设置归并路数以及缓冲区的大小；获得外排序的访问磁盘的次数并进行分析,可采用小文件来模拟磁盘块。

问题分析

（1）设计并实现最小输者树
（2）应用最小输者树结构实现生成初始顺串
（3）应用最小输者树结构实现 K K K 路归并
（4）随机生成较长文件，设置归并路数以及缓冲区大小，对以上操作进行分析验证。

设计思路

（1）调整输者树
将新进入输者树的节点与其父节点进行比较，把败者所对应的索引存在父节点中，赢者再与上一层的父节点进行比较，直到比较到根结点，最后将败者存在 t r e e [ 1 ] tree[1] tree[1] 中，赢者存在 t r e e [ 0 ] tree[0] tree[0] 中。
（2）构建输者树（初始化）
将初始化输者树的过程转换为调整输者树的过程，假设外部结点为 l e a v e [ 0 , … , k ] leave[0,…,k] leave[0,…,k]，先指定 k k k 为 m i n min min 即设置 l e a v e [ k ] leave[k] leave[k] 为比赛的胜者，然后将所有的内部结点置为 k k k，最后从 0 0 0 到 k − 1 k-1 k−1，重构输者树，即可得到初始输者树。
（3）利用输者树生成初始顺串
假设输者树的结点个数为 k k k，从输入集合中输入前 k k k 个元素并初始化这些元素的顺串号均为 1 1 1，建立这 k k k 个选手的最小输者树，并重复以下过程，直至所有元素都输出到对应的顺串中。
(i) 将赢者 W W W 移入它顺串号所对应的顺串中；
(ii) 若输入集合中有下一个输入元素，则 N = N= N= 下一个输入元素，否则 N = ∞ N=∞ N=∞
(iii) 如果 N N N 的元素值 ≥ W ≥W ≥W 的元素值，则 N N N 的顺串号 = W = W =W 的顺串号，否则 N N N 的顺串号 = W = W =W 的顺串号 + 1 +1 +1
(iiii) N N N 代替 W W W，重构输者树。
（4）利用输者树进行 K K K 路归并
从外部文件中读取 K ∗ B U F _ S I Z E K*BUF\_SIZE K∗BUF_SIZE 个元素，分别放入 K K K个输入缓冲区中，以每个缓冲区中的第一个元素构建输者树，每次将赢者移至输出缓冲区中，然后调整输者树。当输出缓冲区满时，将数据块写入外存，当输入缓冲区空时，再从外存中读入一个数据块，重复上述过程，直至读取到文件结束，即完成K路归并。

测试结果

以随机生成一个数据量为100000的文件进行测试为例：
在这里插入图片描述
利用内排序生成测试文件test.txt

#include <bits/stdc++.h>
using namespace std;
int main()
{
	freopen("input.txt","r",stdin);
	freopen("test.txt","w",stdout);
	int a[100010];
	int idx=0,n;
	while(cin>>n){
		a[idx++]=n;
	}
	sort(a,a+idx);
	for(int i=0;i<idx;i++){
		cout<<a[i]<<" ";
	}
}

比较temp34.txt与test.txt：
在这里插入图片描述
之后又测试了数据量分别为1000，10000，1000000的文件，结果均正确。

分析与探讨

为了进一步研究输者树应用在外排序中的优势，又进行了以下实验：
在这里插入图片描述
缓冲区大小均为1024的情况下输者树与快速排序生成的初始顺串个数对比：

外排序所需要的时间由内部排序所需要的时间、外存信息读写所需要的时间、内部归并所需要的时间三部分组成，其中减少外存信息的读写次数是提高外部排序效率的关键。而对于同一个文件来说，进行外排序所需读写外存的次数与归并趟数有关，为了减少归并趟数，可以从减少初始顺串的个数、增加归并的路数两个方面着手。

生成初始顺串
在缓冲区大小相同的条件下，利用输者树生成的初始顺串个数明显少于快速排序，这是因为快速排序得到的顺串的编号与容量是固定的，而输者树根据相应的赢者规则，每次取出胜者后，对树重构，新加入的元素的顺串号由其元素值与胜者元素值的大小关系所决定，从而尽可能通过动态比较增加了每个初始顺串的容量，使得初始顺串个数减少。
K路归并
假设归并路数为 k k k，外部文件中的数据个数为 n n n，如果只是简单地进行线性比较，取 k k k 路中的最小，那么时间复杂度为 O ( n ∗ k ) O(n*k) O(n∗k)。若采用输者树，初始化时，首先将内部节点初始化为 k k k，时间复杂度为 O ( k ) O(k) O(k)， n n n 次移入和替代赢者共需耗时 O ( n ∗ l o g k ) O(n*logk) O(n∗logk)，故总的时间复杂度为 O ( k + n l o g k ) O(k+nlogk) O(k+nlogk)，从而提高了 k k k 路归并的性能。
增加归并路数
分析以上实验数据可得，在数据量和缓冲区大小一定时，适当地增加归并路数，会减少初始顺串的数量与访问外存的次数，从而加快外排序的速度；但是当归并路数k无限增大时，输入缓冲区的个数也会随之增多，在内存短缺的情况下，每个缓冲区的容量会相应的减少，从而会使得内外存交换数据的次数增加（加载顺串到输入缓冲区的次数增多），因此 k k k 值的设定要根据实际内存的大小而定，过多地增加 k k k 值，虽然会减少归并次数，但会使程序的运行效率降低。

代码实现

#include <iostream>
#include <cstdio>
#include <cstdlib> 
#include <algorithm>
#include <cstring>
#include <string>
#include <climits>
#include <fstream>
using namespace std;

int Count;		//待排序文件的数据个数 
int BUF_PAGES;  //缓冲区有多少个page 
int PAGE_SIZE; //page的大小 
int BUF_SIZE;//BUF_PAGES*PAGE_SIZE; //缓冲区大小
char* input_file;//输入文件名  

//缓冲区 
struct Page{
	int *arr; 
	int current; //当前顺串扫描的位置 
	Page(){
		current=0;
	}
	Page(int bufsize){
		arr=new int[bufsize+1];
		current=0;
	}
};

//选手 
struct player{
	int id;//编号 
	int element;//元素 
	//赢者规则 
	friend bool operator<(const player &s1,const player &s2){
		if(s1.id!=s2.id)
			return s1.id<s2.id;
			
		else
			return s1.element<s2.element;
	}
	//重载运算符 = 
	player& operator=(const player &s){
		id=s.id;
		element=s.element;
		return *this;
	} 
}; 

//输者树 
class LoserTree{
	public:
		LoserTree(){}
		~LoserTree(){}
		//初始化 
		void Initialize(int kk);
		//返回赢者 
		int Winner(){return tree[0];}
		//重构 
		void Adjust(int s);
		void Adjust(int *,player *,int ,int);
		//快排生成顺串 
		void Generate_1(); 
		//输者树生成顺串 
		void Generate_2(); 
		//K路归并 
		void K_Merge(int start,int k);
		//外排序
		void ExternSort(int k);  
		//读数据块
		int Read_Data_Block(FILE* file, int a[], int n);
		//写数据块
		void Write_Data_Block(FILE* file, int a[], int n); 
		//获取临时文件名称 
		char* Get_Filename(int index); 
	private: 
    	int *tree;	//内部结点 
		int *leave; //外部结点          
    	int k;      //选手个数  
		int file_count;	//临时文件数目	 	
};

//获取临时文件名称 
char* LoserTree::Get_Filename(int index){
    char *tempname=new char[100];
    sprintf(tempname,"temp%d.txt",index);
    return tempname;
}

//从外存中读数据块 
int LoserTree::Read_Data_Block(FILE* file, int a[], int n){
    int i=0;
    while(i<n&&(fscanf(file,"%d",&a[i]))!=EOF)
        i++;
    return i;
}

//往外存中写数据块 
void LoserTree::Write_Data_Block(FILE* file, int a[], int n){
    for(int i=0;i<n;i++)
        fprintf(file,"%d ",a[i]);
    fprintf(file,"%d",INT_MAX);
}

//初始化 
void LoserTree::Initialize(int kk){
	k=kk;
	leave[k]=INT_MIN;
	for(int i=0;i<k;i++)
		tree[i]=k;
	for(int i=0;i<k;i++)
		Adjust(i);
}

//重构输者树 
void LoserTree::Adjust(int s){
	//父节点 
	int t=(s+k)>>1;
	//向上调整 
	while(t>0){
		if(leave[s]>leave[tree[t]]){ 
			s^=tree[t];
			tree[t]^=s;
			s^=tree[t];
		}
		t>>=1;
	}
	tree[0]=s;
}

//利用快速排序生成顺串 
void LoserTree::Generate_1(){
    FILE* fin=fopen(input_file,"rt");
    int n=0;
	file_count=0;
	//内部缓冲区 
    int* array=new int[BUF_SIZE/2];
    while ((n=Read_Data_Block(fin, array, BUF_SIZE/2))>0){
        sort(array, array+BUF_SIZE/2);
        char* filename=Get_Filename(file_count++);
        FILE* tempfile=fopen(filename,"w");
        free(filename);
        Write_Data_Block(tempfile,array,n);
        fclose(tempfile);
    }
    delete[] array;
    fclose(fin);
	cout<<file_count<<endl;
}

//重构输者树 
void LoserTree::Adjust(int *tree,player *leave,int s,int k){
	int t=(s+k)>>1;
	while(t>0){
		if(leave[tree[t]]<leave[s]){
			s^=tree[t];
			tree[t]^=s;
			s^=tree[t];
		}
		t>>=1;
	}
	tree[0]=s;	
}

//输者树生成顺串 
void LoserTree::Generate_2(){
	file_count=0;
	int K=BUF_SIZE/2;
	int Tree[K];
	player Leave[K+1];
	int OutputNum=Count/BUF_SIZE+10;
	//文件输出流 
	ofstream out[OutputNum]; 
	for(int i=0;i<OutputNum;i++){
		char* tmp=Get_Filename(i);
		out[i].open(tmp,ios::app);
	}
	//文件输入流 
	ifstream in;
	in.open(input_file);
	if(!in.is_open()){
		cout<<"错误：当前目录下该文件不存在！"<<endl;
		exit(0);
	}
	for(int i=0;i<K;i++){
		in>>Leave[i].element;
		Leave[i].id=0;		
	}  
	//初始化输者树 
	Leave[K].element=INT_MIN;
	Leave[K].id=-1;
	for(int i=0;i<K;i++)
		Tree[i]=K;
	for(int i=0;i<K;i++)
		Adjust(Tree,Leave,i,K);
	int maxid=0;
	//输者树排序 
	while (Leave[Tree[0]].element!=INT_MAX){
		int p=Tree[0];
		file_count=max(file_count,Leave[p].id);
		out[Leave[p].id]<<Leave[p].element<<" ";
		//读到文件尾 
		if(in.eof()){
			Leave[p].element=INT_MAX;
			Leave[p].id=INT_MAX;
			in.close();
		} 
		//读入新元素并确定顺串号 
		else{
			int element;
			in>>element;
			if(element<Leave[p].element){
				Leave[p].id++;
			}
			Leave[p].element=element;
		}
		//重构输者树 
		Adjust(Tree,Leave,p,K);
    }
	file_count++;		
	//添加 ∞作为终止符 
    for(int i=0;i<file_count;i++){
		out[i]<<INT_MAX;
		out[i].close();    	
	}		
}

//K路合并 
void LoserTree::K_Merge(int start,int k){
	int count=file_count-start;
	if(count<=1)
		return;
	int runs=count/k;
	int cur=file_count; 
	if(count%k!=0)
		runs++;
	int cnt=runs;
	int j=start;
	int K=k;
	//执行runs次k路归并 
	while(cnt--){
		k=min(K,cur-j);
		ifstream in[k];
		//读取当前要归并的临时文件 
		for (int i=0;i<k;i++){
			char* tmp=Get_Filename(j++);
			in[i].open(tmp);
		}		
		//输入缓冲区 
		Page Buffer[k];
		for(int i=0;i<k;i++){
			Buffer[i]=Page(BUF_SIZE);
		}
		//从外存读入缓冲区 
		for(int i=0;i<k;i++){
			int num=0;
		    while(num<PAGE_SIZE&&!in[i].eof()){
				in[i]>>Buffer[i].arr[num];
				num++;
			}
			Buffer[i].current=0;
		}
		//读取外部节点 
		for(int i=0;i<k;i++){
			leave[i]=Buffer[i].arr[0];
			Buffer[i].current++;
		}
		//初始化输者树 
		Initialize(k);
		int p;
		//输出缓冲区 
		int res[PAGE_SIZE];
		int idx=0;
		char* outputfile=Get_Filename(file_count);
		ofstream out(outputfile);	
		//输者树排序	
		while (leave[tree[0]] != INT_MAX){
			p=tree[0];  //p为当前最小关键字所在的顺串 
			res[idx++]=leave[p];
			//输出缓冲区已满，将数据块写入外存 
			if(idx==PAGE_SIZE){
				for(idx=0;idx<PAGE_SIZE;idx++){
					out<<res[idx]<<" ";
				} 
				idx=0;
			}
			//输入缓冲区已空，从外存读入数据块 
			if(Buffer[p].current>=PAGE_SIZE){
				int num=0;
			    while (num<PAGE_SIZE&&!in[p].eof()){
					in[p]>>Buffer[p].arr[num];
					num++;
				}
				Buffer[p].current=0;
				leave[p]=Buffer[p].arr[Buffer[p].current];
				Buffer[p].current++;				
			}
			//继续读取缓冲区 
			else{
				leave[p]=Buffer[p].arr[Buffer[p].current];
				Buffer[p].current++;
			}
			//重构输者树 
			Adjust(p);
        }
        //输出缓冲区剩余内容写入外存 
		for(int oidx=0;oidx<idx;oidx++){
			out<<res[oidx]<<" ";
		}
		//终止符 
		out<<INT_MAX;
		//关闭文件流 
		for (int i=0;i<k;i++){
			in[i].close();
		}
		out.close();
		file_count++;
	}
	//进行下一轮归并 
	K_Merge(j,K);
}

//外排序 
void LoserTree::ExternSort(int K){
		double time=0;  
		double counts=0;  
		LARGE_INTEGER nFreq;  
		LARGE_INTEGER nBeginTime;  
		LARGE_INTEGER nEndTime;  
		QueryPerformanceFrequency(&nFreq);  
		QueryPerformanceCounter(&nBeginTime);//开始计时 
		//生成初始顺串 
        Generate_2();	
        k=K;
        tree = new int[k];
        leave = new int[k+1];
		//K路归并		
		K_Merge(0,K);
		cout<<endl; 
		cout<<"外排序成功！"<<endl;
		cout<<"结果存于temp"<<file_count-1<<".txt"<<endl;
		QueryPerformanceCounter(&nEndTime);//停止计时  
		time=(double)(nEndTime.QuadPart-nBeginTime.QuadPart)/(double)nFreq.QuadPart;//计算程序执行时间单位为s  
		cout<<"运行时间："<<time*1000<<"ms"<<endl;
		//释放内存空间 
        delete[] tree;
        delete[] leave;	
        
}

//主程序 
int main()
{
	//创建输者树 
	LoserTree a;
	//指定输入文件
	cout<<"请输入要进行外排序的文件名称：";
	string Input_File;
	cin>>Input_File;
	input_file=(char*)Input_File.c_str();
	//指定待排序的文件大小
	cout<<"请输入待排序文件的大小（数据个数）：";
	cin>>Count;
	cout<<"请输入缓冲区大小（上限1,000,000）：";
	cin>>BUF_SIZE;
	if(BUF_SIZE<0||BUF_SIZE>1000000){
		cout<<"错误：输入缓冲区不合法！"<<endl;
		exit(0);
	}
	//指定归并路数 
	cout<<"请输入归并路数：";
	int ways;
	cin>>ways;
	BUF_PAGES=ways+1;
	PAGE_SIZE=BUF_SIZE/BUF_PAGES;
	cout<<endl;
	//执行外排序 
	a.ExternSort(ways);
	return 0;
}

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)